Hi Hailong,
Most probably your server is over-loaded and more requests cannot get served.
What is the utilization of the server and how many back-end handlers are you
running?
Regards,
-Stavros.
On Oct 23, 2012, at 8:13 PM, Hailong Yang wrote:
Hi Stavros,
I think you are right. Now I can see all the getSummary RPC calls. Thank you
very much for clearing things out. One more question, how can I increase
query/sec sent from the faban client so that I can saturate the backend server?
Right now I have 5 faban agents simulating 600 users(<fa:scale>600</fa:scale>),
and I only get 400 queries/sec. I have tried to increase the fa:scale as well
as the number of faban agents, however as long as I go beyond the 600 users, I
get the following error during the benchmark execution:
Oct 23, 2012 10:33:04 AM com.sun.faban.driver.engine.AgentThread logError
WARNING: SearchDriverAgent[3].282.doGet: Connection refused
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
at java.net.Socket.connect(Socket.java:529)
at
com.sun.faban.driver.transport.util.TimedSocket.connect(TimedSocket.java:292)
at java.net.Socket.connect(Socket.java:478)
at
com.sun.faban.driver.transport.sunhttp.HttpClient.doConnect(HttpClient.java:171)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:394)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:529)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:233)
at
com.sun.faban.driver.transport.sunhttp.HttpClient.<init>(HttpClient.java:119)
at
com.sun.faban.driver.transport.sunhttp.HttpClient.New(HttpClient.java:100)
at
com.sun.faban.driver.transport.sunhttp.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:99)
at
com.sun.faban.driver.transport.sunhttp.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:41)
at
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:911)
at
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:836)
at
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1172)
at
java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379)
at
com.sun.faban.driver.transport.sunhttp.SunHttpTransport.fetchResponse(SunHttpTransport.java:617)
at
com.sun.faban.driver.transport.sunhttp.SunHttpTransport.fetchURL(SunHttpTransport.java:383)
at
com.sun.faban.driver.transport.sunhttp.SunHttpTransport.fetchURL(SunHttpTransport.java:402)
at
com.sun.faban.driver.transport.sunhttp.SunHttpTransport.fetchURL(SunHttpTransport.java:440)
at sample.searchdriver.SearchDriver.doGet(SearchDriver.java:140)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at com.sun.faban.driver.engine.TimeThread.doRun(TimeThread.java:169)
at com.sun.faban.driver.engine.AgentThread.run(AgentThread.java:202)
Does that mean I have already reached the upper limit of connections that the
faban driver can support? Or maybe I missed some important configurations to
support more concurrent connections?
Best
Hailong
On Tue, Oct 23, 2012 at 1:21 AM, Volos Stavros
<[email protected]<mailto:[email protected]>> wrote:
Hi Hailong,
Try to set log4j.logger.org.apache.hadoop=INFO.
The class ipc.RPC is part of the hadoop distribution; hence, the above flag
affects the logging level of this class.
The faban client shoots queries to the web interface; therefore, there
shouldn't be any disparity.
I expect that you will see all the ipc.RPC calls as soon as you set the
logger.hadoop flag to INFO.
Please let me know the outcome.
Regards,
-Stavros.
________________________________________
From: Hailong Yang [[email protected]<mailto:[email protected]>]
Sent: Tuesday, October 23, 2012 1:12 AM
To: Volos Stavros
Cc: [email protected]<mailto:[email protected]>; Lingjia Tang;
Jason Mars
Subject: Re: How to fit the index into the memory for the web search benchmark
Yes, it is a part of the hadoop.log. I have attached the log4j.properties file.
I noticed the following two lines might be suspicious.
log4j.logger.org.apache.nutch=INFO
log4j.logger.org.apache.hadoop=WARN
So which one indicates the log level of the backend during the search? If it is
the latter, maybe we can change the level to INFO so that we can get more
information.
BTW, are you using the built-in web portal to issue the queries? I looked
through the source code of the front page (search.jsp). And I found the
following snippet.
Hits hits;
try{
query.getParams().initFrom(start + hitsToRetrieve, hitsPerSite, "site",
sort, reverse);
hits = bean.search(query);
} catch (IOException e){
hits = new Hits(0,new Hit[0]);
}
int end = (int)Math.min(hits.getLength(), start + hitsPerPage);
int length = end-start;
int realEnd = (int)Math.min(hits.getLength(), start + hitsToRetrieve);
Hit[] show = hits.getHits(start, realEnd-start);
HitDetails[] details = bean.getDetails(show);
Summary[] summaries = bean.getSummary(details, query);
bean.LOG.info<http://bean.LOG.info/><http://bean.LOG.info<http://bean.LOG.info/>>("total
hits: " + hits.getTotal());
So that means if you are using the web portal to perform the search, you will
intrigue the getSummary method. However, I am only using the search driver
provided by the faban client, which contains no logic to access the raw data of
the indexes. If that is the case, then it explains the disparity.
Best
Hailong
On Mon, Oct 22, 2012 at 3:30 PM, Volos Stavros
<[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>
wrote:
Dear Hailong,
Is this a part of the hadoop.log content?
It looks like that the logging parameters in the log4j.properties file are not
properly set. Can you please attach the conf/log4j.properties file of the Nutch
distribution you are using for search (the instance that is running on the
backend node).
This is what my hadoop.log looks like. As you can see the ipc.RPC logs all the
calls (search, getDetails, getSummary)
2012-05-25 20:05:26,614 INFO ipc.RPC - Call: search(complete)
2012-05-25 20:05:26,614 INFO searcher.NutchBean - searching for 20 raw hits
2012-05-25 20:05:26,614 INFO searcher.NutchBean - re-searching for 40 raw
hits, query: complete
-site:"ajc.stats.com<http://ajc.stats.com/><http://ajc.stats.com/>"
2012-05-25 20:05:26,615 INFO searcher.NutchBean - found 14155 raw hits
2012-05-25 20:05:26,615 DEBUG ipc.Server - Served: search queueTime= 186
procesingTime= 1
2012-05-25 20:05:26,615 INFO ipc.RPC - Return:
org.apache.nutch.searcher.Hits@1c210f97
2012-05-25 20:05:26,615 DEBUG ipc.Server - IPC Server Responder: responding to
#462940 from
192.168.9.135:51920<http://192.168.9.135:51920/><http://192.168.9.135:51920<http://192.168.9.135:51920/>>
2012-05-25 20:05:26,615 DEBUG ipc.Server - IPC Server Responder: responding to
#462940 from
192.168.9.135:51920<http://192.168.9.135:51920/><http://192.168.9.135:51920<http://192.168.9.135:51920/>>
Wrote 409 bytes.
2012-05-25 20:05:26,615 DEBUG ipc.Server - IPC Server handler 2 on 8890: has
#462941 from
192.168.9.135:44300<http://192.168.9.135:44300/><http://192.168.9.135:44300<http://192.168.9.135:44300/>>
2012-05-25 20:05:26,615 INFO ipc.RPC - Call:
getDetails([Lorg.apache.nutch.searcher.Hit;@3e869...
2012-05-25 20:05:26,615 DEBUG ipc.Server - got #463094
2012-05-25 20:05:26,615 DEBUG ipc.Server - Served: getDetails queueTime= 187
procesingTime= 0
2012-05-25 20:05:26,615 DEBUG ipc.Server - Served: getSummary queueTime= 187
procesingTime= 4
2012-05-25 20:05:26,615 INFO ipc.RPC - Return:
[Lorg.apache.nutch.searcher.HitDetails;@7495195...
2012-05-25 20:05:26,615 DEBUG ipc.Server - IPC Server Responder: responding to
#462941 from
192.168.9.135:44300<http://192.168.9.135:44300/><http://192.168.9.135:44300<http://192.168.9.135:44300/>>
2012-05-25 20:05:26,615 DEBUG ipc.Server - IPC Server Responder: responding to
#462941 from
192.168.9.135:44300<http://192.168.9.135:44300/><http://192.168.9.135:44300<http://192.168.9.135:44300/>>
Wrote 3379 bytes.
2012-05-25 20:05:26,615 DEBUG ipc.Server - IPC Server handler 2 on 8890: has
#462942 from
192.168.9.135:51921<http://192.168.9.135:51921/><http://192.168.9.135:51921<http://192.168.9.135:51921/>>
2012-05-25 20:05:26,615 INFO ipc.RPC - Call:
getSummary([Lorg.apache.nutch.searcher.HitDetails...
Regards,
-Stavros.
On Oct 22, 2012, at 8:14 PM, Hailong Yang wrote:
Dear Stavros,
I could not verify what you said at the backend. What I saw at the backend was
almost the same with the frontend. There was no indication in the log file that
the getSummary method was intrigued. I also looked into the source code of the
driver, for each query it sent a http request to the backend. However, the
response may not necessarily include the contents in its message body. Could
you refer me to the logs where you see the detailed contents for each query?
2012-10-22 10:03:47,995 INFO searcher.NutchBean - found 1 raw hits
2012-10-22 10:03:47,996 INFO searcher.NutchBean - re-searching for 40 raw
hits, query: personal "at a"
2012-10-22 10:03:47,997 INFO searcher.NutchBean - found 1 raw hits
2012-10-22 10:03:47,997 INFO searcher.NutchBean - re-searching for 80 raw
hits, query: personal "at a"
2012-10-22 10:03:47,997 INFO searcher.NutchBean - found 1 raw hits
2012-10-22 10:03:47,997 INFO searcher.NutchBean - re-searching for 160 raw
hits, query: personal "at a"
2012-10-22 10:03:47,998 INFO searcher.NutchBean - found 1 raw hits
2012-10-22 10:03:48,037 INFO searcher.NutchBean - searching for 20 raw hits
2012-10-22 10:03:48,038 INFO searcher.NutchBean - re-searching for 40 raw
hits, query: personal "at a"
2012-10-22 10:03:48,038 INFO searcher.NutchBean - found 1 raw hits
2012-10-22 10:03:48,038 INFO searcher.NutchBean - re-searching for 80 raw
hits, query: personal "at a"
2012-10-22 10:03:48,039 INFO searcher.NutchBean - found 1 raw hits
2012-10-22 10:03:48,039 INFO searcher.NutchBean - re-searching for 160 raw
hits, query: personal "at a"
2012-10-22 10:03:48,040 INFO searcher.NutchBean - found 1 raw hits
2012-10-22 10:03:48,041 INFO searcher.NutchBean - searching for 20 raw hits
2012-10-22 10:03:48,042 INFO searcher.NutchBean - re-searching for 40 raw
hits, query: personal "at a"
2012-10-22 10:03:48,042 INFO searcher.NutchBean - found 1 raw hits
2012-10-22 10:03:48,042 INFO searcher.NutchBean - re-searching for 80 raw
hits, query: personal "at a"
2012-10-22 10:03:48,043 INFO searcher.NutchBean - found 1 raw hits
2012-10-22 10:03:48,043 INFO searcher.NutchBean - re-searching for 160 raw
hits, query: personal "at a"
2012-10-22 10:03:48,044 INFO searcher.NutchBean - found 1 raw hits
Best
Hailong
On Mon, Oct 22, 2012 at 9:43 AM, Hailong Yang
<[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>
wrote:
Dear Stavros,
Thank you very much for your reply. I will go through that log right away.
Best
Hailong
On Mon, Oct 22, 2012 at 4:07 AM, Volos Stavros
<[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>
wrote:
Dear Hailong,
The frontend will ask the summary for the top documents. A backend node will
receive a getSummary request for every top document it owns. You can go through
the logs of the backend node and verify that the node does receive getSummary
requests.
Regards,
-Stavros.
________________________________________
From: Hailong Yang
[[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>]
Sent: Monday, October 22, 2012 10:38 AM
To: Volos Stavros
Cc:
[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>;
Lingjia Tang; Jason Mars
Subject: Re: How to fit the index into the memory for the web search benchmark
Dear Stavros,
I am confused why we need to bring the segments into memory. I examined the log
file from the front end server which recorded the queries sent to and responses
received from the nutch server. The log file showed the nutch server only
replied how many hits were found in the crawled dataset without being asked for
the details of the page contents. So that means when orchestrating the
searching, the object NutchBean never needs to call the method getSummary that
accesses the segments to retrieve the page contents. That is also to say we
don't need to care about whether the size of the segments could be able to fit
into the memory for this specific web search workload in CloudSuite, right?
Please Correct me if I am wrong.
Best
Hailong
On Sun, Oct 21, 2012 at 9:04 AM, Volos Stavros
<[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>><mailto:[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>>
wrote:
Dear Hailong,
The reason you get I/O activity is due to the fact that the segments don't fit
into the memory.
I would recommend reducing the size of your index so that indexes+segments
occupy roughly 16GB.
This is relatively easy to do in case you used multiple reducer tasks (during
the crawling phase) to create
multiple partitions.
(see Notes at http://parsa.epfl.ch/cloudsuite/search.html: The
mapred.reduce.tasks property
determines how many index and segment partitions will be created.)
Regards,
-Stavros.
________________________________________
From: Hailong Yang
[[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>><mailto:[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>]
Sent: Friday, October 19, 2012 8:03 PM
To: Volos Stavros
Cc:
[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>><mailto:[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>;
Lingjia Tang; Jason Mars
Subject: Re: How to fit the index into the memory for the web search benchmark
Dear Stavros,
Thank you for your reply. I understand the data structures required during the
search. The 6GB is only the size of the actual index ( the directory of
indexes). The whole data including the segments accounts for 30GB.
Best
Hailong
On Fri, Oct 19, 2012 at 9:03 AM, Volos Stavros
<[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>><mailto:[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>><mailto:[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>><mailto:[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>>>
wrote:
Dear Hailong,
There are two components that are used when performing a query against the
index serving node:
(a) the actual index (under indexes)
(b) segments (under segments)
What exactly is 6GB? Are you including the segments as well?
Regards,
-Stavros.
________________________________________
From: Hailong Yang
[[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>><mailto:[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>><mailto:[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>><mailto:[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>>]
Sent: Wednesday, October 17, 2012 4:51 AM
To:
[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>><mailto:[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>><mailto:[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>><mailto:[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>>
Cc: Lingjia Tang; Jason Mars
Subject: How to fit the index into the memory for the web search benchmark
Hi CloudSuite,
I am experimenting with the web search benchmark. However, I am wondering how
to fit the index into the memory in order to avoid unnecessary disk access. I
have a 6GB index crawled from wikipedia and the RAM is 16GB. During the
workload execution, I noticed there were periodical 2% I/O utilization increase
and the memory used by nutch server was always less than 500MB. So I guess the
whole index is not brought into the memory by default before serving the search
queries, right? Could you tell me how to do that exactly as you did in the
clearing cloud paper. Thanks!
Best
Hailong