Re: How to fit the index into the memory for the web search benchmark

Hailong Yang Tue, 23 Oct 2012 11:14:02 -0700

Hi Stavros,

I think you are right. Now I can see all the getSummary RPC calls. Thank
you very much for clearing things out. One more question, how can I
increase query/sec sent from the faban client so that I can saturate the
backend server? Right now I have 5 faban agents simulating 600
users(<fa:scale>600</fa:scale>), and I only get 400 queries/sec. I have
tried to increase the fa:scale as well as the number of faban agents,
however as long as I go beyond the 600 users, I get the following error
during the benchmark execution:


Oct 23, 2012 10:33:04 AM com.sun.faban.driver.engine.AgentThread logError
WARNING: SearchDriverAgent[3].282.doGet: Connection refused
java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
        at
java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
        at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
        at java.net.Socket.connect(Socket.java:529)
        at
com.sun.faban.driver.transport.util.TimedSocket.connect(TimedSocket.java:292)
        at java.net.Socket.connect(Socket.java:478)
        at
com.sun.faban.driver.transport.sunhttp.HttpClient.doConnect(HttpClient.java:171)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:394)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:529)
        at sun.net.www.http.HttpClient.<init>(HttpClient.java:233)
        at
com.sun.faban.driver.transport.sunhttp.HttpClient.<init>(HttpClient.java:119)
        at
com.sun.faban.driver.transport.sunhttp.HttpClient.New(HttpClient.java:100)
        at
com.sun.faban.driver.transport.sunhttp.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:99)
        at
com.sun.faban.driver.transport.sunhttp.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:41)
        at
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:911)
        at
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:836)
        at
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1172)
        at
java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379)
        at
com.sun.faban.driver.transport.sunhttp.SunHttpTransport.fetchResponse(SunHttpTransport.java:617)
        at
com.sun.faban.driver.transport.sunhttp.SunHttpTransport.fetchURL(SunHttpTransport.java:383)
        at
com.sun.faban.driver.transport.sunhttp.SunHttpTransport.fetchURL(SunHttpTransport.java:402)
        at
com.sun.faban.driver.transport.sunhttp.SunHttpTransport.fetchURL(SunHttpTransport.java:440)
        at sample.searchdriver.SearchDriver.doGet(SearchDriver.java:140)
        at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at com.sun.faban.driver.engine.TimeThread.doRun(TimeThread.java:169)
        at com.sun.faban.driver.engine.AgentThread.run(AgentThread.java:202)

Does that mean I have already reached the upper limit of connections that
the faban driver can support? Or maybe I missed some important
configurations to support more concurrent connections?

Best

Hailong

On Tue, Oct 23, 2012 at 1:21 AM, Volos Stavros <[email protected]>wrote:

> Hi Hailong,
>
> Try to set log4j.logger.org.apache.hadoop=INFO.
>
> The class ipc.RPC is part of the hadoop distribution; hence, the above
> flag affects the logging level of this class.
>
> The faban client shoots queries to the web interface; therefore, there
> shouldn't be any disparity.
>
> I expect that you will see all the ipc.RPC calls as soon as you set the
> logger.hadoop flag to INFO.
>
> Please let me know the outcome.
>
> Regards,
> -Stavros.
> ________________________________________
> From: Hailong Yang [[email protected]]
> Sent: Tuesday, October 23, 2012 1:12 AM
> To: Volos Stavros
> Cc: [email protected]; Lingjia Tang; Jason Mars
> Subject: Re: How to fit the index into the memory for the web search
> benchmark
>
> Yes, it is a part of the hadoop.log. I have attached the log4j.properties
> file. I noticed the following two lines might be suspicious.
> log4j.logger.org.apache.nutch=INFO
> log4j.logger.org.apache.hadoop=WARN
> So which one indicates the log level of the backend during the search? If
> it is the latter, maybe we can change the level to INFO so that we can get
> more information.
> BTW, are you using the built-in web portal to issue the queries? I looked
> through the source code of the front page (search.jsp). And I found the
> following snippet.
>
> Hits hits;
>    try{
>       query.getParams().initFrom(start + hitsToRetrieve, hitsPerSite,
> "site", sort, reverse);
>      hits = bean.search(query);
>    } catch (IOException e){
>      hits = new Hits(0,new Hit[0]);
>    }
>    int end = (int)Math.min(hits.getLength(), start + hitsPerPage);
>    int length = end-start;
>    int realEnd = (int)Math.min(hits.getLength(), start + hitsToRetrieve);
>
>    Hit[] show = hits.getHits(start, realEnd-start);
>    HitDetails[] details = bean.getDetails(show);
>    Summary[] summaries = bean.getSummary(details, query);
>    bean.LOG.info<http://bean.LOG.info>("total hits: " + hits.getTotal());
>
> So that means if you are using the web portal to perform the search, you
> will intrigue the getSummary method. However, I am only using the search
> driver provided by the faban client, which contains no logic to access the
> raw data of the indexes. If that is the case, then it explains the
> disparity.
>
> Best
>
> Hailong
>
> On Mon, Oct 22, 2012 at 3:30 PM, Volos Stavros <[email protected]
> <mailto:[email protected]>> wrote:
> Dear Hailong,
>
> Is this a part of the hadoop.log content?
>
> It looks like that the logging parameters in the log4j.properties file are
> not properly set. Can you please attach the conf/log4j.properties file of
> the Nutch distribution you are using for search (the instance that is
> running on the backend node).
>
> This is what my hadoop.log looks like. As you can see the ipc.RPC logs all
> the calls (search, getDetails, getSummary)
>
> 2012-05-25 20:05:26,614 INFO  ipc.RPC - Call: search(complete)
> 2012-05-25 20:05:26,614 INFO  searcher.NutchBean - searching for 20 raw
> hits
> 2012-05-25 20:05:26,614 INFO  searcher.NutchBean - re-searching for 40 raw
> hits, query: complete -site:"ajc.stats.com<http://ajc.stats.com/>"
> 2012-05-25 20:05:26,615 INFO  searcher.NutchBean - found 14155 raw hits
> 2012-05-25 20:05:26,615 DEBUG ipc.Server - Served: search queueTime= 186
> procesingTime= 1
> 2012-05-25 20:05:26,615 INFO  ipc.RPC - Return:
> org.apache.nutch.searcher.Hits@1c210f97
> 2012-05-25 20:05:26,615 DEBUG ipc.Server - IPC Server Responder:
> responding to #462940 from 192.168.9.135:51920<http://192.168.9.135:51920>
> 2012-05-25 20:05:26,615 DEBUG ipc.Server - IPC Server Responder:
> responding to #462940 from 192.168.9.135:51920<http://192.168.9.135:51920>
> Wrote 409 bytes.
> 2012-05-25 20:05:26,615 DEBUG ipc.Server - IPC Server handler 2 on 8890:
> has #462941 from 192.168.9.135:44300<http://192.168.9.135:44300>
> 2012-05-25 20:05:26,615 INFO  ipc.RPC - Call:
> getDetails([Lorg.apache.nutch.searcher.Hit;@3e869...
> 2012-05-25 20:05:26,615 DEBUG ipc.Server -  got #463094
> 2012-05-25 20:05:26,615 DEBUG ipc.Server - Served: getDetails queueTime=
> 187 procesingTime= 0
> 2012-05-25 20:05:26,615 DEBUG ipc.Server - Served: getSummary queueTime=
> 187 procesingTime= 4
> 2012-05-25 20:05:26,615 INFO  ipc.RPC - Return:
> [Lorg.apache.nutch.searcher.HitDetails;@7495195...
> 2012-05-25 20:05:26,615 DEBUG ipc.Server - IPC Server Responder:
> responding to #462941 from 192.168.9.135:44300<http://192.168.9.135:44300>
> 2012-05-25 20:05:26,615 DEBUG ipc.Server - IPC Server Responder:
> responding to #462941 from 192.168.9.135:44300<http://192.168.9.135:44300>
> Wrote 3379 bytes.
> 2012-05-25 20:05:26,615 DEBUG ipc.Server - IPC Server handler 2 on 8890:
> has #462942 from 192.168.9.135:51921<http://192.168.9.135:51921>
> 2012-05-25 20:05:26,615 INFO  ipc.RPC - Call:
> getSummary([Lorg.apache.nutch.searcher.HitDetails...
>
> Regards,
> -Stavros.
>
> On Oct 22, 2012, at 8:14 PM, Hailong Yang wrote:
>
> Dear Stavros,
>
> I could not verify what you said at the backend. What I saw at the backend
> was almost the same with the frontend. There was no indication in the log
> file that the getSummary method was intrigued. I also looked into the
> source code of the driver, for each query it sent a http request to the
> backend. However, the response may not necessarily include the contents in
> its message body. Could you refer me to the logs where you see the detailed
> contents for each query?
>
> 2012-10-22 10:03:47,995 INFO  searcher.NutchBean - found 1 raw hits
> 2012-10-22 10:03:47,996 INFO  searcher.NutchBean - re-searching for 40 raw
> hits, query: personal "at a"
> 2012-10-22 10:03:47,997 INFO  searcher.NutchBean - found 1 raw hits
> 2012-10-22 10:03:47,997 INFO  searcher.NutchBean - re-searching for 80 raw
> hits, query: personal "at a"
> 2012-10-22 10:03:47,997 INFO  searcher.NutchBean - found 1 raw hits
> 2012-10-22 10:03:47,997 INFO  searcher.NutchBean - re-searching for 160
> raw hits, query: personal "at a"
> 2012-10-22 10:03:47,998 INFO  searcher.NutchBean - found 1 raw hits
> 2012-10-22 10:03:48,037 INFO  searcher.NutchBean - searching for 20 raw
> hits
> 2012-10-22 10:03:48,038 INFO  searcher.NutchBean - re-searching for 40 raw
> hits, query: personal "at a"
> 2012-10-22 10:03:48,038 INFO  searcher.NutchBean - found 1 raw hits
> 2012-10-22 10:03:48,038 INFO  searcher.NutchBean - re-searching for 80 raw
> hits, query: personal "at a"
> 2012-10-22 10:03:48,039 INFO  searcher.NutchBean - found 1 raw hits
> 2012-10-22 10:03:48,039 INFO  searcher.NutchBean - re-searching for 160
> raw hits, query: personal "at a"
> 2012-10-22 10:03:48,040 INFO  searcher.NutchBean - found 1 raw hits
> 2012-10-22 10:03:48,041 INFO  searcher.NutchBean - searching for 20 raw
> hits
> 2012-10-22 10:03:48,042 INFO  searcher.NutchBean - re-searching for 40 raw
> hits, query: personal "at a"
> 2012-10-22 10:03:48,042 INFO  searcher.NutchBean - found 1 raw hits
> 2012-10-22 10:03:48,042 INFO  searcher.NutchBean - re-searching for 80 raw
> hits, query: personal "at a"
> 2012-10-22 10:03:48,043 INFO  searcher.NutchBean - found 1 raw hits
> 2012-10-22 10:03:48,043 INFO  searcher.NutchBean - re-searching for 160
> raw hits, query: personal "at a"
> 2012-10-22 10:03:48,044 INFO  searcher.NutchBean - found 1 raw hits
>
> Best
>
> Hailong
>
> On Mon, Oct 22, 2012 at 9:43 AM, Hailong Yang <[email protected]<mailto:
> [email protected]>> wrote:
> Dear Stavros,
>
> Thank you very much for your reply. I will go through that log right away.
>
> Best
>
> Hailong
>
>
> On Mon, Oct 22, 2012 at 4:07 AM, Volos Stavros <[email protected]
> <mailto:[email protected]>> wrote:
> Dear Hailong,
>
> The frontend will ask the summary for the top documents. A backend node
> will receive a getSummary request for every top document it owns. You can
> go through the logs of the backend node and verify that the node does
> receive getSummary requests.
>
> Regards,
> -Stavros.
> ________________________________________
> From: Hailong Yang [[email protected]<mailto:[email protected]>]
> Sent: Monday, October 22, 2012 10:38 AM
> To: Volos Stavros
> Cc: [email protected]<mailto:[email protected]>; Lingjia
> Tang; Jason Mars
> Subject: Re: How to fit the index into the memory for the web search
> benchmark
>
> Dear Stavros,
>
> I am confused why we need to bring the segments into memory. I examined
> the log file from the front end server which recorded the queries sent to
> and responses received from the nutch server. The log file showed the nutch
> server only replied how many hits were found in the crawled dataset without
> being asked for the details of the page contents. So that means when
> orchestrating the searching, the object NutchBean never needs to call the
> method getSummary that accesses the segments to retrieve the page contents.
> That is also to say we don't need to care about whether the size of the
> segments could be able to fit into the memory for this specific web search
> workload in CloudSuite, right? Please Correct me if I am wrong.
>
> Best
>
> Hailong
>
>
> On Sun, Oct 21, 2012 at 9:04 AM, Volos Stavros <[email protected]
> <mailto:[email protected]><mailto:[email protected]<mailto:
> [email protected]>>> wrote:
> Dear Hailong,
>
> The reason you get I/O activity is due to the fact that the segments don't
> fit into the memory.
>
> I would recommend reducing the size of your index so that indexes+segments
> occupy roughly 16GB.
>
> This is relatively easy to do in case you used multiple reducer tasks
> (during the crawling phase) to create
> multiple partitions.
>
> (see Notes at http://parsa.epfl.ch/cloudsuite/search.html: The
> mapred.reduce.tasks property
> determines how many index and segment partitions will be created.)
>
> Regards,
> -Stavros.
> ________________________________________
> From: Hailong Yang [[email protected]<mailto:[email protected]
> ><mailto:[email protected]<mailto:[email protected]>>]
> Sent: Friday, October 19, 2012 8:03 PM
> To: Volos Stavros
> Cc: [email protected]<mailto:[email protected]><mailto:
> [email protected]<mailto:[email protected]>>; Lingjia
> Tang; Jason Mars
> Subject: Re: How to fit the index into the memory for the web search
> benchmark
>
> Dear Stavros,
>
> Thank you for your reply. I understand the data structures required during
> the search. The 6GB is only the size of the actual index ( the directory of
> indexes). The whole data including the segments accounts for 30GB.
>
> Best
>
> Hailong
>
> On Fri, Oct 19, 2012 at 9:03 AM, Volos Stavros <[email protected]
> <mailto:[email protected]><mailto:[email protected]<mailto:
> [email protected]>><mailto:[email protected]<mailto:
> [email protected]><mailto:[email protected]<mailto:
> [email protected]>>>> wrote:
> Dear Hailong,
>
> There are two components that are used when performing a query against the
> index serving node:
> (a) the actual index (under indexes)
> (b) segments (under segments)
>
> What exactly is 6GB? Are you including the segments as well?
>
> Regards,
> -Stavros.
>
>
> ________________________________________
> From: Hailong Yang [[email protected]<mailto:[email protected]
> ><mailto:[email protected]<mailto:[email protected]>><mailto:
> [email protected]<mailto:[email protected]><mailto:[email protected]
> <mailto:[email protected]>>>]
> Sent: Wednesday, October 17, 2012 4:51 AM
> To: [email protected]<mailto:[email protected]><mailto:
> [email protected]<mailto:[email protected]>><mailto:
> [email protected]<mailto:[email protected]><mailto:
> [email protected]<mailto:[email protected]>>>
> Cc: Lingjia Tang; Jason Mars
> Subject: How to fit the index into the memory for the web search benchmark
>
> Hi CloudSuite,
>
> I am experimenting with the web search benchmark. However, I am wondering
> how to fit the index into the memory in order to avoid unnecessary disk
> access. I have a 6GB index crawled from wikipedia and the RAM is 16GB.
> During the workload execution, I noticed there were periodical 2% I/O
> utilization increase and the memory used by nutch server was always less
> than 500MB. So I guess the whole index is not brought into the memory by
> default before serving the search queries, right? Could you tell me how to
> do that exactly as you did in the clearing cloud paper. Thanks!
>
>
> Best
>
> Hailong
>
>
>
>
>
>
>

Re: How to fit the index into the memory for the web search benchmark

Reply via email to