Re: How to fit the index into the memory for the web search benchmark

Hailong Yang Mon, 22 Oct 2012 16:12:24 -0700

Yes, it is a part of the hadoop.log. I have attached the log4j.properties
file. I noticed the following two lines might be suspicious.
log4j.logger.org.apache.nutch=INFO
log4j.logger.org.apache.hadoop=WARN
So which one indicates the log level of the backend during the search? If
it is the latter, maybe we can change the level to INFO so that we can get
more information.
BTW, are you using the built-in web portal to issue the queries? I looked
through the source code of the front page (search.jsp). And I found the
following snippet.


Hits hits;
   try{
      query.getParams().initFrom(start + hitsToRetrieve, hitsPerSite,
"site", sort, reverse);
     hits = bean.search(query);
   } catch (IOException e){
     hits = new Hits(0,new Hit[0]);
   }
   int end = (int)Math.min(hits.getLength(), start + hitsPerPage);
   int length = end-start;
   int realEnd = (int)Math.min(hits.getLength(), start + hitsToRetrieve);

   Hit[] show = hits.getHits(start, realEnd-start);
   HitDetails[] details = bean.getDetails(show);
   Summary[] summaries = bean.getSummary(details, query);
   bean.LOG.info("total hits: " + hits.getTotal());

So that means if you are using the web portal to perform the search, you
will intrigue the getSummary method. However, I am only using the search
driver provided by the faban client, which contains no logic to access the
raw data of the indexes. If that is the case, then it explains the
disparity.

Best

Hailong

On Mon, Oct 22, 2012 at 3:30 PM, Volos Stavros <[email protected]>wrote:

>  Dear Hailong,
>
>  Is this a part of the hadoop.log content?
>
>  It looks like that the logging parameters in the log4j.properties file
> are not properly set. Can you please attach the conf/log4j.properties file
> of the Nutch distribution you are using for search (the instance that is
> running on the backend node).
>
>  This is what my hadoop.log looks like. As you can see the ipc.RPC logs
> all the calls (search, getDetails, getSummary)
>
>   *2012-05-25 20:05:26,614 INFO  ipc.RPC - Call: search(complete)*
> 2012-05-25 20:05:26,614 INFO  searcher.NutchBean - searching for 20 raw
> hits
> 2012-05-25 20:05:26,614 INFO  searcher.NutchBean - re-searching for 40 raw
> hits, query: complete -site:"ajc.stats.com"
> 2012-05-25 20:05:26,615 INFO  searcher.NutchBean - found 14155 raw hits
> 2012-05-25 20:05:26,615 DEBUG ipc.Server - Served: search queueTime= 186
> procesingTime= 1
> 2012-05-25 20:05:26,615 INFO  ipc.RPC - Return:
> org.apache.nutch.searcher.Hits@1c210f97
> 2012-05-25 20:05:26,615 DEBUG ipc.Server - IPC Server Responder:
> responding to #462940 from 192.168.9.135:51920
> 2012-05-25 20:05:26,615 DEBUG ipc.Server - IPC Server Responder:
> responding to #462940 from 192.168.9.135:51920 Wrote 409 bytes.
> 2012-05-25 20:05:26,615 DEBUG ipc.Server - IPC Server handler 2 on 8890:
> has #462941 from 192.168.9.135:44300
> *2012-05-25 20:05:26,615 INFO  ipc.RPC - Call:
> getDetails([Lorg.apache.nutch.searcher.Hit;@3e869...*
> 2012-05-25 20:05:26,615 DEBUG ipc.Server -  got #463094
> 2012-05-25 20:05:26,615 DEBUG ipc.Server - Served: getDetails queueTime=
> 187 procesingTime= 0
> 2012-05-25 20:05:26,615 DEBUG ipc.Server - Served: getSummary queueTime=
> 187 procesingTime= 4
> 2012-05-25 20:05:26,615 INFO  ipc.RPC - Return:
> [Lorg.apache.nutch.searcher.HitDetails;@7495195...
> 2012-05-25 20:05:26,615 DEBUG ipc.Server - IPC Server Responder:
> responding to #462941 from 192.168.9.135:44300
> 2012-05-25 20:05:26,615 DEBUG ipc.Server - IPC Server Responder:
> responding to #462941 from 192.168.9.135:44300 Wrote 3379 bytes.
> 2012-05-25 20:05:26,615 DEBUG ipc.Server - IPC Server handler 2 on 8890:
> has #462942 from 192.168.9.135:51921
> *2012-05-25 20:05:26,615 INFO  ipc.RPC - Call:
> getSummary([Lorg.apache.nutch.searcher.HitDetails...*
>
>  Regards,
> -Stavros.
>
>  On Oct 22, 2012, at 8:14 PM, Hailong Yang wrote:
>
> Dear Stavros,
>
>  I could not verify what you said at the backend. What I saw at the
> backend was almost the same with the frontend. There was no indication in
> the log file that the getSummary method was intrigued. I also looked into
> the source code of the driver, for each query it sent a http request to the
> backend. However, the response may not necessarily include the contents in
> its message body. Could you refer me to the logs where you see the detailed
> contents for each query?
>
>  2012-10-22 10:03:47,995 INFO  searcher.NutchBean - found 1 raw hits
> 2012-10-22 10:03:47,996 INFO  searcher.NutchBean - re-searching for 40 raw
> hits, query: personal "at a"
> 2012-10-22 10:03:47,997 INFO  searcher.NutchBean - found 1 raw hits
> 2012-10-22 10:03:47,997 INFO  searcher.NutchBean - re-searching for 80 raw
> hits, query: personal "at a"
> 2012-10-22 10:03:47,997 INFO  searcher.NutchBean - found 1 raw hits
> 2012-10-22 10:03:47,997 INFO  searcher.NutchBean - re-searching for 160
> raw hits, query: personal "at a"
> 2012-10-22 10:03:47,998 INFO  searcher.NutchBean - found 1 raw hits
> 2012-10-22 10:03:48,037 INFO  searcher.NutchBean - searching for 20 raw
> hits
> 2012-10-22 10:03:48,038 INFO  searcher.NutchBean - re-searching for 40 raw
> hits, query: personal "at a"
> 2012-10-22 10:03:48,038 INFO  searcher.NutchBean - found 1 raw hits
> 2012-10-22 10:03:48,038 INFO  searcher.NutchBean - re-searching for 80 raw
> hits, query: personal "at a"
> 2012-10-22 10:03:48,039 INFO  searcher.NutchBean - found 1 raw hits
> 2012-10-22 10:03:48,039 INFO  searcher.NutchBean - re-searching for 160
> raw hits, query: personal "at a"
> 2012-10-22 10:03:48,040 INFO  searcher.NutchBean - found 1 raw hits
> 2012-10-22 10:03:48,041 INFO  searcher.NutchBean - searching for 20 raw
> hits
> 2012-10-22 10:03:48,042 INFO  searcher.NutchBean - re-searching for 40 raw
> hits, query: personal "at a"
> 2012-10-22 10:03:48,042 INFO  searcher.NutchBean - found 1 raw hits
> 2012-10-22 10:03:48,042 INFO  searcher.NutchBean - re-searching for 80 raw
> hits, query: personal "at a"
> 2012-10-22 10:03:48,043 INFO  searcher.NutchBean - found 1 raw hits
> 2012-10-22 10:03:48,043 INFO  searcher.NutchBean - re-searching for 160
> raw hits, query: personal "at a"
> 2012-10-22 10:03:48,044 INFO  searcher.NutchBean - found 1 raw hits
>
>  Best
>
>  Hailong
>
> On Mon, Oct 22, 2012 at 9:43 AM, Hailong Yang <[email protected]> wrote:
>
>> Dear Stavros,
>>
>>  Thank you very much for your reply. I will go through that log right
>> away.
>>
>>  Best
>>
>>  Hailong
>>
>>
>> On Mon, Oct 22, 2012 at 4:07 AM, Volos Stavros <[email protected]>wrote:
>>
>>> Dear Hailong,
>>>
>>> The frontend will ask the summary for the top documents. A backend node
>>> will receive a getSummary request for every top document it owns. You can
>>> go through the logs of the backend node and verify that the node does
>>> receive getSummary requests.
>>>
>>> Regards,
>>> -Stavros.
>>> ________________________________________
>>> From: Hailong Yang [[email protected]]
>>>  Sent: Monday, October 22, 2012 10:38 AM
>>> To: Volos Stavros
>>> Cc: [email protected]; Lingjia Tang; Jason Mars
>>> Subject: Re: How to fit the index into the memory for the web search
>>> benchmark
>>>
>>> Dear Stavros,
>>>
>>>  I am confused why we need to bring the segments into memory. I
>>> examined the log file from the front end server which recorded the queries
>>> sent to and responses received from the nutch server. The log file showed
>>> the nutch server only replied how many hits were found in the crawled
>>> dataset without being asked for the details of the page contents. So that
>>> means when orchestrating the searching, the object NutchBean never needs to
>>> call the method getSummary that accesses the segments to retrieve the page
>>> contents. That is also to say we don't need to care about whether the size
>>> of the segments could be able to fit into the memory for this specific web
>>> search workload in CloudSuite, right? Please Correct me if I am wrong.
>>>
>>> Best
>>>
>>> Hailong
>>>
>>>
>>>  On Sun, Oct 21, 2012 at 9:04 AM, Volos Stavros <[email protected]
>>> <mailto:[email protected]>> wrote:
>>> Dear Hailong,
>>>
>>>  The reason you get I/O activity is due to the fact that the segments
>>> don't fit into the memory.
>>>
>>> I would recommend reducing the size of your index so that
>>> indexes+segments occupy roughly 16GB.
>>>
>>> This is relatively easy to do in case you used multiple reducer tasks
>>> (during the crawling phase) to create
>>> multiple partitions.
>>>
>>> (see Notes at http://parsa.epfl.ch/cloudsuite/search.html: The
>>> mapred.reduce.tasks property
>>> determines how many index and segment partitions will be created.)
>>>
>>> Regards,
>>> -Stavros.
>>> ________________________________________
>>>  From: Hailong Yang [[email protected]<mailto:[email protected]>]
>>> Sent: Friday, October 19, 2012 8:03 PM
>>> To: Volos Stavros
>>>  Cc: [email protected]<mailto:[email protected]>;
>>> Lingjia Tang; Jason Mars
>>> Subject: Re: How to fit the index into the memory for the web search
>>> benchmark
>>>
>>> Dear Stavros,
>>>
>>> Thank you for your reply. I understand the data structures required
>>> during the search. The 6GB is only the size of the actual index ( the
>>> directory of indexes). The whole data including the segments accounts for
>>> 30GB.
>>>
>>> Best
>>>
>>> Hailong
>>>
>>>  On Fri, Oct 19, 2012 at 9:03 AM, Volos Stavros <[email protected]
>>> <mailto:[email protected]><mailto:[email protected]<mailto:
>>> [email protected]>>> wrote:
>>> Dear Hailong,
>>>
>>> There are two components that are used when performing a query against
>>> the index serving node:
>>> (a) the actual index (under indexes)
>>> (b) segments (under segments)
>>>
>>> What exactly is 6GB? Are you including the segments as well?
>>>
>>> Regards,
>>> -Stavros.
>>>
>>>
>>> ________________________________________
>>>  From: Hailong Yang [[email protected]<mailto:[email protected]
>>> ><mailto:[email protected]<mailto:[email protected]>>]
>>> Sent: Wednesday, October 17, 2012 4:51 AM
>>>  To: [email protected]<mailto:[email protected]><mailto:
>>> [email protected]<mailto:[email protected]>>
>>>  Cc: Lingjia Tang; Jason Mars
>>> Subject: How to fit the index into the memory for the web search
>>> benchmark
>>>
>>> Hi CloudSuite,
>>>
>>> I am experimenting with the web search benchmark. However, I am
>>> wondering how to fit the index into the memory in order to avoid
>>> unnecessary disk access. I have a 6GB index crawled from wikipedia and the
>>> RAM is 16GB. During the workload execution, I noticed there were periodical
>>> 2% I/O utilization increase and the memory used by nutch server was always
>>> less than 500MB. So I guess the whole index is not brought into the memory
>>> by default before serving the search queries, right? Could you tell me how
>>> to do that exactly as you did in the clearing cloud paper. Thanks!
>>>
>>>
>>> Best
>>>
>>> Hailong
>>>
>>>
>>>
>>
>
>

log4j.properties
Description: Binary data

Re: How to fit the index into the memory for the web search benchmark

Reply via email to