Hi George,

I'm hoping something comes out of this discussion too, as our solr instance
is not running fast enough to query.

You can also see if you can use the stats util to delete some of the bots in
the logs. I've also noticed that the default spiders config doesn't include
matches for msnbot/bingbot. So we might need to build something for a
user-agent based search-and-destroy for those entries.


You can get a count of how many records are in solr,
http://localhost:8080/solr/statistics/select?q=type:0+OR+type:2+OR+type:3+OR+type:4
<http://localhost:8080/solr/statistics/select?q=type:0+OR+type:2+OR+type:3+OR+type:4>Look
at: <result name="response" numFound="9307348" start="0">

I don't know if solr has a performance break-off, where after a certain
number of documents the performance trails off, but I've noticed that giving
more memory always helps.

Our production server runs on a VM with 2.5GB of memory, and thats not
enough for all of the existing webapps, and SOLR to work well. SOLR queries
are abysmally slow, however the usage events are still being logged/inserted
just fine.

To test things out, I've copied our production /dspace/solr directory to my
workstation which has 6GB of memory, and queries run much faster, as in they
finish before a user turns the computer off. So the bitter pill could be
that you either will want to boost your production server with more memory,
or have a dedicated SOLR server.

One thing I've noticed with regard to the file-size of the solr data files
of the index is that there are some files which are really really big. In
our instance we have.
Size Filename
933M _3dd02.fdt
158M _3j75t.fdt
146M _3dd02.frq
136M _3dyoh.fdt
132M _3dtx2.fdt
132M _3dmlz.fdt
131M _3iyu2.fdt
...

So it could be the pain that solr has to open an especially large 933MB file
from disk, and load it into memory. On memory limited machines this can
cause java out of memory errors, or just poor performance. I'm wondering if
there is some performance enhancement that can go into our implementation of
our Solr Client in dspace to create a more optimized index.

Another approach could be to limit the data we store in solr. We currently
store a bunch of things, and it all adds up, especially when you have 9
million of them.
Currently an example hit is:
<doc>
<str name="city">
Columbus
</str>
<str name="continent">
NA
</str>
<str name="countryCode">
US
</str>
<str name="dns">
ack5859s3.lib.ohio-state.edu.
</str>
<int name="id">
1256
</int>
<str name="ip">
128.146.175.194
</str>
<bool name="isBot">
false
</bool>
<float name="latitude">
40.029007
</float>
<float name="longitude">
-83.0809
</float>
<arr name="owningComm">
<int>
154
</int>
<int>
19
</int>
<int>
19
</int>
</arr>
<date name="time">
2010-06-14T16:39:33.586Z
</date>
<int name="type">
3
</int>
<str name="userAgent">
Mozilla/5.0 (X11; U; Linux x86_64; en-US) AppleWebKit/533.4 (KHTML, like
Gecko) Chrome/5.0.366.2 Safari/533.4
</str>
</doc>

Perhaps after a certain time period (6 months) we could have a
super-optimize where we squash results per
community/collection/item/bitstream down to monthly/daily result. So instead
of determining that collection:1256 has 945 hits by finding all 945 records,
but it might be more efficient to have to count 6 monthly aggregate records
that have a value of a couple hundred hits. This approach would lose some of
the fine-grained quality of the search results that SOLR gives us, but it
would make the process much faster. For instance we could run a query that
returned you all hits you've had from visitors running 64-bit linux, from
the US and give me the result per hour.

I do really like all the data in SOLR, especially when we upgrade our system
to 1.7, which has discovery, for then we can combine statistics and real
data for the more interesting queries.
I hope we can make progress on our Solr implementation, however, I'm still
looking at things like Elastic Search and services like Loggly for the time
being.

--
Peter Dietz
Systems Developer/Engineer
Ohio State University Libraries



On Fri, Jan 14, 2011 at 11:53 AM, George Stanley Kozak <g...@cornell.edu>wrote:

> Hi…
>
>
>
> Last month I wrote about a problem that I was having with the Solr
> Statistics (I am using DSpace 1.6.2).  Starting in December, I was noticing
> that we were getting high CPU usage,  and I learned that the statistics were
> not appearing for users.  When they clicked the “View Statistics” button,
>  the browser seemed to “churn” and the stats never appeared.
>
>
>
> After the advice that I received from various people (thank you, all), I
> added the Solr patch and made the changes the Solr Config file.  I
> recompiled things and then ran the “dspace stats-util –optimize” step.
> However, what I now have is just a partial fix.  I don’t have the out of
> control CPU usage, but the statistics never appear.  If a user presses the
> “View Statistics” button, the browser just seems to “churn” and the
> statistics never appear.
>
>
>
> For the time being, I have temporarily removed the “View Statistics” button
> from community-home.jsp, collection-home.jsp and display-item.jsp.
>
>
>
> Outside what I have already done.  Does anyone have any other suggestions?
> By the way, the Statistics work OK on my test system (which has a smaller
> database and less traffic).
>
>
>
> George Kozak
>
> Digital Library Specialist
>
> Cornell University Library Information Technologies (CUL-IT)
>
> 501 Olin Library
>
> Cornell University
>
> Ithaca, NY 14853
>
> 607-255-8924
>
>
>
>
> ------------------------------------------------------------------------------
> Protect Your Site and Customers from Malware Attacks
> Learn about various malware tactics and how to avoid them. Understand
> malware threats, the impact they can have on your business, and how you
> can protect your company and customers by using code signing.
> http://p.sf.net/sfu/oracle-sfdevnl
> _______________________________________________
> DSpace-tech mailing list
> DSpace-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>
>
------------------------------------------------------------------------------
Protect Your Site and Customers from Malware Attacks
Learn about various malware tactics and how to avoid them. Understand 
malware threats, the impact they can have on your business, and how you 
can protect your company and customers by using code signing.
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to