Re: [Dspace-general] How solr works with dpsace

Mark Diggory Mon, 19 Jul 2010 13:18:43 -0700

Hello Sauluha, Chris and everyone else interested in this topic,

 I will comment that a number of individuals contacted me offline to offer
words of support on the discovery activities presented at OR10. Chris, I
want to let you know that the work on Discovery is about using solr as the
"service" for search and browse capabilities in DSpace, and this does not
elleviate the need to have good practices and detailed strategies for how to
organize search and browse fields for faceting in DSpace.  But only
designates where such work should go on.

At this time, one thing I did not present on (that I wish I had) is that
Discovery takes a minimalist approach to indexing DSpace objects,
mapping verbatim the metadata fields and content in a strategy "dissimilar"
to that of the current lucene implementation, the old DSpace way that
indexing occurred was to create a set of properties in the dspace.cfg that
mapped things like

Lucene Author Field == dc.contributor.author + dc.creator

While we could have done the same for discovery, we chose instead to us Solr
to abstract this process away from DSpace entirely.  Thus DSpace just issues
verbatim that.

Item dc.contributor.author == Lucene dc.contributor.author

It is then left an exercise for the configuration of solr to process the
merging, which it manages quite well without our having to hardcode such
activities into the DSpace codebase itself.

Thus we attain with just solr configuration the merging requirements for
dublin core such as:

dc.contributor = dc.contributor.*

And the maintainer can do more complex customizations such as configuring
analyzers capable of parsing/tokening specific field values etc. for
instance if one wanted to index multilingual filed values based on a
dictionary lookup or appropriate label values for an authority key stored in
a controlled metadata field.

The ultimate objective of Discovery is to enable the complete replacement of
significant portions of hardcoded DSpace codebase with just usage of Solr
directly. Alleviating what is a resource strapped, developer centric,
activity with a small community (DSpace Search/Browse) with a more
configurable process that has a much larger and experienced enterprise
community of support (Solr).

This said, there is still  a need to improve how we organize our DSpace
Items and the metadata therein into Solr indexes, and on which side of the
indexing process (DSpace Indexing Client vs Solr Request Handlers) it is
more approprate to do such activities.  Chris, I would be very interested to
see contribution on how to map such features as hierarchical controlled
vocabularies and other well defined / normaized preexisting
taxonomies/vocabularies together with Solr Facets to allow more complex
facetting features.  I will add in that one tool we are considering to
enhance Solr facetting further is the Bobo Browse Integration with Solr (
http://code.google.com/p/bobo-browse/wiki/SolrIntegration and
http://snaprojects.jira.com/wiki/display/BOBO/Home).  The intention here is
to provide (1) sorting of facett values and (2) grouping of search results
by multiple sort fields and (3) performance enhancements on top of Solr
facetting.

So I finally challenge both Chris and Sauluha, you will get "more bang for
your buck" if you target Solr for such features as auto-classification and
auto-hierarchy building rather than hardcoding it into DSpace itself.  You
will ultimately target a community of users much larger than DSpace alone
and possibly attain greater buy-in, peer review, contribution and reuse on
such enhancements. All of which will feed-back to benefit DSpace in the end.

If you hardwire such tooling to DSpace for a "quick win", you not only limit
the exposure and success of your own work, but if contributed into the core
DSpace implementation, you will also be restricting other DSpace
stakeholders to have to assist in maintaining it over the long term, this is
the same problem that arose with the original Search/Browse implementation
in DSpace.  The DSpace community should always work to reuse more popular
third party solutions with large cross market OS communities rather than
inventing its own custom solutions.  This is because the application targets
a specific narrow vertical market for Institutional Repositories that is
resource limited. Reuse avoids DSpace stakeholders being stuck with a
stagnating codebase (only known by developers that have left the project)
while the larger OS community continues to evolve.

Sincerely,
Mark

On Tue, Jul 13, 2010 at 4:38 AM, Christophe Dupriez <
[email protected]> wrote:

>  Hi Sauleha!
>
> SolR in DSpace 1.6, for now, is used to manage statistical reports
> generation.
> Mark Diggory (@mire) is experimenting integration of SolR as an
> indexation/search engine for DSpace Items
> (project called DSpace Discovery).
>
> Thank you for bringing CASTANET to my attention: it seems a refreshing way
> to cope with indexation.
> I must learn more!
>
> http://www.powershow.com/view/1e363-NWU3N/Castanet_Using_WordNet_to_Build_Facet_Hierarchies
> I just learned about Flamenco and orderered the printed copy of the book:
> http://searchuserinterfaces.com
> which is probably a "must read" for all DSpace developers!
>
> Personnaly, I went thru extensive improvements of Lucene integration for
> DSpace 1.42.
> I was wondering for much too long how to integrate SolR to provide faceting
> to my users.
> Finaly, I have done it with Lucene alone (no SolR added!).
> It is rather simple (few days of work) IF and ONLY IF your faceting data is
> perfectly controled and normalized upfront.
> Our approach to control and normalization is described here:
> http://dsug09.ub.gu.se/index.php/dsug/dsug09/paper/viewFile/22/3
>
> I join a JPG of the current result (query about "MUSIC*" in a database of
> 90 thousands articles about toxicology).
> If it gets scrubbed, I can send it separately.
>
> It ask for some changes in classes:
> * DSQuery to analyse current research
> * Faceter, a new class to gather faceting information and to generate
> desired output
> * and a modification in search\results.jsp to include a call to Faceter in
> the right column of the page.
>
> Much simpler than integrating SolR.
> BUT, SolR in DSpace would bring many other benefits....
> If DSpace committers take it on their shoulders (too many modifications
> everywhere in DSpace code for an "outsider")
>
> Good luck!
>
> Christophe Dupriez
>
>
> Le 13/07/2010 11:03, Sauleha Durrani a écrit :
>
>
> Dear all,
>
> I am trying to integrate multifaceted search with dspace.. I am facing
> several issues.
>
>    - Apache Solr provides faceted search over lucene but I am unable to
>    understand its working. Can anyone guide me in how Solr works? and will it
>    help us in integrating multifaceted search with Dspace ???
>    - My other question is that I am also working on a multifaceted
>    algorithm, We have derived its idea from "CATSANET". Does anybody has
>    another idea?
>
> I shall be anxiously waiting for reply..
> Thank you.
> Take care
> Best Regards ..
> SAULEHA * *
>
>
>
> ------------------------------
> Hotmail: Trusted email with powerful SPAM protection. Sign up 
> now.<https://signup.live.com/signup.aspx?id=60969>
>
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by Sprint
> What will you do first with EVO, the first 4G phone?
> Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
>
>
> _______________________________________________
> Dspace-general mailing 
> [email protected]https://lists.sourceforge.net/lists/listinfo/dspace-general
>
>
>
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by Sprint
> What will you do first with EVO, the first 4G phone?
> Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
> _______________________________________________
> Dspace-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dspace-general
>
>

-- 
Mark R. Diggory
Head of U.S. Operations - @mire

http://www.atmire.com - Institutional Repository Solutions
http://www.togather.eu - Before getting together, get t...@ther

------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first

_______________________________________________
Dspace-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-general

Re: [Dspace-general] How solr works with dpsace

Reply via email to