Correct me if I don't have this right:  you had an existing instance of dspace, 
where search worked properly.  You cloned the instance to new server, and after 
the transfer, media filter wasn't able to extract full text properly from PDFs 
with special characters in them.  When you re-submit the PDFs to the new 
instance, media filter (and thus search) works as it should?  

It's possible the pdfs were damaged in the transfer.  How did you transfer them?

I assume you're not seeing any errors in the media filter output, right?

Cheers,

Bill

Bill Anderson
Software Developer
Digital Library Development
Georgia Tech Library

----- "mikan.d.dspace listmail" <mikan.dsp...@gmail.com> wrote:

| Hi Stuart,
| As I mentioned in my earlier post, runnin filter-media with --force
| (-f) switch didnt fix the problem.
| 
| -Mika
| 
| 2009/6/16 Stuart Lewis <s.le...@auckland.ac.nz>:
| > Hi Mika,
| >
| > Since running filter-media on new items seems OK, have you tried
| running:
| >
| > [dspace]/bin/filter-media -f
| >
| > -f forces all the bitstreams to be re-filtered.
| >
| > Thanks,
| >
| >
| > Stuart Lewis
| > Digital Services Programmer
| > Te Tumu Herenga The University of Auckland Library
| > Auckland Mail Centre, Private Bag 92019, Auckland 1142, New Zealand
| > Ph: 64 9 373-7599 x81928
| > http://www.library.auckland.ac.nz/
| >
| >
| >
| > -----Original Message-----
| > From: mikan.d.dspace listmail [mailto:mikan.dsp...@gmail.com]
| > Sent: Tuesday, 16 June 2009 1:05 a.m.
| > To: Terrance Davis
| > Cc: Dspace Tech
| > Subject: Re: [Dspace-tech] DSpace search weirdness
| >
| > Nope.
| > The server 1 has Debian 5 with Java  version "1.6.0_12". and server
| 2
| > has RHEL and Java version  "1.5.0_18". Could this cause the
| problem?
| >
| > Another strange thing I noticed, is that if I re-submit the entire
| > item & file and then run filter-media, the text is extracted
| > correctly?? So, to me  it seems that the old data in the
| transferred
| > assetstore is handled incorrectly. Strange, eh?
| >
| > -Mika
| >
| >
| >
| >
| > 2009/6/15 Terrance Davis <terrance.da...@utah.edu>:
| >> Hi Mika,
| >>
| >> Are both systems using the same OS version and the same version of
| Java?
| >>
| >> Best regards,
| >>
| >> Terrance
| >>
| >> --
| >> Web Applications Programmer
| >> Institute for Clean and Secure Energy
| >> University of Utah
| >> http://www.ices.utah.edu
| >>
| >>
| >> On Jun 15, 2009, at 2:01 AM, mikan.d.dspace listmail wrote:
| >>
| >>> Hi Terrance,
| >>>
| >>> I double-checked the indexes in configuration and they do match.
| What
| >>> I noticed though, is that the text extracted from pdf files
| differ,
| >>> which might be the cause of this problem. It seems that when
| >>> filter-media extracts the text on the other server, it messes up
| some
| >>> special characters, thus making them unsearchable. What might be
| >>> causing  this? Both databases are set to UNICODE when created. Is
| >>> there some other system setting that might be causing this?
| >>>
| >>> Example of extracted text is below:
| >>>
| >>> Server 1: (correct encoding)
| >>> 3. PUNAISEN KIRJAN SISÄLTÖ
| >>> Jaettiin punaisen kirjan sisällön päivitystä varten
| vastuuhenkilöt
| >>> seuraavaksi:
| >>> 3.1 Yleisasu ja kirjan sisällön järjestys miettii ja tarkastelee
| Tiina
| >>> Sairanen
| >>>
| >>> Server 2: (Messed up characters)
| >>>
| >>> 3. PUNAISEN KIRJAN SIS?LT?
| >>> Jaettiin punaisen kirjan sis?ll?n p?ivityst? varten
| vastuuhenkil?t
| >>> seuraavaksi:
| >>> 3.1 Yleisasu ja kirjan sis?ll?n j?rjestys miettii ja tarkastelee
| Tiina
| >>> Sairanen
| >>>
| >>>
| >>> Thanks for any help,
| >>> Mika
| >>>
| >>>
| >>> 2009/6/12 Terrance Davis <terrance.da...@utah.edu>:
| >>>>
| >>>> Hi Mika,
| >>>> My first guess is that your config files don't match. You might
| want to
| >>>> check the server that is returning 40 results. If the configured
| search
| >>>> indexes have any white space (such as a tab) after the
| properties, they
| >>>> might not be matching up with the dublin core and not indexing
| properly.
| >>>> No trim() is happening on the configured search index properties
| from the
| >>>> 1.5.2 dspace.cfg, so they may look the same, but be thrown off by
| extra
| >>>> unwanted white space.
| >>>> Best regards,
| >>>> Terrance Davis
| >>>> --
| >>>> Web Applications Programmer
| >>>> Institute for Clean and Secure Energy
| >>>> University of Utah
| >>>> http://www.ices.utah.edu/
| >>>>
| >>>>
| >>>>
| >>>> On Jun 12, 2009, at 5:24 AM, mikan.d.dspace listmail wrote:
| >>>>
| >>>> Im confused by the way DSpace search works. I cloned our Dspace
| 1.5.2
| >>>> instance to another server. They both have the same config, same
| items
| >>>> etc. However when I run search I get different results?! With the
| same
| >>>> search term the other search shows 40 results and the other 72.
| I've
| >>>> forced reindexing and media-filters but nothing changes. What
| could be
| >>>> the  cause of this?
| >>>>
| >>>> Thanks,
| >>>> Mika
| >>>>
| >>>>
| >>>>
| ------------------------------------------------------------------------------
| >>>> Crystal Reports - New Free Runtime and 30 Day Trial
| >>>> Check out the new simplified licensing option that enables
| unlimited
| >>>> royalty-free distribution of the report engine for externally
| facing
| >>>> server and web deployment.
| >>>> http://p.sf.net/sfu/businessobjects
| >>>> _______________________________________________
| >>>> DSpace-tech mailing list
| >>>> DSpace-tech@lists.sourceforge.net
| >>>> https://lists.sourceforge.net/lists/listinfo/dspace-tech
| >>>>
| >>>>
| >>
| >>
| >
| >
| ------------------------------------------------------------------------------
| > Crystal Reports - New Free Runtime and 30 Day Trial
| > Check out the new simplified licensing option that enables
| unlimited
| > royalty-free distribution of the report engine for externally
| facing
| > server and web deployment.
| > http://p.sf.net/sfu/businessobjects
| > _______________________________________________
| > DSpace-tech mailing list
| > DSpace-tech@lists.sourceforge.net
| > https://lists.sourceforge.net/lists/listinfo/dspace-tech
| >
| 
| ------------------------------------------------------------------------------
| Crystal Reports - New Free Runtime and 30 Day Trial
| Check out the new simplified licensing option that enables unlimited
| royalty-free distribution of the report engine for externally facing 
| server and web deployment.
| http://p.sf.net/sfu/businessobjects
| _______________________________________________
| DSpace-tech mailing list
| DSpace-tech@lists.sourceforge.net
| https://lists.sourceforge.net/lists/listinfo/dspace-tech

------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables unlimited
royalty-free distribution of the report engine for externally facing 
server and web deployment.
http://p.sf.net/sfu/businessobjects
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to