Hi Bill, You have got it right. I cloned the new DSpace instance roughly by:
1. first doing a fresh install of DSpace source 2. Importing database dump from the other server (taken with pg_dump, I also tried pg_restore btw.) 3. I created assetstore.tar.gz from my old server and copied it to the new server. When I run media-filter or media-filter --force, the extracted text doesnt get the special characters (say ä, ö, å) right, but has '?' -mark instead of them. On my original server everything works fine. And on my new server, the new submissions work fine after filter-media. I just re-ran filter-media -f and no error messages come up. Maybe I should dig the assetstore to see what the files look like from the command line? How could I find out the assetstore path for a specific item? Thanks, Mika 2009/6/16 <bill.ander...@library.gatech.edu>: > Correct me if I don't have this right: you had an existing instance of > dspace, where search worked properly. You cloned the instance to new server, > and after the transfer, media filter wasn't able to extract full text > properly from PDFs with special characters in them. When you re-submit the > PDFs to the new instance, media filter (and thus search) works as it should? > > It's possible the pdfs were damaged in the transfer. How did you transfer > them? > > I assume you're not seeing any errors in the media filter output, right? > > Cheers, > > Bill > > Bill Anderson > Software Developer > Digital Library Development > Georgia Tech Library > > ----- "mikan.d.dspace listmail" <mikan.dsp...@gmail.com> wrote: > > | Hi Stuart, > | As I mentioned in my earlier post, runnin filter-media with --force > | (-f) switch didnt fix the problem. > | > | -Mika > | > | 2009/6/16 Stuart Lewis <s.le...@auckland.ac.nz>: > | > Hi Mika, > | > > | > Since running filter-media on new items seems OK, have you tried > | running: > | > > | > [dspace]/bin/filter-media -f > | > > | > -f forces all the bitstreams to be re-filtered. > | > > | > Thanks, > | > > | > > | > Stuart Lewis > | > Digital Services Programmer > | > Te Tumu Herenga The University of Auckland Library > | > Auckland Mail Centre, Private Bag 92019, Auckland 1142, New Zealand > | > Ph: 64 9 373-7599 x81928 > | > http://www.library.auckland.ac.nz/ > | > > | > > | > > | > -----Original Message----- > | > From: mikan.d.dspace listmail [mailto:mikan.dsp...@gmail.com] > | > Sent: Tuesday, 16 June 2009 1:05 a.m. > | > To: Terrance Davis > | > Cc: Dspace Tech > | > Subject: Re: [Dspace-tech] DSpace search weirdness > | > > | > Nope. > | > The server 1 has Debian 5 with Java version "1.6.0_12". and server > | 2 > | > has RHEL and Java version "1.5.0_18". Could this cause the > | problem? > | > > | > Another strange thing I noticed, is that if I re-submit the entire > | > item & file and then run filter-media, the text is extracted > | > correctly?? So, to me it seems that the old data in the > | transferred > | > assetstore is handled incorrectly. Strange, eh? > | > > | > -Mika > | > > | > > | > > | > > | > 2009/6/15 Terrance Davis <terrance.da...@utah.edu>: > | >> Hi Mika, > | >> > | >> Are both systems using the same OS version and the same version of > | Java? > | >> > | >> Best regards, > | >> > | >> Terrance > | >> > | >> -- > | >> Web Applications Programmer > | >> Institute for Clean and Secure Energy > | >> University of Utah > | >> http://www.ices.utah.edu > | >> > | >> > | >> On Jun 15, 2009, at 2:01 AM, mikan.d.dspace listmail wrote: > | >> > | >>> Hi Terrance, > | >>> > | >>> I double-checked the indexes in configuration and they do match. > | What > | >>> I noticed though, is that the text extracted from pdf files > | differ, > | >>> which might be the cause of this problem. It seems that when > | >>> filter-media extracts the text on the other server, it messes up > | some > | >>> special characters, thus making them unsearchable. What might be > | >>> causing this? Both databases are set to UNICODE when created. Is > | >>> there some other system setting that might be causing this? > | >>> > | >>> Example of extracted text is below: > | >>> > | >>> Server 1: (correct encoding) > | >>> 3. PUNAISEN KIRJAN SISÄLTÖ > | >>> Jaettiin punaisen kirjan sisällön päivitystä varten > | vastuuhenkilöt > | >>> seuraavaksi: > | >>> 3.1 Yleisasu ja kirjan sisällön järjestys miettii ja tarkastelee > | Tiina > | >>> Sairanen > | >>> > | >>> Server 2: (Messed up characters) > | >>> > | >>> 3. PUNAISEN KIRJAN SIS?LT? > | >>> Jaettiin punaisen kirjan sis?ll?n p?ivityst? varten > | vastuuhenkil?t > | >>> seuraavaksi: > | >>> 3.1 Yleisasu ja kirjan sis?ll?n j?rjestys miettii ja tarkastelee > | Tiina > | >>> Sairanen > | >>> > | >>> > | >>> Thanks for any help, > | >>> Mika > | >>> > | >>> > | >>> 2009/6/12 Terrance Davis <terrance.da...@utah.edu>: > | >>>> > | >>>> Hi Mika, > | >>>> My first guess is that your config files don't match. You might > | want to > | >>>> check the server that is returning 40 results. If the configured > | search > | >>>> indexes have any white space (such as a tab) after the > | properties, they > | >>>> might not be matching up with the dublin core and not indexing > | properly. > | >>>> No trim() is happening on the configured search index properties > | from the > | >>>> 1.5.2 dspace.cfg, so they may look the same, but be thrown off by > | extra > | >>>> unwanted white space. > | >>>> Best regards, > | >>>> Terrance Davis > | >>>> -- > | >>>> Web Applications Programmer > | >>>> Institute for Clean and Secure Energy > | >>>> University of Utah > | >>>> http://www.ices.utah.edu/ > | >>>> > | >>>> > | >>>> > | >>>> On Jun 12, 2009, at 5:24 AM, mikan.d.dspace listmail wrote: > | >>>> > | >>>> Im confused by the way DSpace search works. I cloned our Dspace > | 1.5.2 > | >>>> instance to another server. They both have the same config, same > | items > | >>>> etc. However when I run search I get different results?! With the > | same > | >>>> search term the other search shows 40 results and the other 72. > | I've > | >>>> forced reindexing and media-filters but nothing changes. What > | could be > | >>>> the cause of this? > | >>>> > | >>>> Thanks, > | >>>> Mika > | >>>> > | >>>> > | >>>> > | > ------------------------------------------------------------------------------ > | >>>> Crystal Reports - New Free Runtime and 30 Day Trial > | >>>> Check out the new simplified licensing option that enables > | unlimited > | >>>> royalty-free distribution of the report engine for externally > | facing > | >>>> server and web deployment. > | >>>> http://p.sf.net/sfu/businessobjects > | >>>> _______________________________________________ > | >>>> DSpace-tech mailing list > | >>>> DSpace-tech@lists.sourceforge.net > | >>>> https://lists.sourceforge.net/lists/listinfo/dspace-tech > | >>>> > | >>>> > | >> > | >> > | > > | > > | > ------------------------------------------------------------------------------ > | > Crystal Reports - New Free Runtime and 30 Day Trial > | > Check out the new simplified licensing option that enables > | unlimited > | > royalty-free distribution of the report engine for externally > | facing > | > server and web deployment. > | > http://p.sf.net/sfu/businessobjects > | > _______________________________________________ > | > DSpace-tech mailing list > | > DSpace-tech@lists.sourceforge.net > | > https://lists.sourceforge.net/lists/listinfo/dspace-tech > | > > | > | > ------------------------------------------------------------------------------ > | Crystal Reports - New Free Runtime and 30 Day Trial > | Check out the new simplified licensing option that enables unlimited > | royalty-free distribution of the report engine for externally facing > | server and web deployment. > | http://p.sf.net/sfu/businessobjects > | _______________________________________________ > | DSpace-tech mailing list > | DSpace-tech@lists.sourceforge.net > | https://lists.sourceforge.net/lists/listinfo/dspace-tech > ------------------------------------------------------------------------------ Crystal Reports - New Free Runtime and 30 Day Trial Check out the new simplified licensing option that enables unlimited royalty-free distribution of the report engine for externally facing server and web deployment. http://p.sf.net/sfu/businessobjects _______________________________________________ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech