Re: [Dspace-tech] DSpace a memory hog?
This is good news for us -- the batch importing and indexing are cron jobs kicked off at mid-night and at least the domestic users won't feel them. -Pan On 4/19/07, Robert Tansley <[EMAIL PROTECTED]> wrote: Hi Pan, The Web server aspect (i.e. Tomcat) should have fairly constant memory use -- the vast majority of operations are very short and work on a very small number of objects, and as soon as the request is over any memory used is returned to the heap. How much memory you need to give it largely depends on the load, i.e. how many of these the server will be servicing at a given instant. The areas I think folks have run into memory use issues are batch importing, indexing and the media filters (thumbnail generation, text extraction for indexing) -- these operate on a large number of objects at once, and some of the DSpace code isn't so great at freeing up objects in these operations. But we're finding the problems and fixing them as Cory mentions. Getting technical below: Developers: a quick scan of the code shows that: batch export (classic): needs fixing batch import (classic): needs fixing browse indexer: needs fixing search (lucene indexer): needs fixing media filter: OK history system: problems recording collection state (loads all items into memory) Sitemap generator: OK checksum checker: fine but only because it has its own DB access routines and doesn't use the APIs (!) The new-style packager (with plug-ins) only appears to be able to operate on one Item at a time. Also found: BitstreamStorageManager appears to reach up into busines logic layer and user checker API () this needs fixing. This is probably because the checksum checker includes its own DB access API :-O The above could probably be fixed for 1.4.2, with the potential exception of the checksum checker which needs to be changed to use the correct APIs. Rob On 18/04/07, Pan Family <[EMAIL PROTECTED]> wrote: > Thank you all for giving your opinion! > > Technically, is it the web application or the indexer that requires > most of the memory? What data is kept in memory all the time > (even when nobody is searching)? Is the memory usage proportional > to the number of concurrent sessions? > > Thanks again, > > Pan > > > > > > On 4/18/07, Cory Snavely <[EMAIL PROTECTED]> wrote: > > Well, as I said at first, it all depends on your definition of what a > > memory hog is. Today's hog fits in tomorrow's pocket. We better all > > already be used to that. > > > > Also, I don't think for a *minute* that the original developers of > > DSpace made a casual choice about their development environment--in > > fact, I think they made a responsible choice given the alternatives. > > Let's give our colleagues credit that's due. Their choice permits > > scaling and fits well for an open-source project. Putting the general > > problem of memory bloat in their laps seems pretty angsty to me. > > > > Lastly, dedicating a server to DSpace is a choice, not a necessity. We > > as implementors have complete freedom to separate out the database and > > storage tiers, and mechanisms exist for scaling Tomcat horizontally as > > well. In the other direction, I suspect people are running DSpace on > > VMware or xen virtual machines, too. > > > > Cory Snavely > > University of Michigan Library IT Core Services > > > > On Wed, 2007-04-18 at 13:40 -0500, Brad Teale wrote: > > > Pan, > > > > > > Dspace is a memory hog considering the functionality the application > > > provides. This is mainly due to the technological choices made by the > > > founders of the Dspace project, and not the functional requirements the > > > Dspace project fulfills. > > > > > > Application and memory bloat are pervasive in the IT industry. Each > > > individual organization should look at their requirements whether they > > > are hardware, software or both. Having to dedicate a machine to an > > > application, especially a relatively simple application like Dspace, is > > > wasteful for hardware resources and people resources. > > > > > > Web applications should _not_ need 2G of memory to "run comfortably". > > > > > > > > > > - > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 express and take > control of your XML. No limits. Just data. Click to get it now. > http://sourceforge.net/powerbar/db2/ > ___ > DSpace-tech mailing list > DSpace-tech@lists.sourceforge.net > https
[Dspace-tech] the URI handle: what is it supposed to represent?
*Please use this identifier to cite or link to this item: http://hdl.handle.net/123456789/545 *URI: http://hdl.handle.net/123456789/545 What's up with the URI handle? It is a broken link for me, but what is it supposed to represent? Is there a way to customize it out, if my users don't care about this info.? Thanks, Pan - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] DSpace a memory hog?
Thank you all for giving your opinion! Technically, is it the web application or the indexer that requires most of the memory? What data is kept in memory all the time (even when nobody is searching)? Is the memory usage proportional to the number of concurrent sessions? Thanks again, Pan On 4/18/07, Cory Snavely <[EMAIL PROTECTED]> wrote: Well, as I said at first, it all depends on your definition of what a memory hog is. Today's hog fits in tomorrow's pocket. We better all already be used to that. Also, I don't think for a *minute* that the original developers of DSpace made a casual choice about their development environment--in fact, I think they made a responsible choice given the alternatives. Let's give our colleagues credit that's due. Their choice permits scaling and fits well for an open-source project. Putting the general problem of memory bloat in their laps seems pretty angsty to me. Lastly, dedicating a server to DSpace is a choice, not a necessity. We as implementors have complete freedom to separate out the database and storage tiers, and mechanisms exist for scaling Tomcat horizontally as well. In the other direction, I suspect people are running DSpace on VMware or xen virtual machines, too. Cory Snavely University of Michigan Library IT Core Services On Wed, 2007-04-18 at 13:40 -0500, Brad Teale wrote: > Pan, > > Dspace is a memory hog considering the functionality the application > provides. This is mainly due to the technological choices made by the > founders of the Dspace project, and not the functional requirements the > Dspace project fulfills. > > Application and memory bloat are pervasive in the IT industry. Each > individual organization should look at their requirements whether they > are hardware, software or both. Having to dedicate a machine to an > application, especially a relatively simple application like Dspace, is > wasteful for hardware resources and people resources. > > Web applications should _not_ need 2G of memory to "run comfortably". > - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
[Dspace-tech] DSpace a memory hog?
Hi, There is a rumor that says DSpace is a memory hog. I don't know where this is from but it may not be that important. What is important is that it makes my management nerves. So I'd like to hear from those who know anything about this issue. Is it really a memory hog? Under what circumstances it might become a memory hog? Or there should be no worry about memory usage at all? Thanks a lot in advance! -Pan - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
[Dspace-tech] another customization question
Hi, If I click on Titles, by defult I am seeing the 1st 21 items available. How do I change the default value so I can show more items per page? Also, the default look and feel does not use space efficiently. How can I change the texts "DEV DSpace at XXX," "Browse by Title," "Jump to: 0-9 ...," "or enter first few ...," and "Showing items ..." to a different location (e.g., bottom) or to use less space? Thanks! -Pan - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV___ DSpace-tech mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dspace-tech
[Dspace-tech] customize the interface?
Hi, I'd like to customize DSpace by replacing the texts "DSpace TM About DSpace Software" on the upper left and "DSpace Software Copyright ..." down below by the texts describing my institution and my project. Could someone please show me where in the html/java code under which directories that I can make these changes? Thanks a lot for your help! -Pan - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV___ DSpace-tech mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] how can I find out the collectionID?
Thanks, Stephen! I used --add --resume and it worked: If the items under my archive_dir are the same, nothing is added. But if I add new items under the archive_dir, only the new items are added. I assume that I can use the same mapfile in this way, and as I grow the number of items under the archive_dir, my mapfile will have more and more items listed in the file. Correct? --replace did not work for me. I got NullPointerException, as shown below. What is the right way of using --replace? Thanks, -Pan error from --replace - dsrun org.dspace.app.itemimport.ItemImport --replace --eperson= [EMAIL PROTECTED] --collection=123456789/2 --source=/Users/pan/tmp/ --mapfile=/Users/pan/matfile2.txt Destination collections: Owning Collection: PODAAC collection Replacing: 123456789/18 java.lang.NullPointerException at org.dspace.app.itemimport.ItemImport.deleteItem(ItemImport.java :692) at org.dspace.app.itemimport.ItemImport.replaceItems(ItemImport.java :567) at org.dspace.app.itemimport.ItemImport.main(ItemImport.java:411) java.lang.NullPointerException On 2/27/07, Stephen De Gabrielle <[EMAIL PROTECTED]> wrote: Hi. I think you can use the mapfile and --resume to import only items not in the mapfile. (mapfile is just a list of handle/folder pairs - one for each item imported) --replace may also be useful for updating items dsrun org.dspace.app.itemimport.ItemImport --replace [EMAIL PROTECTED] --collection=collectID --source=items_dir --mapfile=mapfile "Replacing items uses the map file to replace the old items and still retain their handles." See http://dspace.org/technology/system-docs/application.html#itemimporter I hope this helps. Cheers, Stephen On 2/27/07, Pan Family <[EMAIL PROTECTED]> wrote: > Yes, I can import items in batch mode now. Thanks! > I have also tried to import two items under two directories, > item_001 and item_002, and DSpace imported them all > at once, which is what I wanted. But DSpace does not > seem to know that the items are already in its database > and it will import them as many times as I asked it to. > So it looks that for automatically importing only the delta > of a document collection spred out under directories and > sub-directories, I'll need to write some code. > Has anyone done this before? > > FYI, I am using DSpace for a distributed data center > at JPL, a Caltech laboratory. > > > Thanks, > > -Pan > > > On 2/23/07, Jayan Chirayath Kurian <[EMAIL PROTECTED]> wrote: > > > > > > > > Your import is fine now ? > > > > (1) It's fine if u have used none.I edited the metadata registry and added > the conference qualifier for a second creator element. You can refer > w3schools.com for basic XML. > > (2) No problem. > > > > (1) mapfile stores the details of files imported using batch import. You > can note that incase u need to remove those imported files this mapfile is > required. > > (2) For each item we have created a directory structure in > archive_directory. i.e item_001, item_002 etc. > > > > You are using Dspace for individual use or corporate organization. > > > > Jayan > > > > > From: Pan Family [mailto:[EMAIL PROTECTED] > > Sent: Sat 2/24/2007 12:27 PM > > To: Jayan Chirayath Kurian > > > > Cc: dspace-tech@lists.sourceforge.net > > Subject: Re: [Dspace-tech] how can I find out the collectionID? > > > > > > > > Yes, it did help!!! > > > > Still two problems: > > (1) ... element="creator" qualifier="conference" or qualifier="email" ... > > caused some exception until I changed qualifier="none" > > But in your example, "conference" was the qualifier. > > Where can I find more info. on how to write good Dublin_core.xml? > > (2) what is this about? Can I ignore it? > > Processing handle file: handle > > It appears there is no handle file -- generating one > > > > Questions: > > (1) A map file is gnereated, but what is it for? > > (2) What if I have several documents, each is an item, > > under one directory, say Items_001? Do I prepare > > multiple corresponding .xml files? Do I list all the > > file names in the file contents? > > > > Thanks! > > > > -Pan > > > > > > > > > > > > > > > > > > On 2/23/07, Jayan Chirayath Kurian < [EMAIL PROTECTED]> wrote: > > > > > > > > > > > > i have Dspace 1.4.1 on windows 2003. > > > > > > (1)My directory structure is C:\D
Re: [Dspace-tech] how can I find out the collectionID?
Yes, it did help!!! Still two problems: (1) ... element="creator" qualifier="conference" or qualifier="email" ... caused some exception until I changed qualifier="none" But in your example, "conference" was the qualifier. Where can I find more info. on how to write good Dublin_core.xml? (2) what is this about? Can I ignore it? Processing handle file: handle It appears there is no handle file -- generating one Questions: (1) A map file is gnereated, but what is it for? (2) What if I have several documents, each is an item, under one directory, say Items_001? Do I prepare multiple corresponding .xml files? Do I list all the file names in the file contents? Thanks! -Pan On 2/23/07, Jayan Chirayath Kurian <[EMAIL PROTECTED]> wrote: i have Dspace 1.4.1 on windows 2003. (1)My directory structure is C:\DSpace\bin\archive_directory (2)The "archive_directory" contains the folder Item_001 (3) Item_001 folder contains (1) Dublin_core.XML (2) contents file and (3) test.pdf please check the name of the file. It should be contents and not contents.txt To rename contents.txt to contents, i used REN contents.txt contents at command prompt. (4) dsrun org.dspace.app.itemimport.ItemImport -a [EMAIL PROTECTED]/2 -s= C:\DSpace\bin\archive_directory -m=mapfile10 I hope this helps. Jayan -- *From:* Pan Family [mailto:[EMAIL PROTECTED] *Sent:* Sat 2/24/2007 11:02 AM *To:* Jayan Chirayath Kurian *Cc:* dspace-tech@lists.sourceforge.net; [EMAIL PROTECTED] *Subject:* Re: [Dspace-tech] how can I find out the collectionID? Hi Jayan (or anyone who knows how to do batch submission): I am still unable to do batch submission. Here is what I did: (1) Created a directory, /Users/pan/tmp and put 3 files under it: Content (a text file, attached); Dublin_core.xml (attached); and batch_import.pdf (the doc I wanted to submit to DSpace); (2) Ran: pan$ dsrun org.dspace.app.itemimport.ItemImport --add --eperson= [EMAIL PROTECTED] --collection=123456789/2 --source=/Users/pan/tmp --mapfile=/Users/pan/test_map Destination collections: Owning Collection: PODAAC collection Adding items from directory: /Users/pan/tmp Generating mapfile: /Users/pan/test_map No error message was shown, but the pdf file was not imported. An empty test_map file was generated. I also ran filter-media and found that all bitstreams were skipped because no new doc has been added. I found out from 1.4.1 beta 1 System Doc (pp. 22) that there are batch tools and registration is an althernate means to upload bitstreams, but no details or examples are provided. Can you provide links to more details or examples please? Thanks a lot for your help! -Pan On 2/1/07, Jayan Chirayath Kurian <[EMAIL PROTECTED]> wrote: > > You solved your problem in importing documents or are u using the > interface to upload documents into the repository. > > > > Jayan > > > -- > > *From:* Pan Family [mailto:[EMAIL PROTECTED] > *Sent:* Friday, February 02, 2007 5:19 AM > *To:* Jayan Chirayath Kurian > *Subject:* Re: [Dspace-tech] how can I find out the collectionID? > > > > Thanks a lot! > > -Pan > > On 1/31/07, *Jayan Chirayath Kurian* <[EMAIL PROTECTED]> wrote: > > > > - > > *-* > > * * * > AMIC-Chiangmai** University** Refresher Course on Communication Research > Methodology : Chiangmai, Oct 29-Nov 2, 1984.* > > * * *The Logic of Social > Science Research. * > > * * *Atal, > Yogesh. * > > * * *1984-10-29* dcvalue> > > * * > > > > > -- > > *From:* Pan Family [mailto: [EMAIL PROTECTED] > *Sent:* Thursday, February 01, 2007 3:52 AM > *To:* Jayan Chirayath Kurian > *Cc:* dspace-tech@lists.sourceforge.net > > > *Subject:* Re: [Dspace-tech] how can I find out the collectionID? > > > > Could you please kindly provide a sample Dublin_core.xml? > > I assumed that dsrun would recursively go through the > directories and index all the files under them. Apparently > I was wrong. The requirement of Dublin_core.xml and > the content file makes the process much less automatic. > Is there a way around this? > > Thanks a lot! > > -Pan > > On 1/30/07, *Jayan Chirayath Kurian* <[EMAIL PROTECTED]> wrote: > > > > > -- > > *From:* Pan Family [mailto: [EMAIL PROTECTED] > *Sent:* Wednesday, January 31, 2007 1:15 PM > *To:* Jayan Chirayath Kurian > *Cc:* Dorothea Salo; dspace-tech@lists.sourceforge.net > *Subject:* Re: [Dspace-tech] how can I find out the collectionID? > > > > Ok. I will give this a try. > > Still two questions: > (1) Where can I get the file Dublin_core.XML? > > Dublin_core.xml contains the me
[Dspace-tech] Fwd: need help in indexing and harvesting
Hi Krishna: Could you please share your Java program with me? I am looking for automatic ways of inserting items from file directories into a collection. If for any reason, you cannot share your code, could you please point me to the info. on how to program using the DSpace API (better with simple examples)? I am a new user of DSpace and am not familiar with using APIs to customize DSpace yet. Sample programs can really help. Thanks a lot! -Pan On 2/22/07, Krishna <[EMAIL PROTECTED]> wrote: Hi everyone, I have developed a java program which could insert items and files in a collection. Now i would like index all the items that i have inserted and the items that i am going to index. And also can you please tell me how to implement the harvest functionality. I have gone through the DSpace API and could not exactly figure out how to use them. public static void indexContent(Context c, DSpaceObject dso) throws java.sql.SQLException, java.io.IOExceptionIndexItem() adds a single item to the index Thanking you, Krishna - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] how can I find out the collectionID?
Hi Jayan (or anyone who knows how to do batch submission): I am still unable to do batch submission. Here is what I did: (1) Created a directory, /Users/pan/tmp and put 3 files under it: Content (a text file, attached); Dublin_core.xml (attached); and batch_import.pdf (the doc I wanted to submit to DSpace); (2) Ran: pan$ dsrun org.dspace.app.itemimport.ItemImport --add --eperson= [EMAIL PROTECTED] --collection=123456789/2 --source=/Users/pan/tmp --mapfile=/Users/pan/test_map Destination collections: Owning Collection: PODAAC collection Adding items from directory: /Users/pan/tmp Generating mapfile: /Users/pan/test_map No error message was shown, but the pdf file was not imported. An empty test_map file was generated. I also ran filter-media and found that all bitstreams were skipped because no new doc has been added. I found out from 1.4.1 beta 1 System Doc (pp. 22) that there are batch tools and registration is an althernate means to upload bitstreams, but no details or examples are provided. Can you provide links to more details or examples please? Thanks a lot for your help! -Pan On 2/1/07, Jayan Chirayath Kurian <[EMAIL PROTECTED]> wrote: You solved your problem in importing documents or are u using the interface to upload documents into the repository. Jayan -- *From:* Pan Family [mailto:[EMAIL PROTECTED] *Sent:* Friday, February 02, 2007 5:19 AM *To:* Jayan Chirayath Kurian *Subject:* Re: [Dspace-tech] how can I find out the collectionID? Thanks a lot! -Pan On 1/31/07, *Jayan Chirayath Kurian* <[EMAIL PROTECTED]> wrote: - *-* * * *AMIC-Chiangmai ** University** Refresher Course on Communication Research Methodology : Chiangmai, Oct 29-Nov 2, 1984.* * * *The Logic of Social Science Research.* * * *Atal, Yogesh.* * * *1984-10-29* * * -- *From:* Pan Family [mailto:[EMAIL PROTECTED] *Sent:* Thursday, February 01, 2007 3:52 AM *To:* Jayan Chirayath Kurian *Cc:* dspace-tech@lists.sourceforge.net *Subject:* Re: [Dspace-tech] how can I find out the collectionID? Could you please kindly provide a sample Dublin_core.xml? I assumed that dsrun would recursively go through the directories and index all the files under them. Apparently I was wrong. The requirement of Dublin_core.xml and the content file makes the process much less automatic. Is there a way around this? Thanks a lot! -Pan On 1/30/07, *Jayan Chirayath Kurian* <[EMAIL PROTECTED]> wrote: ------ *From:* Pan Family [mailto: [EMAIL PROTECTED] *Sent:* Wednesday, January 31, 2007 1:15 PM *To:* Jayan Chirayath Kurian *Cc:* Dorothea Salo; dspace-tech@lists.sourceforge.net *Subject:* Re: [Dspace-tech] how can I find out the collectionID? Ok. I will give this a try. Still two questions: (1) Where can I get the file Dublin_core.XML? Dublin_core.xml contains the meta data descriptions of the resource (e.g. title, date published etc). You have to create the xml file using a notepad. (2) Let's say I only want to index one file named: foo.pdf, and I put it under /Users/pan/tmp/foo.pdf and pass src=/Users/pan to dsrun Is foo.pdf considered the content file or the resource? And which is the third type of file? foo.pdf is the resource (i.e. pdf or ppt or jpeg…..) Content file is a text file that just contains the name of the resource i.e. foo.pdf Thanks a lot! -Pan On 1/30/07, *Jayan Chirayath Kurian* <[EMAIL PROTECTED]> wrote: I feel the tmp directory should have (1) the Dublin_core.XML (2) contents file and (3) actual resource. The tmp directory should have all these files without any more subdirectories for these files. Can you try with source=/Users/pan/ and removing all subdirectories under tmp and having only these 3 files listed above. Hope it works. My structure is src = C:\DSpace\bin\archive_directory The archive_directory contains the directory Item_001 Item_001 contains (1) Dublin_core.XML (2) contents file and (3) actual resource. There are no more subdirectories under Item_001. Thanks, Jayan ---------- *From:* Pan Family [mailto: [EMAIL PROTECTED] *Sent:* Wednesday, January 31, 2007 4:06 AM *To:* Jayan Chirayath Kurian *Cc:* Dorothea Salo; dspace-tech@lists.sourceforge.net *Subject:* Re: [Dspace-tech] how can I find out the collectionID? Thanks for your help! I am working on Mac OS X. Yes, "pan" contains "tmp" It seems that for me the dir that I give to source= cannot contain any subdirs. For example, if I give it "/Users/pan/" I got an error complaining about the missing file ".fvwm/dublin_core.xml" .fvwm is a subdir under "Users/pan/" If I give it "/Users/pan/tmp/" then it complains about the same missing file under the subdirs of "tmp" until I removed all the subdirs under "tmp" But I still don't get t
Re: [Dspace-tech] how can I find out the collectionID?
Not yet. I am still working on it. I would like to avoid using the GUI to submit. Instead, I would like to be able to recursively go through a dir and its sub-dirs and automatically crawl. Has anybody done this before? Thanks, -Lei On 2/1/07, Jayan Chirayath Kurian <[EMAIL PROTECTED]> wrote: You solved your problem in importing documents or are u using the interface to upload documents into the repository. Jayan -- *From:* Pan Family [mailto:[EMAIL PROTECTED] *Sent:* Friday, February 02, 2007 5:19 AM *To:* Jayan Chirayath Kurian *Subject:* Re: [Dspace-tech] how can I find out the collectionID? Thanks a lot! -Pan On 1/31/07, *Jayan Chirayath Kurian* <[EMAIL PROTECTED]> wrote: - *-* * * *AMIC-Chiangmai ** University** Refresher Course on Communication Research Methodology : Chiangmai, Oct 29-Nov 2, 1984.* * * *The Logic of Social Science Research.* * * *Atal, Yogesh.* * * *1984-10-29* * * -- *From:* Pan Family [mailto:[EMAIL PROTECTED] *Sent:* Thursday, February 01, 2007 3:52 AM *To:* Jayan Chirayath Kurian *Cc:* dspace-tech@lists.sourceforge.net *Subject:* Re: [Dspace-tech] how can I find out the collectionID? Could you please kindly provide a sample Dublin_core.xml? I assumed that dsrun would recursively go through the directories and index all the files under them. Apparently I was wrong. The requirement of Dublin_core.xml and the content file makes the process much less automatic. Is there a way around this? Thanks a lot! -Pan On 1/30/07, *Jayan Chirayath Kurian* <[EMAIL PROTECTED]> wrote: ------ *From:* Pan Family [mailto: [EMAIL PROTECTED] *Sent:* Wednesday, January 31, 2007 1:15 PM *To:* Jayan Chirayath Kurian *Cc:* Dorothea Salo; dspace-tech@lists.sourceforge.net *Subject:* Re: [Dspace-tech] how can I find out the collectionID? Ok. I will give this a try. Still two questions: (1) Where can I get the file Dublin_core.XML? Dublin_core.xml contains the meta data descriptions of the resource (e.g. title, date published etc). You have to create the xml file using a notepad. (2) Let's say I only want to index one file named: foo.pdf, and I put it under /Users/pan/tmp/foo.pdf and pass src=/Users/pan to dsrun Is foo.pdf considered the content file or the resource? And which is the third type of file? foo.pdf is the resource (i.e. pdf or ppt or jpeg…..) Content file is a text file that just contains the name of the resource i.e. foo.pdf Thanks a lot! -Pan On 1/30/07, *Jayan Chirayath Kurian* <[EMAIL PROTECTED]> wrote: I feel the tmp directory should have (1) the Dublin_core.XML (2) contents file and (3) actual resource. The tmp directory should have all these files without any more subdirectories for these files. Can you try with source=/Users/pan/ and removing all subdirectories under tmp and having only these 3 files listed above. Hope it works. My structure is src = C:\DSpace\bin\archive_directory The archive_directory contains the directory Item_001 Item_001 contains (1) Dublin_core.XML (2) contents file and (3) actual resource. There are no more subdirectories under Item_001. Thanks, Jayan ---------- *From:* Pan Family [mailto: [EMAIL PROTECTED] *Sent:* Wednesday, January 31, 2007 4:06 AM *To:* Jayan Chirayath Kurian *Cc:* Dorothea Salo; dspace-tech@lists.sourceforge.net *Subject:* Re: [Dspace-tech] how can I find out the collectionID? Thanks for your help! I am working on Mac OS X. Yes, "pan" contains "tmp" It seems that for me the dir that I give to source= cannot contain any subdirs. For example, if I give it "/Users/pan/" I got an error complaining about the missing file ".fvwm/dublin_core.xml" .fvwm is a subdir under "Users/pan/" If I give it "/Users/pan/tmp/" then it complains about the same missing file under the subdirs of "tmp" until I removed all the subdirs under "tmp" But I still don't get the files under "tmp" imported to my collection, even if no error shows after I removed all subdirs. bubba:$ dsrun org.dspace.app.itemimport.ItemImport --add --eperson= [EMAIL PROTECTED] --collection=123456789/2 --source=/Users/pan/ --mapfile=/Users/pan/test_map --test **Test Run** - not actually importing items. Destination collections: Owning Collection: PODAAC collection Adding items from directory: /Users/pan/ Generating mapfile: /Users/pan/test_map Adding item from directory .fvwm java.io.FileNotFoundException : /Users/pan/.fvwm/dublin_core.xml (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.(FileInputStream.java:106) at java.io.FileInputStream .(FileInputStream.java:66) at sun.net.www.protocol.file.FileURLConnection.connect( FileURLConnec
[Dspace-tech] DSpace not indexing MS Powerpoint files?
Hi, I submitted a MS ppt file to my collection, but filter-media does not want to index this ppt file. I tried to shut down the database (PostgreSQL) and restarted it, and ran filter-media several times, but it did not help. I made sure that this ppt file is indeed in the collection by openning it using View/Open. I have no problem indexing MS Word, text, html, or pdf files. Do I need to do anything special for ppt files? Thanks a lot! -Pan - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier. Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] how can I find out the collectionID?
Could you please kindly provide a sample Dublin_core.xml? I assumed that dsrun would recursively go through the directories and index all the files under them. Apparently I was wrong. The requirement of Dublin_core.xml and the content file makes the process much less automatic. Is there a way around this? Thanks a lot! -Pan On 1/30/07, Jayan Chirayath Kurian <[EMAIL PROTECTED]> wrote: -- *From:* Pan Family [mailto:[EMAIL PROTECTED] *Sent:* Wednesday, January 31, 2007 1:15 PM *To:* Jayan Chirayath Kurian *Cc:* Dorothea Salo; dspace-tech@lists.sourceforge.net *Subject:* Re: [Dspace-tech] how can I find out the collectionID? Ok. I will give this a try. Still two questions: (1) Where can I get the file Dublin_core.XML? Dublin_core.xml contains the meta data descriptions of the resource (e.g. title, date published etc). You have to create the xml file using a notepad. (2) Let's say I only want to index one file named: foo.pdf, and I put it under /Users/pan/tmp/foo.pdf and pass src=/Users/pan to dsrun Is foo.pdf considered the content file or the resource? And which is the third type of file? foo.pdf is the resource (i.e. pdf or ppt or jpeg…..) Content file is a text file that just contains the name of the resource i.e. foo.pdf Thanks a lot! -Pan On 1/30/07, *Jayan Chirayath Kurian* <[EMAIL PROTECTED]> wrote: I feel the tmp directory should have (1) the Dublin_core.XML (2) contents file and (3) actual resource. The tmp directory should have all these files without any more subdirectories for these files. Can you try with source=/Users/pan/ and removing all subdirectories under tmp and having only these 3 files listed above. Hope it works. My structure is src = C:\DSpace\bin\archive_directory The archive_directory contains the directory Item_001 Item_001 contains (1) Dublin_core.XML (2) contents file and (3) actual resource. There are no more subdirectories under Item_001. Thanks, Jayan ---------- *From:* Pan Family [mailto:[EMAIL PROTECTED] *Sent:* Wednesday, January 31, 2007 4:06 AM *To:* Jayan Chirayath Kurian *Cc:* Dorothea Salo; dspace-tech@lists.sourceforge.net *Subject:* Re: [Dspace-tech] how can I find out the collectionID? Thanks for your help! I am working on Mac OS X. Yes, "pan" contains "tmp" It seems that for me the dir that I give to source= cannot contain any subdirs. For example, if I give it "/Users/pan/" I got an error complaining about the missing file ".fvwm/dublin_core.xml" .fvwm is a subdir under "Users/pan/" If I give it "/Users/pan/tmp/" then it complains about the same missing file under the subdirs of "tmp" until I removed all the subdirs under "tmp" But I still don't get the files under "tmp" imported to my collection, even if no error shows after I removed all subdirs. bubba:$ dsrun org.dspace.app.itemimport.ItemImport --add --eperson= [EMAIL PROTECTED] --collection=123456789/2 --source=/Users/pan/ --mapfile=/Users/pan/test_map --test **Test Run** - not actually importing items. Destination collections: Owning Collection: PODAAC collection Adding items from directory: /Users/pan/ Generating mapfile: /Users/pan/test_map Adding item from directory .fvwm java.io.FileNotFoundException : /Users/pan/.fvwm/dublin_core.xml (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.(FileInputStream.java:106) at java.io.FileInputStream .(FileInputStream.java:66) at sun.net.www.protocol.file.FileURLConnection.connect( FileURLConnection.java:70) at sun.net.www.protocol.file.FileURLConnection.getInputStream( FileURLConnection.java :161) at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown Source) at org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse (Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.DOMParser.parse (Unknown Source) at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java :172) at org.dspace.app.itemimport.ItemImport.loadXML (ItemImport.java :1269) at org.dspace.app.itemimport.ItemImport.loadDublinCore( ItemImport.java:795) at org.dspace.app.itemimport.ItemImport.loadMetadata( ItemImport.java:780) at org.dspace.app.itemimport.ItemImport.addItem (ItemImport.java :626) at org.dspace.app.itemimport.ItemImport.addItems(ItemImport.java :498) at org.dspace.app.itemimport.ItemImport.main(ItemImport.java:407) java.io.FileNotFoundException: /Users/pan/.fvwm/dublin_core.xml (
Re: [Dspace-tech] how can I find out the collectionID?
Ok. I will give this a try. Still two questions: (1) Where can I get the file Dublin_core.XML? (2) Let's say I only want to index one file named: foo.pdf, and I put it under /Users/pan/tmp/foo.pdf and pass src=/Users/pan to dsrun Is foo.pdf considered the content file or the resource? And which is the third type of file? Thanks a lot! -Pan On 1/30/07, Jayan Chirayath Kurian <[EMAIL PROTECTED]> wrote: I feel the tmp directory should have (1) the Dublin_core.XML (2) contents file and (3) actual resource. The tmp directory should have all these files without any more subdirectories for these files. Can you try with source=/Users/pan/ and removing all subdirectories under tmp and having only these 3 files listed above. Hope it works. My structure is src = C:\DSpace\bin\archive_directory The archive_directory contains the directory Item_001 Item_001 contains (1) Dublin_core.XML (2) contents file and (3) actual resource. There are no more subdirectories under Item_001. Thanks, Jayan -- *From:* Pan Family [mailto:[EMAIL PROTECTED] *Sent:* Wednesday, January 31, 2007 4:06 AM *To:* Jayan Chirayath Kurian *Cc:* Dorothea Salo; dspace-tech@lists.sourceforge.net *Subject:* Re: [Dspace-tech] how can I find out the collectionID? Thanks for your help! I am working on Mac OS X. Yes, "pan" contains "tmp" It seems that for me the dir that I give to source= cannot contain any subdirs. For example, if I give it "/Users/pan/" I got an error complaining about the missing file ".fvwm/dublin_core.xml" .fvwm is a subdir under "Users/pan/" If I give it "/Users/pan/tmp/" then it complains about the same missing file under the subdirs of "tmp" until I removed all the subdirs under "tmp" But I still don't get the files under "tmp" imported to my collection, even if no error shows after I removed all subdirs. bubba:$ dsrun org.dspace.app.itemimport.ItemImport --add --eperson= [EMAIL PROTECTED] --collection=123456789/2 --source=/Users/pan/ --mapfile=/Users/pan/test_map --test **Test Run** - not actually importing items. Destination collections: Owning Collection: PODAAC collection Adding items from directory: /Users/pan/ Generating mapfile: /Users/pan/test_map Adding item from directory .fvwm java.io.FileNotFoundException : /Users/pan/.fvwm/dublin_core.xml (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.(FileInputStream.java:106) at java.io.FileInputStream .(FileInputStream.java:66) at sun.net.www.protocol.file.FileURLConnection.connect( FileURLConnection.java:70) at sun.net.www.protocol.file.FileURLConnection.getInputStream( FileURLConnection.java :161) at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown Source) at org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse (Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.DOMParser.parse (Unknown Source) at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java :172) at org.dspace.app.itemimport.ItemImport.loadXML (ItemImport.java :1269) at org.dspace.app.itemimport.ItemImport.loadDublinCore( ItemImport.java:795) at org.dspace.app.itemimport.ItemImport.loadMetadata( ItemImport.java:780) at org.dspace.app.itemimport.ItemImport.addItem (ItemImport.java :626) at org.dspace.app.itemimport.ItemImport.addItems(ItemImport.java :498) at org.dspace.app.itemimport.ItemImport.main(ItemImport.java:407) java.io.FileNotFoundException: /Users/pan/.fvwm/dublin_core.xml (No such file or directory) ***End of Test Run*** On 1/29/07, *Jayan Chirayath Kurian* <[EMAIL PROTECTED]> wrote: Can you please try with source=/Users/pan/ I encountered the same problem on windows platform. This was rectified by giving the main folder name with the import command. I assume that "pan" contains the subfolder "tmp" which infact contains the pdf file. Hope you will let me know if this works with you. Thanks, Jayan -- *From:* [EMAIL PROTECTED] [mailto: [EMAIL PROTECTED] *On Behalf Of *Pan Family *Sent:* Tuesday, January 30, 2007 8:02 AM *To:* Dorothea Salo *Cc:* dspace-tech@lists.sourceforge.net *Subject:* Re: [Dspace-tech] how can I find out the collectionID? Hi Dorothea: Thanks a lot for your help! In my case, the handle is 123456789/2. So I used the following command to add a pdf file under /User/pan/tmp, but somehow the pdf file was not added into the collection and the file test_
[Dspace-tech] indexing a website?
Hi, Can we use DSpace to index a website? We have a website that contains tons of documents in html, pdf, doc formats, and we'd like to use DSpace to index these to build a Knowledge Base for our customers to search our website. Of course, we could import those documents in batch mode, but it would be better that a search result points to our website, rather than to files stored in DSpace. I hope my question makes sense to you. Thanks a lot! -Pan - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] how can I find out the collectionID?
Thanks for your help! I am working on Mac OS X. Yes, "pan" contains "tmp" It seems that for me the dir that I give to source= cannot contain any subdirs. For example, if I give it "/Users/pan/" I got an error complaining about the missing file ".fvwm/dublin_core.xml" .fvwm is a subdir under "Users/pan/" If I give it "/Users/pan/tmp/" then it complains about the same missing file under the subdirs of "tmp" until I removed all the subdirs under "tmp" But I still don't get the files under "tmp" imported to my collection, even if no error shows after I removed all subdirs. bubba:$ dsrun org.dspace.app.itemimport.ItemImport --add --eperson= [EMAIL PROTECTED] --collection=123456789/2 --source=/Users/pan/ --mapfile=/Users/pan/test_map --test **Test Run** - not actually importing items. Destination collections: Owning Collection: PODAAC collection Adding items from directory: /Users/pan/ Generating mapfile: /Users/pan/test_map Adding item from directory .fvwm java.io.FileNotFoundException: /Users/pan/.fvwm/dublin_core.xml (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.(FileInputStream.java:106) at java.io.FileInputStream.(FileInputStream.java:66) at sun.net.www.protocol.file.FileURLConnection.connect( FileURLConnection.java:70) at sun.net.www.protocol.file.FileURLConnection.getInputStream( FileURLConnection.java:161) at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown Source) at org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.DOMParser.parse(Unknown Source) at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:172) at org.dspace.app.itemimport.ItemImport.loadXML(ItemImport.java :1269) at org.dspace.app.itemimport.ItemImport.loadDublinCore( ItemImport.java:795) at org.dspace.app.itemimport.ItemImport.loadMetadata(ItemImport.java :780) at org.dspace.app.itemimport.ItemImport.addItem(ItemImport.java:626) at org.dspace.app.itemimport.ItemImport.addItems(ItemImport.java :498) at org.dspace.app.itemimport.ItemImport.main(ItemImport.java:407) java.io.FileNotFoundException: /Users/pan/.fvwm/dublin_core.xml (No such file or directory) ***End of Test Run*** On 1/29/07, Jayan Chirayath Kurian <[EMAIL PROTECTED]> wrote: Can you please try with source=/Users/pan/ I encountered the same problem on windows platform. This was rectified by giving the main folder name with the import command. I assume that "pan" contains the subfolder "tmp" which infact contains the pdf file. Hope you will let me know if this works with you. Thanks, Jayan -- *From:* [EMAIL PROTECTED] [mailto: [EMAIL PROTECTED] *On Behalf Of *Pan Family *Sent:* Tuesday, January 30, 2007 8:02 AM *To:* Dorothea Salo *Cc:* dspace-tech@lists.sourceforge.net *Subject:* Re: [Dspace-tech] how can I find out the collectionID? Hi Dorothea: Thanks a lot for your help! In my case, the handle is 123456789/2. So I used the following command to add a pdf file under /User/pan/tmp, but somehow the pdf file was not added into the collection and the file test_map is empty. No error message was shown either. I wonder what I did wrong. Could you give me some ideas on how to debug? Thanks again, -Pan bubba:~/dspace-1.4.1-source /bin pan$ dsrun org.dspace.app.itemimport.ItemImport --add [EMAIL PROTECTED]/2 --source=/Users/pan/tmp/ --mapfile=/Users/pan/tmp/test_map Destination collections: Owning Collection: PODAAC collection Adding items from directory: /Users/pan/tmp/ Generating mapfile: /Users/pan/tmp/test_map On 1/29/07, *Dorothea Salo *<[EMAIL PROTECTED]> wrote: Pan Family wrote: > dsrun org.dspace.app.itemimport.ItemImport --add > [EMAIL PROTECTED] --collection=collectionID --source=items_dir > --mapfile=mapfile > > Hi, > > The above command for batch import requires > the collectionID as input. I wonder how > I can find out this ID? Is it the string > that I used to name my collection, or an ID > that DSpace uses internally? You can use the collection's handle for this; go to the collection's home page and use the numbers after "handle/" in the URL. If you should need the internal DSpace collection ID for some reason, though, log in, surf to the collection page, and then use the "Edit" button under Admin Tools. From there, choose "Collection's Authorizations," a
Re: [Dspace-tech] how can I find out the collectionID?
Hi Dorothea: Thanks a lot for your help! In my case, the handle is 123456789/2. So I used the following command to add a pdf file under /User/pan/tmp, but somehow the pdf file was not added into the collection and the file test_map is empty. No error message was shown either. I wonder what I did wrong. Could you give me some ideas on how to debug? Thanks again, -Pan bubba:~/dspace-1.4.1-source/bin pan$ dsrun org.dspace.app.itemimport.ItemImport --add [EMAIL PROTECTED]/2 --source=/Users/pan/tmp/ --mapfile=/Users/pan/tmp/test_map Destination collections: Owning Collection: PODAAC collection Adding items from directory: /Users/pan/tmp/ Generating mapfile: /Users/pan/tmp/test_map On 1/29/07, Dorothea Salo <[EMAIL PROTECTED]> wrote: Pan Family wrote: > dsrun org.dspace.app.itemimport.ItemImport --add > [EMAIL PROTECTED] --collection=collectionID --source=items_dir > --mapfile=mapfile > > Hi, > > The above command for batch import requires > the collectionID as input. I wonder how > I can find out this ID? Is it the string > that I used to name my collection, or an ID > that DSpace uses internally? You can use the collection's handle for this; go to the collection's home page and use the numbers after "handle/" in the URL. If you should need the internal DSpace collection ID for some reason, though, log in, surf to the collection page, and then use the "Edit" button under Admin Tools. From there, choose "Collection's Authorizations," and DSpace will pop up the "DB ID" in the title of the page. (I hope there's an easier way to do this! There certainly should be.) Dorothea -- Dorothea Salo, Digital Repository Services Librarian (703)993-3742 [EMAIL PROTECTED] AIM: gmumars MSN 2FL, Fenwick Library George Mason University 4400 University Drive, Fairfax VA 22031 - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
[Dspace-tech] how can I find out the collectionID?
dsrun org.dspace.app.itemimport.ItemImport --add [EMAIL PROTECTED] --collection=collectionID --source=items_dir --mapfile=mapfile Hi, The above command for batch import requires the collectionID as input. I wonder how I can find out this ID? Is it the string that I used to name my collection, or an ID that DSpace uses internally? Thanks a lot! -Pan - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech