Re: [Dspace-tech] media filter question

2013-09-25 Thread Bill Tantzen
Solved.

In v3.2, bitstreamformatregistry.short_description for mimetype
application/pdf is 'Adobe PDF'.  However, in my installation (for some long
lost reason) the short_description is simply 'PDF'.

Therefore in MediaFilterManager.java::filterBitstream(), the test at line
556:

  if (fmts.contains(myBitstream.getFormat().getShortDescription()))

never returns true, so no pdf files are ever processed.

As a workaround, in dspace.cfg, I changed

  filter.org.dspace.app.mediafilter.PDFFilter.inputFormats = Adobe PDF

to filter.org.dspace.app.mediafilter.PDFFilter.inputFormats = Adobe PDF, PDF

and voila!  Everything works.  I could just have easily updated
bitstreamformatregistry, but I was wary of breaking something else.

Cheers!
Bill


On Mon, Sep 23, 2013 at 11:06 AM, Bill Tantzen wile...@gmail.com wrote:

 Ivan,

 Thanks for checking in...

 dspace filter-media returns with exit status 0.  The dspace log shows no
 errors, just entries of the form:

 2013-09-23 10:37:41,012 INFO  org.dspace.search.DSIndexer @ Writing
 Community: 2408/104859 to Index

 or:

 2013-09-23 10:37:40,336 INFO  org.dspace.search.DSIndexer @ Writing
 Collection: 2408/55874 to Index

 The output from the command line is short.  Normally, I would expect to
 see a log of each bitstream examined beginning with 'FILTERED' or
 'SKIPPED'.  Instead I see only a few errors for .doc files (Invalid Format)
 followed by a couple of SKIPPED entries for bitstreams with an existing
 .txt file.

 All the .pdf files are in the ORIGINAL bundle.  For instance:

 dspace= select * from item2bundle where item_id = 34950;
 -[ RECORD 1 ]
 id| 39982
 item_id   | 34950
 bundle_id | 39983
 -[ RECORD 2 ]
 id| 39983
 item_id   | 34950
 bundle_id | 39984

 dspace= select * from bundle where bundle_id in ( 39983, 39984 );
 -[ RECORD 1 ]+-
 bundle_id| 39983
 name | LICENSE
 primary_bitstream_id |
 -[ RECORD 2 ]+-
 bundle_id| 39984
 name | ORIGINAL
 primary_bitstream_id |

 dspace= select * from bundle2bitstream where bundle_id = 39984;
 -[ RECORD 1 ]---+--
 id  | 40042
 bundle_id   | 39984
 bitstream_id| 40065
 bitstream_order | 2

 dspace= select * from bitstream where bitstream_id = 40065;
 -[ RECORD 1 ]---+
 bitstream_id| 40065
 bitstream_format_id | 3
 name| 8175706.pdf
 size_bytes  | 6587102
 checksum| 164de17195af1d0de45cd17a431fc2b9
 checksum_algorithm  | MD5
 description |
 user_format_description |
 source  | /dspace/assetstore/dspace-sr/upload/8175706.pdf
 internal_id | 104968051252620967298398595849898250327
 deleted | f
 store_number| 0
 sequence_id | 2

 This bitstream however is neither FILTERED nor SKIPPED.

 This database has been recently updated from v1.42 to v3, and I suspect
 the problem is somewhere in the db rather than a bug in the code, but
 everything *looks* right to me.  I can trace the relations from the
 community to collection to item, but for some reason the bitstreams are
 simply not checked.

 What do you think?
 Bill


 On Sun, Sep 22, 2013 at 12:35 PM, helix84 heli...@centrum.sk wrote:

 Hi Bill, please remember to keep dspace-tech in CC.

 Can you please tell me what the result of each of my suggestion was?
 1) What was the errorlevel of your filter-media command?
 2) Did you look at the log while it was running using tail -f?
 3) Were all the bitstreams you expected to be filtered in the ORIGINAL
 bundle? (check at least a few)


 On Fri, Sep 20, 2013 at 10:09 PM, Bill Tantzen wile...@gmail.com wrote:
  Hi Ivan!
 
  I've tried all these suggestions, and still, no success.
 
  There are no errors in the log, only entries of the form:
 
  2013-09-20 15:00:24,802 INFO  org.dspace.search.DSIndexer @ Writing
  Community: 2408/36293 to Index
 
  And
 
  2013-09-20 15:00:17,990 INFO  org.dspace.search.DSIndexer @ Writing
  Collection: 2408/35292 to Index
 
  One for each community and collection.  The bundles are ORIGINAL,
 nothing
  special here...
 
  The database seems OK, I am able to follow the communities to
 collections to
  items just fine, but no bitstreams are being filtered.
 
  I'll keep debugging on my end, but if you have any other ideas, do pass
 them
  my way!
  Bill
 
 
  On Thu, Sep 19, 2013 at 9:08 AM, helix84 heli...@centrum.sk wrote:
 
  Hi Bill,
 
  Jose's suggestion to look at the logs for errors is a good one. First
  of all, we should determine whether the filtering failed during
  processing some item or whether it completed with nothing else to
  process.
 
  Also check the errorlevel of the command. 1 means error, 0 means
 success.
 
 
  On Thu, Sep 19, 2013 at 3:03 PM, Bill Tantzen wile...@gmail.com
 wrote:
   Still working on this media filter issue -- maybe this 

Re: [Dspace-tech] media filter question

2013-09-23 Thread Bill Tantzen
Ivan,

Thanks for checking in...

dspace filter-media returns with exit status 0.  The dspace log shows no
errors, just entries of the form:

2013-09-23 10:37:41,012 INFO  org.dspace.search.DSIndexer @ Writing
Community: 2408/104859 to Index

or:

2013-09-23 10:37:40,336 INFO  org.dspace.search.DSIndexer @ Writing
Collection: 2408/55874 to Index

The output from the command line is short.  Normally, I would expect to see
a log of each bitstream examined beginning with 'FILTERED' or 'SKIPPED'.
 Instead I see only a few errors for .doc files (Invalid Format) followed
by a couple of SKIPPED entries for bitstreams with an existing .txt file.

All the .pdf files are in the ORIGINAL bundle.  For instance:

dspace= select * from item2bundle where item_id = 34950;
-[ RECORD 1 ]
id| 39982
item_id   | 34950
bundle_id | 39983
-[ RECORD 2 ]
id| 39983
item_id   | 34950
bundle_id | 39984

dspace= select * from bundle where bundle_id in ( 39983, 39984 );
-[ RECORD 1 ]+-
bundle_id| 39983
name | LICENSE
primary_bitstream_id |
-[ RECORD 2 ]+-
bundle_id| 39984
name | ORIGINAL
primary_bitstream_id |

dspace= select * from bundle2bitstream where bundle_id = 39984;
-[ RECORD 1 ]---+--
id  | 40042
bundle_id   | 39984
bitstream_id| 40065
bitstream_order | 2

dspace= select * from bitstream where bitstream_id = 40065;
-[ RECORD 1 ]---+
bitstream_id| 40065
bitstream_format_id | 3
name| 8175706.pdf
size_bytes  | 6587102
checksum| 164de17195af1d0de45cd17a431fc2b9
checksum_algorithm  | MD5
description |
user_format_description |
source  | /dspace/assetstore/dspace-sr/upload/8175706.pdf
internal_id | 104968051252620967298398595849898250327
deleted | f
store_number| 0
sequence_id | 2

This bitstream however is neither FILTERED nor SKIPPED.

This database has been recently updated from v1.42 to v3, and I suspect the
problem is somewhere in the db rather than a bug in the code, but
everything *looks* right to me.  I can trace the relations from the
community to collection to item, but for some reason the bitstreams are
simply not checked.

What do you think?
Bill


On Sun, Sep 22, 2013 at 12:35 PM, helix84 heli...@centrum.sk wrote:

 Hi Bill, please remember to keep dspace-tech in CC.

 Can you please tell me what the result of each of my suggestion was?
 1) What was the errorlevel of your filter-media command?
 2) Did you look at the log while it was running using tail -f?
 3) Were all the bitstreams you expected to be filtered in the ORIGINAL
 bundle? (check at least a few)


 On Fri, Sep 20, 2013 at 10:09 PM, Bill Tantzen wile...@gmail.com wrote:
  Hi Ivan!
 
  I've tried all these suggestions, and still, no success.
 
  There are no errors in the log, only entries of the form:
 
  2013-09-20 15:00:24,802 INFO  org.dspace.search.DSIndexer @ Writing
  Community: 2408/36293 to Index
 
  And
 
  2013-09-20 15:00:17,990 INFO  org.dspace.search.DSIndexer @ Writing
  Collection: 2408/35292 to Index
 
  One for each community and collection.  The bundles are ORIGINAL, nothing
  special here...
 
  The database seems OK, I am able to follow the communities to
 collections to
  items just fine, but no bitstreams are being filtered.
 
  I'll keep debugging on my end, but if you have any other ideas, do pass
 them
  my way!
  Bill
 
 
  On Thu, Sep 19, 2013 at 9:08 AM, helix84 heli...@centrum.sk wrote:
 
  Hi Bill,
 
  Jose's suggestion to look at the logs for errors is a good one. First
  of all, we should determine whether the filtering failed during
  processing some item or whether it completed with nothing else to
  process.
 
  Also check the errorlevel of the command. 1 means error, 0 means
 success.
 
 
  On Thu, Sep 19, 2013 at 3:03 PM, Bill Tantzen wile...@gmail.com
 wrote:
   Still working on this media filter issue -- maybe this might point me
 in
   the
   right direction:  how are bitstreams selected for filtering?  Is it
   something like SELECT * FROM bitstream WHERE ???
   What is in the WHERE clause?  Or is there some other basis for
   selection?
 
  No, it's not SQL. It's a recursive call down the hierarchy, as you can
  see in this method and the few following it: [1]
 
  However your WHERE suggestion got me thinking which bitstreams are
  being processed and the answer is bitstreams in the ORIGINAL bundle.
  So please check that your content bundles are called ORIGINAL and not
  something else (e.g. THUMBNAIL or something custom).
 
  [1]
 
 https://github.com/DSpace/DSpace/blob/dspace-3.2/dspace-api/src/main/java/org/dspace/app/mediafilter/MediaFilterManager.java#L393
  [2]
 
 

Re: [Dspace-tech] media filter question

2013-09-22 Thread helix84
Hi Bill, please remember to keep dspace-tech in CC.

Can you please tell me what the result of each of my suggestion was?
1) What was the errorlevel of your filter-media command?
2) Did you look at the log while it was running using tail -f?
3) Were all the bitstreams you expected to be filtered in the ORIGINAL
bundle? (check at least a few)


On Fri, Sep 20, 2013 at 10:09 PM, Bill Tantzen wile...@gmail.com wrote:
 Hi Ivan!

 I've tried all these suggestions, and still, no success.

 There are no errors in the log, only entries of the form:

 2013-09-20 15:00:24,802 INFO  org.dspace.search.DSIndexer @ Writing
 Community: 2408/36293 to Index

 And

 2013-09-20 15:00:17,990 INFO  org.dspace.search.DSIndexer @ Writing
 Collection: 2408/35292 to Index

 One for each community and collection.  The bundles are ORIGINAL, nothing
 special here...

 The database seems OK, I am able to follow the communities to collections to
 items just fine, but no bitstreams are being filtered.

 I'll keep debugging on my end, but if you have any other ideas, do pass them
 my way!
 Bill


 On Thu, Sep 19, 2013 at 9:08 AM, helix84 heli...@centrum.sk wrote:

 Hi Bill,

 Jose's suggestion to look at the logs for errors is a good one. First
 of all, we should determine whether the filtering failed during
 processing some item or whether it completed with nothing else to
 process.

 Also check the errorlevel of the command. 1 means error, 0 means success.


 On Thu, Sep 19, 2013 at 3:03 PM, Bill Tantzen wile...@gmail.com wrote:
  Still working on this media filter issue -- maybe this might point me in
  the
  right direction:  how are bitstreams selected for filtering?  Is it
  something like SELECT * FROM bitstream WHERE ???
  What is in the WHERE clause?  Or is there some other basis for
  selection?

 No, it's not SQL. It's a recursive call down the hierarchy, as you can
 see in this method and the few following it: [1]

 However your WHERE suggestion got me thinking which bitstreams are
 being processed and the answer is bitstreams in the ORIGINAL bundle.
 So please check that your content bundles are called ORIGINAL and not
 something else (e.g. THUMBNAIL or something custom).

 [1]
 https://github.com/DSpace/DSpace/blob/dspace-3.2/dspace-api/src/main/java/org/dspace/app/mediafilter/MediaFilterManager.java#L393
 [2]
 https://github.com/DSpace/DSpace/blob/dspace-3.2/dspace-api/src/main/java/org/dspace/app/mediafilter/MediaFilterManager.java#L502

 Regards,
 ~~helix84

 Compulsory reading: DSpace Mailing List Etiquette
 https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette





Regards,
~~helix84

Compulsory reading: DSpace Mailing List Etiquette
https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette

--
LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint
2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes
Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/22/13. 
http://pubads.g.doubleclick.net/gampad/clk?id=64545871iu=/4140/ostg.clktrk
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette


Re: [Dspace-tech] media filter question

2013-09-19 Thread Bill Tantzen
Still working on this media filter issue -- maybe this might point me in
the right direction:  how are bitstreams selected for filtering?  Is it
something like SELECT * FROM bitstream WHERE ???
What is in the WHERE clause?  Or is there some other basis for selection?

Thanks,
Bill


On Wed, Sep 18, 2013 at 2:09 PM, Bill Tantzen wile...@gmail.com wrote:

 Here's a snip from my dspace.cfg:

 #Names of the enabled MediaFilter or FormatFilter plugins

 filter.plugins = \
   PDF Text Extractor, \
   PDF Thumbnail, \
   HTML Text Extractor, \
   Word Text Extractor, \
   JPEG Thumbnail, \
   Branded Preview JPEG, \
   PowerPoint Text Extractor

 # [To enable Branded Preview]: remove last line above, and uncomment 2
 lines be\
 low

 #Word Text Extractor, JPEG Thumbnail, \

 #Branded Preview JPEG


 #Assign 'human-understandable' names to each filter

 plugin.named.org.dspace.app.mediafilter.FormatFilter = \
   org.dspace.app.mediafilter.XPDF2Text = PDF Text Extractor, \
   org.dspace.app.mediafilter.XPDF2Thumbnail = PDF Thumbnail, \
   org.dspace.app.mediafilter.HTMLFilter = HTML Text Extractor, \
   org.dspace.app.mediafilter.WordFilter = Word Text Extractor, \
   org.dspace.app.mediafilter.JPEGFilter = JPEG Thumbnail, \
   org.dspace.app.mediafilter.BrandedPreviewJPEGFilter = Branded Preview
 JPEG, \
   org.dspace.app.mediafilter.PowerPointFilter = PowerPoint Text Extractor

 Specifically, I *think* the pdf filter should be enabled...  As I said,
 the majority of the files are .pdf...
 Bill


 On Wed, Sep 18, 2013 at 2:00 PM, helix84 heli...@centrum.sk wrote:

 Hi Bill,

 check your configuration to see which media filters you actually have
 enabled:

 https://wiki.duraspace.org/pages/viewpage.action?pageId=32474041#TransformingDSpaceContent(MediaFilters)-AvailableMediaFilters

 It's possible that you have only a mediafilter for one file type
 enabled and thus it skips the majority of your files.


 Regards,
 ~~helix84

 Compulsory reading: DSpace Mailing List Etiquette
 https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette



--
LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint
2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes
Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13. 
http://pubads.g.doubleclick.net/gampad/clk?id=58041151iu=/4140/ostg.clktrk___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette

Re: [Dspace-tech] media filter question

2013-09-19 Thread Jose Blanco
Bill, When you go view an item as an admin, you should be able to see
the txt file created based off the pdf file.  I suppose you can see
these for the pdf files media-filter actually got to, but not to the
others, right?  I also wonder if media filter chocked along the way,
but you said you did not get any error messages.  What about in the
logs? Look at some items as admin and see if this gives you any clue.

-Jose

On Thu, Sep 19, 2013 at 9:03 AM, Bill Tantzen wile...@gmail.com wrote:
 Still working on this media filter issue -- maybe this might point me in the
 right direction:  how are bitstreams selected for filtering?  Is it
 something like SELECT * FROM bitstream WHERE ???
 What is in the WHERE clause?  Or is there some other basis for selection?

 Thanks,
 Bill


 On Wed, Sep 18, 2013 at 2:09 PM, Bill Tantzen wile...@gmail.com wrote:

 Here's a snip from my dspace.cfg:

 #Names of the enabled MediaFilter or FormatFilter plugins
 filter.plugins = \
   PDF Text Extractor, \
   PDF Thumbnail, \
   HTML Text Extractor, \
   Word Text Extractor, \
   JPEG Thumbnail, \
   Branded Preview JPEG, \
   PowerPoint Text Extractor

 # [To enable Branded Preview]: remove last line above, and uncomment 2
 lines be\
 low
 #Word Text Extractor, JPEG Thumbnail, \
 #Branded Preview JPEG

 #Assign 'human-understandable' names to each filter
 plugin.named.org.dspace.app.mediafilter.FormatFilter = \
   org.dspace.app.mediafilter.XPDF2Text = PDF Text Extractor, \
   org.dspace.app.mediafilter.XPDF2Thumbnail = PDF Thumbnail, \
   org.dspace.app.mediafilter.HTMLFilter = HTML Text Extractor, \
   org.dspace.app.mediafilter.WordFilter = Word Text Extractor, \
   org.dspace.app.mediafilter.JPEGFilter = JPEG Thumbnail, \
   org.dspace.app.mediafilter.BrandedPreviewJPEGFilter = Branded Preview
 JPEG, \
   org.dspace.app.mediafilter.PowerPointFilter = PowerPoint Text Extractor

 Specifically, I *think* the pdf filter should be enabled...  As I said,
 the majority of the files are .pdf...
 Bill


 On Wed, Sep 18, 2013 at 2:00 PM, helix84 heli...@centrum.sk wrote:

 Hi Bill,

 check your configuration to see which media filters you actually have
 enabled:

 https://wiki.duraspace.org/pages/viewpage.action?pageId=32474041#TransformingDSpaceContent(MediaFilters)-AvailableMediaFilters

 It's possible that you have only a mediafilter for one file type
 enabled and thus it skips the majority of your files.


 Regards,
 ~~helix84

 Compulsory reading: DSpace Mailing List Etiquette
 https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette




 --
 LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
 1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint
 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack
 includes
 Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13.
 http://pubads.g.doubleclick.net/gampad/clk?id=58041151iu=/4140/ostg.clktrk
 ___
 DSpace-tech mailing list
 DSpace-tech@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/dspace-tech
 List Etiquette:
 https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette

--
LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint
2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes
Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13. 
http://pubads.g.doubleclick.net/gampad/clk?id=58041151iu=/4140/ostg.clktrk
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette


Re: [Dspace-tech] media filter question

2013-09-19 Thread Jose Blanco
One more thing.  do this:

./dspace filter-media -h

and see what is avaialble.  I have version 3 so I'm not sure what is
in your version, but mine has these options, and one of them is to
index a particular item, so you could try that and see what happens.

 ./dspace filter-media -h
usage: MediaFilterManager

 -p,--plugins   ONLY run the specified Media Filter plugin(s)
listed from 'filter.plugins' in dspace.cfg.
Separate multiple with a comma (,)
(e.g. MediaFilterManager -p
Word Text Extractor,PDF Text Extractor)
 -s,--skip  SKIP the bitstreams belonging to identifier
Separate multiple identifiers with a comma (,)
(e.g. MediaFilterManager -s
123456789/34,123456789/323)
 -f,--force force all bitstreams to be processed
 -h,--help  help
 -i,--identifierONLY process bitstreams belonging to identifier
 -m,--maximum   process no more than maximum items
 -n,--noindex   do NOT update the search index after filtering
bitstreams
 -q,--quiet do not print anything except in the event of errors.
 -v,--verbose   print all extracted text and other details to STDOUT

On Thu, Sep 19, 2013 at 9:49 AM, Jose Blanco blan...@umich.edu wrote:
 Bill, When you go view an item as an admin, you should be able to see
 the txt file created based off the pdf file.  I suppose you can see
 these for the pdf files media-filter actually got to, but not to the
 others, right?  I also wonder if media filter chocked along the way,
 but you said you did not get any error messages.  What about in the
 logs? Look at some items as admin and see if this gives you any clue.

 -Jose

 On Thu, Sep 19, 2013 at 9:03 AM, Bill Tantzen wile...@gmail.com wrote:
 Still working on this media filter issue -- maybe this might point me in the
 right direction:  how are bitstreams selected for filtering?  Is it
 something like SELECT * FROM bitstream WHERE ???
 What is in the WHERE clause?  Or is there some other basis for selection?

 Thanks,
 Bill


 On Wed, Sep 18, 2013 at 2:09 PM, Bill Tantzen wile...@gmail.com wrote:

 Here's a snip from my dspace.cfg:

 #Names of the enabled MediaFilter or FormatFilter plugins
 filter.plugins = \
   PDF Text Extractor, \
   PDF Thumbnail, \
   HTML Text Extractor, \
   Word Text Extractor, \
   JPEG Thumbnail, \
   Branded Preview JPEG, \
   PowerPoint Text Extractor

 # [To enable Branded Preview]: remove last line above, and uncomment 2
 lines be\
 low
 #Word Text Extractor, JPEG Thumbnail, \
 #Branded Preview JPEG

 #Assign 'human-understandable' names to each filter
 plugin.named.org.dspace.app.mediafilter.FormatFilter = \
   org.dspace.app.mediafilter.XPDF2Text = PDF Text Extractor, \
   org.dspace.app.mediafilter.XPDF2Thumbnail = PDF Thumbnail, \
   org.dspace.app.mediafilter.HTMLFilter = HTML Text Extractor, \
   org.dspace.app.mediafilter.WordFilter = Word Text Extractor, \
   org.dspace.app.mediafilter.JPEGFilter = JPEG Thumbnail, \
   org.dspace.app.mediafilter.BrandedPreviewJPEGFilter = Branded Preview
 JPEG, \
   org.dspace.app.mediafilter.PowerPointFilter = PowerPoint Text Extractor

 Specifically, I *think* the pdf filter should be enabled...  As I said,
 the majority of the files are .pdf...
 Bill


 On Wed, Sep 18, 2013 at 2:00 PM, helix84 heli...@centrum.sk wrote:

 Hi Bill,

 check your configuration to see which media filters you actually have
 enabled:

 https://wiki.duraspace.org/pages/viewpage.action?pageId=32474041#TransformingDSpaceContent(MediaFilters)-AvailableMediaFilters

 It's possible that you have only a mediafilter for one file type
 enabled and thus it skips the majority of your files.


 Regards,
 ~~helix84

 Compulsory reading: DSpace Mailing List Etiquette
 https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette




 --
 LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
 1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint
 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack
 includes
 Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13.
 http://pubads.g.doubleclick.net/gampad/clk?id=58041151iu=/4140/ostg.clktrk
 ___
 DSpace-tech mailing list
 DSpace-tech@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/dspace-tech
 List Etiquette:
 https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette

--
LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint
2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes
Mobile, 

Re: [Dspace-tech] media filter question

2013-09-19 Thread helix84
Hi Bill,

Jose's suggestion to look at the logs for errors is a good one. First
of all, we should determine whether the filtering failed during
processing some item or whether it completed with nothing else to
process.

Also check the errorlevel of the command. 1 means error, 0 means success.


On Thu, Sep 19, 2013 at 3:03 PM, Bill Tantzen wile...@gmail.com wrote:
 Still working on this media filter issue -- maybe this might point me in the
 right direction:  how are bitstreams selected for filtering?  Is it
 something like SELECT * FROM bitstream WHERE ???
 What is in the WHERE clause?  Or is there some other basis for selection?

No, it's not SQL. It's a recursive call down the hierarchy, as you can
see in this method and the few following it: [1]

However your WHERE suggestion got me thinking which bitstreams are
being processed and the answer is bitstreams in the ORIGINAL bundle.
So please check that your content bundles are called ORIGINAL and not
something else (e.g. THUMBNAIL or something custom).

[1] 
https://github.com/DSpace/DSpace/blob/dspace-3.2/dspace-api/src/main/java/org/dspace/app/mediafilter/MediaFilterManager.java#L393
[2] 
https://github.com/DSpace/DSpace/blob/dspace-3.2/dspace-api/src/main/java/org/dspace/app/mediafilter/MediaFilterManager.java#L502

Regards,
~~helix84

Compulsory reading: DSpace Mailing List Etiquette
https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette

--
LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint
2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes
Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13. 
http://pubads.g.doubleclick.net/gampad/clk?id=58041151iu=/4140/ostg.clktrk
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette


Re: [Dspace-tech] media filter question

2013-09-18 Thread Bill Tantzen
Here's a snip from my dspace.cfg:

#Names of the enabled MediaFilter or FormatFilter plugins

filter.plugins = \
  PDF Text Extractor, \
  PDF Thumbnail, \
  HTML Text Extractor, \
  Word Text Extractor, \
  JPEG Thumbnail, \
  Branded Preview JPEG, \
  PowerPoint Text Extractor

# [To enable Branded Preview]: remove last line above, and uncomment 2
lines be\
low

#Word Text Extractor, JPEG Thumbnail, \

#Branded Preview JPEG


#Assign 'human-understandable' names to each filter

plugin.named.org.dspace.app.mediafilter.FormatFilter = \
  org.dspace.app.mediafilter.XPDF2Text = PDF Text Extractor, \
  org.dspace.app.mediafilter.XPDF2Thumbnail = PDF Thumbnail, \
  org.dspace.app.mediafilter.HTMLFilter = HTML Text Extractor, \
  org.dspace.app.mediafilter.WordFilter = Word Text Extractor, \
  org.dspace.app.mediafilter.JPEGFilter = JPEG Thumbnail, \
  org.dspace.app.mediafilter.BrandedPreviewJPEGFilter = Branded Preview
JPEG, \
  org.dspace.app.mediafilter.PowerPointFilter = PowerPoint Text Extractor

Specifically, I *think* the pdf filter should be enabled...  As I said, the
majority of the files are .pdf...
Bill


On Wed, Sep 18, 2013 at 2:00 PM, helix84 heli...@centrum.sk wrote:

 Hi Bill,

 check your configuration to see which media filters you actually have
 enabled:

 https://wiki.duraspace.org/pages/viewpage.action?pageId=32474041#TransformingDSpaceContent(MediaFilters)-AvailableMediaFilters

 It's possible that you have only a mediafilter for one file type
 enabled and thus it skips the majority of your files.


 Regards,
 ~~helix84

 Compulsory reading: DSpace Mailing List Etiquette
 https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette

--
LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint
2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes
Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13. 
http://pubads.g.doubleclick.net/gampad/clk?id=58041151iu=/4140/ostg.clktrk___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette

Re: [Dspace-tech] media filter question

2013-09-18 Thread helix84
Hi Bill,

check your configuration to see which media filters you actually have enabled:
https://wiki.duraspace.org/pages/viewpage.action?pageId=32474041#TransformingDSpaceContent(MediaFilters)-AvailableMediaFilters

It's possible that you have only a mediafilter for one file type
enabled and thus it skips the majority of your files.


Regards,
~~helix84

Compulsory reading: DSpace Mailing List Etiquette
https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette

--
LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint
2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes
Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13. 
http://pubads.g.doubleclick.net/gampad/clk?id=58041151iu=/4140/ostg.clktrk
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette