Booooo On Mon, Nov 23, 2015 at 5:09 PM, Chris Mattmann <[email protected]> wrote:
> yep, agreed. > > — > Chris Mattmann > [email protected] > > > > > > > -----Original Message----- > From: Tom Barber <[email protected]> > Reply-To: <[email protected]> > Date: Monday, November 23, 2015 at 9:06 AM > To: "[email protected]" <[email protected]> > Subject: Re: Crawling / Archiving binary data with Solr backend > > >Dumping a .met file and calling the filemgr client ingest routine works > >fine, so its something either broken or i'm doing wrong in the crawler it > >appears. > > > >Tom > > > >On Mon, Nov 23, 2015 at 3:45 PM, Tom Barber <[email protected]> > >wrote: > > > >> I'll give it a go. Thanks. > >> > >> On Mon, Nov 23, 2015 at 3:44 PM, Chris Mattmann > >><[email protected]> > >> wrote: > >> > >>> Doesn’t look weird. Hmm. Can you generate a metadata file > >>> using TikaCmdLine extractor and then use that metadata file > >>> to ingest into File Manager by hand? Does that work? > >>> > >>> — > >>> Chris Mattmann > >>> [email protected] > >>> > >>> > >>> > >>> > >>> > >>> > >>> -----Original Message----- > >>> From: Tom Barber <[email protected]> > >>> Reply-To: <[email protected]> > >>> Date: Monday, November 23, 2015 at 7:43 AM > >>> To: "[email protected]" <[email protected]> > >>> Subject: Re: Crawling / Archiving binary data with Solr backend > >>> > >>> >Author: Alun Davis - Loudmouth > >>> >Content-Length: 3273160 > >>> >Content-Type: audio/mpeg > >>> >X-Parsed-By: org.apache.tika.parser.DefaultParser > >>> >X-TIKA:digest:MD5: 5f374012180e94778346619515152f74 > >>> >X-TIKA:digest:SHA256: > >>> >34d8bf9da8feb848922138eb7807c0d71ed92376422fb28c8cbbffe788574ab0 > >>> >channels: 2 > >>> >creator: Alun Davis - Loudmouth > >>> >dc:creator: Alun Davis - Loudmouth > >>> >dc:title: Teenage Baghead > >>> >meta:author: Alun Davis - Loudmouth > >>> >resourceName: Teenage Baghead.mp3 > >>> >samplerate: 44100 > >>> >title: Teenage Baghead > >>> >version: MPEG 3 Layer III Version 1 > >>> >xmpDM:album: > >>> >xmpDM:artist: Alun Davis - Loudmouth > >>> >xmpDM:audioChannelType: Stereo > >>> >xmpDM:audioCompressor: MP3 > >>> >xmpDM:audioSampleRate: 44100 > >>> >xmpDM:duration: 204577.046875 > >>> >xmpDM:genre: Pop > >>> >xmpDM:logComment: www.maimthattune.com for more! > >>> >xmpDM:releaseDate: 2001 > >>> > > >>> > > >>> >Nothing that should scare a parser in the mp3 at least. > >>> > > >>> >On Mon, Nov 23, 2015 at 3:33 PM, Chris Mattmann < > >>> [email protected]> > >>> >wrote: > >>> > > >>> >> yeah check the metadata. Any weird UTF-8 encoding? > >>> >> > >>> >> (aka run tika on the file outside of OODT what do you see?) > >>> >> > >>> >> — > >>> >> Chris Mattmann > >>> >> [email protected] > >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> >> -----Original Message----- > >>> >> From: Tom Barber <[email protected]> > >>> >> Reply-To: <[email protected]> > >>> >> Date: Monday, November 23, 2015 at 7:23 AM > >>> >> To: "[email protected]" <[email protected]> > >>> >> Subject: Re: Crawling / Archiving binary data with Solr backend > >>> >> > >>> >> >./crawler/bin/crawler_launcher --filemgrUrl > >>>http://localhost:9000 > >>> >> >--operation --launchMetCrawler --clientTransferer > >>> >> >org.apache.oodt.cas.filemgr.datatransfer.InPlaceDataTransferFactory > >>> >> >--productPath $OODT_HOME/data/staging --metExtractor > >>> >> >org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor > >>> >> >--metExtractorConfig > >>> >>/home/bugg/Projects/surrey100/oodt/data/met/tika.conf > >>> >> > > >>> >> >I'm running that. Which runs fine with the default lucene stuff, > >>>also > >>> >>runs > >>> >> >fine with a txt file, but doesn't run fine over a random picture I > >>> >>took or > >>> >> >over an mp3 I tested it on. > >>> >> > > >>> >> > > >>> >> >On Mon, Nov 23, 2015 at 3:12 PM, Mattmann, Chris A (3980) < > >>> >> >[email protected]> wrote: > >>> >> > > >>> >> >> Encoding issues with the extracted metadata? What are you getting > >>> >> >> just running Tika on the files? > >>> >> >> > >>> >> >> The actual data shouldn’t matter since it’s not being ingested > >>> >> >> (are you doing it in place, or what data transferer are you > >>>using)? > >>> >> >> > >>> >> >> Cheers, > >>> >> >> Chris > >>> >> >> > >>> >> >> > >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>> >> >> Chris Mattmann, Ph.D. > >>> >> >> Chief Architect > >>> >> >> Instrument Software and Science Data Systems Section (398) > >>> >> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >>> >> >> Office: 168-519, Mailstop: 168-527 > >>> >> >> Email: [email protected] > >>> >> >> WWW: http://sunset.usc.edu/~mattmann/ > >>> >> >> > >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>> >> >> Adjunct Associate Professor, Computer Science Department > >>> >> >> University of Southern California, Los Angeles, CA 90089 USA > >>> >> >> > >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> -----Original Message----- > >>> >> >> From: Tom Barber <[email protected]> > >>> >> >> Reply-To: "[email protected]" <[email protected]> > >>> >> >> Date: Monday, November 23, 2015 at 6:36 AM > >>> >> >> To: "[email protected]" <[email protected]> > >>> >> >> Subject: Crawling / Archiving binary data with Solr backend > >>> >> >> > >>> >> >> >Hello, > >>> >> >> > > >>> >> >> >Looks like I've never tried it before with binary data. If I > >>>swap > >>> >>the > >>> >> >> >filemgr defaults to use solr then try and crawl my staging > >>> directory > >>> >> >>using > >>> >> >> >the Tika extractor I get a lot of > >>> >> >> > > >>> >> >> >org.apache.xmlrpc.XmlRpcException: java.lang.Exception: > >>> >> >> >org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException: > >>> >>Error > >>> >> >> >ingesting product > >>> >> >>[org.apache.oodt.cas.filemgr.structs.Product@62b19476] > >>> >> >> : > >>> >> >> >null > >>> >> >> >at > >>> >> >> > >>> >> > >>> > >>> > >>>>>>>>org.apache.xmlrpc.XmlRpcClientResponseProcessor.decodeException(Xml > >>>>>>>>Rpc > >>> >>>>>Cl > >>> >> >>>ie > >>> >> >> >ntResponseProcessor.java:104) > >>> >> >> >at > >>> >> >> > >>> >> > >>> > >>> > >>>>>>>>org.apache.xmlrpc.XmlRpcClientResponseProcessor.decodeResponse(XmlR > >>>>>>>>pcC > >>> >>>>>li > >>> >> >>>en > >>> >> >> >tResponseProcessor.java:71) > >>> >> >> >at > >>> >> >> > >>> >> > >>> > >>> > >>>>>>>>org.apache.xmlrpc.XmlRpcClientWorker.execute(XmlRpcClientWorker.jav > >>>>>>>>a:7 > >>> >>>>>3) > >>> >> >> > > >>> >> >> > > >>> >> >> >Type things. > >>> >> >> > > >>> >> >> >Any ideas? > >>> >> >> > > >>> >> >> >Tom > >>> >> >> > >>> >> >> > >>> >> > >>> >> > >>> >> > >>> > >>> > >>> > >> > > >
