Sent from my iPhone
Begin forwarded message: From: MengYing Wang <[email protected]<mailto:[email protected]>> Date: October 16, 2014 at 4:45:56 PM PDT To: "Verma, Rishi (398J)" <[email protected]<mailto:[email protected]>> Cc: Christian Alan Mattmann <[email protected]<mailto:[email protected]>>, "Mcgibbney, Lewis J (398J)" <[email protected]<mailto:[email protected]>>, "Bryant, Ann C (398J-Affiliate)" <[email protected]<mailto:[email protected]>>, "Ramirez, Paul M (398J)" <[email protected]<mailto:[email protected]>>, "Mattmann, Chris A (3980)" <[email protected]<mailto:[email protected]>>, Tyler Palsulich <[email protected]<mailto:[email protected]>> Subject: Re: Directed Research Weekly Report from 2014/09/29 - 2014/10/05 Dear Rishi, When I try to use the OODT RADiX using the command "mvn clean package -Pfm-solr-catalog", I get the "profile with id: 'fm-solr-catalog' has not been activated" error. Have you by any chance seen this error before? Thank you! Also after the installation, no solr directory is found in my machine too. $ mvn clean package -Pfm-solr-catalog [INFO] Scanning for projects... [WARNING] Profile with id: 'fm-solr-catalog' has not been activated. [INFO] Reactor build order: [INFO] Data Management System ...... Best, Mengying Wang On Sun, Oct 5, 2014 at 3:05 PM, Verma, Rishi (398J) <[email protected]<mailto:[email protected]>> wrote: Hi Mengying, For integrating OODT File Manager with Solr, you have a couple options depending on the type of deployment you are doing and what stage your software is at: If you’re starting from scratch: 1. Use Vagrant Virtual Machine technology to get a pre-built OODT deployment connected to Solr in one command: https://cwiki.apache.org/confluence/display/OODT/Vagrant+Powered+OODT 2. Use OODT RADiX for a pre-built deployment directory containing OODT File Manager, Workflow, Resource etc and Solr pre-integrated. RADiX allows for pre-configured OODT deployments, abstracting you from checking out individual OODT modules via source and building them. See: https://cwiki.apache.org/confluence/display/OODT/RADiX+Powered+By+OODT#RADiXPoweredByOODT-TheCommands Make sure to build with the command: mvn -Pfm-solr-catalog package (see read me: http://svn.apache.org/repos/asf/oodt/trunk/mvn/archetypes/radix/src/main/resources/archetype-resources/README.txt) 3. Connect OODT FM with Solr manually, see: https://cwiki.apache.org/confluence/display/OODT/Integrating+Solr+with+OODT+RADiX If you already have a deployed OODT FM: 1. Follow these directions: https://cwiki.apache.org/confluence/display/OODT/Solr+File+Manager+Quick+Start+Guide 2. If the above doesn’t work, then use OODT RADiX to create a FM and Solr deployment that works, and copy those directories to your currently deployed production directory. Thanks - hope that helps! Rishi On Oct 5, 2014, at 10:56 AM, MengYing Wang <[email protected]<mailto:[email protected]>> wrote: Dear Prof. Mattmann and Rishi, Attached is the nutch and solr directories. [https://ssl.gstatic.com/docs/doclist/images/icon_9_archive_list.png] nutch_solr.zip<https://docs.google.com/file/d/0B7PYVKDpy0jlSnI3U1lFcGY0WnM/edit?usp=drive_web> As for problem (6), I could use SolrIndexer instead. Following is my File Manager directory. https://drive.google.com/file/d/0B7PYVKDpy0jlVTk2NWFFY2sycW8/view?usp=sharing Thank you! Best, Mengying Wang On Sun, Oct 5, 2014 at 9:25 AM, Christian Alan Mattmann <[email protected]<mailto:[email protected]>> wrote: Thanks Angela. Great work! Some comments/feedback: (1) According to https://cwiki.apache.org/confluence/display/OODT/OODT+Push-Pull+User+Guide, use the Apache OODT Pushpull to crawl data files from a remote server to the local machine [Failed, no data files downloaded at all]. - This problem is not so urgent. Maybe I should use some ftp client tools, e.g., FileZilla, to download data files in the remote ftp servers. MY COMMENT: Please send me your PushPull directory zipped up. I will take a look - Tyler can you also look? (3) According to https://wiki.apache.org/nutch/IntranetDocumentSearch, use the Apache Nutch and Solr to crawl and index local data files [Failed, No data is indexed in Solr]. - This problem is not so urgent. Maybe this feature only works for the Nutch 2.x. My Nutch version is 1.9. Also I could use the OODT Crawler to ingest local files. MY COMMENT: Please send me your nutch + solr directories, zipped up. I will take a look. (6) According to https://cwiki.apache.org/confluence/display/OODT/Solr+File+Manager+Quick+St art+Guide, integrate the Apache OODT File Manager with the Apache Solr [Failed, No product information available in the Solr]. - It doesn't work out. However, I could use (5) to integrate OODT File Manager and the Solr. MY COMMENT: Rishi, can you guys help Angela with OODT + Solr + FM? It¹s not working for her. Thanks! ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Adjunct Associate Professor, Computer Science Department University of Southern California Los Angeles, CA 90089 USA Email: [email protected]<mailto:[email protected]> WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: MengYing Wang <[email protected]<mailto:[email protected]>> Date: Saturday, October 4, 2014 at 9:12 PM To: Chris Mattmann <[email protected]<mailto:[email protected]>>, "Mcgibbney, Lewis J (398J)" <[email protected]<mailto:[email protected]>> Cc: Annie Bryant <[email protected]<mailto:[email protected]>>, "Ramirez, Paul M (398J)" <[email protected]<mailto:[email protected]>>, Chris Mattmann <[email protected]<mailto:[email protected]>> Subject: Directed Research Weekly Report from 2014/09/29 - 2014/10/05 >Dear Prof. Mattmann, > > >New status of the previous failed problems: > > >(1) According to >https://cwiki.apache.org/confluence/display/OODT/OODT+Push-Pull+User+Guide >, use the Apache OODT Pushpull to crawl data files from > a remote server to the local machine [Failed, no data files downloaded >at all]. > > >- This problem is not so urgent. Maybe I should use some ftp client >tools, e.g., FileZilla, to download data files in the remote ftp servers. > > >(2) Use the Apache OODT Pushpull to crawl webpages [Succeed]. > > >(3) According to https://wiki.apache.org/nutch/IntranetDocumentSearch, >use the Apache Nutch and Solr to crawl and index local data files [Failed, > No data is indexed in Solr]. > > >- This problem is not so urgent. Maybe this feature > only works for the Nutch 2.x. My Nutch version is 1.9. Also I could use >the OODT Crawler to ingest local files. > > >(4) Integrate the tike parser with the Apache Nutch [Failed, No tike >fields available in the Solr]. > > >- Still in progress. > > >(5) According to >https://cwiki.apache.org/confluence/display/OODT/Using+the+SolrIndexer+to+ >dump+a+File+Manager+Catalog, > use the SolrIndexer to dump all product information from the Apache OODT >File Manager to the Apache Solr [Succeed]. > > >(6) According to >https://cwiki.apache.org/confluence/display/OODT/Solr+File+Manager+Quick+S >tart+Guide, integrate the Apache OODT File Manager > with the Apache Solr [Failed, No product information available in the >Solr]. > > >- It doesn't work out. However, I could use (5) to integrate OODT File >Manager and the Solr. > > >So far, I have two ways to crawl remote data and construct indexes in the >Solr: > > >(1) moving data to the local machine using the FileZilla -> developing >metadata extractor using the Tika -> crawling the data directory using >the OODT Crawler -> migrating product information to the Solr uing the >SolrIndexer > > >(2) crawling websites using the Nutch -> indexing some basic metadata in >the Solr > > > > >Thanks. > > >Best, >Mengying (Angela) Wang > >On Mon, Sep 29, 2014 at 12:22 PM, MengYing Wang ><[email protected]<mailto:[email protected]>> wrote: > >Dear Prof. Mattmann, > > >In the previous two weeks, I was trying to solve the following problems: > > >(1) According to >https://cwiki.apache.org/confluence/display/OODT/OODT+Push-Pull+User+Guide >, use the Apache OODT Pushpull to crawl data files from > a remote server to the local machine [Failed, couldn't find the data >files]. > > >(2) Use the Apache OODT Pushpull to crawl webpages [Failed, HttpClient >ClassNotFoundException]. > > >(3) According to https://wiki.apache.org/nutch/IntranetDocumentSearch, >use the Apache Nutch and Solr to crawl and index local data files [Failed, > No data files found in Solr]. > > >(4) According to >https://cwiki.apache.org/confluence/display/OODT/Solr+File+Manager+Quick+S >tart+Guide, search and delete redundant products > in the Apache OODT File Manager [Succeed]. > > >(5) According to >https://cwiki.apache.org/confluence/display/OODT/OODT+Crawler+Help, use >the Apache OODT Crawler and Tika to extract metadata and then query > the metadata in the Apache OODT File Manager [Succeed]. > > >(6) According to https://wiki.apache.org/nutch/IndexMetatags, use the >plugins to parse HTML meta tags into separate fields in the Solr index >[Succeed]. > > >(7) Integrate the tike parser with the Apache Nutch to extract metadata >information which would be indexed in the Solr [Failed, No tike fields >available in the Solr]. > > >(8) According to >https://cwiki.apache.org/confluence/display/OODT/Using+the+SolrIndexer+to+ >dump+a+File+Manager+Catalog, > use the SolrIndexer to dump all product information from the Apache OODT >File Manager to the Apache Solr [Failed, No product information available >in the Solr]. > > >(9) According to >https://cwiki.apache.org/confluence/display/OODT/Solr+File+Manager+Quick+S >tart+Guide, integrate the > Apache OODT File Manager with the Apache Solr [Failed, No product >information available in the Solr]. > > >(10) According to https://lucene.apache.org/solr/4_10_0/tutorial.html, >explore a simple command line tool for posting, deleting, updating and >querying > raw XMLs to the solr server [Succeed]. > > >Thank you. > > > >Best, >Mengying Wang > > >On Wed, Sep 17, 2014 at 11:44 AM, MengYing Wang ><[email protected]<mailto:[email protected]>> wrote: > >Dear Prof. Mattmann, > > >For the last week, I was learning the various apache tool tutorials, and >trying to figure out how to crawl data files in the web, and then build >up a metadata index for future queries. So far, I have found the >following two approaches: > > >1: Use the Apache OODT Pushpull to crawl a bunch of data files from some >remote server to localhost -> Use the Apache Tika to extract the >metadata information for each data file -> Use the Apache OODT File >Manager to ingest the metadata files -> Use > the query_tool script to query the metadata information stored in the >Apache OODT File Manager > > >We could also achieve the above process by employing the Apache OODT >CAS-Curator to automatically call the Apache Tika and the Apache File >Manager, for the details you could refer to >http://oodt.apache.org/components/maven/curator/user/basic.html > > >2: Use the Apache Nutch to crawl a number of webpages -> Use the Apache >Solr to do the text queries. > > >However, there are some problems that I am still trying to solve: > > >(1) According to the Apache OODT Pushpull user guide >(https://cwiki.apache.org/confluence/display/OODT/OODT+Push-Pull+User+Guid >e), data files should > be downloaded to the staging area. However, when I started the pushpull >script, I have waited for at least 15 minutes but nothing was downloaded. >I have checked the remote FTP server, there indeed are some data files. >-_-! > > >************************************************************************** >*********** >guest-wireless-207-151-035-013:bin AngelaWang$ ./pushpull >TRANSFER: >org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory >^C >************************************************************************** >*********** > > >Also, url-downloader script cannot work because of the java > NoClassDefFoundError. > > >************************************************************************** >*********** > >guest-wireless-207-151-035-013:bin AngelaWang$ ./url-downloader > >http://pds-imaging.jpl.nasa.gov/data/msl/MSLHAZ_0XXX/CATALOG/CATINFO.TXT ><http://pds-imaging.jpl.nasa.gov/data/msl/MSLHAZ_0XXX/CATALOG/CATINFO.TXT> > . >Exception in thread "main" java.lang.NoClassDefFoundError: >org/apache/oodt/cas/pushpull/protocol/http/HttpClient >Caused by: java.lang.ClassNotFoundException: >org.apache.oodt.cas.pushpull.protocol.http.HttpClient >at java.net.URLClassLoader$1.run(URLClassLoader.java:202) >at java.security.AccessController.doPrivileged(Native Method) >at java.net.URLClassLoader.findClass(URLClassLoader.java:190) >at java.lang.ClassLoader.loadClass(ClassLoader.java:306) >at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) >at java.lang.ClassLoader.loadClass(ClassLoader.java:247) > >************************************************************************** >*********** > > > >2: According to the Apache OODT Crawler Help >(https://cwiki.apache.org/confluence/display/OODT/OODT+Crawler+Help), the >Apache OODT Crawler could be integrated > with the Apache Tika. However, there is no >org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor class in >my Apache OODT Crawler package. > > >3: How to dump the metadata in the Apache OODT File Manager to the Apache >Solr using the Apache OODT Workflow Manager? I still have no clear answer >yet. > > >4: According to the Apache Solr Tutorial >(https://lucene.apache.org/solr/4_10_0/tutorial.html), users should be >able to add/delete/update documents using post.jar script. > However, it doesn't work in my machine. > > >************************************************************************** >*********** > >guest-wireless-207-151-035-013:exampledocs AngelaWang$ java -jar post.jar >solr.xml >SimplePostTool version 1.5 >Posting files to base url >http://localhost:8983/solr/update <http://localhost:8983/solr/update> >using content-type application/xml.. >POSTing file solr.xml >SimplePostTool: WARNING: Solr returned an error #400 (Bad Request) for >url: >http://localhost:8983/solr/update >SimplePostTool: WARNING: Response: <?xml version="1.0" encoding="UTF-8"?> ><response> ><lst name="responseHeader"><int name="status">400</int><int >name="QTime">1</int></lst><lst name="error"><str name="msg">ERROR: >[doc=SOLR1000] unknown field 'name'</str><int name="code">400</int></lst> ></response> >SimplePostTool: WARNING: IOException while reading response: >java.io.IOException: Server returned HTTP response code: 400 for URL: >http://localhost:8983/solr/update >1 files indexed. >COMMITting Solr index changes to >http://localhost:8983/solr/update.. >Time spent: 0:00:00.032 > >************************************************************************** >*********** > > > >Solr logs: > > >************************************************************************** >*********** > >6506114 [qtp1314570047-14] ERROR org.apache.solr.core.SolrCore >org.apache.solr.common.SolrException: ERROR: [doc=SOLR1000] unknown field >'name' >at >org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:185 >) >at >org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand >.java:78) >at >org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.j >ava:238) >at >org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.ja >va:164) >at >org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdatePr >ocessorFactory.java:69) >....... >************************************************************************** >*********** > > > >I will continue to solve the above problems this week, and could we >discuss the two approaches this Thursday after the class? Many thanks! >Have a good day! > > >Best, >Mengying (Angela) Wang > > > >On Mon, Sep 8, 2014 at 10:32 PM, MengYing Wang ><[email protected]<mailto:[email protected]>> wrote: > >Dear Prof. Mattmann, > > >For the previous week, I have successfully installed the following >softwares in my personal computer: > > >1: Apache OODT Catalog and Archive File Management Component: >http://oodt.apache.org/components/maven/filemgr/user/basic.html >2: Apache OODT Catalog and Archive Crawling Framework: >http://oodt.apache.org/components/maven/crawler/user/ > >3: Apache OODT Catalog and Archive Workflow Management Component: >http://oodt.apache.org/components/maven/workflow/user/basic.html > >4: Apache Solr: >https://cwiki.apache.org/confluence/display/solr/Installing+Solr >5: Apache Nutch: >http://wiki.apache.org/nutch/NutchTutorial#A3._Crawl_your_first_website >6: Apache Tika: http://tika.apache.org/0.9/gettingstarted.html > > >This week I will continue to playing with these softwares to figure out >the following three questions: >(1) how to get the metadata using Apache OODT or Apache Nutch? >(2) how to dump the metadata from Apache OODT to Apache Solr? >(3) how to query the metadata stored in Solr? > >Best, >Mengying (Angela) Wang > > > > > > > > > > >-- >Best, >Mengying (Angela) Wang > > > > > > > > > > >-- >Best, >Mengying (Angela) Wang > > > > > > > > > > >-- >Best, >Mengying (Angela) Wang > > > > -- Best, Mengying (Angela) Wang --- Rishi Verma NASA Jet Propulsion Laboratory California Institute of Technology 4800 Oak Grove Drive, M/S 158-248 Pasadena, CA 91109 Tel: 1-818-393-5826<tel:1-818-393-5826> -- Best, Mengying (Angela) Wang
