Hi Zichuan, Please see:
http://oodt.apache.org/components/maven/crawler/user/ And also see the UpdateWorkflowStatusToIngest crawler action. Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Zichuan Wang <zichu...@usc.edu> Date: Tuesday, October 28, 2014 at 9:37 PM To: Luke <shuai...@usc.edu> Cc: Chris Mattmann <chris.a.mattm...@jpl.nasa.gov>, "dev@oodt.apache.org" <dev@oodt.apache.org>, Chris Mattmann <mattm...@usc.edu>, "zhouj...@usc.edu" <zhouj...@usc.edu>, "xiaoy...@usc.edu" <xiaoy...@usc.edu> Subject: 回复: Question about OODT file manager >Dear Professor, > > >We are stuck in OODT. The most critical problem we have now is > > >“How to make crawler work with workflow”? > > >-- >Zichuan Wang >University of Southern California, Department of Computer Science > > > >在 2014年10月28日 星期二,下午12:52,Luke 写道: > >Dear Professor Mattamnn, >Thanks a lot Professor Mattmann for the kind help, it is appreciated, >sorry for getting back to you with my appreciation, I have been >conducting tests with OODT based on your advice, but unfortunately I am >having another problem.... > > >I am following the steps >(https://cwiki.apache.org/confluence/display/OODT/CAS-PGE+Learn+by+Example >) to get a sense of how to get workflow to work. >The problem is that the File-Concatenator-PGE (by running the wmgr-client >command-line) does not seems to be invoked or executed, but I am seeing >the tasks are getting stacked up in the workflow manager with status >either "RSUBMIT" or "QUEUED", but they > are not getting executed, PFA: workflow_monitor.jpg, please note, by >default the workflow min pool size is 6; so here comes another problem, i >have 6 submitted tasks with status RSUBMIT, but any new incoming tasks >will be forwarded to the waiting QUEUE with > status "QUEUED"...please refer to the workflow_monitor.jpg for details, >where I have 3 QUEUED workflow task and 6 RSUMBITE tasks. > > > >Question 1): not sure why the workflow is not being executed, and hanging >at the state of "RSUBMIT", after enabling the log level, I am seeing the >following entry in the log, not sure if this has anything to do with the >"hanging" problem where workflow > is not getting executed and hanging at state of "RSUBMIT". >Oct 28, 2014 3:35:07 AM >org.apache.oodt.cas.workflow.engine.IterativeWorkflowProcessorThread >safeCheckJobComplete >WARNING: Exception checking completion status for job: >[2014-10-28T01:59:32.813-07:00]: Messsage: java.lang.Exception: >java.lang.NullPointerException > > >Question 2): I think currently on my side any new incoming workflow task >I am sending with the following command is being directed to the waiting >"QUEUE" because of the min pool size (i.e. 6) (I can increase this to a >larger number though), > >./wmgr-client --url http://localhost:9200 --operation --sendEvent >--eventName fileconcatenator-pge --metaData --key RunID testNumber1 >If possible, I would like to please know if there is a way we can purge >the queue and get rid of those workflow tasks either in "RSUMBIT" and >"QUEUED" I have already sent, please kindly help. > > >Very sorry for troubling you with this, to be honest I find OODT a bit >challenging to grasp within a short time frame, probably because there is >no book like OODT in action like Solr.... and what I am doing is just >trial and error blended with guess, but > I don’t want to make a blind guess, it will be appreciated if you can >please also shed some lights on where I can get more information logging >or other way where I can troubleshoot. I think it might be worth tracking >what is happening when workflow reach the > status "RSUBMIT" and how to get a specific logging info specific to it... > > >Again your advice and kind help will be appreciated usual. > > > > >Thanks >Luke > > > >-----Original Message----- >From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] >Sent: 2014年10月26日 22:18 >To: Luke; 'Zichuan Wang' >Cc: 'Christian Alan Mattmann'; zhouj...@usc.edu; >xiaoy...@usc.edu; >dev@oodt.apache.org >Subject: Re: re: Question about OODT file manager > > >Hi Luke, > > >Thanks and sorry it’s taken me a while to reply. Here are some details >below: > > > > >-----Original Message----- >From: Luke <shuai...@usc.edu> >Date: Sunday, October 26, 2014 at 6:19 PM >To: Chris Mattmann <chris.a.mattm...@jpl.nasa.gov>, 'Zichuan Wang' ><zichu...@usc.edu> >Cc: Chris Mattmann <mattm...@usc.edu>, "zhouj...@usc.edu" ><zhouj...@usc.edu>, "xiaoy...@usc.edu" <xiaoy...@usc.edu>, >"dev@oodt.apache.org" <dev@oodt.apache.org> >Subject: RE: re: Question about OODT file manager > > > >Hi Professor Mattmann and OODT DEV, > > >Sorry to trouble you with this email, our team has been struggling in >the oodt to send json files to solr. >One of the difficulties is still getting OODT workflow to call the >poster.py in etllib. > > > > > >Sorry that you’re having difficulty let me try and help. > > > > > >I am not sure if my understanding is correct with OODT requirement, I >hope you can please kindly advice and help with our confusion. > > >a set of goals in my mind with OODT is as follows, please kindly >confirm and clarify: > > >1) >Get the File-Manager up and running. > > > > > >Yep, hopefully as installed via OODT RADIX. > > > >2) >send all json files with command wmgr-client to the fileManager server. >(I believe we can achieve it with a bash script or probably python >that calls the command line sequentially with each json file name as an >argument?!) > > > > > >Suggestion: > > >1. Use the OODT crawler and file manager to crawl/index the JSON files (in >place data transfer). >2. Take a look at CAS-PGE, it will help you write a workflow task that >will wrap >ETLlib and the poster command. >3. Once you are confident with #2, whip up a script that pages through >all of >your indexed JSON files, and then for each one, submits a workflow event >(you >may need to look into aggregating them) that calls your CAS-PGE wrapped >poster task from ETLlib. > > > >3) >Once we have json files sent and stored in the File-Manager, we need to >get workflow-manager up and running, and we can create a workflow that >send those jsons file from the file manager to solr. > > > > > >See above. > > > >4) >Create a workflow according to >Workflow2 User Guide ><https://cwiki.apache.org/confluence/display/OODT/Workflow2+User+Guide> > >here comes the problem….. > > > > > > > > > > > > > > > > > > > > >I am not sure how to create a workflow task which can call the >poster.py in python etllib, it looks like we need to create our own >java class that extend <TaskInstance> which is an abstract Java class >with one abstract method that has the following signature: > > > > >protectedabstract ResultsState performExecution(ControlMetadata >crtlMetadata); >However, the detail of where to find the corresponding libs >and where to put our implementation in workflow manager is being >neglected in that page. I am not sure if we should use TaskInstance, >but it seems the workflow has to have an interface thru which it can >call the python code i.e. poster.py. and it looks like we need to >embody the TaskInstance::performExecution by injecting the code that >calls the poster.py and return the resultState. > > > > >It would be greatly appreciated if you could please shed some lights >and advice how we can get a task instance to call the poster.py. BTW, I >am also not sure if my understanding is correct, please kindly correct >it if inappropriate. Your help will be appreciated as usual. > > > > > > >Thanks >Luke > > > > > >Thanks Luke, see above. Let me know if it helps. > > >Cheers! > > >Chris > > > > > >From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] > > >Sent: 2014年10月25日 >13:34 >To: Zichuan Wang >Cc: Christian Alan Mattmann; Luke; zhouj...@usc.edu; >xiaoy...@usc.edu >Subject: Re: 回复: Question about OODT file manager > > > > > > >Please cc >dev@oodt.apache.org <mailto:dev@oodt.apache.org> I will reply in detail >soon > > >Sent from my iPhone > > > > > > > >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >++ >Chris Mattmann, Ph.D. >Chief Architect >Instrument Software and Science Data Systems Section (398) NASA Jet >Propulsion Laboratory Pasadena, CA 91109 USA >Office: 168-519, Mailstop: 168-527 >Email: chris.a.mattm...@nasa.gov >WWW: http://sunset.usc.edu/~mattmann/ >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >++ >Adjunct Associate Professor, Computer Science Department University of >Southern California, Los Angeles, CA 90089 USA >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >++ > > > > > > > > > > > > > > > > > >On Oct 25, 2014, at 1:26 PM, "Zichuan Wang" <zichu...@usc.edu> wrote: > > > > >Dear Professor, > > > > > > >Could please also explain how I can crawl all JSON file name under a >specific directory using CAS-PGE? I’ll work through this example >https://cwiki.apache.org/confluence/display/OODT/CAS-PGE+Learn+by+Exam > > > >p > >le, but it doesn’t mention anything about crawling, instead it >manually set the Input files paths... > > > > > > > > >-- > > >Zichuan Wang > > >University of Southern California, Department of Computer Science > > > > > > > > >在 2014年10月25日 星期六,下午12:10,Zichuan Wang >写道: > > >Dear Professor, > > > > > > >In assignment 2 specification I noticed that you mentioned OODT File >Manager, but from my understanding, we are using ETLLib poster which >talks directly to Solr. So how can we use OODT File Manager in this >assignment? > > > > > > >-- > > >Zichuan Wang > > >University of Southern California, Department of Computer Science > > > > > > > > > >附件: >- workflow_monitor.jpg > > > > >