Some sample code to monitor multiple directories is now available at: https://github.com/DataTorrent/examples/tree/master/tutorials/fileIO-multiDir
It shows how to use a custom implementation of definePartitions() to create multiple partitions of the file input operator and group them into "slices" where each slice monitors a single directory. Ram On Wed, May 25, 2016 at 9:55 AM, Munagala Ramanath <r...@datatorrent.com> wrote: > I'm hoping to have a sample sometime next week. > > Ram > > On Wed, May 25, 2016 at 9:30 AM, Mukkamula, Suryavamshivardhan (CWM-NR) < > suryavamshivardhan.mukkam...@rbc.com> wrote: > >> Thank you so much ram, for your advice , Option (a) would be ideal for my >> requirement. >> >> >> >> Do you have sample usage for partitioning with individual configuration >> set ups different partitions? >> >> >> >> Regards, >> >> Surya Vamshi >> >> >> >> *From:* Munagala Ramanath [mailto:r...@datatorrent.com] >> *Sent:* 2016, May, 25 12:11 PM >> *To:* users@apex.apache.org >> *Subject:* Re: Multiple directories >> >> >> >> You have 2 options: (a) AbstractFileInputOperator (b) >> FileSplitter/BlockReader >> >> >> >> For (a), each partition (i.e. replica or the operator) can scan only a >> single directory, so if you have 100 >> >> directories, you can simply start with 100 partitions; since each >> partition is scanning its own directory >> >> you don't need to worry about which files the lines came from. This >> approach however needs a custom >> >> definePartition() implementation in your subclass to assign the >> appropriate directory and XML parsing >> >> config file to each partition; it also needs adequate cluster resources >> to be able to spin up the required >> >> number of partitions. >> >> >> >> For (b), there is some documentation in the Operators section at >> http://docs.datatorrent.com/ including >> >> sample code. There operators support scanning multiple directories out of >> the box but have more >> >> elaborate configuration options. Check this out and see if it works in >> your use case. >> >> >> >> Ram >> >> >> >> On Wed, May 25, 2016 at 8:17 AM, Mukkamula, Suryavamshivardhan (CWM-NR) < >> suryavamshivardhan.mukkam...@rbc.com> wrote: >> >> Hello Ram/Team, >> >> >> >> My requirement is to read input feeds from different locations on HDFS >> and parse those files by reading XML configuration files (each input feed >> has configuration file which defines the fields inside the input feeds). >> >> >> >> My approach : I would like to define a mapping file which contains >> individual feed identifier, feed location , configuration file location. I >> would like to read this mapping file at initial load within setup() method >> and define my DirectoryScan.acceptFiles. Here my challenge is when I read >> the files , I should parse the lines by reading the individual >> configuration files. How do I know the line is from particular file , if I >> know this I can read the corresponding configuration file before parsing >> the line. >> >> >> >> Please let me know how do I handle this. >> >> >> >> Regards, >> >> Surya Vamshi >> >> >> >> *From:* Munagala Ramanath [mailto:r...@datatorrent.com] >> *Sent:* 2016, May, 24 5:49 PM >> *To:* Mukkamula, Suryavamshivardhan (CWM-NR) >> *Subject:* Multiple directories >> >> >> >> One way of addressing the issue is to use some sort of external tool >> (like a script) to >> >> copy all the input files to a common directory (making sure that the file >> names are >> >> unique to prevent one file from overwriting another) before the Apex >> application starts. >> >> >> >> The Apex application then starts and processes files from this directory. >> >> >> >> If you set the partition count of the file input operator to N, it will >> create N partitions and >> >> the files will be automatically distributed among the partitions. The >> partitions will work >> >> in parallel. >> >> >> >> Ram >> >> _______________________________________________________________________ >> >> This [email] may be privileged and/or confidential, and the sender does >> not waive any related rights and obligations. Any distribution, use or >> copying of this [email] or the information it contains by other than an >> intended recipient is unauthorized. If you received this [email] in error, >> please advise the sender (by return [email] or otherwise) immediately. You >> have consented to receive the attached electronically at the above-noted >> address; please retain a copy of this confirmation for future reference. >> >> >> >> _______________________________________________________________________ >> >> This [email] may be privileged and/or confidential, and the sender does >> not waive any related rights and obligations. Any distribution, use or >> copying of this [email] or the information it contains by other than an >> intended recipient is unauthorized. If you received this [email] in error, >> please advise the sender (by return [email] or otherwise) immediately. You >> have consented to receive the attached electronically at the above-noted >> address; please retain a copy of this confirmation for future reference. >> >> >