Some sample code to monitor multiple directories is now available at:
https://github.com/DataTorrent/examples/tree/master/tutorials/fileIO-multiDir

It shows how to use a custom implementation of definePartitions() to create
multiple partitions of the file input operator and group them
into "slices" where each slice monitors a single directory.

Ram

On Wed, May 25, 2016 at 9:55 AM, Munagala Ramanath <r...@datatorrent.com>
wrote:

> I'm hoping to have a sample sometime next week.
>
> Ram
>
> On Wed, May 25, 2016 at 9:30 AM, Mukkamula, Suryavamshivardhan (CWM-NR) <
> suryavamshivardhan.mukkam...@rbc.com> wrote:
>
>> Thank you so much ram, for your advice , Option (a) would be ideal for my
>> requirement.
>>
>>
>>
>> Do you have sample usage for partitioning with individual configuration
>> set ups different partitions?
>>
>>
>>
>> Regards,
>>
>> Surya Vamshi
>>
>>
>>
>> *From:* Munagala Ramanath [mailto:r...@datatorrent.com]
>> *Sent:* 2016, May, 25 12:11 PM
>> *To:* users@apex.apache.org
>> *Subject:* Re: Multiple directories
>>
>>
>>
>> You have 2 options: (a) AbstractFileInputOperator (b)
>> FileSplitter/BlockReader
>>
>>
>>
>> For (a), each partition (i.e. replica or the operator) can scan only a
>> single directory, so if you have 100
>>
>> directories, you can simply start with 100 partitions; since each
>> partition is scanning its own directory
>>
>> you don't need to worry about which files the lines came from. This
>> approach however needs a custom
>>
>> definePartition() implementation in your subclass to assign the
>> appropriate directory and XML parsing
>>
>> config file to each partition; it also needs adequate cluster resources
>> to be able to spin up the required
>>
>> number of partitions.
>>
>>
>>
>> For (b), there is some documentation in the Operators section at
>> http://docs.datatorrent.com/ including
>>
>> sample code. There operators support scanning multiple directories out of
>> the box but have more
>>
>> elaborate configuration options. Check this out and see if it works in
>> your use case.
>>
>>
>>
>> Ram
>>
>>
>>
>> On Wed, May 25, 2016 at 8:17 AM, Mukkamula, Suryavamshivardhan (CWM-NR) <
>> suryavamshivardhan.mukkam...@rbc.com> wrote:
>>
>> Hello Ram/Team,
>>
>>
>>
>> My requirement is to read input feeds from different locations on HDFS
>> and parse those files by reading XML configuration files (each input feed
>> has configuration file which defines the fields inside the input feeds).
>>
>>
>>
>> My approach : I would like to define a mapping file which contains
>> individual feed identifier, feed location , configuration file location. I
>> would like to read this mapping file at initial load within setup() method
>> and define my DirectoryScan.acceptFiles. Here my challenge is when I read
>> the files , I should parse the lines by reading the individual
>> configuration files. How do I know the line is from particular file , if I
>> know this I can read the corresponding configuration file before parsing
>> the line.
>>
>>
>>
>> Please let me know how do I handle this.
>>
>>
>>
>> Regards,
>>
>> Surya Vamshi
>>
>>
>>
>> *From:* Munagala Ramanath [mailto:r...@datatorrent.com]
>> *Sent:* 2016, May, 24 5:49 PM
>> *To:* Mukkamula, Suryavamshivardhan (CWM-NR)
>> *Subject:* Multiple directories
>>
>>
>>
>> One way of addressing the issue is to use some sort of external tool
>> (like a script) to
>>
>> copy all the input files to a common directory (making sure that the file
>> names are
>>
>> unique to prevent one file from overwriting another) before the Apex
>> application starts.
>>
>>
>>
>> The Apex application then starts and processes files from this directory.
>>
>>
>>
>> If you set the partition count of the file input operator to N, it will
>> create N partitions and
>>
>> the files will be automatically distributed among the partitions. The
>> partitions will work
>>
>> in parallel.
>>
>>
>>
>> Ram
>>
>> _______________________________________________________________________
>>
>> This [email] may be privileged and/or confidential, and the sender does
>> not waive any related rights and obligations. Any distribution, use or
>> copying of this [email] or the information it contains by other than an
>> intended recipient is unauthorized. If you received this [email] in error,
>> please advise the sender (by return [email] or otherwise) immediately. You
>> have consented to receive the attached electronically at the above-noted
>> address; please retain a copy of this confirmation for future reference.
>>
>>
>>
>> _______________________________________________________________________
>>
>> This [email] may be privileged and/or confidential, and the sender does
>> not waive any related rights and obligations. Any distribution, use or
>> copying of this [email] or the information it contains by other than an
>> intended recipient is unauthorized. If you received this [email] in error,
>> please advise the sender (by return [email] or otherwise) immediately. You
>> have consented to receive the attached electronically at the above-noted
>> address; please retain a copy of this confirmation for future reference.
>>
>>
>

Reply via email to