How about allowing also a varArg of multiple file names for the input
format?

We'd then have the option of

 - File or directory
 - List of files or directories
 - Base directory + regex that matches contained file paths



On Wed, Jul 1, 2015 at 10:13 AM, Flavio Pompermaier <pomperma...@okkam.it>
wrote:

> +1 :)
>
> On Wed, Jul 1, 2015 at 10:08 AM, chan fentes <chanfen...@gmail.com> wrote:
>
>> Thank you all for your help and for pointing out different possibilities.
>> It would be nice to have an input format that takes a directory and a
>> regex pattern (for file names) to create one data source instead of 1500.
>> This would have helped me to avoid the problem. Maybe this can be included
>> in one of the future releases. ;)
>>
>> 2015-06-30 19:02 GMT+02:00 Stephan Ewen <se...@apache.org>:
>>
>>> I agree with Aljoscha and Ufuk.
>>>
>>> As said, it will be hard for the system (currently) to handle 1500
>>> sources, but handling a parallel source with 1500 files will be very
>>> efficient.
>>> This is possible, if all sources (files) deliver the same data type and
>>> would be unioned.
>>>
>>> If that is true, you can
>>>
>>>  - Specify the input as a directory.
>>>
>>>  - If you cannot do that, because there is no common parent directory,
>>> you can "union" the files into one data source with a simple trick, as
>>> described here:
>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/open-multiple-file-from-list-of-uri-tp1804p1807.html
>>>
>>>
>>>
>>> On Tue, Jun 30, 2015 at 5:36 PM, Aljoscha Krettek <aljos...@apache.org>
>>> wrote:
>>>
>>>> Hi Chan,
>>>> Flink sources support giving a directory as an input path in a source.
>>>> If you do this it will read each of the files in that directory. They way
>>>> you do it leads to a very big plan, because the plan will be replicated
>>>> 1500 times, this could lead to the OutOfMemoryException.
>>>>
>>>> Is there a specific reason why you create 1500 separate sources?
>>>>
>>>> Regards,
>>>> Aljoscha
>>>>
>>>> On Tue, 30 Jun 2015 at 17:17 chan fentes <chanfen...@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> how many data sources can I use in one Flink plan? Is there any limit?
>>>>> I get an
>>>>> java.lang.OutOfMemoryException: unable to create native thread
>>>>> when having approx. 1500 files. What I basically do is the following:
>>>>> DataSource ->Map -> Map -> GroupBy -> GroupReduce per file
>>>>> and then
>>>>> Union -> GroupBy -> Sum in a tree-like reduction.
>>>>>
>>>>> I have checked the workflow. It runs on a cluster without any problem,
>>>>> if I only use few files. Does Flink use a thread per operator? It seems as
>>>>> if I am limited in the amount of threads I can use. How can I avoid the
>>>>> exception mentioned above?
>>>>>
>>>>> Best regards
>>>>> Chan
>>>>>
>>>>
>>>
>>
>
>

Reply via email to