Re: Loopup objects in distributed cache

vivek thakre Thu, 04 Apr 2013 20:52:31 -0700

Thanks Jan for your reply. This is helpful

Vivek



On Thu, Apr 4, 2013 at 12:11 AM, Jan Dolinár <dolik....@gmail.com> wrote:

> Hello Vivek,
>
> GenericUDTF has method initialize() which is only called once per task. So
> if you read your files in this method and store the structures in memory
> then the overhead is relatively small (reading 15MB per mapper is
> negligible compared to several GB of processed data).
>
> Best regards,
> Jan
>
>
> On Wed, Apr 3, 2013 at 10:35 PM, vivek thakre <vivek.tha...@gmail.com>wrote:
>
>> Hello,
>>
>> I want to write a functionality using UDTF. The functionality involves
>> reading 7 different text files and create lookup structures such as Map,
>> Set, List , Map of String and List etc to be used in the logic.
>>
>> These files are small size average 15 MB.
>>
>> I can add these files in distributed cache and access them in UDTF, read
>> the files, and create the necessary lookup data structures, but this would
>> mean that the files will be opened, read and closed every time the UDTF is
>> invoked.
>>
>> Is there a way that I can just read the files once, create the data
>> structures needed , put them in distributed cache and access them from UDTF?
>>
>> I don't think creating hive tables from these files and doing a map side
>> join is possible, as the functionality that I want to implement is fairly
>> complex and I am not sure if it can be done just using hive query and join
>> without using UDTF.
>>
>> Thanks in advance.
>>
>
>

Re: Loopup objects in distributed cache

Reply via email to