Re: Best Practice: store depending on data content

Ruslan Al-Fakikh Thu, 05 Jul 2012 08:01:38 -0700

That is a very interesting offtopic:)
I think I will reinvestigate HCatalog some day and come up with
specific questions.


Thanks a lot for explaining

On Wed, Jul 4, 2012 at 4:37 AM, Dmitriy Ryaboy <[email protected]> wrote:
> Imagine increasing the number of datasets by a couple orders of
> magnitude. "ls" stops being a good browsing too pretty quickly.
>
> Then, add the need to manage quotas and retention policies for
> different data producers, to find resources across multiple teams, to
> have a web ui for easy metadata search...
>
> (and now we are totally and thoroughly offtopic. Sorry.)
>
> D
>
> On Tue, Jul 3, 2012 at 2:56 AM, Ruslan Al-Fakikh
> <[email protected]> wrote:
>> Dmirtiy,
>>
>> In our organization we use file paths for this purpose like this:
>> /incoming/datasetA
>> /incoming/datasetB
>> /reports/datasetC
>> etc
>>
>> On Mon, Jul 2, 2012 at 9:37 PM, Dmitriy Ryaboy <[email protected]> wrote:
>>> "It would give me the list of datasets in one place accessible from all
>>> tools,"
>>>
>>> And that's exactly why you want it.
>>>
>>> D
>>>
>>> On Mon, Jul 2, 2012 at 5:57 AM, Ruslan Al-Fakikh <[email protected]> 
>>> wrote:
>>>> Hey Alan,
>>>>
>>>> I am not familiar with Apache processes, so I could be wrong in my
>>>> point 1, I am sorry.
>>>> Basically my impressions was that Cloudera is pushing Avro format for
>>>> intercommunications between hadoop tools like pig, hive and mapreduce.
>>>> https://ccp.cloudera.com/display/CDHDOC/Avro+Usage
>>>> http://www.cloudera.com/blog/2011/07/avro-data-interop/
>>>> And if I decide to use Avro then HCatalog becomes a little redundant.
>>>> It would give me the list of datasets in one place accessible from all
>>>> tools, but all the columns (names and types) would be stored in Avro
>>>> schemas and Hive metastore becomes just a stub for those Avro schemas:
>>>> https://github.com/jghoman/haivvreo#creating-avro-backed-hive-tables
>>>> And having those avro schemas I could access data from pig and
>>>> mapreduce without HCatalog. Though I haven't figured out how to deal
>>>> without hive partitions yet.
>>>>
>>>> Best Regards,
>>>> Ruslan
>>>>
>>>> On Fri, Jun 29, 2012 at 9:13 PM, Alan Gates <[email protected]> wrote:
>>>>> On a different topic, I'm interested in why you refuse to use a project 
>>>>> in the incubator.  Incubation is the Apache process by why a community is 
>>>>> built around the code.  It says nothing about the maturity of the code.
>>>>>
>>>>> Alan.
>>>>>
>>>>> On Jun 28, 2012, at 10:59 AM, Ruslan Al-Fakikh wrote:
>>>>>
>>>>>> Hi Markus,
>>>>>>
>>>>>> Currently I am doing almost the same task. But in Hive.
>>>>>> In Hive you can use the native Avro+Hive integration:
>>>>>> https://issues.apache.org/jira/browse/HIVE-895
>>>>>> Or haivvreo project if you are not using the latest version of Hive.
>>>>>> Also there is a Dynamic Partition feature in Hive that can separate
>>>>>> your data by a column value.
>>>>>>
>>>>>> As for HCatalog - I refused to use it after some investigation, because:
>>>>>> 1) It is still incubating
>>>>>> 2) It is not supported by Cloudera (the distribution provider we are
>>>>>> currently using)
>>>>>>
>>>>>> I think it would be perfect if MultiStorage would be generic in the
>>>>>> way you described, but I am not familiar with it.
>>>>>>
>>>>>> Ruslan
>>>>>>
>>>>>> On Thu, Jun 28, 2012 at 9:27 PM, Thejas Nair <[email protected]> 
>>>>>> wrote:
>>>>>>> I am not aware of any work on adding those features to MultiStorage.
>>>>>>>
>>>>>>> I think the best way to do this is to use Hcatalog. (It makes the hive
>>>>>>> metastore available for all of hadoop, so you get metadata for your 
>>>>>>> data as
>>>>>>> well).
>>>>>>> You can associate a outputformat+serde for a table (instead of file name
>>>>>>> ending), and HCatStorage will automatically pick the right format.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Thejas
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 6/28/12 2:17 AM, Markus Resch wrote:
>>>>>>>>
>>>>>>>> Thanks Thejas,
>>>>>>>>
>>>>>>>> This _really_ helped a lot :)
>>>>>>>> Some additional question on this:
>>>>>>>> As far as I see, the MultiStorage is currently just capable to write 
>>>>>>>> CSV
>>>>>>>> output, right? Is there any attempt ongoing currently to make this
>>>>>>>> storage more generic regarding the format of the output data? For our
>>>>>>>> needs we would require AVRO output as well as some special proprietary
>>>>>>>> binary encoding for which we already created our own storage. I'm
>>>>>>>> thinking about a storage that will select a certain writer method
>>>>>>>> depending to the file names ending.
>>>>>>>>
>>>>>>>> Do you know of such efforts?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> Markus
>>>>>>>>
>>>>>>>>
>>>>>>>> Am Freitag, den 22.06.2012, 11:23 -0700 schrieb Thejas Nair:
>>>>>>>>>
>>>>>>>>> You can use MultiStorage store func -
>>>>>>>>>
>>>>>>>>> http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/MultiStorage.html
>>>>>>>>>
>>>>>>>>> Or if you want something more flexible, and have metadata as well, use
>>>>>>>>> hcatalog . Specify the keys on which you want to partition as your
>>>>>>>>> partition keys in the table. Then use HcatStorer() to store the data.
>>>>>>>>> See http://incubator.apache.org/hcatalog/docs/r0.4.0/index.html
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Thejas
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 6/22/12 4:54 AM, Markus Resch wrote:
>>>>>>>>>>
>>>>>>>>>> Hey everyone,
>>>>>>>>>>
>>>>>>>>>> We're doing some aggregation. The result contains a key where we 
>>>>>>>>>> want to
>>>>>>>>>> have a single output file for each key. Is it possible to store files
>>>>>>>>>> like this? Especially adjusting the path by the key's value.
>>>>>>>>>>
>>>>>>>>>> Example:
>>>>>>>>>> Input = LOAD 'my/data.avro' USING AvroStorage;
>>>>>>>>>> [.... doing stuff....]
>>>>>>>>>> Output = GROUP AggregatesValues BY Key;
>>>>>>>>>> FOREACH Output Store * into
>>>>>>>>>> '/my/output/path/by/$Output.Key/Result.avro'
>>>>>>>>>>
>>>>>>>>>> I know this example does not work. But is there anything similar
>>>>>>>>>> possible? And, as I assume, not: is there some framework in the 
>>>>>>>>>> hadoop
>>>>>>>>>> world that can do such stuff?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>> Markus
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>
>>
>>
>> --
>> Best Regards,
>> Ruslan Al-Fakikh

Re: Best Practice: store depending on data content

Reply via email to