Re: Best Practice: store depending on data content

Dmitriy Ryaboy Tue, 03 Jul 2012 17:38:15 -0700

Imagine increasing the number of datasets by a couple orders of
magnitude. "ls" stops being a good browsing too pretty quickly.


Then, add the need to manage quotas and retention policies for
different data producers, to find resources across multiple teams, to
have a web ui for easy metadata search...

(and now we are totally and thoroughly offtopic. Sorry.)

D

On Tue, Jul 3, 2012 at 2:56 AM, Ruslan Al-Fakikh
<[email protected]> wrote:
> Dmirtiy,
>
> In our organization we use file paths for this purpose like this:
> /incoming/datasetA
> /incoming/datasetB
> /reports/datasetC
> etc
>
> On Mon, Jul 2, 2012 at 9:37 PM, Dmitriy Ryaboy <[email protected]> wrote:
>> "It would give me the list of datasets in one place accessible from all
>> tools,"
>>
>> And that's exactly why you want it.
>>
>> D
>>
>> On Mon, Jul 2, 2012 at 5:57 AM, Ruslan Al-Fakikh <[email protected]> 
>> wrote:
>>> Hey Alan,
>>>
>>> I am not familiar with Apache processes, so I could be wrong in my
>>> point 1, I am sorry.
>>> Basically my impressions was that Cloudera is pushing Avro format for
>>> intercommunications between hadoop tools like pig, hive and mapreduce.
>>> https://ccp.cloudera.com/display/CDHDOC/Avro+Usage
>>> http://www.cloudera.com/blog/2011/07/avro-data-interop/
>>> And if I decide to use Avro then HCatalog becomes a little redundant.
>>> It would give me the list of datasets in one place accessible from all
>>> tools, but all the columns (names and types) would be stored in Avro
>>> schemas and Hive metastore becomes just a stub for those Avro schemas:
>>> https://github.com/jghoman/haivvreo#creating-avro-backed-hive-tables
>>> And having those avro schemas I could access data from pig and
>>> mapreduce without HCatalog. Though I haven't figured out how to deal
>>> without hive partitions yet.
>>>
>>> Best Regards,
>>> Ruslan
>>>
>>> On Fri, Jun 29, 2012 at 9:13 PM, Alan Gates <[email protected]> wrote:
>>>> On a different topic, I'm interested in why you refuse to use a project in 
>>>> the incubator.  Incubation is the Apache process by why a community is 
>>>> built around the code.  It says nothing about the maturity of the code.
>>>>
>>>> Alan.
>>>>
>>>> On Jun 28, 2012, at 10:59 AM, Ruslan Al-Fakikh wrote:
>>>>
>>>>> Hi Markus,
>>>>>
>>>>> Currently I am doing almost the same task. But in Hive.
>>>>> In Hive you can use the native Avro+Hive integration:
>>>>> https://issues.apache.org/jira/browse/HIVE-895
>>>>> Or haivvreo project if you are not using the latest version of Hive.
>>>>> Also there is a Dynamic Partition feature in Hive that can separate
>>>>> your data by a column value.
>>>>>
>>>>> As for HCatalog - I refused to use it after some investigation, because:
>>>>> 1) It is still incubating
>>>>> 2) It is not supported by Cloudera (the distribution provider we are
>>>>> currently using)
>>>>>
>>>>> I think it would be perfect if MultiStorage would be generic in the
>>>>> way you described, but I am not familiar with it.
>>>>>
>>>>> Ruslan
>>>>>
>>>>> On Thu, Jun 28, 2012 at 9:27 PM, Thejas Nair <[email protected]> 
>>>>> wrote:
>>>>>> I am not aware of any work on adding those features to MultiStorage.
>>>>>>
>>>>>> I think the best way to do this is to use Hcatalog. (It makes the hive
>>>>>> metastore available for all of hadoop, so you get metadata for your data 
>>>>>> as
>>>>>> well).
>>>>>> You can associate a outputformat+serde for a table (instead of file name
>>>>>> ending), and HCatStorage will automatically pick the right format.
>>>>>>
>>>>>> Thanks,
>>>>>> Thejas
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 6/28/12 2:17 AM, Markus Resch wrote:
>>>>>>>
>>>>>>> Thanks Thejas,
>>>>>>>
>>>>>>> This _really_ helped a lot :)
>>>>>>> Some additional question on this:
>>>>>>> As far as I see, the MultiStorage is currently just capable to write CSV
>>>>>>> output, right? Is there any attempt ongoing currently to make this
>>>>>>> storage more generic regarding the format of the output data? For our
>>>>>>> needs we would require AVRO output as well as some special proprietary
>>>>>>> binary encoding for which we already created our own storage. I'm
>>>>>>> thinking about a storage that will select a certain writer method
>>>>>>> depending to the file names ending.
>>>>>>>
>>>>>>> Do you know of such efforts?
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> Markus
>>>>>>>
>>>>>>>
>>>>>>> Am Freitag, den 22.06.2012, 11:23 -0700 schrieb Thejas Nair:
>>>>>>>>
>>>>>>>> You can use MultiStorage store func -
>>>>>>>>
>>>>>>>> http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/MultiStorage.html
>>>>>>>>
>>>>>>>> Or if you want something more flexible, and have metadata as well, use
>>>>>>>> hcatalog . Specify the keys on which you want to partition as your
>>>>>>>> partition keys in the table. Then use HcatStorer() to store the data.
>>>>>>>> See http://incubator.apache.org/hcatalog/docs/r0.4.0/index.html
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Thejas
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 6/22/12 4:54 AM, Markus Resch wrote:
>>>>>>>>>
>>>>>>>>> Hey everyone,
>>>>>>>>>
>>>>>>>>> We're doing some aggregation. The result contains a key where we want 
>>>>>>>>> to
>>>>>>>>> have a single output file for each key. Is it possible to store files
>>>>>>>>> like this? Especially adjusting the path by the key's value.
>>>>>>>>>
>>>>>>>>> Example:
>>>>>>>>> Input = LOAD 'my/data.avro' USING AvroStorage;
>>>>>>>>> [.... doing stuff....]
>>>>>>>>> Output = GROUP AggregatesValues BY Key;
>>>>>>>>> FOREACH Output Store * into
>>>>>>>>> '/my/output/path/by/$Output.Key/Result.avro'
>>>>>>>>>
>>>>>>>>> I know this example does not work. But is there anything similar
>>>>>>>>> possible? And, as I assume, not: is there some framework in the hadoop
>>>>>>>>> world that can do such stuff?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>> Markus
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>
>
>
> --
> Best Regards,
> Ruslan Al-Fakikh

Re: Best Practice: store depending on data content

Reply via email to