Re: Best Practice: store depending on data content

Ruslan Al-Fakikh Tue, 03 Jul 2012 02:56:47 -0700

Dmirtiy,

In our organization we use file paths for this purpose like this:
/incoming/datasetA
/incoming/datasetB
/reports/datasetC
etc


On Mon, Jul 2, 2012 at 9:37 PM, Dmitriy Ryaboy <[email protected]> wrote:
> "It would give me the list of datasets in one place accessible from all
> tools,"
>
> And that's exactly why you want it.
>
> D
>
> On Mon, Jul 2, 2012 at 5:57 AM, Ruslan Al-Fakikh <[email protected]> wrote:
>> Hey Alan,
>>
>> I am not familiar with Apache processes, so I could be wrong in my
>> point 1, I am sorry.
>> Basically my impressions was that Cloudera is pushing Avro format for
>> intercommunications between hadoop tools like pig, hive and mapreduce.
>> https://ccp.cloudera.com/display/CDHDOC/Avro+Usage
>> http://www.cloudera.com/blog/2011/07/avro-data-interop/
>> And if I decide to use Avro then HCatalog becomes a little redundant.
>> It would give me the list of datasets in one place accessible from all
>> tools, but all the columns (names and types) would be stored in Avro
>> schemas and Hive metastore becomes just a stub for those Avro schemas:
>> https://github.com/jghoman/haivvreo#creating-avro-backed-hive-tables
>> And having those avro schemas I could access data from pig and
>> mapreduce without HCatalog. Though I haven't figured out how to deal
>> without hive partitions yet.
>>
>> Best Regards,
>> Ruslan
>>
>> On Fri, Jun 29, 2012 at 9:13 PM, Alan Gates <[email protected]> wrote:
>>> On a different topic, I'm interested in why you refuse to use a project in 
>>> the incubator.  Incubation is the Apache process by why a community is 
>>> built around the code.  It says nothing about the maturity of the code.
>>>
>>> Alan.
>>>
>>> On Jun 28, 2012, at 10:59 AM, Ruslan Al-Fakikh wrote:
>>>
>>>> Hi Markus,
>>>>
>>>> Currently I am doing almost the same task. But in Hive.
>>>> In Hive you can use the native Avro+Hive integration:
>>>> https://issues.apache.org/jira/browse/HIVE-895
>>>> Or haivvreo project if you are not using the latest version of Hive.
>>>> Also there is a Dynamic Partition feature in Hive that can separate
>>>> your data by a column value.
>>>>
>>>> As for HCatalog - I refused to use it after some investigation, because:
>>>> 1) It is still incubating
>>>> 2) It is not supported by Cloudera (the distribution provider we are
>>>> currently using)
>>>>
>>>> I think it would be perfect if MultiStorage would be generic in the
>>>> way you described, but I am not familiar with it.
>>>>
>>>> Ruslan
>>>>
>>>> On Thu, Jun 28, 2012 at 9:27 PM, Thejas Nair <[email protected]> 
>>>> wrote:
>>>>> I am not aware of any work on adding those features to MultiStorage.
>>>>>
>>>>> I think the best way to do this is to use Hcatalog. (It makes the hive
>>>>> metastore available for all of hadoop, so you get metadata for your data 
>>>>> as
>>>>> well).
>>>>> You can associate a outputformat+serde for a table (instead of file name
>>>>> ending), and HCatStorage will automatically pick the right format.
>>>>>
>>>>> Thanks,
>>>>> Thejas
>>>>>
>>>>>
>>>>>
>>>>> On 6/28/12 2:17 AM, Markus Resch wrote:
>>>>>>
>>>>>> Thanks Thejas,
>>>>>>
>>>>>> This _really_ helped a lot :)
>>>>>> Some additional question on this:
>>>>>> As far as I see, the MultiStorage is currently just capable to write CSV
>>>>>> output, right? Is there any attempt ongoing currently to make this
>>>>>> storage more generic regarding the format of the output data? For our
>>>>>> needs we would require AVRO output as well as some special proprietary
>>>>>> binary encoding for which we already created our own storage. I'm
>>>>>> thinking about a storage that will select a certain writer method
>>>>>> depending to the file names ending.
>>>>>>
>>>>>> Do you know of such efforts?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Markus
>>>>>>
>>>>>>
>>>>>> Am Freitag, den 22.06.2012, 11:23 -0700 schrieb Thejas Nair:
>>>>>>>
>>>>>>> You can use MultiStorage store func -
>>>>>>>
>>>>>>> http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/MultiStorage.html
>>>>>>>
>>>>>>> Or if you want something more flexible, and have metadata as well, use
>>>>>>> hcatalog . Specify the keys on which you want to partition as your
>>>>>>> partition keys in the table. Then use HcatStorer() to store the data.
>>>>>>> See http://incubator.apache.org/hcatalog/docs/r0.4.0/index.html
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Thejas
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 6/22/12 4:54 AM, Markus Resch wrote:
>>>>>>>>
>>>>>>>> Hey everyone,
>>>>>>>>
>>>>>>>> We're doing some aggregation. The result contains a key where we want 
>>>>>>>> to
>>>>>>>> have a single output file for each key. Is it possible to store files
>>>>>>>> like this? Especially adjusting the path by the key's value.
>>>>>>>>
>>>>>>>> Example:
>>>>>>>> Input = LOAD 'my/data.avro' USING AvroStorage;
>>>>>>>> [.... doing stuff....]
>>>>>>>> Output = GROUP AggregatesValues BY Key;
>>>>>>>> FOREACH Output Store * into
>>>>>>>> '/my/output/path/by/$Output.Key/Result.avro'
>>>>>>>>
>>>>>>>> I know this example does not work. But is there anything similar
>>>>>>>> possible? And, as I assume, not: is there some framework in the hadoop
>>>>>>>> world that can do such stuff?
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> Markus
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>



-- 
Best Regards,
Ruslan Al-Fakikh

Re: Best Practice: store depending on data content

Reply via email to