That is a very interesting offtopic:) I think I will reinvestigate HCatalog some day and come up with specific questions.
Thanks a lot for explaining On Wed, Jul 4, 2012 at 4:37 AM, Dmitriy Ryaboy <[email protected]> wrote: > Imagine increasing the number of datasets by a couple orders of > magnitude. "ls" stops being a good browsing too pretty quickly. > > Then, add the need to manage quotas and retention policies for > different data producers, to find resources across multiple teams, to > have a web ui for easy metadata search... > > (and now we are totally and thoroughly offtopic. Sorry.) > > D > > On Tue, Jul 3, 2012 at 2:56 AM, Ruslan Al-Fakikh > <[email protected]> wrote: >> Dmirtiy, >> >> In our organization we use file paths for this purpose like this: >> /incoming/datasetA >> /incoming/datasetB >> /reports/datasetC >> etc >> >> On Mon, Jul 2, 2012 at 9:37 PM, Dmitriy Ryaboy <[email protected]> wrote: >>> "It would give me the list of datasets in one place accessible from all >>> tools," >>> >>> And that's exactly why you want it. >>> >>> D >>> >>> On Mon, Jul 2, 2012 at 5:57 AM, Ruslan Al-Fakikh <[email protected]> >>> wrote: >>>> Hey Alan, >>>> >>>> I am not familiar with Apache processes, so I could be wrong in my >>>> point 1, I am sorry. >>>> Basically my impressions was that Cloudera is pushing Avro format for >>>> intercommunications between hadoop tools like pig, hive and mapreduce. >>>> https://ccp.cloudera.com/display/CDHDOC/Avro+Usage >>>> http://www.cloudera.com/blog/2011/07/avro-data-interop/ >>>> And if I decide to use Avro then HCatalog becomes a little redundant. >>>> It would give me the list of datasets in one place accessible from all >>>> tools, but all the columns (names and types) would be stored in Avro >>>> schemas and Hive metastore becomes just a stub for those Avro schemas: >>>> https://github.com/jghoman/haivvreo#creating-avro-backed-hive-tables >>>> And having those avro schemas I could access data from pig and >>>> mapreduce without HCatalog. Though I haven't figured out how to deal >>>> without hive partitions yet. >>>> >>>> Best Regards, >>>> Ruslan >>>> >>>> On Fri, Jun 29, 2012 at 9:13 PM, Alan Gates <[email protected]> wrote: >>>>> On a different topic, I'm interested in why you refuse to use a project >>>>> in the incubator. Incubation is the Apache process by why a community is >>>>> built around the code. It says nothing about the maturity of the code. >>>>> >>>>> Alan. >>>>> >>>>> On Jun 28, 2012, at 10:59 AM, Ruslan Al-Fakikh wrote: >>>>> >>>>>> Hi Markus, >>>>>> >>>>>> Currently I am doing almost the same task. But in Hive. >>>>>> In Hive you can use the native Avro+Hive integration: >>>>>> https://issues.apache.org/jira/browse/HIVE-895 >>>>>> Or haivvreo project if you are not using the latest version of Hive. >>>>>> Also there is a Dynamic Partition feature in Hive that can separate >>>>>> your data by a column value. >>>>>> >>>>>> As for HCatalog - I refused to use it after some investigation, because: >>>>>> 1) It is still incubating >>>>>> 2) It is not supported by Cloudera (the distribution provider we are >>>>>> currently using) >>>>>> >>>>>> I think it would be perfect if MultiStorage would be generic in the >>>>>> way you described, but I am not familiar with it. >>>>>> >>>>>> Ruslan >>>>>> >>>>>> On Thu, Jun 28, 2012 at 9:27 PM, Thejas Nair <[email protected]> >>>>>> wrote: >>>>>>> I am not aware of any work on adding those features to MultiStorage. >>>>>>> >>>>>>> I think the best way to do this is to use Hcatalog. (It makes the hive >>>>>>> metastore available for all of hadoop, so you get metadata for your >>>>>>> data as >>>>>>> well). >>>>>>> You can associate a outputformat+serde for a table (instead of file name >>>>>>> ending), and HCatStorage will automatically pick the right format. >>>>>>> >>>>>>> Thanks, >>>>>>> Thejas >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 6/28/12 2:17 AM, Markus Resch wrote: >>>>>>>> >>>>>>>> Thanks Thejas, >>>>>>>> >>>>>>>> This _really_ helped a lot :) >>>>>>>> Some additional question on this: >>>>>>>> As far as I see, the MultiStorage is currently just capable to write >>>>>>>> CSV >>>>>>>> output, right? Is there any attempt ongoing currently to make this >>>>>>>> storage more generic regarding the format of the output data? For our >>>>>>>> needs we would require AVRO output as well as some special proprietary >>>>>>>> binary encoding for which we already created our own storage. I'm >>>>>>>> thinking about a storage that will select a certain writer method >>>>>>>> depending to the file names ending. >>>>>>>> >>>>>>>> Do you know of such efforts? >>>>>>>> >>>>>>>> Thanks >>>>>>>> >>>>>>>> Markus >>>>>>>> >>>>>>>> >>>>>>>> Am Freitag, den 22.06.2012, 11:23 -0700 schrieb Thejas Nair: >>>>>>>>> >>>>>>>>> You can use MultiStorage store func - >>>>>>>>> >>>>>>>>> http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/MultiStorage.html >>>>>>>>> >>>>>>>>> Or if you want something more flexible, and have metadata as well, use >>>>>>>>> hcatalog . Specify the keys on which you want to partition as your >>>>>>>>> partition keys in the table. Then use HcatStorer() to store the data. >>>>>>>>> See http://incubator.apache.org/hcatalog/docs/r0.4.0/index.html >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Thejas >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 6/22/12 4:54 AM, Markus Resch wrote: >>>>>>>>>> >>>>>>>>>> Hey everyone, >>>>>>>>>> >>>>>>>>>> We're doing some aggregation. The result contains a key where we >>>>>>>>>> want to >>>>>>>>>> have a single output file for each key. Is it possible to store files >>>>>>>>>> like this? Especially adjusting the path by the key's value. >>>>>>>>>> >>>>>>>>>> Example: >>>>>>>>>> Input = LOAD 'my/data.avro' USING AvroStorage; >>>>>>>>>> [.... doing stuff....] >>>>>>>>>> Output = GROUP AggregatesValues BY Key; >>>>>>>>>> FOREACH Output Store * into >>>>>>>>>> '/my/output/path/by/$Output.Key/Result.avro' >>>>>>>>>> >>>>>>>>>> I know this example does not work. But is there anything similar >>>>>>>>>> possible? And, as I assume, not: is there some framework in the >>>>>>>>>> hadoop >>>>>>>>>> world that can do such stuff? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> >>>>>>>>>> Markus >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>> >> >> >> >> -- >> Best Regards, >> Ruslan Al-Fakikh
