Imagine increasing the number of datasets by a couple orders of magnitude. "ls" stops being a good browsing too pretty quickly.
Then, add the need to manage quotas and retention policies for different data producers, to find resources across multiple teams, to have a web ui for easy metadata search... (and now we are totally and thoroughly offtopic. Sorry.) D On Tue, Jul 3, 2012 at 2:56 AM, Ruslan Al-Fakikh <[email protected]> wrote: > Dmirtiy, > > In our organization we use file paths for this purpose like this: > /incoming/datasetA > /incoming/datasetB > /reports/datasetC > etc > > On Mon, Jul 2, 2012 at 9:37 PM, Dmitriy Ryaboy <[email protected]> wrote: >> "It would give me the list of datasets in one place accessible from all >> tools," >> >> And that's exactly why you want it. >> >> D >> >> On Mon, Jul 2, 2012 at 5:57 AM, Ruslan Al-Fakikh <[email protected]> >> wrote: >>> Hey Alan, >>> >>> I am not familiar with Apache processes, so I could be wrong in my >>> point 1, I am sorry. >>> Basically my impressions was that Cloudera is pushing Avro format for >>> intercommunications between hadoop tools like pig, hive and mapreduce. >>> https://ccp.cloudera.com/display/CDHDOC/Avro+Usage >>> http://www.cloudera.com/blog/2011/07/avro-data-interop/ >>> And if I decide to use Avro then HCatalog becomes a little redundant. >>> It would give me the list of datasets in one place accessible from all >>> tools, but all the columns (names and types) would be stored in Avro >>> schemas and Hive metastore becomes just a stub for those Avro schemas: >>> https://github.com/jghoman/haivvreo#creating-avro-backed-hive-tables >>> And having those avro schemas I could access data from pig and >>> mapreduce without HCatalog. Though I haven't figured out how to deal >>> without hive partitions yet. >>> >>> Best Regards, >>> Ruslan >>> >>> On Fri, Jun 29, 2012 at 9:13 PM, Alan Gates <[email protected]> wrote: >>>> On a different topic, I'm interested in why you refuse to use a project in >>>> the incubator. Incubation is the Apache process by why a community is >>>> built around the code. It says nothing about the maturity of the code. >>>> >>>> Alan. >>>> >>>> On Jun 28, 2012, at 10:59 AM, Ruslan Al-Fakikh wrote: >>>> >>>>> Hi Markus, >>>>> >>>>> Currently I am doing almost the same task. But in Hive. >>>>> In Hive you can use the native Avro+Hive integration: >>>>> https://issues.apache.org/jira/browse/HIVE-895 >>>>> Or haivvreo project if you are not using the latest version of Hive. >>>>> Also there is a Dynamic Partition feature in Hive that can separate >>>>> your data by a column value. >>>>> >>>>> As for HCatalog - I refused to use it after some investigation, because: >>>>> 1) It is still incubating >>>>> 2) It is not supported by Cloudera (the distribution provider we are >>>>> currently using) >>>>> >>>>> I think it would be perfect if MultiStorage would be generic in the >>>>> way you described, but I am not familiar with it. >>>>> >>>>> Ruslan >>>>> >>>>> On Thu, Jun 28, 2012 at 9:27 PM, Thejas Nair <[email protected]> >>>>> wrote: >>>>>> I am not aware of any work on adding those features to MultiStorage. >>>>>> >>>>>> I think the best way to do this is to use Hcatalog. (It makes the hive >>>>>> metastore available for all of hadoop, so you get metadata for your data >>>>>> as >>>>>> well). >>>>>> You can associate a outputformat+serde for a table (instead of file name >>>>>> ending), and HCatStorage will automatically pick the right format. >>>>>> >>>>>> Thanks, >>>>>> Thejas >>>>>> >>>>>> >>>>>> >>>>>> On 6/28/12 2:17 AM, Markus Resch wrote: >>>>>>> >>>>>>> Thanks Thejas, >>>>>>> >>>>>>> This _really_ helped a lot :) >>>>>>> Some additional question on this: >>>>>>> As far as I see, the MultiStorage is currently just capable to write CSV >>>>>>> output, right? Is there any attempt ongoing currently to make this >>>>>>> storage more generic regarding the format of the output data? For our >>>>>>> needs we would require AVRO output as well as some special proprietary >>>>>>> binary encoding for which we already created our own storage. I'm >>>>>>> thinking about a storage that will select a certain writer method >>>>>>> depending to the file names ending. >>>>>>> >>>>>>> Do you know of such efforts? >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> Markus >>>>>>> >>>>>>> >>>>>>> Am Freitag, den 22.06.2012, 11:23 -0700 schrieb Thejas Nair: >>>>>>>> >>>>>>>> You can use MultiStorage store func - >>>>>>>> >>>>>>>> http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/MultiStorage.html >>>>>>>> >>>>>>>> Or if you want something more flexible, and have metadata as well, use >>>>>>>> hcatalog . Specify the keys on which you want to partition as your >>>>>>>> partition keys in the table. Then use HcatStorer() to store the data. >>>>>>>> See http://incubator.apache.org/hcatalog/docs/r0.4.0/index.html >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Thejas >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 6/22/12 4:54 AM, Markus Resch wrote: >>>>>>>>> >>>>>>>>> Hey everyone, >>>>>>>>> >>>>>>>>> We're doing some aggregation. The result contains a key where we want >>>>>>>>> to >>>>>>>>> have a single output file for each key. Is it possible to store files >>>>>>>>> like this? Especially adjusting the path by the key's value. >>>>>>>>> >>>>>>>>> Example: >>>>>>>>> Input = LOAD 'my/data.avro' USING AvroStorage; >>>>>>>>> [.... doing stuff....] >>>>>>>>> Output = GROUP AggregatesValues BY Key; >>>>>>>>> FOREACH Output Store * into >>>>>>>>> '/my/output/path/by/$Output.Key/Result.avro' >>>>>>>>> >>>>>>>>> I know this example does not work. But is there anything similar >>>>>>>>> possible? And, as I assume, not: is there some framework in the hadoop >>>>>>>>> world that can do such stuff? >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> >>>>>>>>> Markus >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>> > > > > -- > Best Regards, > Ruslan Al-Fakikh
