In our case we have /result/CustomerId1 /result/CustomerId2 /result/CustomerId3 /result/CustomerId4 [...]
As we have a _lot_ of customers ;) we don't want to make an extra line of code to each script. I think the MultiStorage is perfect for our use case but we need to extend it for avro usage. Best Markus Am Dienstag, den 03.07.2012, 13:56 +0400 schrieb Ruslan Al-Fakikh: > Dmirtiy, > > In our organization we use file paths for this purpose like this: > /incoming/datasetA > /incoming/datasetB > /reports/datasetC > etc > > On Mon, Jul 2, 2012 at 9:37 PM, Dmitriy Ryaboy <[email protected]> wrote: > > "It would give me the list of datasets in one place accessible from all > > tools," > > > > And that's exactly why you want it. > > > > D > > > > On Mon, Jul 2, 2012 at 5:57 AM, Ruslan Al-Fakikh <[email protected]> > > wrote: > >> Hey Alan, > >> > >> I am not familiar with Apache processes, so I could be wrong in my > >> point 1, I am sorry. > >> Basically my impressions was that Cloudera is pushing Avro format for > >> intercommunications between hadoop tools like pig, hive and mapreduce. > >> https://ccp.cloudera.com/display/CDHDOC/Avro+Usage > >> http://www.cloudera.com/blog/2011/07/avro-data-interop/ > >> And if I decide to use Avro then HCatalog becomes a little redundant. > >> It would give me the list of datasets in one place accessible from all > >> tools, but all the columns (names and types) would be stored in Avro > >> schemas and Hive metastore becomes just a stub for those Avro schemas: > >> https://github.com/jghoman/haivvreo#creating-avro-backed-hive-tables > >> And having those avro schemas I could access data from pig and > >> mapreduce without HCatalog. Though I haven't figured out how to deal > >> without hive partitions yet. > >> > >> Best Regards, > >> Ruslan > >> > >> On Fri, Jun 29, 2012 at 9:13 PM, Alan Gates <[email protected]> wrote: > >>> On a different topic, I'm interested in why you refuse to use a project > >>> in the incubator. Incubation is the Apache process by why a community is > >>> built around the code. It says nothing about the maturity of the code. > >>> > >>> Alan. > >>> > >>> On Jun 28, 2012, at 10:59 AM, Ruslan Al-Fakikh wrote: > >>> > >>>> Hi Markus, > >>>> > >>>> Currently I am doing almost the same task. But in Hive. > >>>> In Hive you can use the native Avro+Hive integration: > >>>> https://issues.apache.org/jira/browse/HIVE-895 > >>>> Or haivvreo project if you are not using the latest version of Hive. > >>>> Also there is a Dynamic Partition feature in Hive that can separate > >>>> your data by a column value. > >>>> > >>>> As for HCatalog - I refused to use it after some investigation, because: > >>>> 1) It is still incubating > >>>> 2) It is not supported by Cloudera (the distribution provider we are > >>>> currently using) > >>>> > >>>> I think it would be perfect if MultiStorage would be generic in the > >>>> way you described, but I am not familiar with it. > >>>> > >>>> Ruslan > >>>> > >>>> On Thu, Jun 28, 2012 at 9:27 PM, Thejas Nair <[email protected]> > >>>> wrote: > >>>>> I am not aware of any work on adding those features to MultiStorage. > >>>>> > >>>>> I think the best way to do this is to use Hcatalog. (It makes the hive > >>>>> metastore available for all of hadoop, so you get metadata for your > >>>>> data as > >>>>> well). > >>>>> You can associate a outputformat+serde for a table (instead of file name > >>>>> ending), and HCatStorage will automatically pick the right format. > >>>>> > >>>>> Thanks, > >>>>> Thejas > >>>>> > >>>>> > >>>>> > >>>>> On 6/28/12 2:17 AM, Markus Resch wrote: > >>>>>> > >>>>>> Thanks Thejas, > >>>>>> > >>>>>> This _really_ helped a lot :) > >>>>>> Some additional question on this: > >>>>>> As far as I see, the MultiStorage is currently just capable to write > >>>>>> CSV > >>>>>> output, right? Is there any attempt ongoing currently to make this > >>>>>> storage more generic regarding the format of the output data? For our > >>>>>> needs we would require AVRO output as well as some special proprietary > >>>>>> binary encoding for which we already created our own storage. I'm > >>>>>> thinking about a storage that will select a certain writer method > >>>>>> depending to the file names ending. > >>>>>> > >>>>>> Do you know of such efforts? > >>>>>> > >>>>>> Thanks > >>>>>> > >>>>>> Markus > >>>>>> > >>>>>> > >>>>>> Am Freitag, den 22.06.2012, 11:23 -0700 schrieb Thejas Nair: > >>>>>>> > >>>>>>> You can use MultiStorage store func - > >>>>>>> > >>>>>>> http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/MultiStorage.html > >>>>>>> > >>>>>>> Or if you want something more flexible, and have metadata as well, use > >>>>>>> hcatalog . Specify the keys on which you want to partition as your > >>>>>>> partition keys in the table. Then use HcatStorer() to store the data. > >>>>>>> See http://incubator.apache.org/hcatalog/docs/r0.4.0/index.html > >>>>>>> > >>>>>>> Thanks, > >>>>>>> Thejas > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On 6/22/12 4:54 AM, Markus Resch wrote: > >>>>>>>> > >>>>>>>> Hey everyone, > >>>>>>>> > >>>>>>>> We're doing some aggregation. The result contains a key where we > >>>>>>>> want to > >>>>>>>> have a single output file for each key. Is it possible to store files > >>>>>>>> like this? Especially adjusting the path by the key's value. > >>>>>>>> > >>>>>>>> Example: > >>>>>>>> Input = LOAD 'my/data.avro' USING AvroStorage; > >>>>>>>> [.... doing stuff....] > >>>>>>>> Output = GROUP AggregatesValues BY Key; > >>>>>>>> FOREACH Output Store * into > >>>>>>>> '/my/output/path/by/$Output.Key/Result.avro' > >>>>>>>> > >>>>>>>> I know this example does not work. But is there anything similar > >>>>>>>> possible? And, as I assume, not: is there some framework in the > >>>>>>>> hadoop > >>>>>>>> world that can do such stuff? > >>>>>>>> > >>>>>>>> > >>>>>>>> Thanks > >>>>>>>> > >>>>>>>> Markus
