Re: Best Practice: store depending on data content

Markus Resch Tue, 03 Jul 2012 04:31:00 -0700

In our case we have

/result/CustomerId1
/result/CustomerId2
/result/CustomerId3
/result/CustomerId4
[...]


As we have a _lot_ of customers ;) we don't want to make an extra line
of code to each script.
I think the MultiStorage is perfect for our use case but we need to
extend it for avro usage.

Best
Markus



Am Dienstag, den 03.07.2012, 13:56 +0400 schrieb Ruslan Al-Fakikh:
> Dmirtiy,
> 
> In our organization we use file paths for this purpose like this:
> /incoming/datasetA
> /incoming/datasetB
> /reports/datasetC
> etc
> 
> On Mon, Jul 2, 2012 at 9:37 PM, Dmitriy Ryaboy <[email protected]> wrote:
> > "It would give me the list of datasets in one place accessible from all
> > tools,"
> >
> > And that's exactly why you want it.
> >
> > D
> >
> > On Mon, Jul 2, 2012 at 5:57 AM, Ruslan Al-Fakikh <[email protected]> 
> > wrote:
> >> Hey Alan,
> >>
> >> I am not familiar with Apache processes, so I could be wrong in my
> >> point 1, I am sorry.
> >> Basically my impressions was that Cloudera is pushing Avro format for
> >> intercommunications between hadoop tools like pig, hive and mapreduce.
> >> https://ccp.cloudera.com/display/CDHDOC/Avro+Usage
> >> http://www.cloudera.com/blog/2011/07/avro-data-interop/
> >> And if I decide to use Avro then HCatalog becomes a little redundant.
> >> It would give me the list of datasets in one place accessible from all
> >> tools, but all the columns (names and types) would be stored in Avro
> >> schemas and Hive metastore becomes just a stub for those Avro schemas:
> >> https://github.com/jghoman/haivvreo#creating-avro-backed-hive-tables
> >> And having those avro schemas I could access data from pig and
> >> mapreduce without HCatalog. Though I haven't figured out how to deal
> >> without hive partitions yet.
> >>
> >> Best Regards,
> >> Ruslan
> >>
> >> On Fri, Jun 29, 2012 at 9:13 PM, Alan Gates <[email protected]> wrote:
> >>> On a different topic, I'm interested in why you refuse to use a project 
> >>> in the incubator.  Incubation is the Apache process by why a community is 
> >>> built around the code.  It says nothing about the maturity of the code.
> >>>
> >>> Alan.
> >>>
> >>> On Jun 28, 2012, at 10:59 AM, Ruslan Al-Fakikh wrote:
> >>>
> >>>> Hi Markus,
> >>>>
> >>>> Currently I am doing almost the same task. But in Hive.
> >>>> In Hive you can use the native Avro+Hive integration:
> >>>> https://issues.apache.org/jira/browse/HIVE-895
> >>>> Or haivvreo project if you are not using the latest version of Hive.
> >>>> Also there is a Dynamic Partition feature in Hive that can separate
> >>>> your data by a column value.
> >>>>
> >>>> As for HCatalog - I refused to use it after some investigation, because:
> >>>> 1) It is still incubating
> >>>> 2) It is not supported by Cloudera (the distribution provider we are
> >>>> currently using)
> >>>>
> >>>> I think it would be perfect if MultiStorage would be generic in the
> >>>> way you described, but I am not familiar with it.
> >>>>
> >>>> Ruslan
> >>>>
> >>>> On Thu, Jun 28, 2012 at 9:27 PM, Thejas Nair <[email protected]> 
> >>>> wrote:
> >>>>> I am not aware of any work on adding those features to MultiStorage.
> >>>>>
> >>>>> I think the best way to do this is to use Hcatalog. (It makes the hive
> >>>>> metastore available for all of hadoop, so you get metadata for your 
> >>>>> data as
> >>>>> well).
> >>>>> You can associate a outputformat+serde for a table (instead of file name
> >>>>> ending), and HCatStorage will automatically pick the right format.
> >>>>>
> >>>>> Thanks,
> >>>>> Thejas
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 6/28/12 2:17 AM, Markus Resch wrote:
> >>>>>>
> >>>>>> Thanks Thejas,
> >>>>>>
> >>>>>> This _really_ helped a lot :)
> >>>>>> Some additional question on this:
> >>>>>> As far as I see, the MultiStorage is currently just capable to write 
> >>>>>> CSV
> >>>>>> output, right? Is there any attempt ongoing currently to make this
> >>>>>> storage more generic regarding the format of the output data? For our
> >>>>>> needs we would require AVRO output as well as some special proprietary
> >>>>>> binary encoding for which we already created our own storage. I'm
> >>>>>> thinking about a storage that will select a certain writer method
> >>>>>> depending to the file names ending.
> >>>>>>
> >>>>>> Do you know of such efforts?
> >>>>>>
> >>>>>> Thanks
> >>>>>>
> >>>>>> Markus
> >>>>>>
> >>>>>>
> >>>>>> Am Freitag, den 22.06.2012, 11:23 -0700 schrieb Thejas Nair:
> >>>>>>>
> >>>>>>> You can use MultiStorage store func -
> >>>>>>>
> >>>>>>> http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/MultiStorage.html
> >>>>>>>
> >>>>>>> Or if you want something more flexible, and have metadata as well, use
> >>>>>>> hcatalog . Specify the keys on which you want to partition as your
> >>>>>>> partition keys in the table. Then use HcatStorer() to store the data.
> >>>>>>> See http://incubator.apache.org/hcatalog/docs/r0.4.0/index.html
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Thejas
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 6/22/12 4:54 AM, Markus Resch wrote:
> >>>>>>>>
> >>>>>>>> Hey everyone,
> >>>>>>>>
> >>>>>>>> We're doing some aggregation. The result contains a key where we 
> >>>>>>>> want to
> >>>>>>>> have a single output file for each key. Is it possible to store files
> >>>>>>>> like this? Especially adjusting the path by the key's value.
> >>>>>>>>
> >>>>>>>> Example:
> >>>>>>>> Input = LOAD 'my/data.avro' USING AvroStorage;
> >>>>>>>> [.... doing stuff....]
> >>>>>>>> Output = GROUP AggregatesValues BY Key;
> >>>>>>>> FOREACH Output Store * into
> >>>>>>>> '/my/output/path/by/$Output.Key/Result.avro'
> >>>>>>>>
> >>>>>>>> I know this example does not work. But is there anything similar
> >>>>>>>> possible? And, as I assume, not: is there some framework in the 
> >>>>>>>> hadoop
> >>>>>>>> world that can do such stuff?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>>>
> >>>>>>>> Markus

Re: Best Practice: store depending on data content

Reply via email to