On Tue, Mar 27, 2012 at 5:46 PM, Bejoy KS <bejoy.had...@gmail.com> wrote:
> Hi Franc > Adding on to Harsh's response. If you Partition your data > accordingly in Hive you can easily switch on and off full data scans. > Partitions and sub partitions(multi level partitions) would help you hit > only the required data set. How to partition is totally based on your use > cases or the queries that are intended for the data set. If you are looking > at sampling then you may need to incorporate Buckets as well. > Thanks, that's another vote for looking at hive! cheers > Regards > Bejoy KS > > Sent from handheld, please excuse typos. > > -----Original Message----- > From: Franc Carter <franc.car...@sirca.org.au> > Date: Tue, 27 Mar 2012 17:26:49 > To: <common-user@hadoop.apache.org> > Reply-To: common-user@hadoop.apache.org > Subject: Re: Parts of a file as input > > On Tue, Mar 27, 2012 at 5:22 PM, Franc Carter <franc.car...@sirca.org.au > >wrote: > > > On Tue, Mar 27, 2012 at 5:09 PM, Harsh J <ha...@cloudera.com> wrote: > > > >> Franc, > >> > >> With the given info, all we can tell is that it is possible but we > >> can't tell how as we have no idea how your data/dimensions/etc. are > >> structured. Being a little more specific would help. > >> > > > > Thanks, I'll go in to more detail. > > > > We have data for a large number of entities (10's of millions) for 15+ > > years with fairly fine grained timestamp (but we could do just day > > granularity). > > > > At the extremes, some queries will need to select a small number of > > entities for all 15 years and some queries needing most of the entities > for > > a small time range. > > > > Our current architecture (which we are reviewing) stores the data in 'day > > files' with a sort that increase the chance that data we want will be > close > > together. We can then seek inside the files and only retrieve/process the > > parts we we need. > > > > I'd like to avoid Hadoop having to read and process all of every file to > > answer queries that don't need all the data. > > > > Is that clearer ? > > > > > I should also add that we know the entities and time range we are > interested in at query submission time > > > > > > > >> It is possible to select and pass the right set of inputs per job, and > >> to also implement record readers to only read what is needed > >> specifically. This all depends on how your files are structured. > >> > >> Taking a wild guess, Apache Hive with its columnar storage (RCFile) > >> format may also be what you are looking for. > >> > > > > Thanks I'll have a look in to that > > > > cheers > > > > > >> > >> On Tue, Mar 27, 2012 at 11:32 AM, Franc Carter > >> <franc.car...@sirca.org.au> wrote: > >> > Hi, > >> > > >> > I'm very new to Hadoop and am working through how we may be able to > >> apply > >> > it to our data set. > >> > > >> > One of the things that I am struggling with is understanding if it is > >> > possible to pass tell Hadoop that only parts of the input file will be > >> > needed for a specific job. The reason I believe I may need this is > that > >> we > >> > have two big dimensions in our data set. Queries may want only one of > >> these > >> > dimensions and while some un-needed reading is unavoidable there are > >> cases > >> > that reading the entire data set presents a very significant overhead. > >> > > >> > Or have I just misunderstood something ;-( > >> > > >> > thanks > >> > > >> > -- > >> > > >> > *Franc Carter* | Systems architect | Sirca Ltd > >> > <marc.zianideferra...@sirca.org.au> > >> > > >> > franc.car...@sirca.org.au | www.sirca.org.au > >> > > >> > Tel: +61 2 9236 9118 > >> > > >> > Level 9, 80 Clarence St, Sydney NSW 2000 > >> > > >> > PO Box H58, Australia Square, Sydney NSW 1215 > >> > >> > >> > >> -- > >> Harsh J > >> > > > > > > > > -- > > > > *Franc Carter* | Systems architect | Sirca Ltd > > <marc.zianideferra...@sirca.org.au> > > > > franc.car...@sirca.org.au | www.sirca.org.au > > > > Tel: +61 2 9236 9118 > > > > Level 9, 80 Clarence St, Sydney NSW 2000 > > > > PO Box H58, Australia Square, Sydney NSW 1215 > > > > > > > -- > > *Franc Carter* | Systems architect | Sirca Ltd > <marc.zianideferra...@sirca.org.au> > > franc.car...@sirca.org.au | www.sirca.org.au > > Tel: +61 2 9236 9118 > > Level 9, 80 Clarence St, Sydney NSW 2000 > > PO Box H58, Australia Square, Sydney NSW 1215 > > -- *Franc Carter* | Systems architect | Sirca Ltd <marc.zianideferra...@sirca.org.au> franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215