Re: Parts of a file as input

Franc Carter Mon, 26 Mar 2012 23:57:45 -0700

On Tue, Mar 27, 2012 at 5:46 PM, Bejoy KS <bejoy.had...@gmail.com> wrote:


> Hi Franc
>        Adding on to Harsh's response. If you Partition your data
> accordingly in Hive you can easily switch on and off full data scans.
> Partitions and sub partitions(multi level partitions) would help you hit
> only the required data set.   How to partition is totally based on your use
> cases or the queries that are intended for the data set. If you are looking
> at sampling then you may need to incorporate Buckets as well.
>

Thanks,

that's another vote for looking at hive!

cheers



> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>
> -----Original Message-----
> From: Franc Carter <franc.car...@sirca.org.au>
> Date: Tue, 27 Mar 2012 17:26:49
> To: <common-user@hadoop.apache.org>
> Reply-To: common-user@hadoop.apache.org
> Subject: Re: Parts of a file as input
>
> On Tue, Mar 27, 2012 at 5:22 PM, Franc Carter <franc.car...@sirca.org.au
> >wrote:
>
> > On Tue, Mar 27, 2012 at 5:09 PM, Harsh J <ha...@cloudera.com> wrote:
> >
> >> Franc,
> >>
> >> With the given info, all we can tell is that it is possible but we
> >> can't tell how as we have no idea how your data/dimensions/etc. are
> >> structured. Being a little more specific would help.
> >>
> >
> > Thanks, I'll go in to more detail.
> >
> > We have data for a large number of entities (10's of millions) for 15+
> > years with fairly fine grained timestamp (but we could do just day
> > granularity).
> >
> > At the extremes, some queries will need to select a small number of
> > entities for all 15 years and some queries needing most of the entities
> for
> > a small time range.
> >
> > Our current architecture (which we are reviewing) stores the data in 'day
> > files' with a sort that increase the chance that data we want will be
> close
> > together. We can then seek inside the files and only retrieve/process the
> > parts we we need.
> >
> > I'd like to avoid Hadoop having to read and process all of every file to
> > answer queries that don't need all the data.
> >
> > Is that clearer ?
> >
>
>
> I should also add that we know the entities and time range we are
> interested  in at query submission time
>
>
> >
> >
> >> It is possible to select and pass the right set of inputs per job, and
> >> to also implement record readers to only read what is needed
> >> specifically. This all depends on how your files are structured.
> >>
> >> Taking a wild guess, Apache Hive with its columnar storage (RCFile)
> >> format may also be what you are looking for.
> >>
> >
> > Thanks I'll have a look in to that
> >
> > cheers
> >
> >
> >>
> >> On Tue, Mar 27, 2012 at 11:32 AM, Franc Carter
> >> <franc.car...@sirca.org.au> wrote:
> >> > Hi,
> >> >
> >> > I'm very new to Hadoop and am working through how we may be able to
> >> apply
> >> > it to our data set.
> >> >
> >> > One of the things that I am struggling with is understanding if it is
> >> > possible to pass tell Hadoop that only parts of the input file will be
> >> > needed for a specific job. The reason I believe I may need this is
> that
> >> we
> >> > have two big dimensions in our data set. Queries may want only one of
> >> these
> >> > dimensions and while some un-needed reading is unavoidable there are
> >> cases
> >> > that reading the entire data set presents a very significant overhead.
> >> >
> >> > Or have I just misunderstood something ;-(
> >> >
> >> > thanks
> >> >
> >> > --
> >> >
> >> > *Franc Carter* | Systems architect | Sirca Ltd
> >> >  <marc.zianideferra...@sirca.org.au>
> >> >
> >> > franc.car...@sirca.org.au | www.sirca.org.au
> >> >
> >> > Tel: +61 2 9236 9118
> >> >
> >> > Level 9, 80 Clarence St, Sydney NSW 2000
> >> >
> >> > PO Box H58, Australia Square, Sydney NSW 1215
> >>
> >>
> >>
> >> --
> >> Harsh J
> >>
> >
> >
> >
> > --
> >
> > *Franc Carter* | Systems architect | Sirca Ltd
> >  <marc.zianideferra...@sirca.org.au>
> >
> > franc.car...@sirca.org.au | www.sirca.org.au
> >
> > Tel: +61 2 9236 9118
> >
> > Level 9, 80 Clarence St, Sydney NSW 2000
> >
> > PO Box H58, Australia Square, Sydney NSW 1215
> >
> >
>
>
> --
>
> *Franc Carter* | Systems architect | Sirca Ltd
>  <marc.zianideferra...@sirca.org.au>
>
> franc.car...@sirca.org.au | www.sirca.org.au
>
> Tel: +61 2 9236 9118
>
> Level 9, 80 Clarence St, Sydney NSW 2000
>
> PO Box H58, Australia Square, Sydney NSW 1215
>
>


-- 

*Franc Carter* | Systems architect | Sirca Ltd
 <marc.zianideferra...@sirca.org.au>

franc.car...@sirca.org.au | www.sirca.org.au

Tel: +61 2 9236 9118

Level 9, 80 Clarence St, Sydney NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

Re: Parts of a file as input

Reply via email to