Thank you! On Fri, Dec 7, 2012 at 9:51 AM, Billie Rinaldi <[email protected]> wrote:
> On Thu, Dec 6, 2012 at 2:32 PM, Aji Janis <[email protected]> wrote: > >> Thank you for the clarification. You mentioned that "Using the input >> format, unless you override the autosplitting in it, you will get 1 mapper >> per tablet." in your initial response. Again, pardon me for the newbie >> question, but how do I find out if autosplitting is overriden or not? >> > > You override autosplitting with the command > AccumuloInputFormat.disableAutoAdjustRanges(Configuration). So if you > haven't done that, it will fit mappers to tablets. > > Billie > > > >> >> Aji >> >> >> >> On Tue, Dec 4, 2012 at 8:36 PM, John Vines <[email protected]> wrote: >> >>> Your first two presumptions are correct. You will get 3 mappers and each >>> mapper will have data for only one tablet. >>> >>> Each mapper will function exactly as a scanner for the range of the >>> tablet, so you will get things in lexicographical order. So the mapper for >>> tablet A will get all items for rowA in order before getting items for rowB. >>> >>> John >>> >>> >>> >>> On Tue, Dec 4, 2012 at 6:55 PM, Aji Janis <[email protected]> wrote: >>> >>>> Thank you John for your response. I do have a few followup questions. >>>> Let me use a better example. Lets say my table and tabletserver >>>> distributions are as follows: >>>> >>>> --------------------------------------------- >>>> MyTable: >>>> >>>> rowA | f1 | q1 | v1 >>>> rowA | f2 | q2 | v2 >>>> rowA | f3 | q3 | v3 >>>> >>>> rowB | f1 | q1 | v1 >>>> rowB | f1 | q2 | v2 >>>> >>>> rowC | f1 | q1 | v1 >>>> >>>> rowD | f1 | q1 | v1 >>>> rowD | f1 | q2 | v2 >>>> >>>> rowE | f1 | q1 | v1 >>>> >>>> --------------------------------------------- >>>> >>>> TabletServer1: Tablet A: rowA, rowC >>>> TabletServer2: Tablet B: rowB >>>> TabletServer2: Tablet C: rowD >>>> >>>> -------------------------------------------- >>>> >>>> In this example, if I have a map reduce job that reads from the table >>>> above and writes to MyTable2 table using >>>> org.apache.accumulo.core.client.mapreduce.*AccumuloInputFormat * >>>> and org.apache.accumulo.core.client.mapreduce.*AccumuloOutputFormat*. >>>> >>>> Lets not focus on what the map reduce job itself is. From >>>> your explanation below sounds like if autosplitting is not overriden then >>>> we get *three mappers* total. Is that right? >>>> >>>> Further, I will be right in assuming that a mapper will NOT get data >>>> from multiple tablets. Correct? >>>> >>>> I am also very confused on what the *order of input to the mapper*will be. >>>> Would mapper_at_tabletA get >>>> -all data from rowA before it gets all data from rowC or >>>> -all data from rowC before it gets all data from rowA or >>>> -something like: >>>> rowA | f1 | q1 | v1 >>>> rowA | f2 | q2 | v2 >>>> rowC | f1 | q1 | v1 >>>> rowA | f3 | q3 | v3 >>>> >>>> I know these are a lot of question but I really like to get a good >>>> understanding of the architecture. Thank you! >>>> Aji >>>> >>>> >>>> >>>> On Tue, Dec 4, 2012 at 5:45 PM, John Vines <[email protected]> wrote: >>>> >>>>> A tablet consists of both an in memory portion and 0 to many files in >>>>> HDFS. Each file may be one or many HDFS blocks. Accumulo gets a >>>>> performance >>>>> boost to the natural locality you get when you write data to HDFS, but if >>>>> a >>>>> tablet migrates that locality could be lost until data is compacted >>>>> (rewritten). Locality could be retained due to data replication, but >>>>> Accumulo does not make extraordinary effort to attempt to get a little bit >>>>> of locality, as data will eventually be rewritten and locality restored. >>>>> >>>>> As for your example, if all data for a given row is inserted at the >>>>> same time, then it is guaranteed to be in the same file. There is no >>>>> atomicity guarantee regarding HDFS blocks though, so depending on the >>>>> block >>>>> size and the amount of data in the file (and it's distribution), it is >>>>> possible for a few entries to span files even though they are adjacent. >>>>> >>>>> Using the input format, unless you override the autosplitting in it, >>>>> you will get 1 mapper per tablet. If you disable auto-splitting, then you >>>>> get one mapper per range you specify. >>>>> >>>>> Hope this helps, let me know if you have other questions or need >>>>> clarification. >>>>> >>>>> John >>>>> >>>>> >>>>> >>>>> On Tue, Dec 4, 2012 at 5:21 PM, Aji Janis <[email protected]> wrote: >>>>> >>>>>> NOTE: I am fairly sure this hasn't been asked on here yet - my >>>>>> apologies if it was already asked in which case please forward me a link >>>>>> to >>>>>> the answers.Thank you. >>>>>> >>>>>> If my environment set up is as follows: >>>>>> -64MB HDFS block >>>>>> -5 tablet servers >>>>>> -10 tablets of size 1GB each per tablet server >>>>>> >>>>>> If I have a table like below: >>>>>> rowA | f1 | q1 | v1 >>>>>> rowA | f1 | q2 | v2 >>>>>> >>>>>> rowB | f1 | q1 | v3 >>>>>> >>>>>> rowC | f1 | q1 | v4 >>>>>> rowC | f2 | q1 | v5 >>>>>> rowC | f3 | q3 | v6 >>>>>> >>>>>> From the little documentation, I know all data about rowA will go one >>>>>> tablet which may or may not contain data about other rows ie its all or >>>>>> none. So my questions are: >>>>>> >>>>>> How are the tablets mapped to a Datanode or HDFS block? Obviously, >>>>>> One tablet is split into multiple HDFS blocks (8 in this case) so would >>>>>> they be stored on the same or different datanode(s) or does it not >>>>>> matter? >>>>>> >>>>>> In the example above, would all data about RowC (or A or B) go onto >>>>>> the same HDFS block or different HDFS blocks? >>>>>> >>>>>> When executing a map reduce job how many mappers would I get? (one >>>>>> per hdfs block? or per tablet? or per server?) >>>>>> >>>>>> Thank you in advance for any and all suggestions. >>>>>> >>>>> >>>>> >>>> >>> >> >
