Thanks for replying JG .. I have posted some Doubts inline . On Mon, Oct 19, 2009 at 10:36 PM, Jonathan Gray <[email protected]> wrote:
> Are you currently being limited by network throughput? I wouldn't become > obsessed with data locality until it becomes the bottleneck. > I was thinking that .. this method might be far more efficient (not sure .. just guessing) compared to brute-force method where we read the entire table2 in one of the mappers of table1 . I want check the performance of both the approaches. > > Even the naive implementation of this would not be entirely simple... but > then what do you do if the regions on that node changed during the course of > the map (splits, reassigns, etc)? > Since we can scan .META. and get the start and key of a particular region and build scanners for them .. I thought it would be easy .,, Any hint why it can become complex ? > > I would imagine you'll have other things to optimize well before network > throughput becomes an issue. And if you do go down the route of this kind > of (potential) hyper-optimization, you'll need to be aware of the hardware > you're using and the performance impact of different approaches. If you > only have a single disk, then concurrent scans of two different tables can > cause disk contention, etc... > Are you joining 2 tables by matching row key to row key? If so, then this > sounds like 2 tables that should be 1 table with multiple families (that's > really the value in multiple families... each family is really like a > separate table, but they are easily joined together by row key). > I wanted to implement a Join of 2 tables based on any columnfamily .. (somewhat similar to database Join) > JG > > > bharath v wrote: > >> Kevin : What if i want to implement a Join of 2 tables . Is there an >> alternative to TableInputFormat (TIF) because it reads a single table at a >> time . I thought of a solution ,but Iam not sure whether it works fine . >> >> Suppose we want to join table1 and table2 and we use TIF on table1 and the >> Map phase is as follows . >> >> Map : >> >> Suppose the TIF is reading the region1 of table1. Then we can IN SOME WAY >> get the regions start and end keys corresponding to the table2 on that >> system (if any) where map is being executed >> and read the table2 contents in the Map . This is in some way preserving >> DATA LOCALITY.. >> >> Is this feasible ? Any comments ? >> >> >> >> On Fri, Oct 16, 2009 at 12:09 AM, Kevin Peterson <[email protected] >> >wrote: >> >> On Thu, Oct 15, 2009 at 11:30 AM, Something Something < >>> [email protected]> wrote: >>> >>> 1) I don't think TableInputFormat is useful in this case. Looks like >>>> >>> it's >>> >>>> used for scanning columns from a single HTable. >>>> 2) TableMapReduceUtil - same problem. Seems like this works with just >>>> >>> one >>> >>>> table. >>>> 3) JV recommended NLineInputFormat, but my parameters are not in a file. >>>> They come from multiple files and are in memory. >>>> >>>> I guess what I am looking for is something like... >>>> InMemoryInputFormat... >>>> similar to FileInputFormat & DbInputFormat. There's no such class right >>>> now. >>>> >>>> Worse comes to worst, I can write the parameters into a flat file, and >>>> >>> use >>> >>>> FileInputFormat - but that will slow down this process considerably. Is >>>> there no other way? >>>> >>>> So you need to pull input from multiple tables at once? Are you >>>> expecting >>>> >>> to do a join on these tables? If you explain what the data looks like, >>> we'd >>> understand better. What are your tables, and what would you like to treat >>> as >>> a single input record? >>> >>> >>
