Hi Constantin, The issues you're having sound like they're (probably) much more related to MapReduce than to Phoenix. In order to first determine what the real issue is, could you give a general overview of how your MR job is implemented (or even better, give me a pointer to it on GitHub or something similar)?
- Gabriel On Thu, Jan 15, 2015 at 2:19 PM, Ciureanu, Constantin (GfK) <[email protected]> wrote: > Hello all, > > I finished the MR Job - for now it just failed a few times since the Mappers > gave some weird timeout (600 seconds) apparently not processing anything > meanwhile. > When I check the running mappers, just 3 of them are progressing (quite fast > however, why just 3 are working? - I have 6 machines, 24 tasks can run in the > same time). > > Can be this because of some limitation on number of connections to Phoenix? > > Regards, > Constantin > > > -----Original Message----- > From: Ciureanu, Constantin (GfK) [mailto:[email protected]] > Sent: Wednesday, January 14, 2015 9:44 AM > To: [email protected] > Subject: RE: MapReduce bulk load into Phoenix table > > Hello James, > > Yes, as low as 1500 rows /sec -> using Phoenix JDBC with Batch Inserts of > 1000 records at once, but there are at least 100 dynamic columns for each row. > I was expecting higher values of course - but I will finish soon coding a MR > job to load the same data using Hadoop. > The code I read and adapt in my MR job is from your CsvBulkLoadTool. [ After > finishing it I will test it then post new speed results.] This is basically > using Phoenix connection to "dummy upsert" then takes the Key + List<KV> and > rollback the connection - that was my question yesterday if there's no other > better way. > My new problem is that the CsvUpsertExecutor needs a list of fields (which I > don't have since the columns are dynamic, I do not use anyway a CSV source). > So it would have been nice to have a "reusable building block of code" for > this - I'm sure everyone needs a fast and clean template code to load data > into destination HBase (or Phoenix) Table using Phoenix + MR. > I can create the row key from concatenating my key fields - but I don't know > (yet) how to obtain the salting byte(s). > > My current test cluster details: > - 6x dualcore machines (on AWS) > - more than 100 TB disk space > - the table is salted into 8 buckets and has 8 columns common to all rows > > Thank you for your answer and technical support on this email-list, Constantin > > -----Original Message----- > From: James Taylor [mailto:[email protected]] > Sent: Tuesday, January 13, 2015 7:23 PM > To: user > Subject: Re: MapReduce bulk load into Phoenix table > > Hi Constantin, > 1000-1500 rows per sec? Using our performance.py script, on my Mac laptop, > I'm seeing 27,000 rows per sec (Phoenix 4.2.2 with HBase 0.98.9). > > If you want to realistically measure performance, I'd recommend doing so on a > real cluster. If you'll really only have a single machine, then you're > probably better off using something like MySQL. Using the map-reduce based > CSV loader on a single node is not going to speed anything up. For a cluster > it can make a difference, though. See > http://phoenix.apache.org/phoenix_mr.html > > FYI, Phoenix indexes are only maintained if you go through Phoenix APIs. > > Thanks, > James > > > On Tue, Jan 13, 2015 at 2:45 AM, Vaclav Loffelmann > <[email protected]> wrote: >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> I think the easiest way how to determine if indexes are maintained >> when inserting directly to HBase is to test it. If it is maintained by >> region observer coprocessors, it should. (I'll do tests when as soon >> I'll have some time.) >> >> I don't see any problem with different cols between multiple rows. >> Make view same as you'd make table definition. Null values are not >> stored at HBase hence theres no overhead. >> >> I'm afraid there is not any piece of code (publicly avail) how to do >> that, but it is very straight forward. >> If you use composite primary key, then concat multiple results of >> PDataType.TYPE.toBytes() as rowkey. For values use same logic. Data >> types are defined as enums at this class: >> org.apache.phoenix.schema.PDataType. >> >> Good luck, >> Vaclav; >> >> On 01/13/2015 10:58 AM, Ciureanu, Constantin (GfK) wrote: >>> Thank you Vaclav, >>> >>> I have just started today to write some code :) for MR job that will >>> load data into HBase + Phoenix. Previously I wrote some application >>> to load data using Phoenix JDBC (slow), but I also have experience >>> with HBase so I can understand and write code to load data directly >>> there. >>> >>> If doing so, I'm also worry about: - maintaining (some existing) >>> Phoenix indexes (if any) - perhaps this still works in case the >>> (same) coprocessors would trigger at insert time, but I cannot know >>> how it works behind the scenes. - having the Phoenix view around the >>> HBase table would "solve" the above problem (so there's no index >>> whatsoever) but would create a lot of other problems (my table has a >>> limited number of common columns and the rest are too different from >>> row to row - in total I have hundreds of possible >>> columns) >>> >>> So - to make things faster for me- is there any good piece of code I >>> can find on the internet about how to map my data types to Phoenix >>> data types and use the results as regular HBase Bulk Load? >>> >>> Regards, Constantin >>> >>> -----Original Message----- From: Vaclav Loffelmann >>> [mailto:[email protected]] Sent: Tuesday, January >>> 13, 2015 10:30 AM To: [email protected] Subject: Re: >>> MapReduce bulk load into Phoenix table >>> >>> Hi, our daily usage is to import raw data directly to HBase, but >>> mapped to Phoenix data types. And for querying we use Phoenix view on >>> top of that HBase table. >>> >>> Then you should hit bottleneck of HBase itself. It should be from >>> 10 to 30+ times faster than your current solution. Depending on HW of >>> course. >>> >>> I'd prefer this solution for stream writes. >>> >>> Vaclav >>> >>> On 01/13/2015 10:12 AM, Ciureanu, Constantin (GfK) wrote: >>>> Hello all, >>> >>>> (Due to the slow speed of Phoenix JDBC – single machine ~ >>>> 1000-1500 rows /sec) I am also documenting myself about loading data >>>> into Phoenix via MapReduce. >>> >>>> So far I understood that the Key + List<[Key,Value]> to be inserted >>>> into HBase table is obtained via a “dummy” Phoenix connection – then >>>> those rows are stored into HFiles (then after the MR job finishes it >>>> is Bulk loading those HFiles normally into HBase). >>> >>>> My question: Is there any better / faster approach? I assume this >>>> cannot reach the maximum speed to load data into Phoenix / HBase >>>> table. >>> >>>> Also I would like to find a better / newer sample code than this >>>> one: >>>> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.phoenix/p >>>> ho >>>> >>>> >> enix/4.0.0-incubating/org/apache/phoenix/mapreduce/CsvToKeyValueMapper >>>> .java#CsvToKeyValueMapper.loadPreUpsertProcessor%28org.apache.hadoop >>>> .c >>>> >>>> >> onf.Configuration%29 >>> >>>> Thank you, Constantin >>> >>> >> -----BEGIN PGP SIGNATURE----- >> Version: GnuPG v1 >> >> iQEcBAEBAgAGBQJUtPc4AAoJEG3mD8Wtuk0WWu0IAIJcEveMvKZbrrf3FY0SRbNx >> 6V5jF44t+Dl88r1EX9VvKZrWzjp/uN6t8r3H6D5nNLmq0314jOfK68T0q3uaIqOp >> Ui7UwMRAGkmhhY9pqgZBfWvXsLLJSy9hJU70JSk3RFeynprqPsme9a8CWJi8IfDN >> G83he0X2avQKQJ72hKeDZX9NzKib9cxNQFKtWDpr2NQat5VnCJkCUGprMcMUxU31 >> vqFg9aQ+b40WN0KFJ3p7cI5tlAuJ5Tz7Ogh+KOZUqTVZ8I5OLwQzqpiwDMD6stRb >> PJ1gc7LCs64wJghv5TIZpHyXl/3HOgmpYrO+UfGv1S1qzySpM3B1o9ajTbww3L0= >> =Fdvo >> -----END PGP SIGNATURE-----
