Re: MapReduce bulk load into Phoenix table

Gabriel Reid Fri, 16 Jan 2015 00:39:07 -0800

Hi Constantin,

The issues you're having sound like they're (probably) much more
related to MapReduce than to Phoenix. In order to first determine what
the real issue is, could you give a general overview of how your MR
job is implemented (or even better, give me a pointer to it on GitHub
or something similar)?


- Gabriel


On Thu, Jan 15, 2015 at 2:19 PM, Ciureanu, Constantin (GfK)
<[email protected]> wrote:
> Hello all,
>
> I finished the MR Job - for now it just failed a few times since the Mappers 
> gave some weird timeout (600 seconds) apparently not processing anything 
> meanwhile.
> When I check the running mappers, just 3 of them are progressing (quite fast 
> however, why just 3 are working? - I have 6 machines, 24 tasks can run in the 
> same time).
>
> Can be this because of some limitation on number of connections to Phoenix?
>
> Regards,
>   Constantin
>
>
> -----Original Message-----
> From: Ciureanu, Constantin (GfK) [mailto:[email protected]]
> Sent: Wednesday, January 14, 2015 9:44 AM
> To: [email protected]
> Subject: RE: MapReduce bulk load into Phoenix table
>
> Hello James,
>
> Yes, as low as 1500 rows /sec -> using Phoenix JDBC with Batch Inserts of 
> 1000 records at once, but there are at least 100 dynamic columns for each row.
> I was expecting higher values of course - but I will finish soon coding a MR 
> job to load the same data using Hadoop.
> The code I read and adapt in my MR job is from your CsvBulkLoadTool. [ After 
> finishing it I will test it then post new speed results.] This is basically 
> using Phoenix connection to "dummy upsert" then takes the Key + List<KV> and 
> rollback the connection - that was my question yesterday if there's no other 
> better way.
> My new problem is that the CsvUpsertExecutor needs a list of fields (which I 
> don't have since the columns are dynamic, I do not use anyway a CSV source).
> So it would have been nice to have a "reusable building block of code" for 
> this - I'm sure everyone needs a fast and clean template code to load data 
> into destination HBase (or Phoenix) Table using Phoenix + MR.
> I can create the row key from concatenating my key fields - but I don't know 
> (yet) how to obtain the salting byte(s).
>
> My current test cluster details:
> - 6x dualcore machines (on AWS)
> - more than 100 TB disk space
> - the table is salted into 8 buckets and has 8 columns common to all rows
>
> Thank you for your answer and technical support on this email-list, Constantin
>
> -----Original Message-----
> From: James Taylor [mailto:[email protected]]
> Sent: Tuesday, January 13, 2015 7:23 PM
> To: user
> Subject: Re: MapReduce bulk load into Phoenix table
>
> Hi Constantin,
> 1000-1500 rows per sec? Using our performance.py script, on my Mac laptop, 
> I'm seeing 27,000 rows per sec (Phoenix 4.2.2 with HBase 0.98.9).
>
> If you want to realistically measure performance, I'd recommend doing so on a 
> real cluster. If you'll really only have a single machine, then you're 
> probably better off using something like MySQL. Using the map-reduce based 
> CSV loader on a single node is not going to speed anything up. For a cluster 
> it can make a difference, though. See 
> http://phoenix.apache.org/phoenix_mr.html
>
> FYI, Phoenix indexes are only maintained if you go through Phoenix APIs.
>
> Thanks,
> James
>
>
> On Tue, Jan 13, 2015 at 2:45 AM, Vaclav Loffelmann 
> <[email protected]> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> I think the easiest way how to determine if indexes are maintained
>> when inserting directly to HBase is to test it. If it is maintained by
>> region observer coprocessors, it should. (I'll do tests when as soon
>> I'll have some time.)
>>
>> I don't see any problem with different cols between multiple rows.
>> Make view same as you'd make table definition. Null values are not
>> stored at HBase hence theres no overhead.
>>
>> I'm afraid there is not any piece of code (publicly avail) how to do
>> that, but it is very straight forward.
>> If you use composite primary key, then concat multiple results of
>> PDataType.TYPE.toBytes() as rowkey. For values use same logic. Data
>> types are defined as enums at this class:
>> org.apache.phoenix.schema.PDataType.
>>
>> Good luck,
>> Vaclav;
>>
>> On 01/13/2015 10:58 AM, Ciureanu, Constantin (GfK) wrote:
>>> Thank you Vaclav,
>>>
>>> I have just started today to write some code :) for MR job that will
>>> load data into HBase + Phoenix. Previously I wrote some application
>>> to load data using Phoenix JDBC (slow), but I also have experience
>>> with HBase so I can understand and write code to load data directly
>>> there.
>>>
>>> If doing so, I'm also worry about: - maintaining (some existing)
>>> Phoenix indexes (if any) - perhaps this still works in case the
>>> (same) coprocessors would trigger at insert time, but I cannot know
>>> how it works behind the scenes. - having the Phoenix view around the
>>> HBase table would "solve" the above problem (so there's no index
>>> whatsoever) but would create a lot of other problems (my table has a
>>> limited number of common columns and the rest are too different from
>>> row to row - in total I have hundreds of possible
>>> columns)
>>>
>>> So - to make things faster for me-  is there any good piece of code I
>>> can find on the internet about how to map my data types to Phoenix
>>> data types and use the results as regular HBase Bulk Load?
>>>
>>> Regards, Constantin
>>>
>>> -----Original Message----- From: Vaclav Loffelmann
>>> [mailto:[email protected]] Sent: Tuesday, January
>>> 13, 2015 10:30 AM To: [email protected] Subject: Re:
>>> MapReduce bulk load into Phoenix table
>>>
>>> Hi, our daily usage is to import raw data directly to HBase, but
>>> mapped to Phoenix data types. And for querying we use Phoenix view on
>>> top of that HBase table.
>>>
>>> Then you should hit bottleneck of HBase itself. It should be from
>>> 10 to 30+ times faster than your current solution. Depending on HW of
>>> course.
>>>
>>> I'd prefer this solution for stream writes.
>>>
>>> Vaclav
>>>
>>> On 01/13/2015 10:12 AM, Ciureanu, Constantin (GfK) wrote:
>>>> Hello all,
>>>
>>>> (Due to the slow speed of Phoenix JDBC – single machine ~
>>>> 1000-1500 rows /sec) I am also documenting myself about loading data
>>>> into Phoenix via MapReduce.
>>>
>>>> So far I understood that the Key + List<[Key,Value]> to be inserted
>>>> into HBase table is obtained via a “dummy” Phoenix connection – then
>>>> those rows are stored into HFiles (then after the MR job finishes it
>>>> is Bulk loading those HFiles normally into HBase).
>>>
>>>> My question: Is there any better / faster approach? I assume this
>>>> cannot reach the maximum speed to load data into Phoenix / HBase
>>>> table.
>>>
>>>> Also I would like to find a better / newer sample code than this
>>>> one:
>>>> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.phoenix/p
>>>> ho
>>>>
>>>>
>> enix/4.0.0-incubating/org/apache/phoenix/mapreduce/CsvToKeyValueMapper
>>>> .java#CsvToKeyValueMapper.loadPreUpsertProcessor%28org.apache.hadoop
>>>> .c
>>>>
>>>>
>> onf.Configuration%29
>>>
>>>> Thank you, Constantin
>>>
>>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1
>>
>> iQEcBAEBAgAGBQJUtPc4AAoJEG3mD8Wtuk0WWu0IAIJcEveMvKZbrrf3FY0SRbNx
>> 6V5jF44t+Dl88r1EX9VvKZrWzjp/uN6t8r3H6D5nNLmq0314jOfK68T0q3uaIqOp
>> Ui7UwMRAGkmhhY9pqgZBfWvXsLLJSy9hJU70JSk3RFeynprqPsme9a8CWJi8IfDN
>> G83he0X2avQKQJ72hKeDZX9NzKib9cxNQFKtWDpr2NQat5VnCJkCUGprMcMUxU31
>> vqFg9aQ+b40WN0KFJ3p7cI5tlAuJ5Tz7Ogh+KOZUqTVZ8I5OLwQzqpiwDMD6stRb
>> PJ1gc7LCs64wJghv5TIZpHyXl/3HOgmpYrO+UfGv1S1qzySpM3B1o9ajTbww3L0=
>> =Fdvo
>> -----END PGP SIGNATURE-----

Re: MapReduce bulk load into Phoenix table

Reply via email to