Hi Constantin,

The issues you're having sound like they're (probably) much more
related to MapReduce than to Phoenix. In order to first determine what
the real issue is, could you give a general overview of how your MR
job is implemented (or even better, give me a pointer to it on GitHub
or something similar)?

- Gabriel


On Thu, Jan 15, 2015 at 2:19 PM, Ciureanu, Constantin (GfK)
<[email protected]> wrote:
> Hello all,
>
> I finished the MR Job - for now it just failed a few times since the Mappers 
> gave some weird timeout (600 seconds) apparently not processing anything 
> meanwhile.
> When I check the running mappers, just 3 of them are progressing (quite fast 
> however, why just 3 are working? - I have 6 machines, 24 tasks can run in the 
> same time).
>
> Can be this because of some limitation on number of connections to Phoenix?
>
> Regards,
>   Constantin
>
>
> -----Original Message-----
> From: Ciureanu, Constantin (GfK) [mailto:[email protected]]
> Sent: Wednesday, January 14, 2015 9:44 AM
> To: [email protected]
> Subject: RE: MapReduce bulk load into Phoenix table
>
> Hello James,
>
> Yes, as low as 1500 rows /sec -> using Phoenix JDBC with Batch Inserts of 
> 1000 records at once, but there are at least 100 dynamic columns for each row.
> I was expecting higher values of course - but I will finish soon coding a MR 
> job to load the same data using Hadoop.
> The code I read and adapt in my MR job is from your CsvBulkLoadTool. [ After 
> finishing it I will test it then post new speed results.] This is basically 
> using Phoenix connection to "dummy upsert" then takes the Key + List<KV> and 
> rollback the connection - that was my question yesterday if there's no other 
> better way.
> My new problem is that the CsvUpsertExecutor needs a list of fields (which I 
> don't have since the columns are dynamic, I do not use anyway a CSV source).
> So it would have been nice to have a "reusable building block of code" for 
> this - I'm sure everyone needs a fast and clean template code to load data 
> into destination HBase (or Phoenix) Table using Phoenix + MR.
> I can create the row key from concatenating my key fields - but I don't know 
> (yet) how to obtain the salting byte(s).
>
> My current test cluster details:
> - 6x dualcore machines (on AWS)
> - more than 100 TB disk space
> - the table is salted into 8 buckets and has 8 columns common to all rows
>
> Thank you for your answer and technical support on this email-list, Constantin
>
> -----Original Message-----
> From: James Taylor [mailto:[email protected]]
> Sent: Tuesday, January 13, 2015 7:23 PM
> To: user
> Subject: Re: MapReduce bulk load into Phoenix table
>
> Hi Constantin,
> 1000-1500 rows per sec? Using our performance.py script, on my Mac laptop, 
> I'm seeing 27,000 rows per sec (Phoenix 4.2.2 with HBase 0.98.9).
>
> If you want to realistically measure performance, I'd recommend doing so on a 
> real cluster. If you'll really only have a single machine, then you're 
> probably better off using something like MySQL. Using the map-reduce based 
> CSV loader on a single node is not going to speed anything up. For a cluster 
> it can make a difference, though. See 
> http://phoenix.apache.org/phoenix_mr.html
>
> FYI, Phoenix indexes are only maintained if you go through Phoenix APIs.
>
> Thanks,
> James
>
>
> On Tue, Jan 13, 2015 at 2:45 AM, Vaclav Loffelmann 
> <[email protected]> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> I think the easiest way how to determine if indexes are maintained
>> when inserting directly to HBase is to test it. If it is maintained by
>> region observer coprocessors, it should. (I'll do tests when as soon
>> I'll have some time.)
>>
>> I don't see any problem with different cols between multiple rows.
>> Make view same as you'd make table definition. Null values are not
>> stored at HBase hence theres no overhead.
>>
>> I'm afraid there is not any piece of code (publicly avail) how to do
>> that, but it is very straight forward.
>> If you use composite primary key, then concat multiple results of
>> PDataType.TYPE.toBytes() as rowkey. For values use same logic. Data
>> types are defined as enums at this class:
>> org.apache.phoenix.schema.PDataType.
>>
>> Good luck,
>> Vaclav;
>>
>> On 01/13/2015 10:58 AM, Ciureanu, Constantin (GfK) wrote:
>>> Thank you Vaclav,
>>>
>>> I have just started today to write some code :) for MR job that will
>>> load data into HBase + Phoenix. Previously I wrote some application
>>> to load data using Phoenix JDBC (slow), but I also have experience
>>> with HBase so I can understand and write code to load data directly
>>> there.
>>>
>>> If doing so, I'm also worry about: - maintaining (some existing)
>>> Phoenix indexes (if any) - perhaps this still works in case the
>>> (same) coprocessors would trigger at insert time, but I cannot know
>>> how it works behind the scenes. - having the Phoenix view around the
>>> HBase table would "solve" the above problem (so there's no index
>>> whatsoever) but would create a lot of other problems (my table has a
>>> limited number of common columns and the rest are too different from
>>> row to row - in total I have hundreds of possible
>>> columns)
>>>
>>> So - to make things faster for me-  is there any good piece of code I
>>> can find on the internet about how to map my data types to Phoenix
>>> data types and use the results as regular HBase Bulk Load?
>>>
>>> Regards, Constantin
>>>
>>> -----Original Message----- From: Vaclav Loffelmann
>>> [mailto:[email protected]] Sent: Tuesday, January
>>> 13, 2015 10:30 AM To: [email protected] Subject: Re:
>>> MapReduce bulk load into Phoenix table
>>>
>>> Hi, our daily usage is to import raw data directly to HBase, but
>>> mapped to Phoenix data types. And for querying we use Phoenix view on
>>> top of that HBase table.
>>>
>>> Then you should hit bottleneck of HBase itself. It should be from
>>> 10 to 30+ times faster than your current solution. Depending on HW of
>>> course.
>>>
>>> I'd prefer this solution for stream writes.
>>>
>>> Vaclav
>>>
>>> On 01/13/2015 10:12 AM, Ciureanu, Constantin (GfK) wrote:
>>>> Hello all,
>>>
>>>> (Due to the slow speed of Phoenix JDBC – single machine ~
>>>> 1000-1500 rows /sec) I am also documenting myself about loading data
>>>> into Phoenix via MapReduce.
>>>
>>>> So far I understood that the Key + List<[Key,Value]> to be inserted
>>>> into HBase table is obtained via a “dummy” Phoenix connection – then
>>>> those rows are stored into HFiles (then after the MR job finishes it
>>>> is Bulk loading those HFiles normally into HBase).
>>>
>>>> My question: Is there any better / faster approach? I assume this
>>>> cannot reach the maximum speed to load data into Phoenix / HBase
>>>> table.
>>>
>>>> Also I would like to find a better / newer sample code than this
>>>> one:
>>>> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.phoenix/p
>>>> ho
>>>>
>>>>
>> enix/4.0.0-incubating/org/apache/phoenix/mapreduce/CsvToKeyValueMapper
>>>> .java#CsvToKeyValueMapper.loadPreUpsertProcessor%28org.apache.hadoop
>>>> .c
>>>>
>>>>
>> onf.Configuration%29
>>>
>>>> Thank you, Constantin
>>>
>>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1
>>
>> iQEcBAEBAgAGBQJUtPc4AAoJEG3mD8Wtuk0WWu0IAIJcEveMvKZbrrf3FY0SRbNx
>> 6V5jF44t+Dl88r1EX9VvKZrWzjp/uN6t8r3H6D5nNLmq0314jOfK68T0q3uaIqOp
>> Ui7UwMRAGkmhhY9pqgZBfWvXsLLJSy9hJU70JSk3RFeynprqPsme9a8CWJi8IfDN
>> G83he0X2avQKQJ72hKeDZX9NzKib9cxNQFKtWDpr2NQat5VnCJkCUGprMcMUxU31
>> vqFg9aQ+b40WN0KFJ3p7cI5tlAuJ5Tz7Ogh+KOZUqTVZ8I5OLwQzqpiwDMD6stRb
>> PJ1gc7LCs64wJghv5TIZpHyXl/3HOgmpYrO+UfGv1S1qzySpM3B1o9ajTbww3L0=
>> =Fdvo
>> -----END PGP SIGNATURE-----

Reply via email to