Re: [DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes

2018-09-23 Thread la...@apache.org
 100% agreement.
A bit worried about "boiling the ocean" and risking not getting done anything.
Speaking of modules. I would *love* if we had a simple HBase abstraction API 
and then a module for each version of HBase, rather than a different branch 
each.Most differences are presumably in coprocessors APIs, which should be able 
to be "wrapped away" with some indirection layer.

-- Lars

On Monday, September 17, 2018, 8:52:58 AM PDT, Josh Elser 
 wrote:  
 
 Maybe an implementation detail, but I'm a fan of having a devoted Maven 
module to "client-facing" API as opposed to an annotation-based 
approach. I find a separate module helps to catch problematic API design 
faster, and make it crystal clear what users should (and should not) be 
relying upon).

On 9/17/18 1:00 AM, la...@apache.org wrote:
>  I think we can start by implementing a tighter integration with Spark 
>through DataSource V2.That would make it quickly apparent what parts of 
>Phoenix would need direct access.
> Some parts just need a interface audience declaration (like Phoenix's basic 
> type system) and our agreement that we will change those only according to 
> semantic versioning. Otherwise (like the query plan) will need a bit more 
> thinking. Maybe that's the path to hook Calcite - just making that part up as 
> I write this...
> Perhaps turning the HBase interface into an API might not be so difficult 
> either. That would perhaps be a new client - strictly additional - client API.
> 
> A good Spark interface is in everybody's interest and I think is the best 
> avenue to figure out what's missing/needed.
> -- Lars
> 
>      On Wednesday, September 12, 2018, 12:47:21 PM PDT, Josh Elser 
> wrote:
>  
>  I like it, Lars. I like it very much.
> 
> Just the easy part of doing it... ;)
> 
> On 9/11/18 4:53 PM, la...@apache.org wrote:
>>    Sorry for coming a bit late to this. I've been thinking about some of 
>>lines for a bit.
>> It seems Phoenix serves 4 distinct purposes:
>> 1. Query parsing and compiling.2. A type system3. Query execution4. 
>> Efficient HBase interface
>> Each of these is useful by itself, but we do not expose these as stable 
>> interfaces.We have seen a lot of need to tie HBase into "higher level" 
>> service, such as Spark (and Presto, etc).
>> I think we can get a long way if we separate at least #1 (SQL) from the rest 
>> #2, #3, and #4 (Typed HBase Interface - THI).
>> Phoenix is used via SQL (#1), other tools such as Presto, Impala, Drill, 
>> Spark, etc, can interface efficiently with HBase via THI (#2, #3, and #4).
>> Thoughts?
>> -- Lars
>>        On Monday, August 27, 2018, 11:03:33 AM PDT, Josh Elser 
>> wrote:
>>    
>>    (bcc: dev@hbase, in case folks there have been waiting for me to send
>> this email to dev@phoenix)
>>
>> Hi,
>>
>> In case you missed it, there was an HBaseCon event held in Asia
>> recently. Stack took some great notes and shared them with the HBase
>> community. A few of them touched on Phoenix, directly or in a related
>> manner. I think they are good "criticisms" that are beneficial for us to
>> hear.
>>
>> 1. The phoenix-$version-client.jar size is prohibitively large
>>
>> In this day and age, I'm surprised that this is a big issue for people.
>> I know have a lot of cruft, most of which coming from hadoop. We have
>> gotten better here over recent releases, but I would guess that there is
>> more we can do.
>>
>> 2. Can Phoenix be the de-facto schema for SQL on HBase?
>>
>> We've long asserted "if you have to ask how Phoenix serializes data, you
>> shouldn't be do it" (a nod that you have to write lots of code). What if
>> we turn that on its head? Could we extract our PDataType serialization,
>> composite row-key, column encoding, etc into a minimal API that folks
>> with their own itches can use?
>>
>> With the growing integrations into Phoenix, we could embrace them by
>> providing an API to make what they're doing easier. In the same vein, we
>> cement ourselves as a cornerstone of doing it "correctly".
>>
>> 3. Better recommendations to users to not attempt certain queries.
>>
>> We definitively know that there are certain types of queries that
>> Phoenix cannot support well (compared to optimal Phoenix use-cases).
>> Users very commonly fall into such pitfalls on their own and this leaves
>> a bad taste in their mouth (thinking that the product "stinks").
>>
>> Can we do a better job of telling the user when and why it happened?
>> What would such a user-interaction model look like? Can we supplement
>> the "why" with instructions of what to do differently (even if in the
>> abstract)?
>>
>> 4. Phoenix-Calcite
>>
>> This was mentioned as a "nice to have". From what I understand, there
>> was nothing explicitly from with the implementation or approach, just
>> that it was a massive undertaking to continue with little immediate
>> gain. Would this be a boon for us to try to continue in some form? Are
>> there steps we can take that would help push us along the 

Re: [DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes

2018-09-23 Thread la...@apache.org
I will point out that the thin client can have advantages too:1. The query 
engine pool can be sized independently of clients and region servers now.2. 
Appropriate machine/VM configurations can be picked that are optimal for query 
execution.3. Client and (region) servers can be upgraded entirely independently.
As Josh pointed out... As HBase commiter and PMC member I am unaware of any 
"native SQL" aspirations for HBase.

   On Wednesday, September 19, 2018, 11:07:09 AM PDT, Josh Elser 
 wrote:  
 
 On 9/18/18 12:08 PM, Jaanai Zhang wrote:
>>
>> I don't understand what performance issues you think exist based solely on
>> the above. Those numbers appear to be precisely in line with my
>> expectations. Can you please describe what issues you think exist?
>>
> 
> 1. Performance of the thick client has almost 1~4 time higher than the thin
> client, the performance of the thin client will be decreased when the
> number of concurrencies is increased.  For some applications of the web
> server, this is not enough.
> 2. An HA thin client.
> 3. SQL audit function
> 
> A lot of developer like using the thin client, which has a lower
> maintenance cost on the client.Sorry, that's all that comes to me. :)

The thin-client is always doing more work to execute the same query as 
the thick-client (shipping results to/from PQS), so it shouldn't be 
surprising that the thin-client is slower. This is the trade-off to do 
less in the client and also make a well-defined API for other language 
to talk to PQS.

Out of curiosity, did you increase the JVM heap for PQS or increase 
configuration property defaults for PQS to account for the increased 
concurrency?

> Please be more specific. Asking for "more documentation" doesn't help us
>> actually turn this around into more documentation. What are the specific
>> pain points you have experienced? What topics do you want to know more
>> about? Be as specific as possible.
>>
> 
> About documents:
> 1. I think that we cloud add documents about migrate tools and migrate
> cases since many users migrate from RDBMS(MYSQL/PG/SQL SERVER) to Phoenix
> for some applications of non-transaction.
> 2. How to design PK or indexes.
> 
> About pain points:
> The stability is a big problem. Most of the people use Phoenix as a common
> RDBMS, they are informal to execute a query, even if they don't know why
> server crash when a scan full table was executed, so define use boundary of
> Phoenix is important that rejects some query and reports it user's client.

A migration document would be a great. Something that can supplement the 
existing "Quick Start" document.

What kind of points would you want to have centralized about designing 
PK's or indexes?

> Are you referring to the hbase-spark (and thus, Spark SQL) integration? Or
>> something that some company is building?
>>
> 
> Some companies are building with SPARK SQL to access Phoenix to support
> OLAP and OLTP requirements. it will produce heavily load for HBase cluster
> when Spark reads Phoenix tables,  my co-workers want to directly read
> HFiles of Phoenix tables for some offline business, but that depends on
> more flexible Phoenix API.

Just beware in calling "HBase native sql" as this implies that this is 
something that is a part of Apache HBase (which is not the case).

I doubt anyone in the Phoenix community would take offense to saying 
that a basic read/write SQL-esque language on top of HBase would be much 
more simple/faster than Phoenix is now. The value that Phoenix provides 
is a _robust_ SQL implementation and a consistent secondary indexing 
support. Going beyond a "sql skin" and implementing a database 
management system is where Phoenix excels above the rest.

> Uh, I also got some feedback that some features import for users, For
> example, "alter table modify column" can avoid reloaded data again which is
> expensive operate for the massive data table. I had upload patches to JIRA(
> PHOENIX-4815 ), but
> nobody responds to me  :(.

You should already know that we're all volunteers here, with our own 
responsibilities. You can ask for assistance/help in reviews, but, as 
always, be respectful of everyone's time. This goes for code reviews, as 
well as new documentation.

> Now,  i devote to develop chaos test and PQS stability(it was developed on
> the branch of my company, these patches will contribute to the community
> after stable running ),  if you have any suggests, please tell to me what
> you thinking. I would appreciate your reply.

Would be happy to see what you create.
  

Re: [DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes

2018-09-19 Thread Josh Elser

On 9/18/18 12:08 PM, Jaanai Zhang wrote:


I don't understand what performance issues you think exist based solely on
the above. Those numbers appear to be precisely in line with my
expectations. Can you please describe what issues you think exist?



1. Performance of the thick client has almost 1~4 time higher than the thin
client, the performance of the thin client will be decreased when the
number of concurrencies is increased.  For some applications of the web
server, this is not enough.
2. An HA thin client.
3. SQL audit function

A lot of developer like using the thin client, which has a lower
maintenance cost on the client.Sorry, that's all that comes to me. :)


The thin-client is always doing more work to execute the same query as 
the thick-client (shipping results to/from PQS), so it shouldn't be 
surprising that the thin-client is slower. This is the trade-off to do 
less in the client and also make a well-defined API for other language 
to talk to PQS.


Out of curiosity, did you increase the JVM heap for PQS or increase 
configuration property defaults for PQS to account for the increased 
concurrency?



Please be more specific. Asking for "more documentation" doesn't help us

actually turn this around into more documentation. What are the specific
pain points you have experienced? What topics do you want to know more
about? Be as specific as possible.



About documents:
1. I think that we cloud add documents about migrate tools and migrate
cases since many users migrate from RDBMS(MYSQL/PG/SQL SERVER) to Phoenix
for some applications of non-transaction.
2. How to design PK or indexes.

About pain points:
The stability is a big problem. Most of the people use Phoenix as a common
RDBMS, they are informal to execute a query, even if they don't know why
server crash when a scan full table was executed, so define use boundary of
Phoenix is important that rejects some query and reports it user's client.


A migration document would be a great. Something that can supplement the 
existing "Quick Start" document.


What kind of points would you want to have centralized about designing 
PK's or indexes?



Are you referring to the hbase-spark (and thus, Spark SQL) integration? Or

something that some company is building?



Some companies are building with SPARK SQL to access Phoenix to support
OLAP and OLTP requirements. it will produce heavily load for HBase cluster
when Spark reads Phoenix tables,  my co-workers want to directly read
HFiles of Phoenix tables for some offline business, but that depends on
more flexible Phoenix API.


Just beware in calling "HBase native sql" as this implies that this is 
something that is a part of Apache HBase (which is not the case).


I doubt anyone in the Phoenix community would take offense to saying 
that a basic read/write SQL-esque language on top of HBase would be much 
more simple/faster than Phoenix is now. The value that Phoenix provides 
is a _robust_ SQL implementation and a consistent secondary indexing 
support. Going beyond a "sql skin" and implementing a database 
management system is where Phoenix excels above the rest.



Uh, I also got some feedback that some features import for users, For
example, "alter table modify column" can avoid reloaded data again which is
expensive operate for the massive data table. I had upload patches to JIRA(
PHOENIX-4815 ), but
nobody responds to me  :(.


You should already know that we're all volunteers here, with our own 
responsibilities. You can ask for assistance/help in reviews, but, as 
always, be respectful of everyone's time. This goes for code reviews, as 
well as new documentation.



Now,  i devote to develop chaos test and PQS stability(it was developed on
the branch of my company, these patches will contribute to the community
after stable running ),  if you have any suggests, please tell to me what
you thinking. I would appreciate your reply.


Would be happy to see what you create.


Re: [DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes

2018-09-19 Thread Jaanai Zhang
>
> How about submitting  patch to HBase to modify
> https://hbase.apache.org/poweredbyhbase.html ? :)


It will be hard to find by folks if we add Phoneix's link to this page.
May the home page of HBase is a good place... but which needs the approval
of HBase community.


   Jaanai Zhang
   Best regards!



Jaanai Zhang  于2018年9月19日周三 上午12:08写道:

> I don't understand what performance issues you think exist based solely on
>> the above. Those numbers appear to be precisely in line with my
>> expectations. Can you please describe what issues you think exist?
>>
>
> 1. Performance of the thick client has almost 1~4 time higher than the
> thin client, the performance of the thin client will be decreased when the
> number of concurrencies is increased.  For some applications of the web
> server, this is not enough.
> 2. An HA thin client.
> 3. SQL audit function
>
> A lot of developer like using the thin client, which has a lower
> maintenance cost on the client.Sorry, that's all that comes to me. :)
>
> Please be more specific. Asking for "more documentation" doesn't help us
>> actually turn this around into more documentation. What are the specific
>> pain points you have experienced? What topics do you want to know more
>> about? Be as specific as possible.
>>
>
> About documents:
> 1. I think that we cloud add documents about migrate tools and migrate
> cases since many users migrate from RDBMS(MYSQL/PG/SQL SERVER) to Phoenix
> for some applications of non-transaction.
> 2. How to design PK or indexes.
>
> About pain points:
> The stability is a big problem. Most of the people use Phoenix as a common
> RDBMS, they are informal to execute a query, even if they don't know why
> server crash when a scan full table was executed, so define use boundary of
> Phoenix is important that rejects some query and reports it user's client.
>
> Are you referring to the hbase-spark (and thus, Spark SQL) integration? Or
>> something that some company is building?
>>
>
> Some companies are building with SPARK SQL to access Phoenix to support
> OLAP and OLTP requirements. it will produce heavily load for HBase cluster
> when Spark reads Phoenix tables,  my co-workers want to directly read
> HFiles of Phoenix tables for some offline business, but that depends on
> more flexible Phoenix API.
>
> Uh, I also got some feedback that some features import for users, For
> example, "alter table modify column" can avoid reloaded data again which is
> expensive operate for the massive data table. I had upload patches to JIRA(
> PHOENIX-4815 ), but
> nobody responds to me  :(.
>
> Now,  i devote to develop chaos test and PQS stability(it was developed on
> the branch of my company, these patches will contribute to the community
> after stable running ),  if you have any suggests, please tell to me what
> you thinking. I would appreciate your reply.
>
>
> 
>Jaanai Zhang
>Best regards!
>
>
>
> Josh Elser  于2018年9月18日周二 上午12:03写道:
>
>>
>>
>> On Mon, Sep 17, 2018 at 9:36 AM zhang yun  wrote:
>>
>>> Sorry for replying late. I attended HBaesCon Asia as a speaker and got
>>> some some notes. I think Phoenix’ pains as following:
>>>
>>> 1. Thick client isn’t as more popular as thin client. For some we
>>> applications: 1. Users need to spend a lot of time to solve the
>>> dependencies, 2. Users worry about the stability which some calculation
>>> operates are processed within thick client . 3. Some people hope use multi
>>> program language client, such as Go, .Net and Python etc…  Other benefits:
>>> 1 Easy to add SQL audit function. 2. Recognise invalid SQL and report  to
>>> user...As you said this definitely a big issue which is worth paying
>>> more attention.  However thick client exists some problems,  it is recently
>>> test data about performance:
>>>
>>>
>> I don't understand what performance issues you think exist based solely
>> on the above. Those numbers appear to be precisely in line with my
>> expectations. Can you please describe what issues you think exist?
>>
>>
>>> 2. Actually, Phoenix has a high barrier for beginner than common RDBMS,
>>> users need to learn HBase before using Phoenix, Most of people don’t know
>>> how to  reasonable to use,  so we need more detail documents make Phoenix
>>> use easier.
>>>
>>
>> Please be more specific. Asking for "more documentation" doesn't help us
>> actually turn this around into more documentation. What are the specific
>> pain points you have experienced? What topics do you want to know more
>> about? Be as specific as possible.
>>
>>
>>> 3. HBase 3.0 has a plan about native SQL, does Phoenix has a plan? Even
>>> if many peoples don’t know HBase has a SQL layer which is called Phoenix,
>>> so can we put the link on HBase website?
>>>
>>
>> Uh, I have no idea what you're referring to here about "native SQL". I am
>> not aware of any such 

Re: [DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes

2018-09-18 Thread Jaanai Zhang
>
> I don't understand what performance issues you think exist based solely on
> the above. Those numbers appear to be precisely in line with my
> expectations. Can you please describe what issues you think exist?
>

1. Performance of the thick client has almost 1~4 time higher than the thin
client, the performance of the thin client will be decreased when the
number of concurrencies is increased.  For some applications of the web
server, this is not enough.
2. An HA thin client.
3. SQL audit function

A lot of developer like using the thin client, which has a lower
maintenance cost on the client.Sorry, that's all that comes to me. :)

Please be more specific. Asking for "more documentation" doesn't help us
> actually turn this around into more documentation. What are the specific
> pain points you have experienced? What topics do you want to know more
> about? Be as specific as possible.
>

About documents:
1. I think that we cloud add documents about migrate tools and migrate
cases since many users migrate from RDBMS(MYSQL/PG/SQL SERVER) to Phoenix
for some applications of non-transaction.
2. How to design PK or indexes.

About pain points:
The stability is a big problem. Most of the people use Phoenix as a common
RDBMS, they are informal to execute a query, even if they don't know why
server crash when a scan full table was executed, so define use boundary of
Phoenix is important that rejects some query and reports it user's client.

Are you referring to the hbase-spark (and thus, Spark SQL) integration? Or
> something that some company is building?
>

Some companies are building with SPARK SQL to access Phoenix to support
OLAP and OLTP requirements. it will produce heavily load for HBase cluster
when Spark reads Phoenix tables,  my co-workers want to directly read
HFiles of Phoenix tables for some offline business, but that depends on
more flexible Phoenix API.

Uh, I also got some feedback that some features import for users, For
example, "alter table modify column" can avoid reloaded data again which is
expensive operate for the massive data table. I had upload patches to JIRA(
PHOENIX-4815 ), but
nobody responds to me  :(.

Now,  i devote to develop chaos test and PQS stability(it was developed on
the branch of my company, these patches will contribute to the community
after stable running ),  if you have any suggests, please tell to me what
you thinking. I would appreciate your reply.



   Jaanai Zhang
   Best regards!



Josh Elser  于2018年9月18日周二 上午12:03写道:

>
>
> On Mon, Sep 17, 2018 at 9:36 AM zhang yun  wrote:
>
>> Sorry for replying late. I attended HBaesCon Asia as a speaker and got
>> some some notes. I think Phoenix’ pains as following:
>>
>> 1. Thick client isn’t as more popular as thin client. For some we
>> applications: 1. Users need to spend a lot of time to solve the
>> dependencies, 2. Users worry about the stability which some calculation
>> operates are processed within thick client . 3. Some people hope use multi
>> program language client, such as Go, .Net and Python etc…  Other benefits:
>> 1 Easy to add SQL audit function. 2. Recognise invalid SQL and report  to
>> user...As you said this definitely a big issue which is worth paying
>> more attention.  However thick client exists some problems,  it is recently
>> test data about performance:
>>
>>
> I don't understand what performance issues you think exist based solely on
> the above. Those numbers appear to be precisely in line with my
> expectations. Can you please describe what issues you think exist?
>
>
>> 2. Actually, Phoenix has a high barrier for beginner than common RDBMS,
>> users need to learn HBase before using Phoenix, Most of people don’t know
>> how to  reasonable to use,  so we need more detail documents make Phoenix
>> use easier.
>>
>
> Please be more specific. Asking for "more documentation" doesn't help us
> actually turn this around into more documentation. What are the specific
> pain points you have experienced? What topics do you want to know more
> about? Be as specific as possible.
>
>
>> 3. HBase 3.0 has a plan about native SQL, does Phoenix has a plan? Even
>> if many peoples don’t know HBase has a SQL layer which is called Phoenix,
>> so can we put the link on HBase website?
>>
>
> Uh, I have no idea what you're referring to here about "native SQL". I am
> not aware of any such effort that does this solely inside of HBase, nor
> does it seem inline with HBase's "do one thing well" mantra.
>
> Are you referring to the hbase-spark (and thus, Spark SQL) integration? Or
> something that some company is building?
>
> How about submitting  patch to HBase to modify
> https://hbase.apache.org/poweredbyhbase.html ? :)
>
>
>>
>> On 2018/08/27 18:03:30, Josh Elser  wrote:
>> > (bcc: dev@hbase, in case folks there have been waiting for me to send
>> >
>> > this email to dev@phoenix)>
>> >
>> > Hi,>
>> >
>> > In 

Re: [DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes

2018-09-17 Thread Josh Elser
On Mon, Sep 17, 2018 at 9:36 AM zhang yun  wrote:

> Sorry for replying late. I attended HBaesCon Asia as a speaker and got
> some some notes. I think Phoenix’ pains as following:
>
> 1. Thick client isn’t as more popular as thin client. For some we
> applications: 1. Users need to spend a lot of time to solve the
> dependencies, 2. Users worry about the stability which some calculation
> operates are processed within thick client . 3. Some people hope use multi
> program language client, such as Go, .Net and Python etc…  Other benefits:
> 1 Easy to add SQL audit function. 2. Recognise invalid SQL and report  to
> user...As you said this definitely a big issue which is worth paying
> more attention.  However thick client exists some problems,  it is recently
> test data about performance:
>
>
I don't understand what performance issues you think exist based solely on
the above. Those numbers appear to be precisely in line with my
expectations. Can you please describe what issues you think exist?


> 2. Actually, Phoenix has a high barrier for beginner than common RDBMS,
> users need to learn HBase before using Phoenix, Most of people don’t know
> how to  reasonable to use,  so we need more detail documents make Phoenix
> use easier.
>

Please be more specific. Asking for "more documentation" doesn't help us
actually turn this around into more documentation. What are the specific
pain points you have experienced? What topics do you want to know more
about? Be as specific as possible.


> 3. HBase 3.0 has a plan about native SQL, does Phoenix has a plan? Even if
> many peoples don’t know HBase has a SQL layer which is called Phoenix, so
> can we put the link on HBase website?
>

Uh, I have no idea what you're referring to here about "native SQL". I am
not aware of any such effort that does this solely inside of HBase, nor
does it seem inline with HBase's "do one thing well" mantra.

Are you referring to the hbase-spark (and thus, Spark SQL) integration? Or
something that some company is building?

How about submitting  patch to HBase to modify
https://hbase.apache.org/poweredbyhbase.html ? :)


>
> On 2018/08/27 18:03:30, Josh Elser  wrote:
> > (bcc: dev@hbase, in case folks there have been waiting for me to send >
> > this email to dev@phoenix)>
> >
> > Hi,>
> >
> > In case you missed it, there was an HBaseCon event held in Asia >
> > recently. Stack took some great notes and shared them with the HBase >
> > community. A few of them touched on Phoenix, directly or in a related >
> > manner. I think they are good "criticisms" that are beneficial for us to
> >
> > hear.>
> >
> > 1. The phoenix-$version-client.jar size is prohibitively large>
> >
> > In this day and age, I'm surprised that this is a big issue for people.
> >
> > I know have a lot of cruft, most of which coming from hadoop. We have >
> > gotten better here over recent releases, but I would guess that there is
> >
> > more we can do.>
> >
> > 2. Can Phoenix be the de-facto schema for SQL on HBase?>
> >
> > We've long asserted "if you have to ask how Phoenix serializes data, you
> >
> > shouldn't be do it" (a nod that you have to write lots of code). What if
> >
> > we turn that on its head? Could we extract our PDataType serialization,
> >
> > composite row-key, column encoding, etc into a minimal API that folks >
> > with their own itches can use?>
> >
> > With the growing integrations into Phoenix, we could embrace them by >
> > providing an API to make what they're doing easier. In the same vein, we
> >
> > cement ourselves as a cornerstone of doing it "correctly".>
> >
> > 3. Better recommendations to users to not attempt certain queries.>
> >
> > We definitively know that there are certain types of queries that >
> > Phoenix cannot support well (compared to optimal Phoenix use-cases). >
> > Users very commonly fall into such pitfalls on their own and this leaves
> >
> > a bad taste in their mouth (thinking that the product "stinks").>
> >
> > Can we do a better job of telling the user when and why it happened? >
> > What would such a user-interaction model look like? Can we supplement >
> > the "why" with instructions of what to do differently (even if in the >
> > abstract)?>
> >
> > 4. Phoenix-Calcite>
> >
> > This was mentioned as a "nice to have". From what I understand, there >
> > was nothing explicitly from with the implementation or approach, just >
> > that it was a massive undertaking to continue with little immediate >
> > gain. Would this be a boon for us to try to continue in some form? Are >
> > there steps we can take that would help push us along the right path?>
> >
> > Anyways, I'd love to hear everyone's thoughts. While the concerns were >
> > raised at HBaseCon Asia, the suggestions that accompany them here are >
> > largely mine ;). Feel free to break them out into their own threads if >
> > you think that would be better (or say that you disagree with me -- >
> > that's cool too)!>
> >
> > - Josh>
> >
>


Re: [DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes

2018-09-17 Thread Josh Elser
Maybe an implementation detail, but I'm a fan of having a devoted Maven 
module to "client-facing" API as opposed to an annotation-based 
approach. I find a separate module helps to catch problematic API design 
faster, and make it crystal clear what users should (and should not) be 
relying upon).


On 9/17/18 1:00 AM, la...@apache.org wrote:

  I think we can start by implementing a tighter integration with Spark through 
DataSource V2.That would make it quickly apparent what parts of Phoenix would 
need direct access.
Some parts just need a interface audience declaration (like Phoenix's basic 
type system) and our agreement that we will change those only according to 
semantic versioning. Otherwise (like the query plan) will need a bit more 
thinking. Maybe that's the path to hook Calcite - just making that part up as I 
write this...
Perhaps turning the HBase interface into an API might not be so difficult 
either. That would perhaps be a new client - strictly additional - client API.

A good Spark interface is in everybody's interest and I think is the best 
avenue to figure out what's missing/needed.
-- Lars

 On Wednesday, September 12, 2018, 12:47:21 PM PDT, Josh Elser 
 wrote:
  
  I like it, Lars. I like it very much.


Just the easy part of doing it... ;)

On 9/11/18 4:53 PM, la...@apache.org wrote:

   Sorry for coming a bit late to this. I've been thinking about some of lines 
for a bit.
It seems Phoenix serves 4 distinct purposes:
1. Query parsing and compiling.2. A type system3. Query execution4. Efficient 
HBase interface
Each of these is useful by itself, but we do not expose these as stable interfaces.We 
have seen a lot of need to tie HBase into "higher level" service, such as Spark 
(and Presto, etc).
I think we can get a long way if we separate at least #1 (SQL) from the rest 
#2, #3, and #4 (Typed HBase Interface - THI).
Phoenix is used via SQL (#1), other tools such as Presto, Impala, Drill, Spark, 
etc, can interface efficiently with HBase via THI (#2, #3, and #4).
Thoughts?
-- Lars
       On Monday, August 27, 2018, 11:03:33 AM PDT, Josh Elser 
 wrote:
   
   (bcc: dev@hbase, in case folks there have been waiting for me to send

this email to dev@phoenix)

Hi,

In case you missed it, there was an HBaseCon event held in Asia
recently. Stack took some great notes and shared them with the HBase
community. A few of them touched on Phoenix, directly or in a related
manner. I think they are good "criticisms" that are beneficial for us to
hear.

1. The phoenix-$version-client.jar size is prohibitively large

In this day and age, I'm surprised that this is a big issue for people.
I know have a lot of cruft, most of which coming from hadoop. We have
gotten better here over recent releases, but I would guess that there is
more we can do.

2. Can Phoenix be the de-facto schema for SQL on HBase?

We've long asserted "if you have to ask how Phoenix serializes data, you
shouldn't be do it" (a nod that you have to write lots of code). What if
we turn that on its head? Could we extract our PDataType serialization,
composite row-key, column encoding, etc into a minimal API that folks
with their own itches can use?

With the growing integrations into Phoenix, we could embrace them by
providing an API to make what they're doing easier. In the same vein, we
cement ourselves as a cornerstone of doing it "correctly".

3. Better recommendations to users to not attempt certain queries.

We definitively know that there are certain types of queries that
Phoenix cannot support well (compared to optimal Phoenix use-cases).
Users very commonly fall into such pitfalls on their own and this leaves
a bad taste in their mouth (thinking that the product "stinks").

Can we do a better job of telling the user when and why it happened?
What would such a user-interaction model look like? Can we supplement
the "why" with instructions of what to do differently (even if in the
abstract)?

4. Phoenix-Calcite

This was mentioned as a "nice to have". From what I understand, there
was nothing explicitly from with the implementation or approach, just
that it was a massive undertaking to continue with little immediate
gain. Would this be a boon for us to try to continue in some form? Are
there steps we can take that would help push us along the right path?

Anyways, I'd love to hear everyone's thoughts. While the concerns were
raised at HBaseCon Asia, the suggestions that accompany them here are
largely mine ;). Feel free to break them out into their own threads if
you think that would be better (or say that you disagree with me --
that's cool too)!

- Josh
 

   



Re: [DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes

2018-09-17 Thread zhang yun
Sorry for replying late. I attended HBaesCon Asia as a speaker and got some 
some notes. I think Phoenix’ pains as following:

1. Thick client isn’t as more popular as thin client. For some we applications: 
1. Users need to spend a lot of time to solve the dependencies, 2. Users worry 
about the stability which some calculation operates are processed within thick 
client . 3. Some people hope use multi program language client, such as Go, 
.Net and Python etc…  Other benefits: 1 Easy to add SQL audit function. 2. 
Recognise invalid SQL and report  to user...As you said this definitely a 
big issue which is worth paying more attention.  However thick client exists 
some problems,  it is recently test data about performance:


2. Actually, Phoenix has a high barrier for beginner than common RDBMS, users 
need to learn HBase before using Phoenix, Most of people don’t know how to  
reasonable to use,  so we need more detail documents make Phoenix use easier.

3. HBase 3.0 has a plan about native SQL, does Phoenix has a plan? Even if many 
peoples don’t know HBase has a SQL layer which is called Phoenix, so can we put 
the link on HBase website?


On 2018/08/27 18:03:30, Josh Elser  wrote: 
> (bcc: dev@hbase, in case folks there have been waiting for me to send > 
> this email to dev@phoenix)> 
> 
> Hi,> 
> 
> In case you missed it, there was an HBaseCon event held in Asia > 
> recently. Stack took some great notes and shared them with the HBase > 
> community. A few of them touched on Phoenix, directly or in a related > 
> manner. I think they are good "criticisms" that are beneficial for us to > 
> hear.> 
> 
> 1. The phoenix-$version-client.jar size is prohibitively large> 
> 
> In this day and age, I'm surprised that this is a big issue for people. > 
> I know have a lot of cruft, most of which coming from hadoop. We have > 
> gotten better here over recent releases, but I would guess that there is > 
> more we can do.> 
> 
> 2. Can Phoenix be the de-facto schema for SQL on HBase?> 
> 
> We've long asserted "if you have to ask how Phoenix serializes data, you > 
> shouldn't be do it" (a nod that you have to write lots of code). What if > 
> we turn that on its head? Could we extract our PDataType serialization, > 
> composite row-key, column encoding, etc into a minimal API that folks > 
> with their own itches can use?> 
> 
> With the growing integrations into Phoenix, we could embrace them by > 
> providing an API to make what they're doing easier. In the same vein, we > 
> cement ourselves as a cornerstone of doing it "correctly".> 
> 
> 3. Better recommendations to users to not attempt certain queries.> 
> 
> We definitively know that there are certain types of queries that > 
> Phoenix cannot support well (compared to optimal Phoenix use-cases). > 
> Users very commonly fall into such pitfalls on their own and this leaves > 
> a bad taste in their mouth (thinking that the product "stinks").> 
> 
> Can we do a better job of telling the user when and why it happened? > 
> What would such a user-interaction model look like? Can we supplement > 
> the "why" with instructions of what to do differently (even if in the > 
> abstract)?> 
> 
> 4. Phoenix-Calcite> 
> 
> This was mentioned as a "nice to have". From what I understand, there > 
> was nothing explicitly from with the implementation or approach, just > 
> that it was a massive undertaking to continue with little immediate > 
> gain. Would this be a boon for us to try to continue in some form? Are > 
> there steps we can take that would help push us along the right path?> 
> 
> Anyways, I'd love to hear everyone's thoughts. While the concerns were > 
> raised at HBaseCon Asia, the suggestions that accompany them here are > 
> largely mine ;). Feel free to break them out into their own threads if > 
> you think that would be better (or say that you disagree with me -- > 
> that's cool too)!> 
> 
> - Josh> 
> 

Re: [DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes

2018-09-16 Thread la...@apache.org
 I think we can start by implementing a tighter integration with Spark through 
DataSource V2.That would make it quickly apparent what parts of Phoenix would 
need direct access.
Some parts just need a interface audience declaration (like Phoenix's basic 
type system) and our agreement that we will change those only according to 
semantic versioning. Otherwise (like the query plan) will need a bit more 
thinking. Maybe that's the path to hook Calcite - just making that part up as I 
write this...
Perhaps turning the HBase interface into an API might not be so difficult 
either. That would perhaps be a new client - strictly additional - client API.

A good Spark interface is in everybody's interest and I think is the best 
avenue to figure out what's missing/needed.
-- Lars

On Wednesday, September 12, 2018, 12:47:21 PM PDT, Josh Elser 
 wrote:  
 
 I like it, Lars. I like it very much.

Just the easy part of doing it... ;)

On 9/11/18 4:53 PM, la...@apache.org wrote:
>  Sorry for coming a bit late to this. I've been thinking about some of lines 
>for a bit.
> It seems Phoenix serves 4 distinct purposes:
> 1. Query parsing and compiling.2. A type system3. Query execution4. Efficient 
> HBase interface
> Each of these is useful by itself, but we do not expose these as stable 
> interfaces.We have seen a lot of need to tie HBase into "higher level" 
> service, such as Spark (and Presto, etc).
> I think we can get a long way if we separate at least #1 (SQL) from the rest 
> #2, #3, and #4 (Typed HBase Interface - THI).
> Phoenix is used via SQL (#1), other tools such as Presto, Impala, Drill, 
> Spark, etc, can interface efficiently with HBase via THI (#2, #3, and #4).
> Thoughts?
> -- Lars
>      On Monday, August 27, 2018, 11:03:33 AM PDT, Josh Elser 
> wrote:
>  
>  (bcc: dev@hbase, in case folks there have been waiting for me to send
> this email to dev@phoenix)
> 
> Hi,
> 
> In case you missed it, there was an HBaseCon event held in Asia
> recently. Stack took some great notes and shared them with the HBase
> community. A few of them touched on Phoenix, directly or in a related
> manner. I think they are good "criticisms" that are beneficial for us to
> hear.
> 
> 1. The phoenix-$version-client.jar size is prohibitively large
> 
> In this day and age, I'm surprised that this is a big issue for people.
> I know have a lot of cruft, most of which coming from hadoop. We have
> gotten better here over recent releases, but I would guess that there is
> more we can do.
> 
> 2. Can Phoenix be the de-facto schema for SQL on HBase?
> 
> We've long asserted "if you have to ask how Phoenix serializes data, you
> shouldn't be do it" (a nod that you have to write lots of code). What if
> we turn that on its head? Could we extract our PDataType serialization,
> composite row-key, column encoding, etc into a minimal API that folks
> with their own itches can use?
> 
> With the growing integrations into Phoenix, we could embrace them by
> providing an API to make what they're doing easier. In the same vein, we
> cement ourselves as a cornerstone of doing it "correctly".
> 
> 3. Better recommendations to users to not attempt certain queries.
> 
> We definitively know that there are certain types of queries that
> Phoenix cannot support well (compared to optimal Phoenix use-cases).
> Users very commonly fall into such pitfalls on their own and this leaves
> a bad taste in their mouth (thinking that the product "stinks").
> 
> Can we do a better job of telling the user when and why it happened?
> What would such a user-interaction model look like? Can we supplement
> the "why" with instructions of what to do differently (even if in the
> abstract)?
> 
> 4. Phoenix-Calcite
> 
> This was mentioned as a "nice to have". From what I understand, there
> was nothing explicitly from with the implementation or approach, just
> that it was a massive undertaking to continue with little immediate
> gain. Would this be a boon for us to try to continue in some form? Are
> there steps we can take that would help push us along the right path?
> 
> Anyways, I'd love to hear everyone's thoughts. While the concerns were
> raised at HBaseCon Asia, the suggestions that accompany them here are
> largely mine ;). Feel free to break them out into their own threads if
> you think that would be better (or say that you disagree with me --
> that's cool too)!
> 
> - Josh
>    
> 
  

Re: [DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes

2018-09-12 Thread Josh Elser

I like it, Lars. I like it very much.

Just the easy part of doing it... ;)

On 9/11/18 4:53 PM, la...@apache.org wrote:

  Sorry for coming a bit late to this. I've been thinking about some of lines 
for a bit.
It seems Phoenix serves 4 distinct purposes:
1. Query parsing and compiling.2. A type system3. Query execution4. Efficient 
HBase interface
Each of these is useful by itself, but we do not expose these as stable interfaces.We 
have seen a lot of need to tie HBase into "higher level" service, such as Spark 
(and Presto, etc).
I think we can get a long way if we separate at least #1 (SQL) from the rest 
#2, #3, and #4 (Typed HBase Interface - THI).
Phoenix is used via SQL (#1), other tools such as Presto, Impala, Drill, Spark, 
etc, can interface efficiently with HBase via THI (#2, #3, and #4).
Thoughts?
-- Lars
 On Monday, August 27, 2018, 11:03:33 AM PDT, Josh Elser 
 wrote:
  
  (bcc: dev@hbase, in case folks there have been waiting for me to send

this email to dev@phoenix)

Hi,

In case you missed it, there was an HBaseCon event held in Asia
recently. Stack took some great notes and shared them with the HBase
community. A few of them touched on Phoenix, directly or in a related
manner. I think they are good "criticisms" that are beneficial for us to
hear.

1. The phoenix-$version-client.jar size is prohibitively large

In this day and age, I'm surprised that this is a big issue for people.
I know have a lot of cruft, most of which coming from hadoop. We have
gotten better here over recent releases, but I would guess that there is
more we can do.

2. Can Phoenix be the de-facto schema for SQL on HBase?

We've long asserted "if you have to ask how Phoenix serializes data, you
shouldn't be do it" (a nod that you have to write lots of code). What if
we turn that on its head? Could we extract our PDataType serialization,
composite row-key, column encoding, etc into a minimal API that folks
with their own itches can use?

With the growing integrations into Phoenix, we could embrace them by
providing an API to make what they're doing easier. In the same vein, we
cement ourselves as a cornerstone of doing it "correctly".

3. Better recommendations to users to not attempt certain queries.

We definitively know that there are certain types of queries that
Phoenix cannot support well (compared to optimal Phoenix use-cases).
Users very commonly fall into such pitfalls on their own and this leaves
a bad taste in their mouth (thinking that the product "stinks").

Can we do a better job of telling the user when and why it happened?
What would such a user-interaction model look like? Can we supplement
the "why" with instructions of what to do differently (even if in the
abstract)?

4. Phoenix-Calcite

This was mentioned as a "nice to have". From what I understand, there
was nothing explicitly from with the implementation or approach, just
that it was a massive undertaking to continue with little immediate
gain. Would this be a boon for us to try to continue in some form? Are
there steps we can take that would help push us along the right path?

Anyways, I'd love to hear everyone's thoughts. While the concerns were
raised at HBaseCon Asia, the suggestions that accompany them here are
largely mine ;). Feel free to break them out into their own threads if
you think that would be better (or say that you disagree with me --
that's cool too)!

- Josh
   



Re: [DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes

2018-09-11 Thread la...@apache.org
 Sorry for coming a bit late to this. I've been thinking about some of lines 
for a bit.
It seems Phoenix serves 4 distinct purposes:
1. Query parsing and compiling.2. A type system3. Query execution4. Efficient 
HBase interface
Each of these is useful by itself, but we do not expose these as stable 
interfaces.We have seen a lot of need to tie HBase into "higher level" service, 
such as Spark (and Presto, etc).
I think we can get a long way if we separate at least #1 (SQL) from the rest 
#2, #3, and #4 (Typed HBase Interface - THI).
Phoenix is used via SQL (#1), other tools such as Presto, Impala, Drill, Spark, 
etc, can interface efficiently with HBase via THI (#2, #3, and #4).
Thoughts?
-- Lars
On Monday, August 27, 2018, 11:03:33 AM PDT, Josh Elser  
wrote:  
 
 (bcc: dev@hbase, in case folks there have been waiting for me to send 
this email to dev@phoenix)

Hi,

In case you missed it, there was an HBaseCon event held in Asia 
recently. Stack took some great notes and shared them with the HBase 
community. A few of them touched on Phoenix, directly or in a related 
manner. I think they are good "criticisms" that are beneficial for us to 
hear.

1. The phoenix-$version-client.jar size is prohibitively large

In this day and age, I'm surprised that this is a big issue for people. 
I know have a lot of cruft, most of which coming from hadoop. We have 
gotten better here over recent releases, but I would guess that there is 
more we can do.

2. Can Phoenix be the de-facto schema for SQL on HBase?

We've long asserted "if you have to ask how Phoenix serializes data, you 
shouldn't be do it" (a nod that you have to write lots of code). What if 
we turn that on its head? Could we extract our PDataType serialization, 
composite row-key, column encoding, etc into a minimal API that folks 
with their own itches can use?

With the growing integrations into Phoenix, we could embrace them by 
providing an API to make what they're doing easier. In the same vein, we 
cement ourselves as a cornerstone of doing it "correctly".

3. Better recommendations to users to not attempt certain queries.

We definitively know that there are certain types of queries that 
Phoenix cannot support well (compared to optimal Phoenix use-cases). 
Users very commonly fall into such pitfalls on their own and this leaves 
a bad taste in their mouth (thinking that the product "stinks").

Can we do a better job of telling the user when and why it happened? 
What would such a user-interaction model look like? Can we supplement 
the "why" with instructions of what to do differently (even if in the 
abstract)?

4. Phoenix-Calcite

This was mentioned as a "nice to have". From what I understand, there 
was nothing explicitly from with the implementation or approach, just 
that it was a massive undertaking to continue with little immediate 
gain. Would this be a boon for us to try to continue in some form? Are 
there steps we can take that would help push us along the right path?

Anyways, I'd love to hear everyone's thoughts. While the concerns were 
raised at HBaseCon Asia, the suggestions that accompany them here are 
largely mine ;). Feel free to break them out into their own threads if 
you think that would be better (or say that you disagree with me -- 
that's cool too)!

- Josh
  

Re: [DISCUSS] EXPLAIN'ing what we do well (was Re: [DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes)

2018-08-30 Thread Thomas D'Silva
I created  PHOENIX-4881 to create a  guardrail config property based on the
bytes scanned.
We already have PHOENIX-1481 to improve the explain plan documentation.

On Tue, Aug 28, 2018 at 1:40 PM, James Taylor 
wrote:

> Thomas' idea is a good one. From the EXPLAIN plan ResultSet, you can
> directly get an estimate of the number of bytes that will be scanned. Take
> a look at this [1] documentation. We need to implement PHOENIX-4735 too (so
> that things are setup well out-of-the-box). We could have a kind of
> guardrail config property that would define the max allowed bytes allowed
> to be read and fail a query that goes over this limit. That would cover 80%
> of the issues IMHO. Other guardrail config properties could cover other
> corner cases.
>
> [1] http://phoenix.apache.org/explainplan.html
>
> On Mon, Aug 27, 2018 at 3:01 PM Josh Elser  wrote:
>
> > On 8/27/18 5:03 PM, Thomas D'Silva wrote:
> > >> 3. Better recommendations to users to not attempt certain queries.
> > >>
> > >> We definitively know that there are certain types of queries that
> > Phoenix
> > >> cannot support well (compared to optimal Phoenix use-cases). Users
> very
> > >> commonly fall into such pitfalls on their own and this leaves a bad
> > taste
> > >> in their mouth (thinking that the product "stinks").
> > >>
> > >> Can we do a better job of telling the user when and why it happened?
> > What
> > >> would such a user-interaction model look like? Can we supplement the
> > "why"
> > >> with instructions of what to do differently (even if in the abstract)?
> > >>
> > > Providing relevant feedback before/after a query is run in general is
> > very
> > > hard to do. If stats are enabled we have an estimate of how many
> > rows/bytes
> > > will be scanned.
> > > We could have an optional feature that prevent users from running
> queries
> > > if the rows/bytes scanned are above a certain threshold. We should also
> > > enhance our explain
> > > plan documentationhttp://phoenix.apache.org/explainplan.html  with
> > example
> > > of queries so users know what kinds of queries Phoenix handles well.
> >
> > Breaking this out..
> >
> > Totally agree -- this is by no means "easy". I struggle very often
> > trying to express just _why_ a query that someone is running in Phoenix
> > doesn't run as well as they think it should.
> >
> > Centralizing on the EXPLAIN plan is good. Making sure it's
> > consumable/thorough is probably the lowest hanging fruit. If we can give
> > concrete examples to the kinds of explain plans a user might see, I
> > think that might get use from users/admins.
> >
> > Throwing a random idea out there: with stats and the query plan, can we
> > give a thumbs-up/thumbs-down? If we can, is that useful?
> >
>


Re: [DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes

2018-08-29 Thread Nick Dimiduk
On Mon, Aug 27, 2018 at 2:03 PM, Thomas D'Silva 
wrote:

> >
> >
> > 2. Can Phoenix be the de-facto schema for SQL on HBase?
> >
> > We've long asserted "if you have to ask how Phoenix serializes data, you
> > shouldn't be do it" (a nod that you have to write lots of code). What if
> we
> > turn that on its head? Could we extract our PDataType serialization,
> > composite row-key, column encoding, etc into a minimal API that folks
> with
> > their own itches can use?
> >
> > With the growing integrations into Phoenix, we could embrace them by
> > providing an API to make what they're doing easier. In the same vein, we
> > cement ourselves as a cornerstone of doing it "correctly".
> >
>
> +1 on standardizing the data type and storage format API so that it would
> be easier for other projects to use.
>

Adding my $0.02, since I've thought a good bit about this over the years.

The `DataType` [0] interface in HBase is built this precisely this idea in
mind -- sharing data encoding formats across HBase projects. Phoenix's
`PDataType` implements this interface. Exposing the encoders to 3rd
parties, then, is a matter of those 3rd parties using this interface and
consuming the phoenix-core jar. Maybe we want to break them out into their
own jar to minimize dependencies? That said, Phoenix's smarts about
compound rowkeys and packed column values are beyond simple column
encodings. These may not be as easily exposed to external tools...

I think, realistically, Phoenix would need to expose a number of
schema-related tools together in a package in order to provide "true
interoperability" with other tools. Pick a use case -- I'm fond of
"offline" use-cases, something like building a Phoenix-compatible table
from a MapReduce (or Spark, or Hive, or...) application on a cluster that
doesn't even have HBase available. Then plumb it out the other way, reading
an exported snapshot of a Phoenix table from the same "offline"
environment. It's a pretty extreme case that I think is worth while because
enables a lot of flexibility for users, and would shake out a bunch of
these related issues. I suspect this requires going below the JDBC
interface, but I could be wrong...

-n

[0]:
https://hbase.apache.org/1.2/apidocs/org/apache/hadoop/hbase/types/DataType.html


Re: [DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes

2018-08-28 Thread Andrew Purtell
On Tue, Aug 28, 2018 at 2:01 PM James Taylor  wrote:

> Glad to hear this was discussed at HBaseCon. The most common request I've
> seen asked for is to be able to write Phoenix-compatible data from other,
> non-Phoenix services/projects, mainly because row-by-row updates (even when
> batched) can be a bottleneck. This is not feasible by using low level
> constructs because of all the features provided by Phoenix: secondary
> indexes, composite row keys, encoded columns, storage formats, salting,
> ascending/descending row keys, array support, etc. The most feasible way to
> accomplish writes outside of Phoenix is to use UPSERT VALUES followed by
> PhoenixRuntime#getUncommittedDataIterator to get the Cells that would be
> committed (followed by rolling back the uncommitted data). This maintains
> Phoenix's abstract and minimizes any overhead (the cost of parsing is
> negligible). You can control the frequency of how often the schema is
> pulled over from the server through the UPDATE_CACHE_FREQUENCY declaration.
>
> I haven't seen much demand for bypassing Phoenix JDBC on the read side. If
> you don't want to use Phoenix to query, what's the point in using it?
>

You might have Phoenix clients and HBase clients sharing common data
sources, for whatever reason, we cannot assume what constraints or legacy
issues may present themselves in a given Phoenix or HBase user's
environment. Agree though as a question of prioritization maybe it doesn't
get done until a volunteer does it to scratch a real itch, but at that
point it could be useful to accept the contribution.


> As far as Calicte/Phoenix, it'd be great to see this work picked up. I
> don't think this solves the API problem, though. I good home for this
> adapter would be Apache Drill IMHO. They're up to a new enough version of
> Calcite (and off of their fork) so that this would be feasible and would
> provide immediate benefits on the query side.
>
> Thanks,
> James
>
> On Tue, Aug 28, 2018 at 1:38 PM Andrew Purtell 
> wrote:
>
> > On Mon, Aug 27, 2018 at 11:03 AM Josh Elser  wrote:
> >
> > > 2. Can Phoenix be the de-facto schema for SQL on HBase?
> > >
> > > We've long asserted "if you have to ask how Phoenix serializes data,
> you
> > > shouldn't be do it" (a nod that you have to write lots of code). What
> if
> > > we turn that on its head? Could we extract our PDataType serialization,
> > > composite row-key, column encoding, etc into a minimal API that folks
> > > with their own itches can use?
> > >
> > > With the growing integrations into Phoenix, we could embrace them by
> > > providing an API to make what they're doing easier. In the same vein,
> we
> > > cement ourselves as a cornerstone of doing it "correctly"
> > >
> >
> > There have been discussion where I work where it seems this would be a
> > great idea. If data types, row key constructors, and other key and data
> > serialization concerns were a public API, these could be used by
> connectors
> > to Spark or other systems to generate and consume Phoenix compatible
> data.
> > It improves the integration story all around.
> >
> > Another thought for refactoring I've heard is exposing an API for
> > generating query plans without needing the SQL parser. A public API  for
> > programmatically building query plans could used by connectors to Spark
> or
> > other systems when pushing down parts of a parallelized or federated
> query
> > to Phoenix data sources, avoiding unnecessary hacking SQL language
> > generation, string mangling, or (re)parsing overheads. This kind of
> > describes Calcite's raison d'être. If Phoenix is not embedding Calcite as
> > query planner, as it does not currently, it is independently useful to
> have
> > a public API for programmatic query plan construction given the current
> > implementation regardless. If Phoenix were to embed Calcite as query
> > planner, you'd probably get a ton of re-use among internal and external
> > users of the Calcite APIs. I'd think whatever option you might choose
> would
> > be informed by the suitability (or not) of embedding Calcite as Phoenix's
> > query planner, and how soon that might be expected to be feature
> complete.
> > For what it's worth. Again this extends possibilities for integration.
> >
> >
> > > 3. Better recommendations to users to not attempt certain queries.
> > >
> > > We definitively know that there are certain types of queries that
> > > Phoenix cannot support well (compared to optimal Phoenix use-cases).
> > > Users very commonly fall into such pitfalls on their own and this
> leaves
> > > a bad taste in their mouth (thinking that the product "stinks").
> > >
> > > Can we do a better job of telling the user when and why it happened?
> > > What would such a user-interaction model look like? Can we supplement
> > > the "why" with instructions of what to do differently (even if in the
> > > abstract)?
> > >
> > > 4. Phoenix-Calcite
> > >
> > > This was mentioned as a "nice to have". From what I understand, 

Re: [DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes

2018-08-28 Thread James Taylor
Glad to hear this was discussed at HBaseCon. The most common request I've
seen asked for is to be able to write Phoenix-compatible data from other,
non-Phoenix services/projects, mainly because row-by-row updates (even when
batched) can be a bottleneck. This is not feasible by using low level
constructs because of all the features provided by Phoenix: secondary
indexes, composite row keys, encoded columns, storage formats, salting,
ascending/descending row keys, array support, etc. The most feasible way to
accomplish writes outside of Phoenix is to use UPSERT VALUES followed by
PhoenixRuntime#getUncommittedDataIterator to get the Cells that would be
committed (followed by rolling back the uncommitted data). This maintains
Phoenix's abstract and minimizes any overhead (the cost of parsing is
negligible). You can control the frequency of how often the schema is
pulled over from the server through the UPDATE_CACHE_FREQUENCY declaration.

I haven't seen much demand for bypassing Phoenix JDBC on the read side. If
you don't want to use Phoenix to query, what's the point in using it?

As far as Calicte/Phoenix, it'd be great to see this work picked up. I
don't think this solves the API problem, though. I good home for this
adapter would be Apache Drill IMHO. They're up to a new enough version of
Calcite (and off of their fork) so that this would be feasible and would
provide immediate benefits on the query side.

Thanks,
James

On Tue, Aug 28, 2018 at 1:38 PM Andrew Purtell  wrote:

> On Mon, Aug 27, 2018 at 11:03 AM Josh Elser  wrote:
>
> > 2. Can Phoenix be the de-facto schema for SQL on HBase?
> >
> > We've long asserted "if you have to ask how Phoenix serializes data, you
> > shouldn't be do it" (a nod that you have to write lots of code). What if
> > we turn that on its head? Could we extract our PDataType serialization,
> > composite row-key, column encoding, etc into a minimal API that folks
> > with their own itches can use?
> >
> > With the growing integrations into Phoenix, we could embrace them by
> > providing an API to make what they're doing easier. In the same vein, we
> > cement ourselves as a cornerstone of doing it "correctly"
> >
>
> There have been discussion where I work where it seems this would be a
> great idea. If data types, row key constructors, and other key and data
> serialization concerns were a public API, these could be used by connectors
> to Spark or other systems to generate and consume Phoenix compatible data.
> It improves the integration story all around.
>
> Another thought for refactoring I've heard is exposing an API for
> generating query plans without needing the SQL parser. A public API  for
> programmatically building query plans could used by connectors to Spark or
> other systems when pushing down parts of a parallelized or federated query
> to Phoenix data sources, avoiding unnecessary hacking SQL language
> generation, string mangling, or (re)parsing overheads. This kind of
> describes Calcite's raison d'être. If Phoenix is not embedding Calcite as
> query planner, as it does not currently, it is independently useful to have
> a public API for programmatic query plan construction given the current
> implementation regardless. If Phoenix were to embed Calcite as query
> planner, you'd probably get a ton of re-use among internal and external
> users of the Calcite APIs. I'd think whatever option you might choose would
> be informed by the suitability (or not) of embedding Calcite as Phoenix's
> query planner, and how soon that might be expected to be feature complete.
> For what it's worth. Again this extends possibilities for integration.
>
>
> > 3. Better recommendations to users to not attempt certain queries.
> >
> > We definitively know that there are certain types of queries that
> > Phoenix cannot support well (compared to optimal Phoenix use-cases).
> > Users very commonly fall into such pitfalls on their own and this leaves
> > a bad taste in their mouth (thinking that the product "stinks").
> >
> > Can we do a better job of telling the user when and why it happened?
> > What would such a user-interaction model look like? Can we supplement
> > the "why" with instructions of what to do differently (even if in the
> > abstract)?
> >
> > 4. Phoenix-Calcite
> >
> > This was mentioned as a "nice to have". From what I understand, there
> > was nothing explicitly from with the implementation or approach, just
> > that it was a massive undertaking to continue with little immediate
> > gain. Would this be a boon for us to try to continue in some form? Are
> > there steps we can take that would help push us along the right path?
> >
> > Anyways, I'd love to hear everyone's thoughts. While the concerns were
> > raised at HBaseCon Asia, the suggestions that accompany them here are
> > largely mine ;). Feel free to break them out into their own threads if
> > you think that would be better (or say that you disagree with me --
> > that's cool too)!
> >
> > - Josh
> >
>

Re: [DISCUSS] EXPLAIN'ing what we do well (was Re: [DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes)

2018-08-28 Thread James Taylor
Thomas' idea is a good one. From the EXPLAIN plan ResultSet, you can
directly get an estimate of the number of bytes that will be scanned. Take
a look at this [1] documentation. We need to implement PHOENIX-4735 too (so
that things are setup well out-of-the-box). We could have a kind of
guardrail config property that would define the max allowed bytes allowed
to be read and fail a query that goes over this limit. That would cover 80%
of the issues IMHO. Other guardrail config properties could cover other
corner cases.

[1] http://phoenix.apache.org/explainplan.html

On Mon, Aug 27, 2018 at 3:01 PM Josh Elser  wrote:

> On 8/27/18 5:03 PM, Thomas D'Silva wrote:
> >> 3. Better recommendations to users to not attempt certain queries.
> >>
> >> We definitively know that there are certain types of queries that
> Phoenix
> >> cannot support well (compared to optimal Phoenix use-cases). Users very
> >> commonly fall into such pitfalls on their own and this leaves a bad
> taste
> >> in their mouth (thinking that the product "stinks").
> >>
> >> Can we do a better job of telling the user when and why it happened?
> What
> >> would such a user-interaction model look like? Can we supplement the
> "why"
> >> with instructions of what to do differently (even if in the abstract)?
> >>
> > Providing relevant feedback before/after a query is run in general is
> very
> > hard to do. If stats are enabled we have an estimate of how many
> rows/bytes
> > will be scanned.
> > We could have an optional feature that prevent users from running queries
> > if the rows/bytes scanned are above a certain threshold. We should also
> > enhance our explain
> > plan documentationhttp://phoenix.apache.org/explainplan.html  with
> example
> > of queries so users know what kinds of queries Phoenix handles well.
>
> Breaking this out..
>
> Totally agree -- this is by no means "easy". I struggle very often
> trying to express just _why_ a query that someone is running in Phoenix
> doesn't run as well as they think it should.
>
> Centralizing on the EXPLAIN plan is good. Making sure it's
> consumable/thorough is probably the lowest hanging fruit. If we can give
> concrete examples to the kinds of explain plans a user might see, I
> think that might get use from users/admins.
>
> Throwing a random idea out there: with stats and the query plan, can we
> give a thumbs-up/thumbs-down? If we can, is that useful?
>


Re: [DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes

2018-08-28 Thread Andrew Purtell
On Mon, Aug 27, 2018 at 11:03 AM Josh Elser  wrote:

> 2. Can Phoenix be the de-facto schema for SQL on HBase?
>
> We've long asserted "if you have to ask how Phoenix serializes data, you
> shouldn't be do it" (a nod that you have to write lots of code). What if
> we turn that on its head? Could we extract our PDataType serialization,
> composite row-key, column encoding, etc into a minimal API that folks
> with their own itches can use?
>
> With the growing integrations into Phoenix, we could embrace them by
> providing an API to make what they're doing easier. In the same vein, we
> cement ourselves as a cornerstone of doing it "correctly"
>

There have been discussion where I work where it seems this would be a
great idea. If data types, row key constructors, and other key and data
serialization concerns were a public API, these could be used by connectors
to Spark or other systems to generate and consume Phoenix compatible data.
It improves the integration story all around.

Another thought for refactoring I've heard is exposing an API for
generating query plans without needing the SQL parser. A public API  for
programmatically building query plans could used by connectors to Spark or
other systems when pushing down parts of a parallelized or federated query
to Phoenix data sources, avoiding unnecessary hacking SQL language
generation, string mangling, or (re)parsing overheads. This kind of
describes Calcite's raison d'être. If Phoenix is not embedding Calcite as
query planner, as it does not currently, it is independently useful to have
a public API for programmatic query plan construction given the current
implementation regardless. If Phoenix were to embed Calcite as query
planner, you'd probably get a ton of re-use among internal and external
users of the Calcite APIs. I'd think whatever option you might choose would
be informed by the suitability (or not) of embedding Calcite as Phoenix's
query planner, and how soon that might be expected to be feature complete.
For what it's worth. Again this extends possibilities for integration.


> 3. Better recommendations to users to not attempt certain queries.
>
> We definitively know that there are certain types of queries that
> Phoenix cannot support well (compared to optimal Phoenix use-cases).
> Users very commonly fall into such pitfalls on their own and this leaves
> a bad taste in their mouth (thinking that the product "stinks").
>
> Can we do a better job of telling the user when and why it happened?
> What would such a user-interaction model look like? Can we supplement
> the "why" with instructions of what to do differently (even if in the
> abstract)?
>
> 4. Phoenix-Calcite
>
> This was mentioned as a "nice to have". From what I understand, there
> was nothing explicitly from with the implementation or approach, just
> that it was a massive undertaking to continue with little immediate
> gain. Would this be a boon for us to try to continue in some form? Are
> there steps we can take that would help push us along the right path?
>
> Anyways, I'd love to hear everyone's thoughts. While the concerns were
> raised at HBaseCon Asia, the suggestions that accompany them here are
> largely mine ;). Feel free to break them out into their own threads if
> you think that would be better (or say that you disagree with me --
> that's cool too)!
>
> - Josh
>


-- 
Best regards,
Andrew

Words like orphans lost among the crosstalk, meaning torn from truth's
decrepit hands
   - A23, Crosstalk


[DISCUSS] EXPLAIN'ing what we do well (was Re: [DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes)

2018-08-27 Thread Josh Elser

On 8/27/18 5:03 PM, Thomas D'Silva wrote:

3. Better recommendations to users to not attempt certain queries.

We definitively know that there are certain types of queries that Phoenix
cannot support well (compared to optimal Phoenix use-cases). Users very
commonly fall into such pitfalls on their own and this leaves a bad taste
in their mouth (thinking that the product "stinks").

Can we do a better job of telling the user when and why it happened? What
would such a user-interaction model look like? Can we supplement the "why"
with instructions of what to do differently (even if in the abstract)?


Providing relevant feedback before/after a query is run in general is very
hard to do. If stats are enabled we have an estimate of how many rows/bytes
will be scanned.
We could have an optional feature that prevent users from running queries
if the rows/bytes scanned are above a certain threshold. We should also
enhance our explain
plan documentationhttp://phoenix.apache.org/explainplan.html  with example
of queries so users know what kinds of queries Phoenix handles well.


Breaking this out..

Totally agree -- this is by no means "easy". I struggle very often 
trying to express just _why_ a query that someone is running in Phoenix 
doesn't run as well as they think it should.


Centralizing on the EXPLAIN plan is good. Making sure it's 
consumable/thorough is probably the lowest hanging fruit. If we can give 
concrete examples to the kinds of explain plans a user might see, I 
think that might get use from users/admins.


Throwing a random idea out there: with stats and the query plan, can we 
give a thumbs-up/thumbs-down? If we can, is that useful?


Re: [DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes

2018-08-27 Thread Thomas D'Silva
>
>
> 2. Can Phoenix be the de-facto schema for SQL on HBase?
>
> We've long asserted "if you have to ask how Phoenix serializes data, you
> shouldn't be do it" (a nod that you have to write lots of code). What if we
> turn that on its head? Could we extract our PDataType serialization,
> composite row-key, column encoding, etc into a minimal API that folks with
> their own itches can use?
>
> With the growing integrations into Phoenix, we could embrace them by
> providing an API to make what they're doing easier. In the same vein, we
> cement ourselves as a cornerstone of doing it "correctly".
>

+1 on standardizing the data type and storage format API so that it would
be easier for other projects to use.


> 3. Better recommendations to users to not attempt certain queries.
>
> We definitively know that there are certain types of queries that Phoenix
> cannot support well (compared to optimal Phoenix use-cases). Users very
> commonly fall into such pitfalls on their own and this leaves a bad taste
> in their mouth (thinking that the product "stinks").
>
> Can we do a better job of telling the user when and why it happened? What
> would such a user-interaction model look like? Can we supplement the "why"
> with instructions of what to do differently (even if in the abstract)?
>

Providing relevant feedback before/after a query is run in general is very
hard to do. If stats are enabled we have an estimate of how many rows/bytes
will be scanned.
We could have an optional feature that prevent users from running queries
if the rows/bytes scanned are above a certain threshold. We should also
enhance our explain
plan documentation http://phoenix.apache.org/explainplan.html with example
of queries so users know what kinds of queries Phoenix handles well.


> 4. Phoenix-Calcite
>
> This was mentioned as a "nice to have". From what I understand, there was
> nothing explicitly from with the implementation or approach, just that it
> was a massive undertaking to continue with little immediate gain. Would
> this be a boon for us to try to continue in some form? Are there steps we
> can take that would help push us along the right path?
>

Maybe Maryanne, Rajeshbabu or Ankit can comment on the feasibility of
proceeding with the calcite integration.
It would be good to standardize our query plan APIs so that we can generate
a query plan from a spark catalyst plan for example.


[DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes

2018-08-27 Thread Josh Elser
(bcc: dev@hbase, in case folks there have been waiting for me to send 
this email to dev@phoenix)


Hi,

In case you missed it, there was an HBaseCon event held in Asia 
recently. Stack took some great notes and shared them with the HBase 
community. A few of them touched on Phoenix, directly or in a related 
manner. I think they are good "criticisms" that are beneficial for us to 
hear.


1. The phoenix-$version-client.jar size is prohibitively large

In this day and age, I'm surprised that this is a big issue for people. 
I know have a lot of cruft, most of which coming from hadoop. We have 
gotten better here over recent releases, but I would guess that there is 
more we can do.


2. Can Phoenix be the de-facto schema for SQL on HBase?

We've long asserted "if you have to ask how Phoenix serializes data, you 
shouldn't be do it" (a nod that you have to write lots of code). What if 
we turn that on its head? Could we extract our PDataType serialization, 
composite row-key, column encoding, etc into a minimal API that folks 
with their own itches can use?


With the growing integrations into Phoenix, we could embrace them by 
providing an API to make what they're doing easier. In the same vein, we 
cement ourselves as a cornerstone of doing it "correctly".


3. Better recommendations to users to not attempt certain queries.

We definitively know that there are certain types of queries that 
Phoenix cannot support well (compared to optimal Phoenix use-cases). 
Users very commonly fall into such pitfalls on their own and this leaves 
a bad taste in their mouth (thinking that the product "stinks").


Can we do a better job of telling the user when and why it happened? 
What would such a user-interaction model look like? Can we supplement 
the "why" with instructions of what to do differently (even if in the 
abstract)?


4. Phoenix-Calcite

This was mentioned as a "nice to have". From what I understand, there 
was nothing explicitly from with the implementation or approach, just 
that it was a massive undertaking to continue with little immediate 
gain. Would this be a boon for us to try to continue in some form? Are 
there steps we can take that would help push us along the right path?


Anyways, I'd love to hear everyone's thoughts. While the concerns were 
raised at HBaseCon Asia, the suggestions that accompany them here are 
largely mine ;). Feel free to break them out into their own threads if 
you think that would be better (or say that you disagree with me -- 
that's cool too)!


- Josh