[jira] [Assigned] (PHOENIX-4903) HashCache recreated on client for every RegionServer it is sent to

2018-09-17 Thread Josh Elser (JIRA)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser reassigned PHOENIX-4903:
---

Assignee: Marcell Ortutay

> HashCache recreated on client for every RegionServer it is sent to
> --
>
> Key: PHOENIX-4903
> URL: https://issues.apache.org/jira/browse/PHOENIX-4903
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> To distribute the hash cache to region servers, the master node makes an 
> `AddServerCacheRequest` RPC to each region servers. If there are N region 
> servers, it makes N of these RPC's. For each of the region servers, it 
> generates a serialized RPC message and sends it out. This happens 
> concurrently, and the result is that it uses O(N) memory on the master.
> As an example, if the `AddServerCacheRequest` RPC message is 100MB, and you 
> have a cluster of 100 nodes, it would use 10GB memory on the master, 
> potentially resulting in an "OutOfMemory" exception.
> It would be better if the master could use O(1) memory for the RPC.
> I observed this behavior in Phoenix 4.14.1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PHOENIX-4903) HashCache recreated on client for every RegionServer it is sent to

2018-09-17 Thread Josh Elser (JIRA)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser updated PHOENIX-4903:

Summary: HashCache recreated on client for every RegionServer it is sent to 
 (was: Hash cache RPC uses O(N) memory on master)

> HashCache recreated on client for every RegionServer it is sent to
> --
>
> Key: PHOENIX-4903
> URL: https://issues.apache.org/jira/browse/PHOENIX-4903
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Priority: Major
>
> To distribute the hash cache to region servers, the master node makes an 
> `AddServerCacheRequest` RPC to each region servers. If there are N region 
> servers, it makes N of these RPC's. For each of the region servers, it 
> generates a serialized RPC message and sends it out. This happens 
> concurrently, and the result is that it uses O(N) memory on the master.
> As an example, if the `AddServerCacheRequest` RPC message is 100MB, and you 
> have a cluster of 100 nodes, it would use 10GB memory on the master, 
> potentially resulting in an "OutOfMemory" exception.
> It would be better if the master could use O(1) memory for the RPC.
> I observed this behavior in Phoenix 4.14.1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PHOENIX-4907) IndexScrutinyTool should use empty catalog instead of null

2018-09-17 Thread Geoffrey Jacoby (JIRA)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-4907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geoffrey Jacoby updated PHOENIX-4907:
-
Attachment: PHOENIX-4907.patch

> IndexScrutinyTool should use empty catalog instead of null
> --
>
> Key: PHOENIX-4907
> URL: https://issues.apache.org/jira/browse/PHOENIX-4907
> Project: Phoenix
>  Issue Type: Improvement
>Affects Versions: 5.0.0, 4.15.0
>Reporter: Geoffrey Jacoby
>Assignee: Geoffrey Jacoby
>Priority: Major
> Attachments: PHOENIX-4907.patch
>
>
> Before executing, the index scrutiny tool does a sanity check to make sure 
> that the given data table and index are valid and related to each other. This 
> check uses the JDBC metadata API, and passes in null for the catalog name. 
> Unfortunately, a null entry for catalog causes Phoenix to omit tenant_id from 
> the query against System.Catalog, causing a table scan, which can be lengthy 
> or time out if the server has too many views. 
> It should pass in the empty string for catalog, which will make Phoenix 
> filter on "WHERE tenant_id is NULL", which will avoid the table scan. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PHOENIX-4519) Index rebuild MR jobs not created for "alter index rebuild async" rebuilds

2018-09-17 Thread Geoffrey Jacoby (JIRA)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-4519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geoffrey Jacoby updated PHOENIX-4519:
-
Attachment: PHOENIX-4519-v2.patch

> Index rebuild MR jobs not created for "alter index rebuild async" rebuilds
> --
>
> Key: PHOENIX-4519
> URL: https://issues.apache.org/jira/browse/PHOENIX-4519
> Project: Phoenix
>  Issue Type: Bug
>Reporter: Vincent Poon
>Assignee: Geoffrey Jacoby
>Priority: Major
> Attachments: PHOENIX-4519-v2.patch, PHOENIX-4519.patch
>
>
> It seems we have two ASYNC flags for index rebuilds:
> ASYNC_CREATED_DATE - when an index is created async
> ASYNC_REBUILD_TIMESTAMP - created by "alter index ... rebuild async"
> The PhoenixMRJobSubmitter only submits MR jobs for the former.  We should 
> also submit jobs for the latter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PHOENIX-4907) IndexScrutinyTool should use empty catalog instead of null

2018-09-17 Thread Geoffrey Jacoby (JIRA)
Geoffrey Jacoby created PHOENIX-4907:


 Summary: IndexScrutinyTool should use empty catalog instead of null
 Key: PHOENIX-4907
 URL: https://issues.apache.org/jira/browse/PHOENIX-4907
 Project: Phoenix
  Issue Type: Improvement
Affects Versions: 5.0.0, 4.15.0
Reporter: Geoffrey Jacoby
Assignee: Geoffrey Jacoby


Before executing, the index scrutiny tool does a sanity check to make sure that 
the given data table and index are valid and related to each other. This check 
uses the JDBC metadata API, and passes in null for the catalog name. 

Unfortunately, a null entry for catalog causes Phoenix to omit tenant_id from 
the query against System.Catalog, causing a table scan, which can be lengthy or 
time out if the server has too many views. 

It should pass in the empty string for catalog, which will make Phoenix filter 
on "WHERE tenant_id is NULL", which will avoid the table scan. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PHOENIX-4849) UPSERT SELECT fails with stale region boundary exception after a split

2018-09-17 Thread Lars Hofhansl (JIRA)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl reassigned PHOENIX-4849:
--

Assignee: Thomas D'Silva  (was: Lars Hofhansl)

> UPSERT SELECT fails with stale region boundary exception after a split
> --
>
> Key: PHOENIX-4849
> URL: https://issues.apache.org/jira/browse/PHOENIX-4849
> Project: Phoenix
>  Issue Type: Bug
>Reporter: Akshita Malhotra
>Assignee: Thomas D'Silva
>Priority: Critical
> Attachments: PHOENIX-4849-complete-1.4.txt, PHOENIX-4849-fix.txt, 
> PHOENIX-4849-v2.patch, PHOENIX-4849-v3.patch, PHOENIX-4849-v4.patch, 
> PHOENIX-4849.patch, SerialIterators.diff, SplitIT.patch
>
>
> UPSERT SELECT throws a StaleRegionBoundaryCacheException immediately after a 
> split. On the other hand, an upsert followed by a select for example works 
> absolutely fine
> org.apache.phoenix.schema.StaleRegionBoundaryCacheException: ERROR 1108 
> (XCL08): Cache of region boundaries are out of date.
> at 
> org.apache.phoenix.exception.SQLExceptionCode$14.newException(SQLExceptionCode.java:365)
>  at 
> org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:150)
>  at 
> org.apache.phoenix.util.ServerUtil.parseRemoteException(ServerUtil.java:183)
>  at 
> org.apache.phoenix.util.ServerUtil.parseServerExceptionOrNull(ServerUtil.java:167)
>  at 
> org.apache.phoenix.util.ServerUtil.parseServerException(ServerUtil.java:134)
>  at 
> org.apache.phoenix.iterate.ScanningResultIterator.next(ScanningResultIterator.java:153)
>  at 
> org.apache.phoenix.iterate.TableResultIterator.next(TableResultIterator.java:228)
>  at 
> org.apache.phoenix.iterate.LookAheadResultIterator$1.advance(LookAheadResultIterator.java:47)
>  at 
> org.apache.phoenix.iterate.LookAheadResultIterator.init(LookAheadResultIterator.java:59)
>  at 
> org.apache.phoenix.iterate.LookAheadResultIterator.peek(LookAheadResultIterator.java:73)
>  at 
> org.apache.phoenix.iterate.SerialIterators$SerialIterator.nextIterator(SerialIterators.java:187)
>  at 
> org.apache.phoenix.iterate.SerialIterators$SerialIterator.currentIterator(SerialIterators.java:160)
>  at 
> org.apache.phoenix.iterate.SerialIterators$SerialIterator.peek(SerialIterators.java:218)
>  at 
> org.apache.phoenix.iterate.ConcatResultIterator.currentIterator(ConcatResultIterator.java:100)
>  at 
> org.apache.phoenix.iterate.ConcatResultIterator.next(ConcatResultIterator.java:117)
>  at 
> org.apache.phoenix.iterate.DelegateResultIterator.next(DelegateResultIterator.java:44)
>  at 
> org.apache.phoenix.iterate.LimitingResultIterator.next(LimitingResultIterator.java:47)
>  at org.apache.phoenix.jdbc.PhoenixResultSet.next(PhoenixResultSet.java:805)
>  at 
> org.apache.phoenix.compile.UpsertCompiler.upsertSelect(UpsertCompiler.java:219)
>  at 
> org.apache.phoenix.compile.UpsertCompiler$ClientUpsertSelectMutationPlan.execute(UpsertCompiler.java:1292)
>  at org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixStatement.java:408)
>  at org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixStatement.java:391)
>  at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
>  at 
> org.apache.phoenix.jdbc.PhoenixStatement.executeMutation(PhoenixStatement.java:390)
>  at 
> org.apache.phoenix.jdbc.PhoenixStatement.executeMutation(PhoenixStatement.java:378)
>  at 
> org.apache.phoenix.jdbc.PhoenixPreparedStatement.execute(PhoenixPreparedStatement.java:173)
>  at 
> org.apache.phoenix.jdbc.PhoenixPreparedStatement.execute(PhoenixPreparedStatement.java:183)
>  at 
> org.apache.phoenix.end2end.UpsertSelectAfterSplitTest.upsertSelectData1(UpsertSelectAfterSplitTest.java:109)
>  at 
> org.apache.phoenix.end2end.UpsertSelectAfterSplitTest.testUpsertSelect(UpsertSelectAfterSplitTest.java:59)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>  at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>  at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>  at 

[jira] [Updated] (PHOENIX-4594) Perform binary search on guideposts during query compilation

2018-09-17 Thread Bin Shi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-4594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bin Shi updated PHOENIX-4594:
-
Attachment: PHOENIX-4594_0917.patch

> Perform binary search on guideposts during query compilation
> 
>
> Key: PHOENIX-4594
> URL: https://issues.apache.org/jira/browse/PHOENIX-4594
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: James Taylor
>Assignee: Bin Shi
>Priority: Major
> Attachments: PHOENIX-4594-0913.patch, PHOENIX-4594_0917.patch
>
>
> If there are many guideposts, performance will suffer during query 
> compilation because we do a linear search of the guideposts to find the 
> intersection with the scan ranges. Instead, in 
> BaseResultIterators.getParallelScans() we should populate an array of 
> guideposts and perform a binary search. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes

2018-09-17 Thread Josh Elser
On Mon, Sep 17, 2018 at 9:36 AM zhang yun  wrote:

> Sorry for replying late. I attended HBaesCon Asia as a speaker and got
> some some notes. I think Phoenix’ pains as following:
>
> 1. Thick client isn’t as more popular as thin client. For some we
> applications: 1. Users need to spend a lot of time to solve the
> dependencies, 2. Users worry about the stability which some calculation
> operates are processed within thick client . 3. Some people hope use multi
> program language client, such as Go, .Net and Python etc…  Other benefits:
> 1 Easy to add SQL audit function. 2. Recognise invalid SQL and report  to
> user...As you said this definitely a big issue which is worth paying
> more attention.  However thick client exists some problems,  it is recently
> test data about performance:
>
>
I don't understand what performance issues you think exist based solely on
the above. Those numbers appear to be precisely in line with my
expectations. Can you please describe what issues you think exist?


> 2. Actually, Phoenix has a high barrier for beginner than common RDBMS,
> users need to learn HBase before using Phoenix, Most of people don’t know
> how to  reasonable to use,  so we need more detail documents make Phoenix
> use easier.
>

Please be more specific. Asking for "more documentation" doesn't help us
actually turn this around into more documentation. What are the specific
pain points you have experienced? What topics do you want to know more
about? Be as specific as possible.


> 3. HBase 3.0 has a plan about native SQL, does Phoenix has a plan? Even if
> many peoples don’t know HBase has a SQL layer which is called Phoenix, so
> can we put the link on HBase website?
>

Uh, I have no idea what you're referring to here about "native SQL". I am
not aware of any such effort that does this solely inside of HBase, nor
does it seem inline with HBase's "do one thing well" mantra.

Are you referring to the hbase-spark (and thus, Spark SQL) integration? Or
something that some company is building?

How about submitting  patch to HBase to modify
https://hbase.apache.org/poweredbyhbase.html ? :)


>
> On 2018/08/27 18:03:30, Josh Elser  wrote:
> > (bcc: dev@hbase, in case folks there have been waiting for me to send >
> > this email to dev@phoenix)>
> >
> > Hi,>
> >
> > In case you missed it, there was an HBaseCon event held in Asia >
> > recently. Stack took some great notes and shared them with the HBase >
> > community. A few of them touched on Phoenix, directly or in a related >
> > manner. I think they are good "criticisms" that are beneficial for us to
> >
> > hear.>
> >
> > 1. The phoenix-$version-client.jar size is prohibitively large>
> >
> > In this day and age, I'm surprised that this is a big issue for people.
> >
> > I know have a lot of cruft, most of which coming from hadoop. We have >
> > gotten better here over recent releases, but I would guess that there is
> >
> > more we can do.>
> >
> > 2. Can Phoenix be the de-facto schema for SQL on HBase?>
> >
> > We've long asserted "if you have to ask how Phoenix serializes data, you
> >
> > shouldn't be do it" (a nod that you have to write lots of code). What if
> >
> > we turn that on its head? Could we extract our PDataType serialization,
> >
> > composite row-key, column encoding, etc into a minimal API that folks >
> > with their own itches can use?>
> >
> > With the growing integrations into Phoenix, we could embrace them by >
> > providing an API to make what they're doing easier. In the same vein, we
> >
> > cement ourselves as a cornerstone of doing it "correctly".>
> >
> > 3. Better recommendations to users to not attempt certain queries.>
> >
> > We definitively know that there are certain types of queries that >
> > Phoenix cannot support well (compared to optimal Phoenix use-cases). >
> > Users very commonly fall into such pitfalls on their own and this leaves
> >
> > a bad taste in their mouth (thinking that the product "stinks").>
> >
> > Can we do a better job of telling the user when and why it happened? >
> > What would such a user-interaction model look like? Can we supplement >
> > the "why" with instructions of what to do differently (even if in the >
> > abstract)?>
> >
> > 4. Phoenix-Calcite>
> >
> > This was mentioned as a "nice to have". From what I understand, there >
> > was nothing explicitly from with the implementation or approach, just >
> > that it was a massive undertaking to continue with little immediate >
> > gain. Would this be a boon for us to try to continue in some form? Are >
> > there steps we can take that would help push us along the right path?>
> >
> > Anyways, I'd love to hear everyone's thoughts. While the concerns were >
> > raised at HBaseCon Asia, the suggestions that accompany them here are >
> > largely mine ;). Feel free to break them out into their own threads if >
> > you think that would be better (or say that you disagree with me -- >
> > that's cool too)!>
> >
> > - Josh>
> >
>


Re: [DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes

2018-09-17 Thread Josh Elser
Maybe an implementation detail, but I'm a fan of having a devoted Maven 
module to "client-facing" API as opposed to an annotation-based 
approach. I find a separate module helps to catch problematic API design 
faster, and make it crystal clear what users should (and should not) be 
relying upon).


On 9/17/18 1:00 AM, la...@apache.org wrote:

  I think we can start by implementing a tighter integration with Spark through 
DataSource V2.That would make it quickly apparent what parts of Phoenix would 
need direct access.
Some parts just need a interface audience declaration (like Phoenix's basic 
type system) and our agreement that we will change those only according to 
semantic versioning. Otherwise (like the query plan) will need a bit more 
thinking. Maybe that's the path to hook Calcite - just making that part up as I 
write this...
Perhaps turning the HBase interface into an API might not be so difficult 
either. That would perhaps be a new client - strictly additional - client API.

A good Spark interface is in everybody's interest and I think is the best 
avenue to figure out what's missing/needed.
-- Lars

 On Wednesday, September 12, 2018, 12:47:21 PM PDT, Josh Elser 
 wrote:
  
  I like it, Lars. I like it very much.


Just the easy part of doing it... ;)

On 9/11/18 4:53 PM, la...@apache.org wrote:

   Sorry for coming a bit late to this. I've been thinking about some of lines 
for a bit.
It seems Phoenix serves 4 distinct purposes:
1. Query parsing and compiling.2. A type system3. Query execution4. Efficient 
HBase interface
Each of these is useful by itself, but we do not expose these as stable interfaces.We 
have seen a lot of need to tie HBase into "higher level" service, such as Spark 
(and Presto, etc).
I think we can get a long way if we separate at least #1 (SQL) from the rest 
#2, #3, and #4 (Typed HBase Interface - THI).
Phoenix is used via SQL (#1), other tools such as Presto, Impala, Drill, Spark, 
etc, can interface efficiently with HBase via THI (#2, #3, and #4).
Thoughts?
-- Lars
       On Monday, August 27, 2018, 11:03:33 AM PDT, Josh Elser 
 wrote:
   
   (bcc: dev@hbase, in case folks there have been waiting for me to send

this email to dev@phoenix)

Hi,

In case you missed it, there was an HBaseCon event held in Asia
recently. Stack took some great notes and shared them with the HBase
community. A few of them touched on Phoenix, directly or in a related
manner. I think they are good "criticisms" that are beneficial for us to
hear.

1. The phoenix-$version-client.jar size is prohibitively large

In this day and age, I'm surprised that this is a big issue for people.
I know have a lot of cruft, most of which coming from hadoop. We have
gotten better here over recent releases, but I would guess that there is
more we can do.

2. Can Phoenix be the de-facto schema for SQL on HBase?

We've long asserted "if you have to ask how Phoenix serializes data, you
shouldn't be do it" (a nod that you have to write lots of code). What if
we turn that on its head? Could we extract our PDataType serialization,
composite row-key, column encoding, etc into a minimal API that folks
with their own itches can use?

With the growing integrations into Phoenix, we could embrace them by
providing an API to make what they're doing easier. In the same vein, we
cement ourselves as a cornerstone of doing it "correctly".

3. Better recommendations to users to not attempt certain queries.

We definitively know that there are certain types of queries that
Phoenix cannot support well (compared to optimal Phoenix use-cases).
Users very commonly fall into such pitfalls on their own and this leaves
a bad taste in their mouth (thinking that the product "stinks").

Can we do a better job of telling the user when and why it happened?
What would such a user-interaction model look like? Can we supplement
the "why" with instructions of what to do differently (even if in the
abstract)?

4. Phoenix-Calcite

This was mentioned as a "nice to have". From what I understand, there
was nothing explicitly from with the implementation or approach, just
that it was a massive undertaking to continue with little immediate
gain. Would this be a boon for us to try to continue in some form? Are
there steps we can take that would help push us along the right path?

Anyways, I'd love to hear everyone's thoughts. While the concerns were
raised at HBaseCon Asia, the suggestions that accompany them here are
largely mine ;). Feel free to break them out into their own threads if
you think that would be better (or say that you disagree with me --
that's cool too)!

- Josh
 

   



Re: [DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes

2018-09-17 Thread zhang yun
Sorry for replying late. I attended HBaesCon Asia as a speaker and got some 
some notes. I think Phoenix’ pains as following:

1. Thick client isn’t as more popular as thin client. For some we applications: 
1. Users need to spend a lot of time to solve the dependencies, 2. Users worry 
about the stability which some calculation operates are processed within thick 
client . 3. Some people hope use multi program language client, such as Go, 
.Net and Python etc…  Other benefits: 1 Easy to add SQL audit function. 2. 
Recognise invalid SQL and report  to user...As you said this definitely a 
big issue which is worth paying more attention.  However thick client exists 
some problems,  it is recently test data about performance:


2. Actually, Phoenix has a high barrier for beginner than common RDBMS, users 
need to learn HBase before using Phoenix, Most of people don’t know how to  
reasonable to use,  so we need more detail documents make Phoenix use easier.

3. HBase 3.0 has a plan about native SQL, does Phoenix has a plan? Even if many 
peoples don’t know HBase has a SQL layer which is called Phoenix, so can we put 
the link on HBase website?


On 2018/08/27 18:03:30, Josh Elser  wrote: 
> (bcc: dev@hbase, in case folks there have been waiting for me to send > 
> this email to dev@phoenix)> 
> 
> Hi,> 
> 
> In case you missed it, there was an HBaseCon event held in Asia > 
> recently. Stack took some great notes and shared them with the HBase > 
> community. A few of them touched on Phoenix, directly or in a related > 
> manner. I think they are good "criticisms" that are beneficial for us to > 
> hear.> 
> 
> 1. The phoenix-$version-client.jar size is prohibitively large> 
> 
> In this day and age, I'm surprised that this is a big issue for people. > 
> I know have a lot of cruft, most of which coming from hadoop. We have > 
> gotten better here over recent releases, but I would guess that there is > 
> more we can do.> 
> 
> 2. Can Phoenix be the de-facto schema for SQL on HBase?> 
> 
> We've long asserted "if you have to ask how Phoenix serializes data, you > 
> shouldn't be do it" (a nod that you have to write lots of code). What if > 
> we turn that on its head? Could we extract our PDataType serialization, > 
> composite row-key, column encoding, etc into a minimal API that folks > 
> with their own itches can use?> 
> 
> With the growing integrations into Phoenix, we could embrace them by > 
> providing an API to make what they're doing easier. In the same vein, we > 
> cement ourselves as a cornerstone of doing it "correctly".> 
> 
> 3. Better recommendations to users to not attempt certain queries.> 
> 
> We definitively know that there are certain types of queries that > 
> Phoenix cannot support well (compared to optimal Phoenix use-cases). > 
> Users very commonly fall into such pitfalls on their own and this leaves > 
> a bad taste in their mouth (thinking that the product "stinks").> 
> 
> Can we do a better job of telling the user when and why it happened? > 
> What would such a user-interaction model look like? Can we supplement > 
> the "why" with instructions of what to do differently (even if in the > 
> abstract)?> 
> 
> 4. Phoenix-Calcite> 
> 
> This was mentioned as a "nice to have". From what I understand, there > 
> was nothing explicitly from with the implementation or approach, just > 
> that it was a massive undertaking to continue with little immediate > 
> gain. Would this be a boon for us to try to continue in some form? Are > 
> there steps we can take that would help push us along the right path?> 
> 
> Anyways, I'd love to hear everyone's thoughts. While the concerns were > 
> raised at HBaseCon Asia, the suggestions that accompany them here are > 
> largely mine ;). Feel free to break them out into their own threads if > 
> you think that would be better (or say that you disagree with me -- > 
> that's cool too)!> 
> 
> - Josh> 
> 

[jira] [Updated] (PHOENIX-4906) Abnormal query result due to Phoenix plan error

2018-09-17 Thread JeongMin Ju (JIRA)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JeongMin Ju updated PHOENIX-4906:
-
Description: 
For a salted table, when a query is made for an entire data target, a different 
plan is created depending on the type of the query, and as a result, erroneous 
data is retrieved as a result.

{code:java}
// Actually, the schema of the table I used is different, but please ignore it.
create table if not exists test.test_tale (
  rk1 varchar not null,
  rk2 varchar not null,
  column1 varchar
  constraint pk primary key (rk1, rk2)
)
...
SALT_BUCKETS=16...
;
{code}
 

I created a table with 16 salting regions and then wrote a lot of data.
 HBase automatically split the region and I did the merging regions for data 
balancing between the region servers.

Then, when run the query, you can see that another plan is created according to 
the Where clause.
 * query1
 select count\(*) from test.test_table;
{code:java}
+---+-++
|PLAN   
| EST_BYTES_READ  | EST_ROWS_READ  |
+---+-++
| CLIENT 1851-CHUNK 5005959292 ROWS 1944546675532 BYTES PARALLEL 11-WAY FULL 
SCAN OVER TEST:TEST_TABLE  | 1944546675532   | 5005959292 |
| SERVER FILTER BY FIRST KEY ONLY   
| 1944546675532   | 5005959292 |
| SERVER AGGREGATE INTO SINGLE ROW  
| 1944546675532   | 5005959292 |
+---+-++

{code}

 * query2
 select count\(*) from test.test_table where rk2 = 'aa';
{code}
+---+-++
|  PLAN 
| EST_BYTES_READ  | EST_ROWS_READ  |
+---+-++
| CLIENT 1846-CHUNK 4992196444 ROWS 1939177965768 BYTES PARALLEL 11-WAY RANGE 
SCAN OVER TEST:TEST_TABLE [0] - [15]  | 1939177965768   | 4992196444 |
| SERVER FILTER BY FIRST KEY ONLY AND RK2 = 'aa'
| 1939177965768   | 4992196444 |
| SERVER AGGREGATE INTO SINGLE ROW  
| 1939177965768   | 4992196444 |
+---+-++
{code}

Since rk2 used in the where clause of query2 is the second column of the PK, it 
must be a full scan query like query1.
However, as you can see, query2 is created by range scan and the generated 
chunk is also less than five compared to query1.
I added the log and printed out the startkey and endkey of the scan object 
generated by the plan.
And I found 5 chunks missing by query2.

All five missing chunks were found in regions where the originally generated 
region boundary value was not maintained through the merge operation.
!initial_salting_region.png!

After merging regions
!merged-region.png!

The code that caused the problem is this part.

 When a select query is executed, the 
[org.apache.phoenix.iterate.BaseResultIterators#getParallelScans|https://github.com/apache/phoenix/blob/v4.11.0-HBase-1.2/phoenix-core/src/main/java/org/apache/phoenix/iterate/BaseResultIterators.java#L743-L744]
 method creates a Scan object based on the GuidePost in the statistics table. 
In the case of a GuidePost that contains a region boundary, it is split into 
two Scan objects. The code used here is 
[org.apache.phoenix.compile.ScanRanges#intersectScan|https://github.com/apache/phoenix/blob/v4.11.0-HBase-1.2/phoenix-core/src/main/java/org/apache/phoenix/compile/ScanRanges.java#L299-L303].

!ScanRanges_intersectScan.png!

In the case of a table that has been salted, the code compares it with the 
remainder after subtracting the salt(prefix) bytes.
I can not be sure that this code is buggy or intended.

In this case, I have merge the region directly, but it is likely to occur 
through HBase's Normalizer function.

I wish other users did not merge the region manually or not the table property 
Normalization_enabled to true  in their production cluster. If so, check to see 
if the initial 

[jira] [Updated] (PHOENIX-4906) Abnormal query result due to Phoenix plan error

2018-09-17 Thread JeongMin Ju (JIRA)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JeongMin Ju updated PHOENIX-4906:
-
Attachment: merged-region.png

> Abnormal query result due to Phoenix plan error
> ---
>
> Key: PHOENIX-4906
> URL: https://issues.apache.org/jira/browse/PHOENIX-4906
> Project: Phoenix
>  Issue Type: Bug
>Affects Versions: 4.11.0, 4.14.0
>Reporter: JeongMin Ju
>Priority: Critical
> Attachments: ScanRanges_intersectScan.png, 
> initial_salting_region.png, merged-region.png
>
>
> For a salted table, when a query is made for an entire data target, a 
> different plan is created depending on the type of the query, and as a 
> result, erroneous data is retrieved as a result.
> {code:java}
> // Actually, the schema of the table I used is different, but please ignore 
> it.
> create table if not exists test.test_tale (
>   rk1 varchar not null,
>   rk2 varchar not null,
>   column1 varchar
>   constraint pk primary key (rk1, rk2)
> )
> ...
> SALT_BUCKETS=16...
> ;
> {code}
>  
> I created a table with 16 salting regions and then wrote a lot of data.
>  HBase automatically split the region and I did the merging regions for data 
> balancing between the region servers.
> Then, when run the query, you can see that another plan is created according 
> to the Where clause.
>  * query1
>  select count\(*) from test.test_table;
> {code:java}
> +---+-++
> |PLAN 
>   | EST_BYTES_READ  | EST_ROWS_READ  |
> +---+-++
> | CLIENT 1851-CHUNK 5005959292 ROWS 1944546675532 BYTES PARALLEL 11-WAY FULL 
> SCAN OVER TEST:TEST_TABLE  | 1944546675532   | 5005959292 |
> | SERVER FILTER BY FIRST KEY ONLY 
>   | 1944546675532   | 5005959292 |
> | SERVER AGGREGATE INTO SINGLE ROW
>   | 1944546675532   | 5005959292 |
> +---+-++
> {code}
>  * query2
>  select count\(*) from test.test_table where rk2 = 'aa';
> {code}
> +---+-++
> |  PLAN   
>   | EST_BYTES_READ  | EST_ROWS_READ  |
> +---+-++
> | CLIENT 1846-CHUNK 4992196444 ROWS 1939177965768 BYTES PARALLEL 11-WAY RANGE 
> SCAN OVER TEST:TEST_TABLE [0] - [15]  | 1939177965768   | 4992196444 |
> | SERVER FILTER BY FIRST KEY ONLY AND RK2 = 'aa'  
>   | 1939177965768   | 4992196444 |
> | SERVER AGGREGATE INTO SINGLE ROW
>   | 1939177965768   | 4992196444 |
> +---+-++
> {code}
> Since rk2 used in the where clause of query2 is the second column of the PK, 
> it must be a full scan query like query1.
> However, as you can see, query2 is created by range scan and the generated 
> chunk is also less than five compared to query1.
> I added the log and printed out the startkey and endkey of the scan object 
> generated by the plan.
> And I found 5 chunks missing by query2.
> All five missing chunks were found in regions where the originally generated 
> region boundary value was not maintained through the merge operation.
> The code that caused the problem is this part.
>  When a select query is executed, the 
> [org.apache.phoenix.iterate.BaseResultIterators#getParallelScans|https://github.com/apache/phoenix/blob/v4.11.0-HBase-1.2/phoenix-core/src/main/java/org/apache/phoenix/iterate/BaseResultIterators.java#L743-L744]
>  method creates a Scan object based on the GuidePost in the statistics table. 
> In the case of a GuidePost that contains a region boundary, it is split into 
> two Scan objects. The code used here is 
> 

[jira] [Updated] (PHOENIX-4906) Abnormal query result due to Phoenix plan error

2018-09-17 Thread JeongMin Ju (JIRA)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JeongMin Ju updated PHOENIX-4906:
-
Attachment: initial_salting_region.png

> Abnormal query result due to Phoenix plan error
> ---
>
> Key: PHOENIX-4906
> URL: https://issues.apache.org/jira/browse/PHOENIX-4906
> Project: Phoenix
>  Issue Type: Bug
>Affects Versions: 4.11.0, 4.14.0
>Reporter: JeongMin Ju
>Priority: Critical
> Attachments: ScanRanges_intersectScan.png, 
> initial_salting_region.png, merged-region.png
>
>
> For a salted table, when a query is made for an entire data target, a 
> different plan is created depending on the type of the query, and as a 
> result, erroneous data is retrieved as a result.
> {code:java}
> // Actually, the schema of the table I used is different, but please ignore 
> it.
> create table if not exists test.test_tale (
>   rk1 varchar not null,
>   rk2 varchar not null,
>   column1 varchar
>   constraint pk primary key (rk1, rk2)
> )
> ...
> SALT_BUCKETS=16...
> ;
> {code}
>  
> I created a table with 16 salting regions and then wrote a lot of data.
>  HBase automatically split the region and I did the merging regions for data 
> balancing between the region servers.
> Then, when run the query, you can see that another plan is created according 
> to the Where clause.
>  * query1
>  select count\(*) from test.test_table;
> {code:java}
> +---+-++
> |PLAN 
>   | EST_BYTES_READ  | EST_ROWS_READ  |
> +---+-++
> | CLIENT 1851-CHUNK 5005959292 ROWS 1944546675532 BYTES PARALLEL 11-WAY FULL 
> SCAN OVER TEST:TEST_TABLE  | 1944546675532   | 5005959292 |
> | SERVER FILTER BY FIRST KEY ONLY 
>   | 1944546675532   | 5005959292 |
> | SERVER AGGREGATE INTO SINGLE ROW
>   | 1944546675532   | 5005959292 |
> +---+-++
> {code}
>  * query2
>  select count\(*) from test.test_table where rk2 = 'aa';
> {code}
> +---+-++
> |  PLAN   
>   | EST_BYTES_READ  | EST_ROWS_READ  |
> +---+-++
> | CLIENT 1846-CHUNK 4992196444 ROWS 1939177965768 BYTES PARALLEL 11-WAY RANGE 
> SCAN OVER TEST:TEST_TABLE [0] - [15]  | 1939177965768   | 4992196444 |
> | SERVER FILTER BY FIRST KEY ONLY AND RK2 = 'aa'  
>   | 1939177965768   | 4992196444 |
> | SERVER AGGREGATE INTO SINGLE ROW
>   | 1939177965768   | 4992196444 |
> +---+-++
> {code}
> Since rk2 used in the where clause of query2 is the second column of the PK, 
> it must be a full scan query like query1.
> However, as you can see, query2 is created by range scan and the generated 
> chunk is also less than five compared to query1.
> I added the log and printed out the startkey and endkey of the scan object 
> generated by the plan.
> And I found 5 chunks missing by query2.
> All five missing chunks were found in regions where the originally generated 
> region boundary value was not maintained through the merge operation.
> The code that caused the problem is this part.
>  When a select query is executed, the 
> [org.apache.phoenix.iterate.BaseResultIterators#getParallelScans|https://github.com/apache/phoenix/blob/v4.11.0-HBase-1.2/phoenix-core/src/main/java/org/apache/phoenix/iterate/BaseResultIterators.java#L743-L744]
>  method creates a Scan object based on the GuidePost in the statistics table. 
> In the case of a GuidePost that contains a region boundary, it is split into 
> two Scan objects. The code used here is 
> 

[jira] [Updated] (PHOENIX-4906) Abnormal query result due to Phoenix plan error

2018-09-17 Thread JeongMin Ju (JIRA)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JeongMin Ju updated PHOENIX-4906:
-
Attachment: ScanRanges_intersectScan.png

> Abnormal query result due to Phoenix plan error
> ---
>
> Key: PHOENIX-4906
> URL: https://issues.apache.org/jira/browse/PHOENIX-4906
> Project: Phoenix
>  Issue Type: Bug
>Affects Versions: 4.11.0, 4.14.0
>Reporter: JeongMin Ju
>Priority: Critical
> Attachments: ScanRanges_intersectScan.png
>
>
> For a salted table, when a query is made for an entire data target, a 
> different plan is created depending on the type of the query, and as a 
> result, erroneous data is retrieved as a result.
> {code:java}
> // Actually, the schema of the table I used is different, but please ignore 
> it.
> create table if not exists test.test_tale (
>   rk1 varchar not null,
>   rk2 varchar not null,
>   column1 varchar
>   constraint pk primary key (rk1, rk2)
> )
> ...
> SALT_BUCKETS=16...
> ;
> {code}
>  
> I created a table with 16 salting regions and then wrote a lot of data.
>  HBase automatically split the region and I did the merging regions for data 
> balancing between the region servers.
> Then, when run the query, you can see that another plan is created according 
> to the Where clause.
>  * query1
>  select count\(*) from test.test_table;
> {code:java}
> +---+-++
> |PLAN 
>   | EST_BYTES_READ  | EST_ROWS_READ  |
> +---+-++
> | CLIENT 1851-CHUNK 5005959292 ROWS 1944546675532 BYTES PARALLEL 11-WAY FULL 
> SCAN OVER TEST:TEST_TABLE  | 1944546675532   | 5005959292 |
> | SERVER FILTER BY FIRST KEY ONLY 
>   | 1944546675532   | 5005959292 |
> | SERVER AGGREGATE INTO SINGLE ROW
>   | 1944546675532   | 5005959292 |
> +---+-++
> {code}
>  * query2
>  select count\(*) from test.test_table where rk2 = 'aa';
> {code}
> +---+-++
> |  PLAN   
>   | EST_BYTES_READ  | EST_ROWS_READ  |
> +---+-++
> | CLIENT 1846-CHUNK 4992196444 ROWS 1939177965768 BYTES PARALLEL 11-WAY RANGE 
> SCAN OVER TEST:TEST_TABLE [0] - [15]  | 1939177965768   | 4992196444 |
> | SERVER FILTER BY FIRST KEY ONLY AND RK2 = 'aa'  
>   | 1939177965768   | 4992196444 |
> | SERVER AGGREGATE INTO SINGLE ROW
>   | 1939177965768   | 4992196444 |
> +---+-++
> {code}
> Since rk2 used in the where clause of query2 is the second column of the PK, 
> it must be a full scan query like query1.
> However, as you can see, query2 is created by range scan and the generated 
> chunk is also less than five compared to query1.
> I added the log and printed out the startkey and endkey of the scan object 
> generated by the plan.
> And I found 5 chunks missing by query2.
> All five missing chunks were found in regions where the originally generated 
> region boundary value was not maintained through the merge operation.
> The code that caused the problem is this part.
>  When a select query is executed, the 
> [org.apache.phoenix.iterate.BaseResultIterators#getParallelScans|https://github.com/apache/phoenix/blob/v4.11.0-HBase-1.2/phoenix-core/src/main/java/org/apache/phoenix/iterate/BaseResultIterators.java#L743-L744]
>  method creates a Scan object based on the GuidePost in the statistics table. 
> In the case of a GuidePost that contains a region boundary, it is split into 
> two Scan objects. The code used here is 
> 

[jira] [Created] (PHOENIX-4906) Abnormal query result due to Phoenix plan error

2018-09-17 Thread JeongMin Ju (JIRA)
JeongMin Ju created PHOENIX-4906:


 Summary: Abnormal query result due to Phoenix plan error
 Key: PHOENIX-4906
 URL: https://issues.apache.org/jira/browse/PHOENIX-4906
 Project: Phoenix
  Issue Type: Bug
Affects Versions: 4.14.0, 4.11.0
Reporter: JeongMin Ju


For a salted table, when a query is made for an entire data target, a different 
plan is created depending on the type of the query, and as a result, erroneous 
data is retrieved as a result.

{code:java}
// Actually, the schema of the table I used is different, but please ignore it.
create table if not exists test.test_tale (
  rk1 varchar not null,
  rk2 varchar not null,
  column1 varchar
  constraint pk primary key (rk1, rk2)
)
...
SALT_BUCKETS=16...
;
{code}
 

I created a table with 16 salting regions and then wrote a lot of data.
 HBase automatically split the region and I did the merging regions for data 
balancing between the region servers.

Then, when run the query, you can see that another plan is created according to 
the Where clause.
 * query1
 select count\(*) from test.test_table;
{code:java}
+---+-++
|PLAN   
| EST_BYTES_READ  | EST_ROWS_READ  |
+---+-++
| CLIENT 1851-CHUNK 5005959292 ROWS 1944546675532 BYTES PARALLEL 11-WAY FULL 
SCAN OVER TEST:TEST_TABLE  | 1944546675532   | 5005959292 |
| SERVER FILTER BY FIRST KEY ONLY   
| 1944546675532   | 5005959292 |
| SERVER AGGREGATE INTO SINGLE ROW  
| 1944546675532   | 5005959292 |
+---+-++

{code}

 * query2
 select count\(*) from test.test_table where rk2 = 'aa';
{code}
+---+-++
|  PLAN 
| EST_BYTES_READ  | EST_ROWS_READ  |
+---+-++
| CLIENT 1846-CHUNK 4992196444 ROWS 1939177965768 BYTES PARALLEL 11-WAY RANGE 
SCAN OVER TEST:TEST_TABLE [0] - [15]  | 1939177965768   | 4992196444 |
| SERVER FILTER BY FIRST KEY ONLY AND RK2 = 'aa'
| 1939177965768   | 4992196444 |
| SERVER AGGREGATE INTO SINGLE ROW  
| 1939177965768   | 4992196444 |
+---+-++
{code}

Since rk2 used in the where clause of query2 is the second column of the PK, it 
must be a full scan query like query1.
However, as you can see, query2 is created by range scan and the generated 
chunk is also less than five compared to query1.
I added the log and printed out the startkey and endkey of the scan object 
generated by the plan.
And I found 5 chunks missing by query2.

All five missing chunks were found in regions where the originally generated 
region boundary value was not maintained through the merge operation.
The code that caused the problem is this part.

 When a select query is executed, the 
[org.apache.phoenix.iterate.BaseResultIterators#getParallelScans|https://github.com/apache/phoenix/blob/v4.11.0-HBase-1.2/phoenix-core/src/main/java/org/apache/phoenix/iterate/BaseResultIterators.java#L743-L744]
 method creates a Scan object based on the GuidePost in the statistics table. 
In the case of a GuidePost that contains a region boundary, it is split into 
two Scan objects. The code used here is 
[org.apache.phoenix.compile.ScanRanges#intersectScan|https://github.com/apache/phoenix/blob/v4.11.0-HBase-1.2/phoenix-core/src/main/java/org/apache/phoenix/compile/ScanRanges.java#L299-L303].

In the case of a table that has been salted, the code compares it with the 
remainder after subtracting the salt(prefix) bytes.
I can not be sure that this code is buggy or intended.

In this case, I have merge the region directly, but it is likely to occur 
through HBase's Normalizer function.

I wish other users did not merge the region manually or not the table property 
Normalization_enabled to true  in their