[jira] [Updated] (PHOENIX-5776) Phoenix pherf unit tests failing

2020-03-12 Thread Sandeep Guggilam (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-5776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Guggilam updated PHOENIX-5776:
--
Attachment: PHOENIX-5776.master.v1.patch

> Phoenix pherf unit tests failing
> 
>
> Key: PHOENIX-5776
> URL: https://issues.apache.org/jira/browse/PHOENIX-5776
> Project: Phoenix
>  Issue Type: Bug
>Affects Versions: 5.1.0, 4.16.0
>Reporter: Sandeep Guggilam
>Assignee: Sandeep Guggilam
>Priority: Major
> Attachments: PHOENIX-5776.master.v1.patch
>
>
> Phoenix Pherf unit tests ResourceTest and PherfTest are failing with "could 
> not load resources /datamodel/query_more.sql"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PHOENIX-5753) Fix erroneous query result when RVC is clipped with desc column

2020-03-12 Thread chenglei (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-5753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenglei updated PHOENIX-5753:
--
Attachment: PHOENIX-5753_v3-4.x-HBase-1.4.patch

> Fix erroneous query result when RVC is clipped with desc column
> ---
>
> Key: PHOENIX-5753
> URL: https://issues.apache.org/jira/browse/PHOENIX-5753
> Project: Phoenix
>  Issue Type: Bug
>Affects Versions: 5.0.0, 4.15.0
>Reporter: chenglei
>Assignee: chenglei
>Priority: Major
>  Labels: DESC
> Fix For: 4.16.0
>
> Attachments: PHOENIX-5753_v2-4.x-HBase-1.4.patch, 
> PHOENIX-5753_v3-4.x-HBase-1.4.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Given following table and data:
> {code:java}
>CREATE TABLE  test
>(
> pk1 INTEGER NOT NULL ,  
> pk2 INTEGER NOT NULL, 
> pk3 INTEGER NOT NULL, 
> pk4 INTEGER NOT NULL, 
> v INTEGER, CONSTRAINT PK PRIMARY KEY(pk1,pk2,pk3 desc,pk4))
>)
> {code}
>   Noticed pk3 is DESC.
> {code:java}
>UPSERT INTO test (pk1, pk2, pk3, pk4, v) VALUES (1,3,4,10,1)
> {code}
> If we execute the following sql:
> {code:java}
>  select * from test
>  where (pk1 >=1 and pk1<=2) and (pk2>=3 and pk2<=4) and (pk3,pk4) < (5,7)
> {code}
> the returned result is empty, but obviously, the above inserted row 
> (1,3,4,10,1) should be returned.
> I think this problem is introduced by PHOENIX-3383 and PHOENIX-4841, when we 
> clip the {{(pk3,pk4) < (5,7)}} because {{pk3}} is {{DESC}}  by following line 
> 260 in {{WhereOptimizer.pushKeyExpressionsToScan}} , {{(pk3,pk4) < (5,7)}} is 
> clipped to {{pk3 <= 5}} and {{pk4 < 7}} .
> {code:java}
> 257List leftRanges = clipLeft(schema, 
> slot.getPKPosition()
> 258+ slotOffset - clipLeftSpan, clipLeftSpan, 
> keyRanges, ptr);
> 259keyRanges =
> 260clipRight(schema, slot.getPKPosition() + 
> slotOffset - 1, keyRanges,
> 261leftRanges, ptr);
> 262if (prevSortOrder == SortOrder.DESC) {
> 263leftRanges = invertKeyRanges(leftRanges);
> 264}
> 265slotSpanArray[cnf.size()] = clipLeftSpan-1;
> 266cnf.add(leftRanges);
> 267clipLeftSpan = 0;
> 268prevSortOrder = sortOrder;
> 269// since we have to clip the portion with the same 
> sort order, we can no longer
> 270// extract the nodes from the where clause
> 271// for eg. for the schema A VARCHAR DESC, B VARCHAR 
> ASC and query
> 272//   WHERE (A,B) < ('a','b')
> 273// the range (* - a\xFFb) is converted to (~a-*)(*-b)
> 274// so we still need to filter on A,B
> 275stopExtracting = true;
> 276}
> {code}
> Eventually after we completed the  
> {{WhereOptimizer.pushKeyExpressionsToScan}}, the result
> {{ScanRanges.ranges}} is  [[[1 - 2]], [[3 - 4]], [[~5 - *)], [(* - 7)]],  
> {{ScanRanges.useSkipScanFilter}} is {{true}}  and {{SkipScanFilter}} is also 
> [[[1 - 2]], [[3 - 4]], [[~5 - *)], [(* - 7)]], so the  the above inserted row 
> (1,3,4,10,1) could not be retrieved.
> But as we know, {{(pk3,pk4) < (5,7)}} is not semantically equals to {{pk3 <= 
> 5}} and {{pk4 < 7}} , we could only have
>   {{pk3 <= 5}}  but not  {{pk4 < 7}}, so when we clipped  {{(pk3,pk4) < 
> (5,7)}}  to {{pk3 <= 5}} , we could  simply skip remaining columns of this 
> RVC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PHOENIX-5753) Fix erroneous query result when RVC is clipped with desc column

2020-03-12 Thread chenglei (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-5753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenglei updated PHOENIX-5753:
--
Attachment: (was: PHOENIX-5753_v2-4.x-HBase-1.4.patch)

> Fix erroneous query result when RVC is clipped with desc column
> ---
>
> Key: PHOENIX-5753
> URL: https://issues.apache.org/jira/browse/PHOENIX-5753
> Project: Phoenix
>  Issue Type: Bug
>Affects Versions: 5.0.0, 4.15.0
>Reporter: chenglei
>Assignee: chenglei
>Priority: Major
>  Labels: DESC
> Fix For: 4.16.0
>
> Attachments: PHOENIX-5753_v2-4.x-HBase-1.4.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Given following table and data:
> {code:java}
>CREATE TABLE  test
>(
> pk1 INTEGER NOT NULL ,  
> pk2 INTEGER NOT NULL, 
> pk3 INTEGER NOT NULL, 
> pk4 INTEGER NOT NULL, 
> v INTEGER, CONSTRAINT PK PRIMARY KEY(pk1,pk2,pk3 desc,pk4))
>)
> {code}
>   Noticed pk3 is DESC.
> {code:java}
>UPSERT INTO test (pk1, pk2, pk3, pk4, v) VALUES (1,3,4,10,1)
> {code}
> If we execute the following sql:
> {code:java}
>  select * from test
>  where (pk1 >=1 and pk1<=2) and (pk2>=3 and pk2<=4) and (pk3,pk4) < (5,7)
> {code}
> the returned result is empty, but obviously, the above inserted row 
> (1,3,4,10,1) should be returned.
> I think this problem is introduced by PHOENIX-3383 and PHOENIX-4841, when we 
> clip the {{(pk3,pk4) < (5,7)}} because {{pk3}} is {{DESC}}  by following line 
> 260 in {{WhereOptimizer.pushKeyExpressionsToScan}} , {{(pk3,pk4) < (5,7)}} is 
> clipped to {{pk3 <= 5}} and {{pk4 < 7}} .
> {code:java}
> 257List leftRanges = clipLeft(schema, 
> slot.getPKPosition()
> 258+ slotOffset - clipLeftSpan, clipLeftSpan, 
> keyRanges, ptr);
> 259keyRanges =
> 260clipRight(schema, slot.getPKPosition() + 
> slotOffset - 1, keyRanges,
> 261leftRanges, ptr);
> 262if (prevSortOrder == SortOrder.DESC) {
> 263leftRanges = invertKeyRanges(leftRanges);
> 264}
> 265slotSpanArray[cnf.size()] = clipLeftSpan-1;
> 266cnf.add(leftRanges);
> 267clipLeftSpan = 0;
> 268prevSortOrder = sortOrder;
> 269// since we have to clip the portion with the same 
> sort order, we can no longer
> 270// extract the nodes from the where clause
> 271// for eg. for the schema A VARCHAR DESC, B VARCHAR 
> ASC and query
> 272//   WHERE (A,B) < ('a','b')
> 273// the range (* - a\xFFb) is converted to (~a-*)(*-b)
> 274// so we still need to filter on A,B
> 275stopExtracting = true;
> 276}
> {code}
> Eventually after we completed the  
> {{WhereOptimizer.pushKeyExpressionsToScan}}, the result
> {{ScanRanges.ranges}} is  [[[1 - 2]], [[3 - 4]], [[~5 - *)], [(* - 7)]],  
> {{ScanRanges.useSkipScanFilter}} is {{true}}  and {{SkipScanFilter}} is also 
> [[[1 - 2]], [[3 - 4]], [[~5 - *)], [(* - 7)]], so the  the above inserted row 
> (1,3,4,10,1) could not be retrieved.
> But as we know, {{(pk3,pk4) < (5,7)}} is not semantically equals to {{pk3 <= 
> 5}} and {{pk4 < 7}} , we could only have
>   {{pk3 <= 5}}  but not  {{pk4 < 7}}, so when we clipped  {{(pk3,pk4) < 
> (5,7)}}  to {{pk3 <= 5}} , we could  simply skip remaining columns of this 
> RVC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PHOENIX-5753) Fix erroneous query result when RVC is clipped with desc column

2020-03-12 Thread chenglei (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-5753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenglei updated PHOENIX-5753:
--
Attachment: PHOENIX-5753_v2-4.x-HBase-1.4.patch

> Fix erroneous query result when RVC is clipped with desc column
> ---
>
> Key: PHOENIX-5753
> URL: https://issues.apache.org/jira/browse/PHOENIX-5753
> Project: Phoenix
>  Issue Type: Bug
>Affects Versions: 5.0.0, 4.15.0
>Reporter: chenglei
>Assignee: chenglei
>Priority: Major
>  Labels: DESC
> Fix For: 4.16.0
>
> Attachments: PHOENIX-5753_v2-4.x-HBase-1.4.patch, 
> PHOENIX-5753_v2-4.x-HBase-1.4.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Given following table and data:
> {code:java}
>CREATE TABLE  test
>(
> pk1 INTEGER NOT NULL ,  
> pk2 INTEGER NOT NULL, 
> pk3 INTEGER NOT NULL, 
> pk4 INTEGER NOT NULL, 
> v INTEGER, CONSTRAINT PK PRIMARY KEY(pk1,pk2,pk3 desc,pk4))
>)
> {code}
>   Noticed pk3 is DESC.
> {code:java}
>UPSERT INTO test (pk1, pk2, pk3, pk4, v) VALUES (1,3,4,10,1)
> {code}
> If we execute the following sql:
> {code:java}
>  select * from test
>  where (pk1 >=1 and pk1<=2) and (pk2>=3 and pk2<=4) and (pk3,pk4) < (5,7)
> {code}
> the returned result is empty, but obviously, the above inserted row 
> (1,3,4,10,1) should be returned.
> I think this problem is introduced by PHOENIX-3383 and PHOENIX-4841, when we 
> clip the {{(pk3,pk4) < (5,7)}} because {{pk3}} is {{DESC}}  by following line 
> 260 in {{WhereOptimizer.pushKeyExpressionsToScan}} , {{(pk3,pk4) < (5,7)}} is 
> clipped to {{pk3 <= 5}} and {{pk4 < 7}} .
> {code:java}
> 257List leftRanges = clipLeft(schema, 
> slot.getPKPosition()
> 258+ slotOffset - clipLeftSpan, clipLeftSpan, 
> keyRanges, ptr);
> 259keyRanges =
> 260clipRight(schema, slot.getPKPosition() + 
> slotOffset - 1, keyRanges,
> 261leftRanges, ptr);
> 262if (prevSortOrder == SortOrder.DESC) {
> 263leftRanges = invertKeyRanges(leftRanges);
> 264}
> 265slotSpanArray[cnf.size()] = clipLeftSpan-1;
> 266cnf.add(leftRanges);
> 267clipLeftSpan = 0;
> 268prevSortOrder = sortOrder;
> 269// since we have to clip the portion with the same 
> sort order, we can no longer
> 270// extract the nodes from the where clause
> 271// for eg. for the schema A VARCHAR DESC, B VARCHAR 
> ASC and query
> 272//   WHERE (A,B) < ('a','b')
> 273// the range (* - a\xFFb) is converted to (~a-*)(*-b)
> 274// so we still need to filter on A,B
> 275stopExtracting = true;
> 276}
> {code}
> Eventually after we completed the  
> {{WhereOptimizer.pushKeyExpressionsToScan}}, the result
> {{ScanRanges.ranges}} is  [[[1 - 2]], [[3 - 4]], [[~5 - *)], [(* - 7)]],  
> {{ScanRanges.useSkipScanFilter}} is {{true}}  and {{SkipScanFilter}} is also 
> [[[1 - 2]], [[3 - 4]], [[~5 - *)], [(* - 7)]], so the  the above inserted row 
> (1,3,4,10,1) could not be retrieved.
> But as we know, {{(pk3,pk4) < (5,7)}} is not semantically equals to {{pk3 <= 
> 5}} and {{pk4 < 7}} , we could only have
>   {{pk3 <= 5}}  but not  {{pk4 < 7}}, so when we clipped  {{(pk3,pk4) < 
> (5,7)}}  to {{pk3 <= 5}} , we could  simply skip remaining columns of this 
> RVC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PHOENIX-5776) Phoenix pherf unit tests failing

2020-03-12 Thread Sandeep Guggilam (Jira)
Sandeep Guggilam created PHOENIX-5776:
-

 Summary: Phoenix pherf unit tests failing
 Key: PHOENIX-5776
 URL: https://issues.apache.org/jira/browse/PHOENIX-5776
 Project: Phoenix
  Issue Type: Bug
Affects Versions: 5.1.0, 4.16.0
Reporter: Sandeep Guggilam
Assignee: Sandeep Guggilam


Phoenix Pherf unit tests ResourceTest and PherfTest are failing with "could not 
load resources /datamodel/query_more.sql"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PHOENIX-5775) Make PreCommit build run all Phoenix tests

2020-03-12 Thread Chinmay Kulkarni (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chinmay Kulkarni updated PHOENIX-5775:
--
Labels: phoenix-hardening  (was: )

> Make PreCommit build run all Phoenix tests
> --
>
> Key: PHOENIX-5775
> URL: https://issues.apache.org/jira/browse/PHOENIX-5775
> Project: Phoenix
>  Issue Type: Bug
>Affects Versions: 4.15.0, 5.1.0
>Reporter: Chinmay Kulkarni
>Priority: Major
>  Labels: phoenix-hardening
>
> Currently, it looks like the Hadoop QA PreCommit build only runs tests in 
> phoenix-core. This skips running tests in phoenix-pherf and other modules. We 
> should ideally run all tests in the project. 
> For ex: 
> https://builds.apache.org/job/PreCommit-PHOENIX-Build/3546//testReport/ for 
> PHOENIX-5607 master branch,
> shows that the build only ran phoenix-core tests, however phoenix-pherf tests 
> like PherfTest.java and ResourceTest.java fail and this was not captured. 
> These tests may have been failing for a long time (not necessarily related to 
> 5607 changes).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PHOENIX-5775) Make PreCommit build run all Phoenix tests

2020-03-12 Thread Chinmay Kulkarni (Jira)
Chinmay Kulkarni created PHOENIX-5775:
-

 Summary: Make PreCommit build run all Phoenix tests
 Key: PHOENIX-5775
 URL: https://issues.apache.org/jira/browse/PHOENIX-5775
 Project: Phoenix
  Issue Type: Bug
Affects Versions: 4.15.0, 5.1.0
Reporter: Chinmay Kulkarni


Currently, it looks like the Hadoop QA PreCommit build only runs tests in 
phoenix-core. This skips running tests in phoenix-pherf and other modules. We 
should ideally run all tests in the project. 

For ex: https://builds.apache.org/job/PreCommit-PHOENIX-Build/3546//testReport/ 
for PHOENIX-5607 master branch,
shows that the build only ran phoenix-core tests, however phoenix-pherf tests 
like PherfTest.java and ResourceTest.java fail and this was not captured. These 
tests may have been failing for a long time (not necessarily related to 5607 
changes).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PHOENIX-5774) Phoenix Mapreduce job over hbase snapshots is extremely inefficient.

2020-03-12 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-5774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah updated PHOENIX-5774:
--
Summary: Phoenix Mapreduce job over hbase snapshots is extremely 
inefficient.  (was: Phoenix Mapreduce job over hbase Snapshots is extremely 
inefficient.)

> Phoenix Mapreduce job over hbase snapshots is extremely inefficient.
> 
>
> Key: PHOENIX-5774
> URL: https://issues.apache.org/jira/browse/PHOENIX-5774
> Project: Phoenix
>  Issue Type: Bug
>Affects Versions: 4.13.1
>Reporter: Rushabh Shah
>Priority: Major
>
> Internally we have tenant estimation framework which calculates the number of 
> rows each tenant occupy in the cluster. Basically what the framework does is 
> it launch MapReduce(MR) job per table and run the following query : "Select 
> tenant_id from " and we do count over this tenant_id in reducer 
> phase.
>  Earlier we use to run this query against live table but we found meta table 
> was getting hammered over the time this job was running so we thought to run 
> the MR job on hbase snapshots instead of live table. Take advantage of this 
> feature: https://issues.apache.org/jira/browse/PHOENIX-3744
> When we were querying live table, the MR job for one of the biggest table in 
> sandbox cluster took around 2.5 hours.
>  After we started using hbase snapshots, the MR job for the same table took 
> 135 hours. We have maximum concurrent running mapper limit to 15 to avoid 
> hammering meta table when we were querying live tables. We didn't remove that 
> restriction after we moved to hbase snapshots.So ideally it shouldn't take 
> 135 hours to complete if we don't have that restriction.
> Some statistics about that table:
>  Size: -578 GB- 2.70 TB, Num Regions in that table: -161- 670
> The average map time took 3 mins 11 seconds when querying live table.
>  The average map time took 5 hours 33 minutes when querying hbase snapshots.
> The issue is we don't consider snapshots while generating splits. So during 
> map phase, each map task has to go through all regions in snapshots to 
> determine which region has the start and end key assigned to that task. After 
> determining all regions, it has to open each region to scan all hfiles in 
> that region. In one such map task, the start and end key from split was 
> distributed among 289 regions(from snapshot not live table). Reading from 
> each region took an average of 90 seconds, so for 289 regions it took 
> approximately 7 hours.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PHOENIX-5698) Phoenix Query with RVC IN list expression generates wrong scan with non-pk ordered pks

2020-03-12 Thread Xinyi Yan (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-5698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinyi Yan updated PHOENIX-5698:
---
Attachment: (was: PHOENIX-5698-master.v2.patch)

> Phoenix Query with RVC IN list expression generates wrong scan with non-pk 
> ordered pks
> --
>
> Key: PHOENIX-5698
> URL: https://issues.apache.org/jira/browse/PHOENIX-5698
> Project: Phoenix
>  Issue Type: Bug
>Affects Versions: 4.15.0, 4.14.3
>Reporter: Daniel Wong
>Assignee: Xinyi Yan
>Priority: Major
>  Labels: DESC
> Attachments: PHOENIX-5698-4.14-HBase-1.3.patch, 
> PHOENIX-5698-4.x-HBase-1.3.patch, PHOENIX-5698-master.v2.patch, 
> PHOENIX-5698.patch
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> In the code below ideally we'd expect a SINGLE ROW DELETE plan client side. 
> However, this generates an incorrect scan with range ['tenant1
> 0CY005xx01Sv6o'). If the order of the RVCs is changed to row key order 
> Phoenix correctly generates a SINGLE ROW SCAN.  As we provide the full PK 
> this we expect a either tightly bounded range scan or a client side delete.  
> Instead we get a range scan on composite leading edge 
> TENANT_ID,KEY_PREFIX,ID1.
>  
> {code:java}
> @Test
>  public void testInListExpressionWithDescAgain() throws Exception {
>  String fullTableName = generateUniqueName();
>  String fullViewName = generateUniqueName();
>  String tenantView = generateUniqueName();
>  // create base table and global view using global connection
>  try (Connection conn = DriverManager.getConnection(getUrl()))
> { conn.setAutoCommit(true); Statement stmt = conn.createStatement(); 
> stmt.execute("CREATE TABLE " + fullTableName + "(\n" + " TENANT_ID CHAR(15) 
> NOT NULL,\n" + " KEY_PREFIX CHAR(3) NOT NULL,\n" + " CONSTRAINT PK PRIMARY 
> KEY (\n" + " TENANT_ID," + " KEY_PREFIX" + ")) MULTI_TENANT=TRUE"); 
> stmt.execute("CREATE VIEW " + fullViewName + "(\n" + " ID1 VARCHAR NOT 
> NULL,\n" + " ID2 VARCHAR NOT NULL,\n" + " EVENT_DATE DATE NOT NULL,\n" + " 
> CONSTRAINT PKVIEW PRIMARY KEY\n" + " (\n" + " ID1, ID2 DESC, EVENT_DATE 
> DESC\n" + ")) AS SELECT * FROM " + fullTableName + " WHERE KEY_PREFIX = 
> '0CY'"); }
> // create and use a tenant specific view to write data
>  try (Connection viewConn = DriverManager.getConnection(TENANT_SPECIFIC_URL1) 
> ) {
>  viewConn.setAutoCommit(true); //need autocommit for serverside deletion
>  Statement stmt = viewConn.createStatement();
>  stmt.execute("CREATE VIEW IF NOT EXISTS " + tenantView + " AS SELECT * FROM 
> " + fullViewName );
>  viewConn.createStatement().execute("UPSERT INTO " + tenantView + "(ID1, ID2, 
> EVENT_DATE) VALUES ('005xx01Sv6o', '300', 153245823)");
>  viewConn.createStatement().execute("UPSERT INTO " + tenantView + "(ID1, ID2, 
> EVENT_DATE) VALUES ('005xx01Sv6o', '400', 153245824)");
>  viewConn.createStatement().execute("UPSERT INTO " + tenantView + "(ID1, ID2, 
> EVENT_DATE) VALUES ('005xx01Sv6o', '500', 153245825)");
>  viewConn.commit();
> ResultSet rs = stmt.executeQuery("SELECT ID1, ID2, EVENT_DATE FROM " + 
> tenantView );
>  printResultSet(rs);
> System.out.println("Delete Start");
> rs = stmt.executeQuery("EXPLAIN DELETE FROM " + tenantView + " WHERE (ID1, 
> EVENT_DATE, ID2) IN (('005xx01Sv6o', 153245824, 
> '400'),('005xx01Sv6o', 153245823, '300'))");
>  printResultSet(rs); // THIS SHOULD BE A SINGLE ROW SCAN
> stmt.execute("DELETE FROM " + tenantView + " WHERE (ID1, EVENT_DATE, ID2) IN 
> (('005xx01Sv6o', 153245824, '400'),('005xx01Sv6o', 
> 153245823, '300'))");
>  viewConn.commit();
>  System.out.println("Delete End");
> rs = stmt.executeQuery("SELECT ID1, ID2, EVENT_DATE FROM " + tenantView );
>  printResultSet(rs);
> }
>  }
> private void printResultSet(ResultSet rs) throws SQLException {
>  StringBuilder builder = new StringBuilder();
>  while(rs.next()) {
>  for(int i = 0; i < rs.getMetaData().getColumnCount(); i++) {
>  Object col = rs.getObject(i + 1);
>  if(col == null)
> { builder.append("null"); }
> else {
>  if(col instanceof Date)
> { DateFormat df = new SimpleDateFormat("-MM-dd HH:mm:ss"); 
> builder.append(df.format(col)); }
> else {
>  builder.append(col.toString());
>  }
>  }
>  builder.append(",");
>  }
>  builder.append("\n");
>  }
>  System.out.println(builder.toString());
>  }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PHOENIX-5698) Phoenix Query with RVC IN list expression generates wrong scan with non-pk ordered pks

2020-03-12 Thread Xinyi Yan (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-5698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinyi Yan updated PHOENIX-5698:
---
Attachment: PHOENIX-5698-master.v2.patch

> Phoenix Query with RVC IN list expression generates wrong scan with non-pk 
> ordered pks
> --
>
> Key: PHOENIX-5698
> URL: https://issues.apache.org/jira/browse/PHOENIX-5698
> Project: Phoenix
>  Issue Type: Bug
>Affects Versions: 4.15.0, 4.14.3
>Reporter: Daniel Wong
>Assignee: Xinyi Yan
>Priority: Major
>  Labels: DESC
> Attachments: PHOENIX-5698-4.14-HBase-1.3.patch, 
> PHOENIX-5698-4.x-HBase-1.3.patch, PHOENIX-5698-master.v2.patch, 
> PHOENIX-5698.patch
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> In the code below ideally we'd expect a SINGLE ROW DELETE plan client side. 
> However, this generates an incorrect scan with range ['tenant1
> 0CY005xx01Sv6o'). If the order of the RVCs is changed to row key order 
> Phoenix correctly generates a SINGLE ROW SCAN.  As we provide the full PK 
> this we expect a either tightly bounded range scan or a client side delete.  
> Instead we get a range scan on composite leading edge 
> TENANT_ID,KEY_PREFIX,ID1.
>  
> {code:java}
> @Test
>  public void testInListExpressionWithDescAgain() throws Exception {
>  String fullTableName = generateUniqueName();
>  String fullViewName = generateUniqueName();
>  String tenantView = generateUniqueName();
>  // create base table and global view using global connection
>  try (Connection conn = DriverManager.getConnection(getUrl()))
> { conn.setAutoCommit(true); Statement stmt = conn.createStatement(); 
> stmt.execute("CREATE TABLE " + fullTableName + "(\n" + " TENANT_ID CHAR(15) 
> NOT NULL,\n" + " KEY_PREFIX CHAR(3) NOT NULL,\n" + " CONSTRAINT PK PRIMARY 
> KEY (\n" + " TENANT_ID," + " KEY_PREFIX" + ")) MULTI_TENANT=TRUE"); 
> stmt.execute("CREATE VIEW " + fullViewName + "(\n" + " ID1 VARCHAR NOT 
> NULL,\n" + " ID2 VARCHAR NOT NULL,\n" + " EVENT_DATE DATE NOT NULL,\n" + " 
> CONSTRAINT PKVIEW PRIMARY KEY\n" + " (\n" + " ID1, ID2 DESC, EVENT_DATE 
> DESC\n" + ")) AS SELECT * FROM " + fullTableName + " WHERE KEY_PREFIX = 
> '0CY'"); }
> // create and use a tenant specific view to write data
>  try (Connection viewConn = DriverManager.getConnection(TENANT_SPECIFIC_URL1) 
> ) {
>  viewConn.setAutoCommit(true); //need autocommit for serverside deletion
>  Statement stmt = viewConn.createStatement();
>  stmt.execute("CREATE VIEW IF NOT EXISTS " + tenantView + " AS SELECT * FROM 
> " + fullViewName );
>  viewConn.createStatement().execute("UPSERT INTO " + tenantView + "(ID1, ID2, 
> EVENT_DATE) VALUES ('005xx01Sv6o', '300', 153245823)");
>  viewConn.createStatement().execute("UPSERT INTO " + tenantView + "(ID1, ID2, 
> EVENT_DATE) VALUES ('005xx01Sv6o', '400', 153245824)");
>  viewConn.createStatement().execute("UPSERT INTO " + tenantView + "(ID1, ID2, 
> EVENT_DATE) VALUES ('005xx01Sv6o', '500', 153245825)");
>  viewConn.commit();
> ResultSet rs = stmt.executeQuery("SELECT ID1, ID2, EVENT_DATE FROM " + 
> tenantView );
>  printResultSet(rs);
> System.out.println("Delete Start");
> rs = stmt.executeQuery("EXPLAIN DELETE FROM " + tenantView + " WHERE (ID1, 
> EVENT_DATE, ID2) IN (('005xx01Sv6o', 153245824, 
> '400'),('005xx01Sv6o', 153245823, '300'))");
>  printResultSet(rs); // THIS SHOULD BE A SINGLE ROW SCAN
> stmt.execute("DELETE FROM " + tenantView + " WHERE (ID1, EVENT_DATE, ID2) IN 
> (('005xx01Sv6o', 153245824, '400'),('005xx01Sv6o', 
> 153245823, '300'))");
>  viewConn.commit();
>  System.out.println("Delete End");
> rs = stmt.executeQuery("SELECT ID1, ID2, EVENT_DATE FROM " + tenantView );
>  printResultSet(rs);
> }
>  }
> private void printResultSet(ResultSet rs) throws SQLException {
>  StringBuilder builder = new StringBuilder();
>  while(rs.next()) {
>  for(int i = 0; i < rs.getMetaData().getColumnCount(); i++) {
>  Object col = rs.getObject(i + 1);
>  if(col == null)
> { builder.append("null"); }
> else {
>  if(col instanceof Date)
> { DateFormat df = new SimpleDateFormat("-MM-dd HH:mm:ss"); 
> builder.append(df.format(col)); }
> else {
>  builder.append(col.toString());
>  }
>  }
>  builder.append(",");
>  }
>  builder.append("\n");
>  }
>  System.out.println(builder.toString());
>  }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PHOENIX-5774) Phoenix Mapreduce job over hbase Snapshots is extremely inefficient.

2020-03-12 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-5774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah updated PHOENIX-5774:
--
Description: 
Internally we have tenant estimation framework which calculates the number of 
rows each tenant occupy in the cluster. Basically what the framework does is it 
launch MapReduce(MR) job per table and run the following query : "Select 
tenant_id from " and we do count over this tenant_id in reducer 
phase.
 Earlier we use to run this query against live table but we found meta table 
was getting hammered over the time this job was running so we thought to run 
the MR job on hbase snapshots instead of live table. Take advantage of this 
feature: https://issues.apache.org/jira/browse/PHOENIX-3744

When we were querying live table, the MR job for one of the biggest table in 
sandbox cluster took around 2.5 hours.
 After we started using hbase snapshots, the MR job for the same table took 135 
hours. We have maximum concurrent running mapper limit to 15 to avoid hammering 
meta table when we were querying live tables. We didn't remove that restriction 
after we moved to hbase snapshots.So ideally it shouldn't take 135 hours to 
complete if we don't have that restriction.

Some statistics about that table:
 Size: -578 GB- 2.71 TB, Num Regions in that table: -161- 670

The average map time took 3 mins 11 seconds when querying live table.
 The average map time took 5 hours 33 minutes when querying hbase snapshots.

The issue is we don't consider snapshots while generating splits. So during map 
phase, each map task has to go through all regions in snapshots to determine 
which region has the start and end key assigned to that task. After determining 
all regions, it has to open each region to scan all hfiles in that region. In 
one such map task, the start and end key from split was distributed among 289 
regions(from snapshot not live table). Reading from each region took an average 
of 90 seconds, so for 289 regions it took approximately 7 hours.

  was:
Internally we have tenant estimation framework which calculates the number of 
rows each tenant occupy in the cluster. Basically what the framework does is it 
launch MapReduce(MR) job per table and run the following query : "Select 
tenant_id from " and we do count over this tenant_id in reducer 
phase.
Earlier we use to run this query against live table but we found meta table was 
getting hammered over the time this job was running so we thought to run the MR 
job on hbase snapshots instead of live table. Take advantage of this feature: 
https://issues.apache.org/jira/browse/PHOENIX-3744

When we were querying live table, the MR job for one of the biggest table in 
sandbox cluster took around 2.5 hours.
After we started using hbase snapshots, the MR job for the same table took 135 
hours.  We have maximum concurrent running mapper limit to 15 to avoid 
hammering meta table when we were querying live tables. We didn't remove that 
restriction after we moved to hbase snapshots.So ideally it shouldn't take 135 
hours to complete if we don't have that restriction.

Some statistics about that table:
Size: -578 GB- 2.71 TB, Num Regions in that table: -161- 671

The average map time took 3 mins 11 seconds when querying live table.
The average map time took 5 hours 33 minutes when querying hbase snapshots.

The issue is we don't consider snapshots while generating splits. So during map 
phase, each map task has to go through all regions in snapshots to determine 
which region has the start and end key assigned to that task. After determining 
all regions, it has to open each region to scan all hfiles in that region. In 
one such map task, the start and end key from split was distributed among 289 
regions(from snapshot not live table). Reading from each region took an average 
of 90 seconds, so for 289 regions it took approximately 7 hours.


> Phoenix Mapreduce job over hbase Snapshots is extremely inefficient.
> 
>
> Key: PHOENIX-5774
> URL: https://issues.apache.org/jira/browse/PHOENIX-5774
> Project: Phoenix
>  Issue Type: Bug
>Affects Versions: 4.13.1
>Reporter: Rushabh Shah
>Priority: Major
>
> Internally we have tenant estimation framework which calculates the number of 
> rows each tenant occupy in the cluster. Basically what the framework does is 
> it launch MapReduce(MR) job per table and run the following query : "Select 
> tenant_id from " and we do count over this tenant_id in reducer 
> phase.
>  Earlier we use to run this query against live table but we found meta table 
> was getting hammered over the time this job was running so we thought to run 
> the MR job on hbase snapshots instead of live table. Take advantage of this 
> feature: https://issues.apache.org/jira/browse/PHOENIX-3744
> 

[jira] [Updated] (PHOENIX-5774) Phoenix Mapreduce job over hbase Snapshots is extremely inefficient.

2020-03-12 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-5774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah updated PHOENIX-5774:
--
Description: 
Internally we have tenant estimation framework which calculates the number of 
rows each tenant occupy in the cluster. Basically what the framework does is it 
launch MapReduce(MR) job per table and run the following query : "Select 
tenant_id from " and we do count over this tenant_id in reducer 
phase.
 Earlier we use to run this query against live table but we found meta table 
was getting hammered over the time this job was running so we thought to run 
the MR job on hbase snapshots instead of live table. Take advantage of this 
feature: https://issues.apache.org/jira/browse/PHOENIX-3744

When we were querying live table, the MR job for one of the biggest table in 
sandbox cluster took around 2.5 hours.
 After we started using hbase snapshots, the MR job for the same table took 135 
hours. We have maximum concurrent running mapper limit to 15 to avoid hammering 
meta table when we were querying live tables. We didn't remove that restriction 
after we moved to hbase snapshots.So ideally it shouldn't take 135 hours to 
complete if we don't have that restriction.

Some statistics about that table:
 Size: -578 GB- 2.70 TB, Num Regions in that table: -161- 670

The average map time took 3 mins 11 seconds when querying live table.
 The average map time took 5 hours 33 minutes when querying hbase snapshots.

The issue is we don't consider snapshots while generating splits. So during map 
phase, each map task has to go through all regions in snapshots to determine 
which region has the start and end key assigned to that task. After determining 
all regions, it has to open each region to scan all hfiles in that region. In 
one such map task, the start and end key from split was distributed among 289 
regions(from snapshot not live table). Reading from each region took an average 
of 90 seconds, so for 289 regions it took approximately 7 hours.

  was:
Internally we have tenant estimation framework which calculates the number of 
rows each tenant occupy in the cluster. Basically what the framework does is it 
launch MapReduce(MR) job per table and run the following query : "Select 
tenant_id from " and we do count over this tenant_id in reducer 
phase.
 Earlier we use to run this query against live table but we found meta table 
was getting hammered over the time this job was running so we thought to run 
the MR job on hbase snapshots instead of live table. Take advantage of this 
feature: https://issues.apache.org/jira/browse/PHOENIX-3744

When we were querying live table, the MR job for one of the biggest table in 
sandbox cluster took around 2.5 hours.
 After we started using hbase snapshots, the MR job for the same table took 135 
hours. We have maximum concurrent running mapper limit to 15 to avoid hammering 
meta table when we were querying live tables. We didn't remove that restriction 
after we moved to hbase snapshots.So ideally it shouldn't take 135 hours to 
complete if we don't have that restriction.

Some statistics about that table:
 Size: -578 GB- 2.71 TB, Num Regions in that table: -161- 670

The average map time took 3 mins 11 seconds when querying live table.
 The average map time took 5 hours 33 minutes when querying hbase snapshots.

The issue is we don't consider snapshots while generating splits. So during map 
phase, each map task has to go through all regions in snapshots to determine 
which region has the start and end key assigned to that task. After determining 
all regions, it has to open each region to scan all hfiles in that region. In 
one such map task, the start and end key from split was distributed among 289 
regions(from snapshot not live table). Reading from each region took an average 
of 90 seconds, so for 289 regions it took approximately 7 hours.


> Phoenix Mapreduce job over hbase Snapshots is extremely inefficient.
> 
>
> Key: PHOENIX-5774
> URL: https://issues.apache.org/jira/browse/PHOENIX-5774
> Project: Phoenix
>  Issue Type: Bug
>Affects Versions: 4.13.1
>Reporter: Rushabh Shah
>Priority: Major
>
> Internally we have tenant estimation framework which calculates the number of 
> rows each tenant occupy in the cluster. Basically what the framework does is 
> it launch MapReduce(MR) job per table and run the following query : "Select 
> tenant_id from " and we do count over this tenant_id in reducer 
> phase.
>  Earlier we use to run this query against live table but we found meta table 
> was getting hammered over the time this job was running so we thought to run 
> the MR job on hbase snapshots instead of live table. Take advantage of this 
> feature: https://issues.apache.org/jira/browse/PHOENIX-3744
> 

[jira] [Updated] (PHOENIX-5774) Phoenix Mapreduce job over hbase Snapshots is extremely inefficient.

2020-03-12 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-5774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah updated PHOENIX-5774:
--
Description: 
Internally we have tenant estimation framework which calculates the number of 
rows each tenant occupy in the cluster. Basically what the framework does is it 
launch MapReduce(MR) job per table and run the following query : "Select 
tenant_id from " and we do count over this tenant_id in reducer 
phase.
Earlier we use to run this query against live table but we found meta table was 
getting hammered over the time this job was running so we thought to run the MR 
job on hbase snapshots instead of live table. Take advantage of this feature: 
https://issues.apache.org/jira/browse/PHOENIX-3744

When we were querying live table, the MR job for one of the biggest table in 
sandbox cluster took around 2.5 hours.
After we started using hbase snapshots, the MR job for the same table took 135 
hours.  We have maximum concurrent running mapper limit to 15 to avoid 
hammering meta table when we were querying live tables. We didn't remove that 
restriction after we moved to hbase snapshots.So ideally it shouldn't take 135 
hours to complete if we don't have that restriction.

Some statistics about that table:
Size: -578 GB- 2.71 TB, Num Regions in that table: -161- 671

The average map time took 3 mins 11 seconds when querying live table.
The average map time took 5 hours 33 minutes when querying hbase snapshots.

The issue is we don't consider snapshots while generating splits. So during map 
phase, each map task has to go through all regions in snapshots to determine 
which region has the start and end key assigned to that task. After determining 
all regions, it has to open each region to scan all hfiles in that region. In 
one such map task, the start and end key from split was distributed among 289 
regions(from snapshot not live table). Reading from each region took an average 
of 90 seconds, so for 289 regions it took approximately 7 hours.

  was:
Internally we have tenant estimation framework which calculates the number of 
rows each tenant occupy in the cluster. Basically what the framework does is it 
launch MapReduce(MR) job per table and run the following query : "Select 
tenant_id from " and we do count over this tenant_id in reducer 
phase.
Earlier we use to run this query against live table but we found meta table was 
getting hammered over the time this job was running so we thought to run the MR 
job on hbase snapshots instead of live table. Take advantage of this feature: 
https://issues.apache.org/jira/browse/PHOENIX-3744

When we were querying live table, the MR job for one of the biggest table in 
sandbox cluster took around 2.5 hours.
After we started using hbase snapshots, the MR job for the same table took 135 
hours.  We have maximum concurrent running mapper limit to 15 to avoid 
hammering meta table when we were querying live tables. We didn't remove that 
restriction after we moved to hbase snapshots.So ideally it shouldn't take 135 
hours to complete if we don't have that restriction.

Some statistics about that table:
Size: 578 GB, Num Regions in that table: 161

The average map time took 3 mins 11 seconds when querying live table.
The average map time took 5 hours 33 minutes when querying hbase snapshots.

The issue is we don't consider snapshots while generating splits. So during map 
phase, each map task has to go through all regions in snapshots to determine 
which region has the start and end key assigned to that task. After determining 
all regions, it has to open each region to scan all hfiles in that region. In 
one such map task, the start and end key from split was distributed among 289 
regions(from snapshot not live table). Reading from each region took an average 
of 90 seconds, so for 289 regions it took approximately 7 hours.


> Phoenix Mapreduce job over hbase Snapshots is extremely inefficient.
> 
>
> Key: PHOENIX-5774
> URL: https://issues.apache.org/jira/browse/PHOENIX-5774
> Project: Phoenix
>  Issue Type: Bug
>Affects Versions: 4.13.1
>Reporter: Rushabh Shah
>Priority: Major
>
> Internally we have tenant estimation framework which calculates the number of 
> rows each tenant occupy in the cluster. Basically what the framework does is 
> it launch MapReduce(MR) job per table and run the following query : "Select 
> tenant_id from " and we do count over this tenant_id in reducer 
> phase.
> Earlier we use to run this query against live table but we found meta table 
> was getting hammered over the time this job was running so we thought to run 
> the MR job on hbase snapshots instead of live table. Take advantage of this 
> feature: https://issues.apache.org/jira/browse/PHOENIX-3744
> When we were querying 

[jira] [Updated] (PHOENIX-5774) Phoenix Mapreduce job over hbase Snapshots is extremely inefficient.

2020-03-12 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-5774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah updated PHOENIX-5774:
--
Description: 
Internally we have tenant estimation framework which calculates the number of 
rows each tenant occupy in the cluster. Basically what the framework does is it 
launch MapReduce(MR) job per table and run the following query : "Select 
tenant_id from " and we do count over this tenant_id in reducer 
phase.
Earlier we use to run this query against live table but we found meta table was 
getting hammered over the time this job was running so we thought to run the MR 
job on hbase snapshots instead of live table. Take advantage of this feature: 
https://issues.apache.org/jira/browse/PHOENIX-3744

When we were querying live table, the MR job for one of the biggest table in 
sandbox cluster took around 2.5 hours.
After we started using hbase snapshots, the MR job for the same table took 135 
hours.  We have maximum concurrent running mapper limit to 15 to avoid 
hammering meta table when we were querying live tables. We didn't remove that 
restriction after we moved to hbase snapshots.So ideally it shouldn't take 135 
hours to complete if we don't have that restriction.

Some statistics about that table:
Size: 578 GB, Num Regions in that table: 161

The average map time took 3 mins 11 seconds when querying live table.
The average map time took 5 hours 33 minutes when querying hbase snapshots.

The issue is we don't consider snapshots while generating splits. So during map 
phase, each map task has to go through all regions in snapshots to determine 
which region has the start and end key assigned to that task. After determining 
all regions, it has to open each region to scan all hfiles in that region. In 
one such map task, the start and end key from split was distributed among 289 
regions(from snapshot not live table). Reading from each region took an average 
of 90 seconds, so for 289 regions it took approximately 7 hours.

  was:
Internally we have tenant estimation framework which calculates the number of 
rows each tenant occupy in the cluster. Basically what the framework does is it 
launch MapReduce(MR) job per table and run the following query : "Select 
tenant_id from " and we do count over this tenant_id in reducer 
phase.
Earlier we use to run this query against live table but we found meta table was 
getting hammered over the time this job was running so we thought to run the MR 
job on hbase snapshots instead of live table. Take advantage of this feature: 
https://issues.apache.org/jira/browse/PHOENIX-3744

When we were querying live table, the MR job for one of the biggest table in 
sandbox cluster took around 2.5 hours.
After we started using hbase snapshots, the MR job for the same table took 135 
hours.  We have maximum concurrent running mapper limit to 15 to avoid 
hammering meta table when we were querying live tables. We didn't remove that 
restriction after we moved to hbase snapshots.So ideally it shouldn't take 135 
hours to complete if we don't have that restriction.

Some statistics about that table:
Size: 578 GB, Num Regions in that table: 161

The average map time took 3 mins 11 seconds when querying live table.
The average map time took 5 hours 33 minutes when querying hbase snapshots.

The issue is we don't consider snapshots while generating splits. So during map 
phase, each map task has to go through all regions in snapshots to determine 
which region has the start and end key assigned to that task. After determining 
all regions, it has to open each region to scan all hfiles in that region. In 
one such map task, the start and end key from split was distributed among 289 
regions. Reading from each region took an average of 90 seconds, so for 289 
regions it took approximately 7 hours.


> Phoenix Mapreduce job over hbase Snapshots is extremely inefficient.
> 
>
> Key: PHOENIX-5774
> URL: https://issues.apache.org/jira/browse/PHOENIX-5774
> Project: Phoenix
>  Issue Type: Bug
>Affects Versions: 4.13.1
>Reporter: Rushabh Shah
>Priority: Major
>
> Internally we have tenant estimation framework which calculates the number of 
> rows each tenant occupy in the cluster. Basically what the framework does is 
> it launch MapReduce(MR) job per table and run the following query : "Select 
> tenant_id from " and we do count over this tenant_id in reducer 
> phase.
> Earlier we use to run this query against live table but we found meta table 
> was getting hammered over the time this job was running so we thought to run 
> the MR job on hbase snapshots instead of live table. Take advantage of this 
> feature: https://issues.apache.org/jira/browse/PHOENIX-3744
> When we were querying live table, the MR job for one of the biggest 

[jira] [Created] (PHOENIX-5774) Phoenix Mapreduce job over hbase Snapshots is extremely inefficient.

2020-03-12 Thread Rushabh Shah (Jira)
Rushabh Shah created PHOENIX-5774:
-

 Summary: Phoenix Mapreduce job over hbase Snapshots is extremely 
inefficient.
 Key: PHOENIX-5774
 URL: https://issues.apache.org/jira/browse/PHOENIX-5774
 Project: Phoenix
  Issue Type: Bug
Affects Versions: 4.13.1
Reporter: Rushabh Shah


Internally we have tenant estimation framework which calculates the number of 
rows each tenant occupy in the cluster. Basically what the framework does is it 
launch MapReduce(MR) job per table and run the following query : "Select 
tenant_id from " and we do count over this tenant_id in reducer 
phase.
Earlier we use to run this query against live table but we found meta table was 
getting hammered over the time this job was running so we thought to run the MR 
job on hbase snapshots instead of live table. Take advantage of this feature: 
https://issues.apache.org/jira/browse/PHOENIX-3744

When we were querying live table, the MR job for one of the biggest table in 
sandbox cluster took around 2.5 hours.
After we started using hbase snapshots, the MR job for the same table took 135 
hours.  We have maximum concurrent running mapper limit to 15 to avoid 
hammering meta table when we were querying live tables. We didn't remove that 
restriction after we moved to hbase snapshots.So ideally it shouldn't take 135 
hours to complete if we don't have that restriction.

Some statistics about that table:
Size: 578 GB, Num Regions in that table: 161

The average map time took 3 mins 11 seconds when querying live table.
The average map time took 5 hours 33 minutes when querying hbase snapshots.

The issue is we don't consider snapshots while generating splits. So during map 
phase, each map task has to go through all regions in snapshots to determine 
which region has the start and end key assigned to that task. After determining 
all regions, it has to open each region to scan all hfiles in that region. In 
one such map task, the start and end key from split was distributed among 289 
regions. Reading from each region took an average of 90 seconds, so for 289 
regions it took approximately 7 hours.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PHOENIX-5773) Index tool output tables should support multiple simultaneous rebuilds

2020-03-12 Thread Kadir OZDEMIR (Jira)
Kadir OZDEMIR created PHOENIX-5773:
--

 Summary: Index tool output tables should support multiple 
simultaneous rebuilds
 Key: PHOENIX-5773
 URL: https://issues.apache.org/jira/browse/PHOENIX-5773
 Project: Phoenix
  Issue Type: Sub-task
Affects Versions: 4.14.3, 5.0.0
Reporter: Kadir OZDEMIR
Assignee: Kadir OZDEMIR


The row key for index tool output tables are derived from the data table. For 
example, the row key for the PHOENIX_INDEX_TOOL table is the scan max time that 
the index tool uses plus the data table row key. If the data table has multiple 
indexes and these indexes are rebuilt at the same time, then this can corrupt 
the PHOENIX_INDEX_TOOL table. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PHOENIX-5317) Upserting rows into child views with pk fails when the base view has an index on it.

2020-03-12 Thread Sandeep Guggilam (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-5317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Guggilam updated PHOENIX-5317:
--
Attachment: PHOENIX-5317.master.v1.patch

> Upserting rows into child views with pk fails when the base view has an index 
> on it.
> 
>
> Key: PHOENIX-5317
> URL: https://issues.apache.org/jira/browse/PHOENIX-5317
> Project: Phoenix
>  Issue Type: Sub-task
>Affects Versions: 4.13.0, 4.14.1
>Reporter: Jacob Isaac
>Assignee: Sandeep Guggilam
>Priority: Major
> Attachments: PHOENIX-5137-TestFailure.txt, 
> PHOENIX-5317.4.x-HBase-1.3.v1.patch, PHOENIX-5317.4.x-HBase-1.3.v2.patch, 
> PHOENIX-5317.4.x-HBase-1.3.v3.patch, PHOENIX-5317.master.v1.patch
>
>
> Steps to reproduce - 
> 1 Create Base Table, Base/Global View and Index using non tenanted connection.
> CREATE TABLE IF NOT EXISTS TEST.BASETABLE (
> TENANT_ID CHAR(15) NOT NULL, 
> KEY_PREFIX CHAR(3) NOT NULL, 
> CREATED_DATE DATE,
> CREATED_BY CHAR(15),
> SYSTEM_MODSTAMP DATE
> CONSTRAINT PK PRIMARY KEY (
> TENANT_ID, 
> KEY_PREFIX 
> )
> ) VERSIONS=1, MULTI_TENANT=true, IMMUTABLE_ROWS=TRUE, REPLICATION_SCOPE=1;
> CREATE VIEW IF NOT EXISTS TEST.MY_GLOBAL_VIEW  (
> TEXT1 VARCHAR NOT NULL,
> INT1 BIGINT NOT NULL,
> DOUBLE1 DECIMAL(12, 3),
> IS_BOOLEAN BOOLEAN,
> RELATIONSHIP_ID CHAR(15),
> TEXT_READ_ONLY VARCHAR,
> DATE_TIME1 DATE,
> JSON1 VARCHAR,
> IP_START_ADDRESS VARCHAR
> CONSTRAINT PKVIEW PRIMARY KEY
> (
> TEXT1, INT1
> )
> )
> AS SELECT * FROM TEST.BASETABLE WHERE KEY_PREFIX = '0CY';
> CREATE INDEX IF NOT EXISTS TEST_MY_GLOBAL_VIEW_SEC_INDEX
> ON TEST.MY_GLOBAL_VIEW (TEXT1, INT1)
> INCLUDE (CREATED_BY, RELATIONSHIP_ID, JSON1, DOUBLE1, IS_BOOLEAN, 
> IP_START_ADDRESS, CREATED_DATE, SYSTEM_MODSTAMP, TEXT_READ_ONLY);
> 2. Create child view using an tenant-owned connection
> CREATE VIEW IF NOT EXISTS TEST."z01" (COL1 VARCHAR, COL2 VARCHAR, COL3 
> VARCHAR, COL4 VARCHAR CONSTRAINT PK PRIMARY KEY (COL1, COL2, COL3, COL4)) AS 
> SELECT * FROM TEST.MY_GLOBAL_VIEW; 
> 3. Upsert into child view
> UPSERT INTO TEST."z01" (DATE_TIME1, INT1, TEXT1, COL1, COL2, COL3, COL4) 
> VALUES (TO_DATE('2017-10-16 22:00:00', '-MM-dd HH:mm:ss'), 10, 'z', 
> '8', 'z', 'z', 'z');
> Following exception is thrown -
> 0: jdbc:phoenix:localhost> UPSERT INTO TEST."z01" (DATE_TIME1, INT1, TEXT1, 
> COL1, COL2, COL3, COL4) VALUES (TO_DATE('2017-10-16 22:00:00', '-MM-dd 
> HH:mm:ss'), 10, 'z', '8', 'z', 'z', 'z');
> java.lang.IllegalArgumentException
>   at 
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:76)
>   at 
> com.google.common.collect.Lists.computeArrayListCapacity(Lists.java:105)
>   at 
> com.google.common.collect.Lists.newArrayListWithExpectedSize(Lists.java:195)
>   at 
> org.apache.phoenix.index.IndexMaintainer.(IndexMaintainer.java:424)
>   at 
> org.apache.phoenix.index.IndexMaintainer.create(IndexMaintainer.java:143)
>   at 
> org.apache.phoenix.schema.PTableImpl.getIndexMaintainer(PTableImpl.java:1176)
>   at 
> org.apache.phoenix.util.IndexUtil.generateIndexData(IndexUtil.java:303)
>   at 
> org.apache.phoenix.execute.MutationState$1.next(MutationState.java:519)
>   at 
> org.apache.phoenix.execute.MutationState$1.next(MutationState.java:501)
>   at org.apache.phoenix.execute.MutationState.send(MutationState.java:941)
>   at 
> org.apache.phoenix.execute.MutationState.send(MutationState.java:1387)
>   at 
> org.apache.phoenix.execute.MutationState.commit(MutationState.java:1228)
>   at 
> org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConnection.java:666)
>   at 
> org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConnection.java:662)
>   at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
>   at 
> org.apache.phoenix.jdbc.PhoenixConnection.commit(PhoenixConnection.java:662)
>   at 
> org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixStatement.java:399)
>   at 
> org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixStatement.java:379)
>   at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
>   at 
> org.apache.phoenix.jdbc.PhoenixStatement.executeMutation(PhoenixStatement.java:378)
>   at 
> org.apache.phoenix.jdbc.PhoenixStatement.executeMutation(PhoenixStatement.java:366)
>   at 
> org.apache.phoenix.jdbc.PhoenixStatement.execute(PhoenixStatement.java:1775)
>   at sqlline.Commands.execute(Commands.java:822)
>   at sqlline.Commands.sql(Commands.java:732)
>   at sqlline.SqlLine.dispatch(SqlLine.java:807)
>   at sqlline.SqlLine.begin(SqlLine.java:681)
>   at sqlline.SqlLine.start(SqlLine.java:398)
>   at 

[jira] [Updated] (PHOENIX-5731) Loading bulkload hfiles should not be blocked if the upsert select happening for differet table.

2020-03-12 Thread Rajeshbabu Chintaguntla (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-5731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajeshbabu Chintaguntla updated PHOENIX-5731:
-
Attachment: PHOENIX-5731_addendum.patch

> Loading bulkload hfiles should not be blocked if the upsert select happening 
> for differet table.
> 
>
> Key: PHOENIX-5731
> URL: https://issues.apache.org/jira/browse/PHOENIX-5731
> Project: Phoenix
>  Issue Type: Bug
>Reporter: Rajeshbabu Chintaguntla
>Assignee: Rajeshbabu Chintaguntla
>Priority: Major
> Fix For: 5.1.0, 4.16.0
>
> Attachments: PHOENIX-5731.patch, PHOENIX-5731_addendum.patch
>
>
> currently we are not allowing to load hfiles after bulkload to avoid deadlock 
> cases when upsert select is happening to same table but when the upserting to 
> different table we should not block the load incremental hfiles.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PHOENIX-5607) Client-server backward compatibility tests

2020-03-12 Thread Sandeep Guggilam (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-5607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Guggilam updated PHOENIX-5607:
--
Attachment: (was: PHOENIX-5607.4.x.v1.patch)

> Client-server backward compatibility tests 
> ---
>
> Key: PHOENIX-5607
> URL: https://issues.apache.org/jira/browse/PHOENIX-5607
> Project: Phoenix
>  Issue Type: Test
>Affects Versions: 4.15.0
>Reporter: Lars Hofhansl
>Assignee: Sandeep Guggilam
>Priority: Blocker
>  Labels: phoenix-hardening
> Fix For: 5.1.0, 4.16.0
>
> Attachments: PHOENIX-5607.4.x-HBase-1.3.v1.patch, 
> PHOENIX-5607.4.x-HBase-1.3.v2.patch, PHOENIX-5607.4.x-HBase-1.3.v3.patch, 
> PHOENIX-5607.4.x-HBase-1.3.v4.patch, PHOENIX-5607.4.x-HBase-1.3.v5.patch, 
> PHOENIX-5607.4.x-HBase-1.4.v1.patch, PHOENIX-5607.4.x-HBase-1.5.v1.patch, 
> PHOENIX-5607.4.x.v1.patch, PHOENIX-5607.4.x.v1.patch, 
> PHOENIX-5607.master.v1.patch
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Filing this as a blocker for 4.16.0.
> As we've seen with the various failed attempts to release 4.15.0 Phoenix' 
> backwards compatibility story is weak, and lacks tests - in fact there're no 
> tests.
> We should not allow to ship 4.16.0 without improving that and without tests.
> [~ckulkarni], [~gjacoby] , FYI, what we discussed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PHOENIX-5607) Client-server backward compatibility tests

2020-03-12 Thread Sandeep Guggilam (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-5607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Guggilam updated PHOENIX-5607:
--
Attachment: PHOENIX-5607.4.x.v1.patch

> Client-server backward compatibility tests 
> ---
>
> Key: PHOENIX-5607
> URL: https://issues.apache.org/jira/browse/PHOENIX-5607
> Project: Phoenix
>  Issue Type: Test
>Affects Versions: 4.15.0
>Reporter: Lars Hofhansl
>Assignee: Sandeep Guggilam
>Priority: Blocker
>  Labels: phoenix-hardening
> Fix For: 5.1.0, 4.16.0
>
> Attachments: PHOENIX-5607.4.x-HBase-1.3.v1.patch, 
> PHOENIX-5607.4.x-HBase-1.3.v2.patch, PHOENIX-5607.4.x-HBase-1.3.v3.patch, 
> PHOENIX-5607.4.x-HBase-1.3.v4.patch, PHOENIX-5607.4.x-HBase-1.3.v5.patch, 
> PHOENIX-5607.4.x-HBase-1.4.v1.patch, PHOENIX-5607.4.x-HBase-1.5.v1.patch, 
> PHOENIX-5607.4.x.v1.patch, PHOENIX-5607.4.x.v1.patch, 
> PHOENIX-5607.master.v1.patch
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Filing this as a blocker for 4.16.0.
> As we've seen with the various failed attempts to release 4.15.0 Phoenix' 
> backwards compatibility story is weak, and lacks tests - in fact there're no 
> tests.
> We should not allow to ship 4.16.0 without improving that and without tests.
> [~ckulkarni], [~gjacoby] , FYI, what we discussed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (PHOENIX-5731) Loading bulkload hfiles should not be blocked if the upsert select happening for differet table.

2020-03-12 Thread Rajeshbabu Chintaguntla (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-5731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajeshbabu Chintaguntla reopened PHOENIX-5731:
--

This patch has as issue that maxBatchSize and maxBatchSizeBytes are not getting 
used when the upsert select is done between different tables which is going to 
be problematic that all the data accumulated instead of committing in batches. 
Hence reopening.

> Loading bulkload hfiles should not be blocked if the upsert select happening 
> for differet table.
> 
>
> Key: PHOENIX-5731
> URL: https://issues.apache.org/jira/browse/PHOENIX-5731
> Project: Phoenix
>  Issue Type: Bug
>Reporter: Rajeshbabu Chintaguntla
>Assignee: Rajeshbabu Chintaguntla
>Priority: Major
> Fix For: 5.1.0, 4.16.0
>
> Attachments: PHOENIX-5731.patch
>
>
> currently we are not allowing to load hfiles after bulkload to avoid deadlock 
> cases when upsert select is happening to same table but when the upserting to 
> different table we should not block the load incremental hfiles.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PHOENIX-5772) Streamline the kerberos logic in thin client java code

2020-03-12 Thread Istvan Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-5772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Istvan Toth updated PHOENIX-5772:
-
Summary: Streamline the kerberos logic in thin client java code  (was: 
Remove the kerberos logic from thin client java code)

> Streamline the kerberos logic in thin client java code
> --
>
> Key: PHOENIX-5772
> URL: https://issues.apache.org/jira/browse/PHOENIX-5772
> Project: Phoenix
>  Issue Type: Task
>Reporter: Istvan Toth
>Assignee: Istvan Toth
>Priority: Major
>
> The thin client has logic for sqlline-thin that automatically enablies SPNEGO 
> based on hbase and hadoop settings and whether or not the user is is logged 
> into kerberos.
> According to my analysis, this is the only feature that this code provides. 
> When using the JAR as a JDBC driver, the kerberos code does not apply.
> However, this feature depends on hadoop, which carries the heavy cost 
> bloating the thin client size and polluting its classpath.
> I propose removing the the feature from the Java code, and potentially 
> implementing it in the sqlline-thin startup script.
> This would cut the JAR size by ~80%, and solve a lot of classpath problems 
> for the users of the JDBC driver.
> This is based on discussion with [~elserj] during the review of PHOENIX-5761



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PHOENIX-5772) Remove the kerberos logic from thin client java code

2020-03-12 Thread Istvan Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-5772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Istvan Toth updated PHOENIX-5772:
-
Description: 
The thin client has logic for sqlline-thin that automatically enablies SPNEGO 
based on hbase and hadoop settings and whether or not the user is is logged 
into kerberos.

According to my analysis, this is the only feature that this code provides. 

When using the JAR as a JDBC driver, the kerberos code does not apply.

However, this feature depends on hadoop, which carries the heavy cost bloating 
the thin client size and polluting its classpath.

I propose removing the the feature from the Java code, and potentially 
implementing it in the sqlline-thin startup script.

This would cut the JAR size by ~80%, and solve a lot of classpath problems for 
the users of the JDBC driver.

This is based on discussion with [~elserj] during the review of PHOENIX-5761

  was:
The thin client has logic for sqlline-thin that automatically enablies SPNEGO 
based on hbase and hadoop settings and whether or not the user is is logged 
into kerberos.

According to my analysis, this is the only feature that this code provides. 

When using the JAR as a JDBC driver, the kerberos code does not apply.

However, this feature depends on both hadoop and hbase, which carries the heavy 
cost bloating the thin client size and polluting its classpath.

I propose removing the the feature from the Java code, and potentially 
implementing it in the sqlline-thin startup script.

This would cut the JAR size by ~80%, and solve a lot of classpath problems for 
the users of the JDBC driver.

This is based on discussion with [~elserj] during the review of PHOENIX-5761


> Remove the kerberos logic from thin client java code
> 
>
> Key: PHOENIX-5772
> URL: https://issues.apache.org/jira/browse/PHOENIX-5772
> Project: Phoenix
>  Issue Type: Task
>Reporter: Istvan Toth
>Assignee: Istvan Toth
>Priority: Major
>
> The thin client has logic for sqlline-thin that automatically enablies SPNEGO 
> based on hbase and hadoop settings and whether or not the user is is logged 
> into kerberos.
> According to my analysis, this is the only feature that this code provides. 
> When using the JAR as a JDBC driver, the kerberos code does not apply.
> However, this feature depends on hadoop, which carries the heavy cost 
> bloating the thin client size and polluting its classpath.
> I propose removing the the feature from the Java code, and potentially 
> implementing it in the sqlline-thin startup script.
> This would cut the JAR size by ~80%, and solve a lot of classpath problems 
> for the users of the JDBC driver.
> This is based on discussion with [~elserj] during the review of PHOENIX-5761



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PHOENIX-5772) Remove the kerberos logic from thin client java code

2020-03-12 Thread Istvan Toth (Jira)
Istvan Toth created PHOENIX-5772:


 Summary: Remove the kerberos logic from thin client java code
 Key: PHOENIX-5772
 URL: https://issues.apache.org/jira/browse/PHOENIX-5772
 Project: Phoenix
  Issue Type: Task
Reporter: Istvan Toth
Assignee: Istvan Toth


The thin client has logic for sqlline-thin that automatically enablies SPNEGO 
based on hbase and hadoop settings and whether or not the user is is logged 
into kerberos.

According to my analysis, this is the only feature that this code provides. 

When using the JAR as a JDBC driver, the kerberos code does not apply.

However, this feature depends on both hadoop and hbase, which carries the heavy 
cost bloating the thin client size and polluting its classpath.

I propose removing the the feature from the Java code, and potentially 
implementing it in the sqlline-thin startup script.

This would cut the JAR size by ~80%, and solve a lot of classpath problems for 
the users of the JDBC driver.

This is based on discussion with [~elserj] during the review of PHOENIX-5761



--
This message was sent by Atlassian Jira
(v8.3.4#803005)