Re: Union PreparedStatement ParameterMetaData Parameter value unbound Issue

2019-09-04 Thread James Taylor
Yes, JIRA please.

On Wed, Sep 4, 2019 at 1:24 PM lewjackman  wrote:

> Does this look like an issue for which I should write a Jira?
>
>
>
> --
> Sent from: http://apache-phoenix-user-list.1124778.n5.nabble.com/
>


Re: On duplicate key update

2019-08-27 Thread James Taylor
The lock will happen every time an "on duplicate key" clause is executed.
Conceptually it's like a checkAndPut, but it's batched. If two threads
attempt to write the same non-existent row, one of them will get there
first and get the lock, while the other would wait behind it (and
subsequently trigger the "on duplicate key" logic).

On Tue, Aug 27, 2019 at 9:05 AM Jon Strayer  wrote:

> Just to make sure I understand, anytime an atomic upsert is executed
> Phoenix translates that into a checkAndPut.  Is that correct?
>
> From the Atomic Upsert documentation:
> "On the server-side, when the commit is processed, the row being updated
> will be locked while the current column values are read and the ON
> DUPLICATE KEY clause is executed."
>
> What I'm not clear on is when the lock happens.  Does it happen every time
> a upsert with an "on duplicate key" clause is executed or only after the
> duplicate key is detected?
>
> Also, if there are two threads attempting to write to the same
> non-existing row, is there a race condition where neither thread executes
> it "on duplicate key" clause?
>
> On 8/26/19, 7:28 PM, "Josh Elser"  wrote:
>
> Out of the box, Phoenix will provide the same semantics that HBase
> does
> for concurrent updates to a (data) table.
>
>
> https://urldefense.com/v3/__https://hbase.apache.org/acid-semantics.html__;!YggOdHcy3_sa5w!7XIAKeE5bdWVHeRnJhaMbEwH9vjnhUzpxgoFvOlQBqYoCWp72RMXHdT636yxH4bg3to$
>
> If you're also asking about how index tables remain in sync, the
> answer
> is a bit more complicated (and has changed in recent versions).
>
> On 8/26/19 2:51 PM, Jon Strayer wrote:
> > How does atomic update work with multiple clients?  Assuming that
> there
> > is no matching record to begin with the access won’t be locked.  It
> > seems like two threads could write conflicting data since they both
> see
> > no existing record (NER).  Is that correct? Or is there something
> that
> > will serialize the writes so that only one of them sees the NER
> state?
> >
>
>
>


Re: Buckets VS regions

2019-08-19 Thread James Taylor
It’ll start with 12 regions, but those regions may split as they’re written
to.

On Mon, Aug 19, 2019 at 4:34 PM jesse  wrote:

> I have a table is  SALT_BUCKETS = 12, but it has 14 regions, is this
> right?
>
> Thanks
>
>
>


Re: Phoenix 4 to 5 Upgrade Path

2019-06-15 Thread James Taylor
The data format is not different between 4.0 and 5.0. The metadata can
change, though. We make a best effort for a seamless, automatic upgrade,
but you should test your specific scenario yourself to be confident that
there are no problems. The larger the version difference, the greater the
risk.

On Fri, Jun 14, 2019 at 7:05 AM Josh Elser  wrote:

> IIRC, the guidance is that 4.14 should work back to 4.12. We're talking
> about minor releases in this case, not bugfix. Every bugfix release
> should be compatible within the current minor release. This is one of
> the tenants of https://semver.org/
>
> You should not have issues going to 5.0.0, but I don't know of anyone
> who has explicitly tested this. You should proceed with caution.
>
> On 6/14/19 2:07 AM, Vova Galchenko wrote:
> > Hey Josh,
> >
> > thanks for getting back to me. The current Phoenix version we're on is
> > 4.14.1. Do I understand correctly that the Phoenix community intends to
> > provide data compatibility at least two versions back? Does this
> > intention apply across major version boundaries? More specifically, does
> > it imply that data produced by 4.14.1 is intended to be compatible with
> > 4.14.2 and 5.0.0?
> >
> > Thank you.
> >
> > vova
> >
> > On Wed, Jun 12, 2019 at 1:18 PM Josh Elser  > > wrote:
> >
> > What version of Phoenix 4 are you coming from?
> >
> > Of note, if you're lagging far behind, you'll get bit by the column
> > encoding turning on by default in 4.10 [1]
> >
> > In general, before we update the system catalog table, we take a
> > snapshot of it, so you can roll back (although this would be
> > manual). In
> > terms of testing as a community, we only do testing back two
> releases.
> > After that, your mileage may vary.
> >
> > - Josh
> >
> > [1] https://phoenix.apache.org/columnencoding.html
> >
> > On 6/11/19 9:02 PM, Vova Galchenko wrote:
> >  > Hello Phoenix Users List!
> >  >
> >  > We at Box are thinking about the upgrade story from Phoenix 4 to
> > 5. As
> >  > part of that, we'd like to understand if these Phoenix versions
> > write
> >  > data in formats compatible with each other. In other words,
> > suppose we
> >  > have an HBase 1.4 cluster used by Phoenix 4. Can I shut
> > everything down,
> >  > upgrade HBase to 2.0 and Phoenix to 5, start everything up again,
> > and
> >  > expect to preserve the integrity of the data stored in the
> original
> >  > cluster? If not, is there guidance for what the upgrade process
> > might
> >  > look like?
> >  >
> >  > Thanks,
> >  >
> >  > vova
> >
>


Re: Large differences in query execution time for similar queries

2019-04-17 Thread James Taylor
Hi Hieu,
You could try add the /*+ SERIAL */ hint to see if that has any impact.
Also, have you tried not salting the table? The SALT_BUCKETS value of 128
is pretty high.

For the other issue, do you have a lot of deleted cells? You might try
running a major compaction. You might try adding a secondary index
on "doubleCol" if that's a common query.

Thanks,
James

On Thu, Apr 11, 2019 at 5:44 PM Hieu Nguyen  wrote:

> Hi,
>
> I am using Phoenix 4.14-cdh5.11, with sqlline-thin as the client.  I am
> seeing strange patterns around SELECT query execution time:
> 1. Increasing the LIMIT past a certain "threshold" results in
> significantly slower execution time.
> 2. Adding just one column (BIGINT) to the SELECT results in significantly
> slower execution time.
>
> This is our schema (names are changed for readability):
> CREATE TABLE "metadata" (
>   "pk"   VARCHAR PRIMARY KEY
> )
> SALT_BUCKETS = 128,
> COLUMN_ENCODED_BYTES = 0,
> BLOOMFILTER = 'ROWCOL',
> COMPRESSION = 'GZ';
>
> CREATE VIEW "extended" (
> "doubleCol" DOUBLE,
> "intCol" BIGINT,
> "intCol2" BIGINT,
> "intCol3" BIGINT,
> "stringCol" VARCHAR,
> "stringCol2" VARCHAR,
> "stringCol3" VARCHAR,
> "stringCol4" VARCHAR,
> "stringCol5" VARCHAR,
> "stringCol6" VARCHAR,
> "stringCol7" VARCHAR,
> ) AS SELECT * FROM "metadata"
>
> We have other views created that also select from "metadata" that define
> their own columns.  Overall, there are 1 million rows in this table, and
> 20k rows match the condition "doubleCol" > 100.
>
> Base query:
> SELECT
> "pk","doubleCol","intCol","intCol2","stringCol","stringCol2","intCol3"
> FROM "templatealldatattype-7d55c5a6-efe3-419d-9bce-9fea7c14f8bc"
> WHERE "doubleCol" > 100
> LIMIT 1
> -> 1.976 seconds
>
> Decreasing LIMIT to 9500 (only 5% decrease in number of rows):
> SELECT
> "pk","doubleCol","intCol","intCol2","stringCol","stringCol2","intCol3"
> FROM "templatealldatattype-7d55c5a6-efe3-419d-9bce-9fea7c14f8bc"
> WHERE "doubleCol" > 100
> LIMIT 9500
> -> 0.409 seconds
>
> Removing "intCol3" from SELECT, keeping LIMIT at 1:
> SELECT "pk","doubleCol","intCol","intCol2","stringCol","stringCol2"
> FROM "templatealldatattype-7d55c5a6-efe3-419d-9bce-9fea7c14f8bc"
> WHERE "doubleCol" > 100
> LIMIT 1
> -> 0.339 seconds
>
> I ran each of these queries a few times in a row.  There was small
> variation in execution time, but the 2nd and 3rd queries never were slower
> than the 1st query.
>
> The EXPLAIN plan did not change, except the ROW LIMIT value when
> explaining the 2nd query (9500 instead of 1).
>
> ++-+++
> |PLAN
>   | EST_BYTES_READ  | EST_ROWS_READ  |
> EST_INFO_TS   |
>
> ++-+++
> | CLIENT 128-CHUNK 382226 ROWS 314572800 BYTES PARALLEL 128-WAY ROUND
> ROBIN FULL SCAN OVER metadata  | 314572800   | 382226 |
> 1554973434637  |
> | SERVER FILTER BY "doubleCol" > 100.0
>| 314572800   | 382226 |
> 1554973434637  |
> | SERVER 1 ROW LIMIT
>| 314572800   | 382226 |
> 1554973434637  |
> | CLIENT 1 ROW LIMIT
>| 314572800   | 382226 |
> 1554973434637  |
>
> ++-+++
>
> I tried adding the SEEK_TO_COLUMN and NO_SEEK_TO_COLUMN hints as suggested
> in a similar thread (
> https://lists.apache.org/thread.html/4ef8384ecd31f30fdaf5837e3abc613142426d899e916c7aae4a46d4@%3Cuser.phoenix.apache.org%3E),
> but they had no effect.
>
> Any pointers to how we can investigate the 4-5x slowdown when increasing
> LIMIT by only ~5% or when selecting just one more BIGINT column?  Could we
> have exceeded some threshold in the result size that caused the query to
> perform a lot slower for seemingly small changes in the query?
>
> Thanks,
> -Hieu
>


Re: Access Hbase Cell Timestamp using Phoenix UDF

2019-03-05 Thread James Taylor
You should be able to declare your parameter as VARBINARY and you can then
use any type.

On Tue, Mar 5, 2019 at 5:56 PM OMKAR NALAWADE 
wrote:

> Hello,
>
> I am trying to access cell level timestamp of a given column in a table
> using a phoenix UDF.
> *For example:* I have created a UDF named TS_UDF(dataType)
> When we register the function using
> CREATE FUNCTION TS_UDF(BOOLEAN) RETURNS BIGINT as '
> com.mypackage.TsFunction.' using jar '
> hdfs:/localhost:8080/hbase/lib/myjar.jar';
>
> In the above statement, I wish to pass a generic datatype to the TS_UDF
> function instead of Boolean datatype so that it can be used for the columns
> with any datatype.
>
> Is there a way we can create a function and pass a generic datatype in
> Phoenix?
>
> I am using HDP 2.6.4 stack with HBase version 1.1.2 and Phoenix version
> 4.7.0
>
> Regards,
>
> Omkar
>
>
> On Tue, Mar 5, 2019 at 4:57 PM OMKAR NALAWADE 
> wrote:
>
>>
>> Hello,
>> I am trying to access cell level timestamp of a given column in a table
>> using a phoenix UDF.
>> *For example:* I have created a UDF named TS_UDF(dataType)
>> When we register the function using
>> CREATE FUNCTION TS_UDF(BOOLEAN) RETURNS BIGINT as '
>> com.mypackage.TsFunction.' using jar '
>> hdfs:/localhost:8080/hbase/lib/myjar.jar';
>>
>> In the above statement, I wish to pass a generic datatype to the TS_UDF
>> function instead of Boolean datatype so that it can be used for the columns
>> with any datatype.
>>
>> Is there a way we can create a function and pass a generic datatype in
>> Phoenix?
>>
>> I am using HDP 2.6.4 stack with HBase version 1.1.2 and Phoenix version
>> 4.7.0
>>
>> Regards,
>>
>> Omkar
>>
>


Re: Phoenix JDBC Connection Warmup

2019-02-02 Thread James Taylor
Have you tried setting UPDATE_CACHE_FREQUENCY on your tables?

On Fri, Feb 1, 2019 at 6:28 PM Jaanai Zhang  wrote:

>  we experimented with issuing the same query repeatedly, and we observed a
>> slow down not only on the first query
>
> I am not sure what the reasons are, perhaps you can enable TRACE log to
> find what leads to slow,  I guess that some meta information is reloaded
> under highly write workload.
>
> 
>Jaanai Zhang
>Best regards!
>
>
>
> William Shen  于2019年2月1日周五 上午2:09写道:
>
>> Thanks Jaanai. Do you know if that is expected only on the first query
>> against a table? For us, we experimented with issuing the same query
>> repeatedly, and we observed a slow down not only on the first query. Does
>> it make sense to preemptively load table metadata on start up to warm up
>> the system to reduce latency during the actual query time (if it is
>> possible to do so)?
>>
>> On Wed, Jan 30, 2019 at 10:54 PM Jaanai Zhang 
>> wrote:
>>
>>> It is expected when firstly query tables after establishing the
>>> connection. Something likes loads some meta information into local cache
>>> that need take some time,  mainly including two aspects: 1. access
>>> SYSTEM.CATALOG table to get schema information of the table  2. access the
>>> meta table of HBase to get regions information of the table
>>>
>>> 
>>>Jaanai Zhang
>>>Best regards!
>>>
>>>
>>>
>>> William Shen  于2019年1月31日周四 下午1:37写道:
>>>
 Hi there,

 I have a component that makes Phoenix queries via the Phoenix JDBC
 Connection. I noticed that consistently, the Phoenix Client takes longer to
 execute a PreparedStatement and it takes longer to read through the
 ResultSet for a period of time (~15m) after a restart of the component. It
 seems like there is a warmup period for the JDBC connection. Is this to be
 expected?

 Thanks!

>>>


Re: How to decode composite rowkey back to original primary keys

2019-01-14 Thread James Taylor
You can do something similar to PhoenixRuntime.decodeColumnValues() to do
that. It'd be something like this:

public static Object[] decodeColumnValues(Connection conn, String
fullTableName, byte[] value) throws SQLException {
PTable table = getTable(conn, fullTableName);
RowKeySchema schema = table.getRowKeySchema();
ImmutableBytesWritable ptr = new ImmutableBytesWritable(value);
schema.iterator(ptr);
int i = 0;
List values = new ArrayList();
while(schema.next(ptr, i, maxOffset) != null) {
values.add(schema.getField(i).getDataType().toObject(ptr));
i++;
}
return values.toArray();
}




On Mon, Jan 14, 2019 at 12:47 PM Shawn Li  wrote:

> Hi,
>
> Phoenix encodes composite key to hbase rowkey. We want to check if there
> any documentation or example to show how to manually decode Hbase rowkey
> back to original values for those primary keys. Or is there any phoenix
> source code we can directly use to do this?
>
> Thanks,
> Shawn
>


Re: Query All Dynamic Columns

2018-12-26 Thread James Taylor
Persisting dynamic column names+types in Phoenix is exactly what views are
for.


On Wed, Dec 26, 2018 at 12:05 PM Vincent Poon 
wrote:

> A lot of work is currently going into handling large numbers of views -
> splittable syscat, view management, etc... but agree that it's not ideal.
>
> There's currently no built-in way to do what you want AFAIK, but you can
> manage the columns yourself in a separate table:
> - store them all in a single column value, and read that value before
> doing your query.  HBase checkAndMutate for locking.
> or
> - store each column as separate rows.  Then you can do things like filter
> by column name efficiently.
> You could 'soft delete' by removing the entries.
>
> Would be a nice improvement to have an option to persist dynamic column
> names+types in Phoenix.
>
> On Fri, Dec 21, 2018 at 12:18 PM Clay Baenziger (BLOOMBERG/ 731 LEX) <
> cbaenzi...@bloomberg.net> wrote:
>
>> Hello,
>>
>> A user of mine brought up a question around dynamic columns in Phoenix
>> today. The quantity of columns should become asymptotic to a few tends of
>> thousands of columns as their data fills in.
>>
>> The user want to query all columns in a table and they are today thinking
>> of using views to do this -- but it is ugly management. They have an
>> unbounded number of views -- which will pollute the global catalog and fail
>> relatively quickly.
>>
>> Has anyone thought about the potentially wasteful[1] approach of scanning
>> all rows in a query to determine columns and then re-running the query for
>> the rows once we know what columns the SQL result will contain. Maybe
>> something cleaner like persisting the set of columns in the statistics
>> table and a SELECT * may return columns with nothing but nulls. Or, even
>> better is there an overall better way to model such a wide schema in
>> Phoenix?
>>
>> -Clay
>>
>> [1]: Perhaps some heuristics could allow for not needing to do 2n reads
>> in all cases?
>>
>


Re: High-availability for transactions

2018-11-09 Thread James Taylor
Hi Curtis,
The omid2 branch is up to date with 4.x-hbase-1.3 and we’ll merge it once
the Omid release is complete (vote will start on Mon). Pedro is the main
one maintaining the CDH branches, so this work should make it into a 4.15
CDH release. If you have some spare cycles, I’m sure he could use some help.
Thanks,
James

On Wed, Nov 7, 2018 at 7:40 AM Curtis Howard 
wrote:

> Hi James,
>
> Great - thanks very much for confirming Omid will be available on 4.x
> branches - that includes the cdh/Cloudera branches as well?
>
> Would you have any suggestions on the best approach to merging 'omid2'
> branch work with say, 4.14-cdh5.14?
>
> Curtis
>
>
> On Tue, Nov 6, 2018 at 3:28 PM James Taylor 
> wrote:
>
>> The Omid integration will be available on all active 4.x branches (1.2,
>> 1.3, and 1.4) as well as on 5.x. The Omid integration has the same unit
>> test coverage as Tephra. If you want to give it a spin, let us know. You'd
>> just need to pull the phoenix-integration branch for Omid and the omid2
>> branch for Phoenix and install them locally.
>>
>> Thanks,
>> James
>>
>> On Mon, Nov 5, 2018 at 5:57 PM Curtis Howard <
>> curtis.james.how...@gmail.com> wrote:
>>
>>> Hi Omid,
>>>
>>> Thanks for the quick reply.
>>>
>>> When the Omid/Phoenix integration is complete, will it be released only
>>> with Phoenix 5.x, or will it also be integrated and tested with 4.14.x
>>> branches?
>>> Can you comment at all on what level of testing the integration will
>>> have had, once released?  (will it be considered an alpha, or beta release,
>>> initially?)
>>>
>>> Curtis
>>>
>>>


Re: High-availability for transactions

2018-11-06 Thread James Taylor
The Omid integration will be available on all active 4.x branches (1.2,
1.3, and 1.4) as well as on 5.x. The Omid integration has the same unit
test coverage as Tephra. If you want to give it a spin, let us know. You'd
just need to pull the phoenix-integration branch for Omid and the omid2
branch for Phoenix and install them locally.

Thanks,
James

On Mon, Nov 5, 2018 at 5:57 PM Curtis Howard 
wrote:

> Hi Omid,
>
> Thanks for the quick reply.
>
> When the Omid/Phoenix integration is complete, will it be released only
> with Phoenix 5.x, or will it also be integrated and tested with 4.14.x
> branches?
> Can you comment at all on what level of testing the integration will have
> had, once released?  (will it be considered an alpha, or beta release,
> initially?)
>
> Curtis
>
>


Re: Null array elements with joins

2018-08-13 Thread James Taylor
I commented on the JIRA you filed here: PHOENIX-4791. Best to keep
discussion there.
Thanks,
James

On Mon, Aug 13, 2018 at 11:08 AM, Gerald Sangudi 
wrote:

> Hello all,
>
> Any suggestions or pointers on the issue below?
>
> Projecting array elements works when not using joins, and does not work
> when we use hash joins. Is there an issue with the ProjectionCompiler for
> joins? I have not been able to isolate the specific cause, and would
> appreciate any pointers or suggestions.
>
> Thanks,
> Gerald
>
> On Tue, Jun 19, 2018 at 10:02 AM, Tulasi Paradarami <
> tulasi.krishn...@gmail.com> wrote:
>
>> Hi,
>>
>> I'm running few tests against Phoenix array and running into this bug
>> where array elements return null values when a join is involved. Is this a
>> known issue/limitation of arrays?
>>
>> create table array_test_1 (id integer not null primary key, arr
>> tinyint[5]);
>> upsert into array_test_1 values (1001, array[0, 0, 0, 0, 0]);
>> upsert into array_test_1 values (1002, array[0, 0, 0, 0, 1]);
>> upsert into array_test_1 values (1003, array[0, 0, 0, 1, 1]);
>> upsert into array_test_1 values (1004, array[0, 0, 1, 1, 1]);
>> upsert into array_test_1 values (1005, array[1, 1, 1, 1, 1]);
>>
>> create table test_table_1 (id integer not null primary key, val varchar);
>> upsert into test_table_1 values (1001, 'abc');
>> upsert into test_table_1 values (1002, 'def');
>> upsert into test_table_1 values (1003, 'ghi');
>>
>> 0: jdbc:phoenix:localhost> select t1.id, t2.val, t1.arr[1], t1.arr[2],
>> t1.arr[3] from array_test_1 as t1 join test_table_1 as t2 on t1.id =
>> t2.id;
>> ++-++---
>> -++
>> | T1.ID  | T2.VAL  | ARRAY_ELEM(T1.ARR, 1)  | ARRAY_ELEM(T1.ARR, 2)  |
>> ARRAY_ELEM(T1.ARR, 3)  |
>> ++-++---
>> -++
>> | 1001   | abc | null   | null   |
>> null   |
>> | 1002   | def | null   | null   |
>> null   |
>> | 1003   | ghi | null   | null   |
>> null   |
>> ++-++---
>> -++
>> 3 rows selected (0.056 seconds)
>>
>> However, directly selecting array elements from the array returns data
>> correctly.
>> 0: jdbc:phoenix:localhost> select t1.id, t1.arr[1], t1.arr[2], t1.arr[3]
>> from array_test_1 as t1;
>> +---+-+-+---
>> --+
>> |  ID   | ARRAY_ELEM(ARR, 1)  | ARRAY_ELEM(ARR, 2)  | ARRAY_ELEM(ARR, 3)
>> |
>> +---+-+-+---
>> --+
>> | 1001  | 0   | 0   | 0
>>  |
>> | 1002  | 0   | 0   | 0
>>  |
>> | 1003  | 0   | 0   | 0
>>  |
>> | 1004  | 0   | 0   | 1
>>  |
>> | 1005  | 1   | 1   | 1
>>  |
>> +---+-+-+---
>> --+
>> 5 rows selected (0.044 seconds)
>>
>>
>>
>


Re: Spark-Phoenix Plugin

2018-08-06 Thread James Taylor
For the UPSERTs on a PreparedStatement that are done by Phoenix for writing
in the Spark adapter, not that these are *not* doing RPCs to the HBase
server to write data (i.e. they are never committed). Instead the UPSERTs
are used to ensure that the correct serialization is performed given the
Phoenix schema. We use a PhoenixRuntime API to get the List from the
uncommitted data and then perform a rollback. Using this technique,
features like salting, column encoding, row timestamp, etc. will continue
to work with the Spark integration.

Thanks,
James

On Mon, Aug 6, 2018 at 7:44 AM, Jaanai Zhang  wrote:

> you can get better performance if directly read/write HBase. you also use
> spark-phoenix, this is an example, reading data from CSV file and writing
> into Phoenix table:
>
> def main(args: Array[String]): Unit = {
>
>   val sc = new SparkContext("local", "phoenix-test")
>   val path = "/tmp/data"
>   val hbaseConnectionString = "host1,host2,host3"
>   val customSchema = StructType(Array(
> StructField("O_ORDERKEY", StringType, true),
> StructField("O_CUSTKEY", StringType, true),
> StructField("O_ORDERSTATUS", StringType, true),
> StructField("O_TOTALPRICE", StringType, true),
> StructField("O_ORDERDATE", StringType, true),
> StructField("O_ORDERPRIORITY", StringType, true),
> StructField("O_CLERK", StringType, true),
> StructField("O_SHIPPRIORITY", StringType, true),
> StructField("O_COMMENT", StringType, true)))
>
>   //import com.databricks.spark.csv._
>   val sqlContext = new SQLContext(sc)
>
>   val df = sqlContext.read
> .format("com.databricks.spark.csv")
> .option("delimiter", "|")
> .option("header", "false")
> .schema(customSchema)
> .load(path)
>
>   val start = System.currentTimeMillis()
>   df.write.format("org.apache.phoenix.spark")
> .mode("overwrite")
> .option("table", "DATAX")
> .option("zkUrl", hbaseConnectionString)
> .save()
>
>   val end = System.currentTimeMillis()
>   print("taken time:" + ((end - start) / 1000) + "s")
> }
>
>
>
>
> 
>Yun Zhang
>Best regards!
>
>
> 2018-08-06 20:10 GMT+08:00 Brandon Geise :
>
>> Thanks for the reply Yun.
>>
>>
>>
>> I’m not quite clear how this would exactly help on the upsert side?  Are
>> you suggesting deriving the type from Phoenix then doing the
>> encoding/decoding and writing/reading directly from HBase?
>>
>>
>>
>> Thanks,
>>
>> Brandon
>>
>>
>>
>> *From: *Jaanai Zhang 
>> *Reply-To: *
>> *Date: *Sunday, August 5, 2018 at 9:34 PM
>> *To: *
>> *Subject: *Re: Spark-Phoenix Plugin
>>
>>
>>
>> You can get data type from Phoenix meta, then encode/decode data to
>> write/read data. I think this way is effective, FYI :)
>>
>>
>>
>>
>> 
>>
>>Yun Zhang
>>
>>Best regards!
>>
>>
>>
>>
>>
>> 2018-08-04 21:43 GMT+08:00 Brandon Geise :
>>
>> Good morning,
>>
>>
>>
>> I’m looking at using a combination of Hbase, Phoenix and Spark for a
>> project and read that using the Spark-Phoenix plugin directly is more
>> efficient than JDBC, however it wasn’t entirely clear from examples when
>> writing a dataframe if an upsert is performed and how much fine-grained
>> options there are for executing the upsert.  Any information someone can
>> share would be greatly appreciated!
>>
>>
>>
>>
>>
>> Thanks,
>>
>> Brandon
>>
>>
>>
>
>


Re: Apache Phoenix with Google Big Table

2018-08-02 Thread James Taylor
Since Google Bigtable doesn't support coprocessors, you'd need a version of
Phoenix that doesn't rely on coprocessors (some server-side hooks that
HBase provides). It'd be non trivial and you'd lose some functionality and
performance in doing this, but theoretically it's possible.

Here are the kinds of changes you'd need to make:
- modify aggregation so that it's done on the client instead of the server
(this would be slower)
- modify order by so that it's done on the client side (this wouldn't scale
well)
- don't use mutable, global secondary indexing (they're maintained on the
server side)
- don't use sequences (or replace them with a more limited/restricted
implementation on top of HBase increment)
- don't use local indexes (you might be able to cobble something together
that'd work on the client, but there'd be some tricky edge cases)
- don't use transactions (though they could potentially be made to work,
there'd be some effort here)

You'd also want to use Phoenix 5.0 since it's using HBase 2.0 APIs which I
believe are the APIs support by Bigtable.

Another couple alternative would be:
- convince Google to support coprocessors, provide kind of similar
server-side hooks, or support the server side code of Phoenix
- use Big Query instead of Phoenix which has some Bigtable integration

Thanks,
James


On Thu, Aug 2, 2018 at 1:56 PM, Vijay Vangapandu <
vijayvangapa...@eharmony.com> wrote:

> Hi Guys,
>
> We are using Hortneworks Hbase cluster with apache phoenix as query server.
> We recently started migrating few services to GCP ( Google cloud).
>
> To move our on-prem Hbase to google cloud we found bigtable as suitable
> replacement as it supports HBase client, but to use Hbase client we need to
> change application code.
>
> Is there anyway we can work with apache phoenix query server with Google
> Bigtable?
>
> Thanks for your response in advance.
>
> --Vijay.
>
>


Re: order by primary key desc return wrong results

2018-07-31 Thread James Taylor
Please file a JIRA.

On Mon, Jul 30, 2018 at 10:12 PM, jie chen  wrote:

> phoenix-4.14-hbase-1.2
>
> 0: jdbc:phoenix:localhost>  create table test(id bigint not null primary
>>> key, a bigint);
>>
>> No rows affected (1.242 seconds)
>>
>> 0: jdbc:phoenix:localhost> upsert into test values(1,11);
>>
>> 1 row affected (0.01 seconds)
>>
>> 0: jdbc:phoenix:localhost> upsert into test values(2,22);
>>
>> 1 row affected (0.007 seconds)
>>
>> 0: jdbc:phoenix:localhost> upsert into test values(3,33);
>>
>> 1 row affected (0.005 seconds)
>>
>> 0: jdbc:phoenix:localhost> select * from test;
>>
>> +-+-+
>>
>> | ID  |  A  |
>>
>> +-+-+
>>
>> | 1   | 11  |
>>
>> | 2   | 22  |
>>
>> | 3   | 33  |
>>
>> +-+-+
>>
>> 3 rows selected (0.015 seconds)
>>
>> 0: jdbc:phoenix:localhost> select * from test order by id desc limit 2
>>> offset 0;
>>
>> +-+-+
>>
>> | ID  |  A  |
>>
>> +-+-+
>>
>> | 3   | 33  |
>>
>> | 2   | 22  |
>>
>> +-+-+
>>
>> 2 rows selected (0.018 seconds)
>>
>> 0: jdbc:phoenix:localhost> select * from test where id in (select id from
>>> test ) order by id desc limit 2 offset 0;
>>
>> +-+-+
>>
>> | ID  |  A  |
>>
>> +-+-+
>>
>> | 2   | 22  |
>>
>> | 1   | 11  |
>>
>> +-+-+
>>
>> wrong results. should be
>
>>
>>>
>>> +-+-+
>>
>> | ID  |  A  |
>>
>> +-+-+
>>
>> | 3   | 33  |
>>
>> | 2   | 22  |
>>
>> +-+-+
>>
>>  there may be some errors. ScanUtil.setupReverseScan code.
>
>


Re: Statements caching

2018-07-27 Thread James Taylor
There's no statement caching available in Phoenix. That would be a good
contribution, though.
Thanks,
James

On Thu, Jul 26, 2018 at 10:45 AM, Batyrshin Alexander <0x62...@gmail.com>
wrote:

>  Hi all,
> Im wondering how to enable statement caching in Phoenix JDBC Driver.
> Is there anything like "cachePrepStmts"?
>


Re: Upsert is EXTREMELY slow

2018-07-13 Thread James Taylor
Phoenix won’t be slower to update secondary indexes than a use case would
be. Both have to do the writes to a second table to keep it in sync.

On Fri, Jul 13, 2018 at 8:39 AM Josh Elser  wrote:

> Also, they're relying on Phoenix to do secondary index updates for them.
>
> Obviously, you can do this faster than Phoenix can if you know the exact
> use-case.
>
> On 7/12/18 6:31 PM, Pedro Boado wrote:
> > A tip for performance is reusing the same preparedStatement , just
> > clearParameters() , set values and executeUpdate() over and over again.
> > Don't close the statement or connections after each upsert. Also, I
> > haven't seen any noticeable benefit on using jdbc batches as Phoenix
> > controls batching by when commit() is called.
> >
> > Keep an eye on not calling commit after every executeUpdate (that's a
> > real performance killer) . Batch commits in every ~1k upserts .
> >
> > Also that attempt of asynchronous code is probably another performance
> > killer. Are you creating a new Runnable per database write and opening
> > and closing dB connections per write? Just spawn a few threads (5 to 10,
> > if client cpu is not maxed keep increasing it) and send upserts in a for
> > loop reusing preparedStatement and connections.
> >
> > With a cluster that size I would expect seeing tens of thousands of
> > writes per second.
> >
> > Finally have you checked that all RS receive same traffic ?
> >
> > On Thu, 12 Jul 2018, 23:10 Pedro Boado,  > > wrote:
> >
> > I believe it's related to your client code - In our use case we do
> > easily 15k writes/sec in a cluster lower specced than yours.
> >
> > Check that your jdbc connection has autocommit off so Phoenix can
> > batch writes and that table has a reasonable UPDATE_CACHE_FREQUENCY
> > ( more than 6 ).
> >
> >
> > On Thu, 12 Jul 2018, 21:54 alchemist,  > > wrote:
> >
> > Thanks a lot for your help.
> > Our test is inserting new rows individually. For our use case,
> > we are
> > benchmarking that we could be able to get 10,000 new rows in a
> > minute, using
> > a cluster of writers if needed.
> > When executing the inserts with Phoenix API (UPSERT) we have
> > been able to
> > get up to 6,000 new rows per minute.
> >
> > We changed our test to perform the inserts individually using
> > the HBase API
> > (Put) rather than Phoenix API (UPSERT) and got an improvement of
> > more than
> > 10x. (up to 60,000 rows per minute).
> >
> > What would explain this difference? I assume that in both cases
> > HBase must
> > grab the locks individually in the same way.
> >
> >
> >
> > --
> > Sent from:
> http://apache-phoenix-user-list.1124778.n5.nabble.com/
> >
>


Re: Hash aggregation

2018-06-14 Thread James Taylor
Hi Gerald,
No further suggestions than my comments on the JIRA. Maybe a good next step
would be a patch?
Thanks,
James

On Tue, Jun 12, 2018 at 8:15 PM, Gerald Sangudi 
wrote:

> Hi Maryann and James,
>
> Any further guidance on PHOENIX-4751
> ?
>
> Thanks,
> Gerald
>
> On Wed, May 23, 2018 at 11:00 AM, Gerald Sangudi 
> wrote:
>
>> Hi Maryann,
>>
>> I filed PHOENIX-4751 
>> .
>>
>> Is this likely to be reviewed soon (say next few weeks), or should I look
>> at the Phoenix source to estimate the scope / impact?
>>
>> Thanks,
>> Gerald
>>
>> On Tue, May 22, 2018 at 11:12 AM, Maryann Xue 
>> wrote:
>>
>>> Since the performance running a group-by aggregation on client side is
>>> most likely bad, it’s usually not desired. The original implementation was
>>> for functionality completeness only so it chose the easiest way, which
>>> reused some existing classes. In some cases, though, the client group-by
>>> can still be tolerable if there aren’t many distinct keys. So yes, please
>>> open a JIRA for implementing hash aggregation on client side. Thank you!
>>>
>>>
>>> Thanks,
>>> Maryann
>>>
>>> On Tue, May 22, 2018 at 10:50 AM Gerald Sangudi 
>>> wrote:
>>>
 Hello,

 Any guidance or thoughts on the thread below?

 Thanks,
 Gerald


 On Fri, May 18, 2018 at 11:39 AM, Gerald Sangudi 
 wrote:

> Maryann,
>
> Can Phoenix provide hash aggregation on the client side? Are there
> design / implementation reasons not to, or should I file a ticket for 
> this?
>
> Thanks,
> Gerald
>
> On Fri, May 18, 2018 at 11:29 AM, Maryann Xue 
> wrote:
>
>> Hi Gerald,
>>
>> Phoenix does have hash aggregation. The reason why sort-based
>> aggregation is used in your query plan is that the aggregation happens on
>> the client side. And that is because sort-merge join is used (as hinted)
>> which is a client driven join, and after that join stage all operations 
>> can
>> only be on the client-side.
>>
>>
>> Thanks,
>> Marynn
>>
>> On Fri, May 18, 2018 at 10:57 AM, Gerald Sangudi <
>> gsang...@23andme.com> wrote:
>>
>>> Hello,
>>>
>>> Does Phoenix provide hash aggregation? If not, is it on the roadmap,
>>> or should I file a ticket? We have aggregation queries that do not 
>>> require
>>> sorted results.
>>>
>>> For example, this EXPLAIN plan shows a CLIENT SORT.
>>>
>>> *CREATE TABLE unsalted (   keyA BIGINT NOT NULL,   keyB
>>> BIGINT NOT NULL,   val SMALLINT,   CONSTRAINT pk PRIMARY KEY 
>>> (keyA,
>>> keyB));*
>>>
>>>
>>> *EXPLAINSELECT /*+ USE_SORT_MERGE_JOIN */ t1.val v1, t2.val v2,
>>> COUNT(*) c FROM unsalted t1 JOIN unsalted t2 ON (t1.keyA = t2.keyA) 
>>> GROUP
>>> BY t1.val,
>>> t2.val;++-++--+|
>>>PLAN   | EST_BYTES_READ | EST_ROWS_READ  |
>>> |++-++--+|
>>> SORT-MERGE-JOIN (INNER) TABLES | null | 
>>> null |
>>> || CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED  | null | 
>>> null
>>> | || AND| null |
>>> null | || CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED  | 
>>> null
>>> | null | || CLIENT SORTED BY [TO_DECIMAL(T1.VAL), T2.VAL]  |
>>> null | null | || CLIENT AGGREGATE INTO DISTINCT ROWS BY [T1.VAL, T2.VAL]
>>>| null | null |
>>> |++-++--+*
>>> Thanks,
>>> Gerald
>>>
>>
>>
>

>>
>


Re: Atomic UPSERT on indexed tables

2018-06-11 Thread James Taylor
It's possible that local indexes could be allowed for atomic upserts, but
global indexes are problematic (in that under load your cluster would
probably die). The reason is that there'd be a cross RS call made for each
row being atomically upserted. If the call hangs due to the RS hosting the
data being down, it ties up the handler thread which can lead to a
cascading effect when writes fail.

Not sure what your timeframe is, but support for Omid transactions is
getting close to being committed. It's on the omid2 feature branch. They
play well with HBase replication, so perhaps that's an option.

On Fri, Jun 8, 2018 at 12:01 PM, Miles Spielberg  wrote:

> This represents a serious shortcoming for our use case. We require some
> level of same-row atomic update capability on tables we'd also like to have
> global indices on. Our understanding is that running under Tephra doesn't
> play well with HBase replication, which we've been intending to use for
> disaster recovery.
>
> Our workload is read-dominated, so we think we're willing to accept the
> row-locking and write time performance penalties. Is this something that
> could be put under an option? Are there other significant shortcomings to
> combining ON DUPLICATE KEY with global indices?
>
> PHOENIX-3925 mentions "service protection," but I'm clear on what that's
> referring to.
>
> Miles Spielberg
> Staff Software Engineer
>
>
> O. 650.485.1102
> 900 Jefferson Ave
> 
> Redwood City, CA 94063
> 
>
> On Mon, Jun 4, 2018 at 7:14 PM, Sergey Soldatov 
> wrote:
>
>> Yes, the documentation doesn't reflect the recent changes. Please see
>> https://issues.apache.org/jira/browse/PHOENIX-3925
>>
>> Thanks,
>> Sergey
>>
>> On Fri, Jun 1, 2018 at 5:39 PM, Miles Spielberg  wrote:
>>
>>> From https://phoenix.apache.org/atomic_upsert.html:
>>>
>>> Although global indexes on columns being atomically updated are
 supported, it’s not recommended as a potentially a separate RPC across the
 wire would be made while the row is under lock to maintain the secondary
 index.
>>>
>>>
>>> But the parser on 4.13.1 doesn't seem to agree:
>>>
>>> 0: jdbc:phoenix:thin:url=http://192.168.99.10> CREATE TABLE T1 (A
>>> VARCHAR PRIMARY KEY, B VARCHAR);
>>>
>>> No rows affected (1.299 seconds)
>>>
>>> 0: jdbc:phoenix:thin:url=http://192.168.99.10> CREATE INDEX T1_B ON
>>> T1(B);
>>>
>>> No rows affected (6.285 seconds)
>>>
>>> 0: jdbc:phoenix:thin:url=http://192.168.99.10> UPSERT INTO T1(A,B)
>>> VALUES('hello', 'world') ON DUPLICATE KEY IGNORE;
>>>
>>> Error: Error -1 (0) : Error while executing SQL "UPSERT INTO T1(A,B)
>>> VALUES('hello', 'world') ON DUPLICATE KEY IGNORE": Remote driver error:
>>> RuntimeException: java.sql.SQLException: ERROR 1224 (42Z24): The ON
>>> DUPLICATE KEY clause may not be used when a table has a global index.
>>> tableName=T1 -> SQLException: ERROR 1224 (42Z24): The ON DUPLICATE KEY
>>> clause may not be used when a table has a global index. tableName=T1
>>> (state=0,code=-1)
>>>
>>> 0: jdbc:phoenix:thin:url=http://192.168.99.10>
>>>
>>> Am I doing something wrong here? Is the documentation inaccurate?
>>>
>>>
>>> Miles Spielberg
>>> Staff Software Engineer
>>>
>>>
>>> O. 650.485.1102
>>> 900 Jefferson Ave
>>> 
>>> Redwood City, CA 94063
>>> 
>>>
>>
>>
>


[ANNOUNCE] Apache Phoenix 4.14 released

2018-06-11 Thread James Taylor
The Apache Phoenix team is pleased to announce the immediate availability
of the 4.14.0 release. Apache Phoenix enables SQL-based OLTP and
operational analytics for Apache Hadoop using Apache HBase as its backing
store and providing integration with other projects in the Apache ecosystem
such as Spark, Hive, Pig, Flume, and MapReduce.

Highlights of the release include:

* Support for HBase 1.4
* Support for CDH 5.11.2, 5.12.2, 5.13.2, and 5.14.2
* Support for GRANT and REVOKE commands
* Secondary index improvements

For more details, visit our blog here [1] and download source and binaries
here [2].

Thanks,
James (on behalf of the Apache Phoenix team)

[1] https://blogs.apache.org/phoenix/entry/announcing-phoenix-4-14-released
[2] http://phoenix.apache.org/download.html


Re: Problem with query with: limit, offset and order by

2018-05-25 Thread James Taylor
OFFSET will not scale well with large values as there is no way to
implement it in HBase other than scanning from the beginning and skipping
that many rows. I'd suggest using row value constructors instead. You can
read more about that here: https://phoenix.apache.org/paged.html

Thanks,
James

On Fri, May 25, 2018 at 6:36 AM, ychernya...@gmail.com <
ychernya...@gmail.com> wrote:

> Hi everyone.
>
>I faced the problem of executing query. We have table with 150 rows, if
> you try execute query  with huge offset, order by  and using limit, phoenix
> will be crushed with such problem:
>
>
> Example query:
>
> Select * from table order by col1 , col2 limit 1000 offset 15677558;
>
> Caused by:
>
> Error: org.apache.phoenix.exception.PhoenixIOException:
> org.apache.hadoop.hbase.DoNotRetryIOException: TABLE,,1524572088462.
> 899bce582428250714db99a6b679e435.: Requested memory of 4201719544 bytes
> is larger than global pool of 272655974 bytes.
> at org.apache.phoenix.util.ServerUtil.createIOException(
> ServerUtil.java:96)
> at org.apache.phoenix.util.ServerUtil.throwIOException(
> ServerUtil.java:62)
> at org.apache.phoenix.coprocessor.BaseScannerRegionObserver$
> RegionScannerHolder.overrideDelegate(BaseScannerRegionObserver.java:264)
> at org.apache.phoenix.coprocessor.BaseScannerRegionObserver$
> RegionScannerHolder.nextRaw(BaseScannerRegionObserver.java:293)
> at org.apache.hadoop.hbase.regionserver.RSRpcServices.
> scan(RSRpcServices.java:2808)
> at org.apache.hadoop.hbase.regionserver.RSRpcServices.
> scan(RSRpcServices.java:3045)
> at org.apache.hadoop.hbase.protobuf.generated.
> ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36613)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2352)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(
> RpcExecutor.java:297)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(
> RpcExecutor.java:277)
> Caused by: org.apache.phoenix.memory.InsufficientMemoryException:
> Requested memory of 4201719544 bytes is larger than global pool of
> 272655974 bytes.
> at org.apache.phoenix.memory.GlobalMemoryManager.allocateBytes(
> GlobalMemoryManager.java:66)
> at org.apache.phoenix.memory.GlobalMemoryManager.allocate(
> GlobalMemoryManager.java:89)
> at org.apache.phoenix.memory.GlobalMemoryManager.allocate(
> GlobalMemoryManager.java:95)
> at org.apache.phoenix.iterate.NonAggregateRegionScannerFacto
> ry.getTopNScanner(NonAggregateRegionScannerFactory.java:315)
> at org.apache.phoenix.iterate.NonAggregateRegionScannerFacto
> ry.getRegionScanner(NonAggregateRegionScannerFactory.java:163)
> at org.apache.phoenix.coprocessor.ScanRegionObserver.
> doPostScannerOpen(ScanRegionObserver.java:72)
> at org.apache.phoenix.coprocessor.BaseScannerRegionObserver$
> RegionScannerHolder.overrideDelegate(BaseScannerRegionObserver.java:245)
> ... 8 more (state=08000,code=101)
>
>
> But if you remove "order by" or "limit " problem will gone.
>
>
> Versions:
> Phoenix:  4.13.1
> Hbase: 1.3
> Hadoop : 2.6.4
>


Re: Cannot access from jdbc

2018-05-23 Thread James Taylor
Hi Nicolas,
I don't think it's fair to ask the Apache Phoenix community to investigate
a vendor specific issue from a more than two year old forked repo. I have
no idea which of the 912 JIRAs since Phoenix 4.7.0 made it into that fork.

Thanks,
James

On Wed, May 23, 2018 at 9:26 AM, Nicolas Paris  wrote:

> Hi James.
>
> I tried to make the update myself but without success.
> This actualy a custom 4.7 version (here the commit log
> https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/
> bk_release-notes/content/patch_phoenix.html)
> The commit log looks more recent than 2 years
>
>
> 2018-05-23 18:16 GMT+02:00 James Taylor :
>
>> The 4.7 release is more than two years old. That's seven releases back
>> from the current release we're voting on 4.14. I'd recommend working with
>> your vendor and urging them to upgrade to a newer, supportable version.
>>
>> Thanks,
>> James
>>
>> On Wed, May 23, 2018 at 9:10 AM, Nicolas Paris 
>> wrote:
>>
>>> hi team
>>>
>>> I am having this error while trying to connect from a java program to a 
>>> working phoenix 4.7 (hdp build) instance:
>>>
>>>
>>>  Wed May 23 16:40:56 CEST 2018, null, java.net.SocketTimeoutException: 
>>> callTimeout=6, callDuration=69037:
>>>  Call to /: failed on local exception: java.io.IOException: 
>>> Connection reset by peer row 'SYSTEM:CATALOG,,'
>>>  on table 'hbase:meta' at region=hbase:meta,,1.1588230740, 
>>> hostname=,,1526661975419, seqNum=0
>>>
>>> The hbase is securized with kerberos, and the whole is managed by zookeeper.
>>>
>>> Any insight on this behavior ?
>>>
>>> Thanks a lot,
>>>
>>>
>>>
>>>
>>>
>>
>


Re: Cannot access from jdbc

2018-05-23 Thread James Taylor
The 4.7 release is more than two years old. That's seven releases back from
the current release we're voting on 4.14. I'd recommend working with your
vendor and urging them to upgrade to a newer, supportable version.

Thanks,
James

On Wed, May 23, 2018 at 9:10 AM, Nicolas Paris  wrote:

> hi team
>
> I am having this error while trying to connect from a java program to a 
> working phoenix 4.7 (hdp build) instance:
>
>
>  Wed May 23 16:40:56 CEST 2018, null, java.net.SocketTimeoutException: 
> callTimeout=6, callDuration=69037:
>  Call to /: failed on local exception: java.io.IOException: 
> Connection reset by peer row 'SYSTEM:CATALOG,,'
>  on table 'hbase:meta' at region=hbase:meta,,1.1588230740, 
> hostname=,,1526661975419, seqNum=0
>
> The hbase is securized with kerberos, and the whole is managed by zookeeper.
>
> Any insight on this behavior ?
>
> Thanks a lot,
>
>
>
>
>


Re: SORT_MERGE_JOIN on non-leading key: server-side sorting

2018-05-17 Thread James Taylor
Hi Gerald,
The fix for PHOENIX-4508 will appear in the 4.14.0 release which we're
working on now. We should have a second RC up shortly that you can use to
verify. The fix isn't in 4.13 since it was checked in after the release.

Thanks,
James

On Thu, May 17, 2018 at 4:44 PM, Maryann Xue  wrote:

> Hey, this is a simple experiment I did based on an existing test case.
> Look for "query3". The test verifies that there is no "SERVER SORT" in the
> entire query plan (both salted and unsalted tables have been covered):
>
>
> @Test
> public void testBug4508() throws Exception {
> Properties props = PropertiesUtil.deepCopy(TEST_PROPERTIES);
> Connection conn = DriverManager.getConnection(getUrl(), props);
> props = PropertiesUtil.deepCopy(TEST_PROPERTIES);
> props.setProperty("TenantId", "010");
> Connection conn010 = DriverManager.getConnection(getUrl(), props);
> try {
> // Salted tables
> String peopleTable = generateUniqueName();
> String myTable = generateUniqueName();
> conn.createStatement().execute("CREATE TABLE " + peopleTable + " (\n" 
> +
> "PERSON_ID VARCHAR NOT NULL,\n" +
> "NAME VARCHAR\n" +
> "CONSTRAINT PK_TEST_PEOPLE PRIMARY KEY (PERSON_ID)) 
> SALT_BUCKETS = 3");
> conn.createStatement().execute("CREATE TABLE " + myTable + " (\n" +
> "LOCALID VARCHAR NOT NULL,\n" +
> "DSID VARCHAR(255) NOT NULL, \n" +
> "EID CHAR(40),\n" +
> "HAS_CANDIDATES BOOLEAN\n" +
> "CONSTRAINT PK_MYTABLE PRIMARY KEY (LOCALID, DSID)) 
> SALT_BUCKETS = 3");
> verifyQueryPlanAndResultForBug4508(conn, peopleTable, myTable);
>
> // Salted multi-tenant tables
> String peopleTable2 = generateUniqueName();
> String myTable2 = generateUniqueName();
> conn.createStatement().execute("CREATE TABLE " + peopleTable2 + " 
> (\n" +
> "TENANT_ID VARCHAR NOT NULL,\n" +
> "PERSON_ID VARCHAR NOT NULL,\n" +
> "NAME VARCHAR\n" +
> "CONSTRAINT PK_TEST_PEOPLE PRIMARY KEY (TENANT_ID, 
> PERSON_ID))\n" +
> "SALT_BUCKETS = 3, MULTI_TENANT=true");
> conn.createStatement().execute("CREATE TABLE " + myTable2 + " (\n" +
> "TENANT_ID VARCHAR NOT NULL,\n" +
> "LOCALID VARCHAR NOT NULL,\n" +
> "DSID VARCHAR(255) NOT NULL, \n" +
> "EID CHAR(40),\n" +
> "HAS_CANDIDATES BOOLEAN\n" +
> "CONSTRAINT PK_MYTABLE PRIMARY KEY (TENANT_ID, LOCALID, 
> DSID))\n" +
> "SALT_BUCKETS = 3, MULTI_TENANT=true");
> verifyQueryPlanAndResultForBug4508(conn010, peopleTable2, myTable2);
> } finally {
> conn.close();
> conn010.close();
> }
> }
>
>
> private static void verifyQueryPlanAndResultForBug4508(
> Connection conn, String peopleTable, String myTable) throws Exception {
> PreparedStatement peopleTableUpsertStmt = conn.prepareStatement(
> "UPSERT INTO " + peopleTable + " VALUES(?, ?)");
> peopleTableUpsertStmt.setString(1, "X001");
> peopleTableUpsertStmt.setString(2, "Marcus");
> peopleTableUpsertStmt.execute();
> peopleTableUpsertStmt.setString(1, "X002");
> peopleTableUpsertStmt.setString(2, "Jenny");
> peopleTableUpsertStmt.execute();
> peopleTableUpsertStmt.setString(1, "X003");
> peopleTableUpsertStmt.setString(2, "Seymour");
> peopleTableUpsertStmt.execute();
> conn.commit();
>
> PreparedStatement myTableUpsertStmt = conn.prepareStatement(
> "UPSERT INTO " + myTable + " VALUES(?, ?, ?, ?)");
> myTableUpsertStmt.setString(1, "X001");
> myTableUpsertStmt.setString(2, "GROUP");
> myTableUpsertStmt.setString(3, null);
> myTableUpsertStmt.setBoolean(4, false);
> myTableUpsertStmt.execute();
> myTableUpsertStmt.setString(1, "X001");
> myTableUpsertStmt.setString(2, "PEOPLE");
> myTableUpsertStmt.setString(3, null);
> myTableUpsertStmt.setBoolean(4, false);
> myTableUpsertStmt.execute();
> myTableUpsertStmt.setString(1, "X003");
> myTableUpsertStmt.setString(2, "PEOPLE");
> myTableUpsertStmt.setString(3, null);
> myTableUpsertStmt.setBoolean(4, false);
> myTableUpsertStmt.execute();
> myTableUpsertStmt.setString(1, "X002");
> myTableUpsertStmt.setString(2, "PEOPLE");
> myTableUpsertStmt.setString(3, "Z990");
> myTableUpsertStmt.setBoolean(4, false);
> myTableUpsertStmt.execute();
> conn.commit();
>
> String query1 = "SELECT /*+ USE_SORT_MERGE_JOIN*/ COUNT(*)\n" +
> "FROM " + peopleTable + " ds JOIN " + myTable + " l\n" +
> "ON ds.PERSON_ID = l.LOCALID\n" +
> "WHERE l.EID IS NULL AND l.DSID = 'PEOPLE' AND l.HAS_CANDIDATES = FALSE";
> String query2 = "SELECT /*+ USE_SORT_MERGE_JOIN */ COUNT(*)\n" +
> "FROM (SELECT LOCALID FROM " + myTable + "\n" +
> "WHERE EID IS NULL AND DSID = 'PEOPLE' AND HAS_CANDIDATES = FALSE) l\n" +
> "JOIN " + peopleTable + " ds ON ds.PERSON_ID = l.LOCALID";
> String query3 = "SELECT /*+ USE_SORT

Re: Binary fields and compression

2018-05-13 Thread James Taylor
You can have a property only apply to a single column family by prefixing
it with the family name:

CREATE TABLE DOCUMENTS (HOST VARCHAR NOT NULL PRIMARY KEY, A.CONTENT
VARBINARY, B.TEXT VARCHAR, B.LABEL VARCHAR, B.DATE_CREATE TIMESTAMP)
B.COMPRESSION='GZ'

On Sun, May 13, 2018 at 3:50 AM Nicolas Paris  wrote:

> Hi,
>
> My use case is storing pdf files from one side, and it's content as text
> from the other. Two kind of queries would either get the text, or the pdf
> from the primary key. Rarely both.
>
> Then I guess the option is to create two columns family, one containing a
> VARBINARY column to store the binary pdf, and the other column family
> containing the text and other metadata in other columns.
>
> Since text can be compressed, I guess that's would be a good idea to
> compress the table.
>
> CREATE TABLE DOCUMENTS (HOST VARCHAR NOT NULL PRIMARY KEY, A.CONTENT
> VARBINARY, B.TEXT VARCHAR, B.LABEL VARCHAR, B.DATE_CREATE TIMESTAMP)
> COMPRESSION='GZ'
>
> The problem is the compression here is for both column family. As a
> result, I will end up compressing pdf, that is lost of efforts.
>
> What do you suggest ?
>
> Thanks
>


Re: UPSERT null vlaues

2018-04-28 Thread James Taylor
The contract with immutable tables is that a row will never be partially
updated (including setting a column to null). Phoenix can’t enforce this
and retain the perf gain it gives you, so it’s on an honor system instead.
Attempts to set a value to null are stripped out rather than potentially
placing a delete marker that in theory shouldn’t be needed.

So short answer is to not declare your table as immutable if you need to
set column values to null.

Thanks,
James

On Sat, Apr 28, 2018 at 3:24 AM Stepan Migunov <
stepan.migu...@firstlinesoftware.com> wrote:

> Thank you James, it was “immutable”. I didn't know that it affects.
>
>
>
> *From:* James Taylor [mailto:jamestay...@apache.org]
> *Sent:* Friday, April 27, 2018 5:37 PM
> *To:* user@phoenix.apache.org
> *Subject:* Re: UPSERT null vlaues
>
>
>
> Hi Stepan,
>
> Please post your complete DDL and indicate the version of Phoenix and
> HBase you’re using.  Your example should work as expected barring
> declaration of the table as immutable or COL2 being part of the primary key.
>
>
>
> Thanks,
>
> James
>
>
>
> On Fri, Apr 27, 2018 at 6:13 AM Stepan Migunov <
> stepan.migu...@firstlinesoftware.com> wrote:
>
> Hi,
> Could you please clarify, how I can set a value to NULL?
>
> After upsert into temp.table (ROWKEY, COL1, COL2) values (100, "ABC",
> null); the value of COL2 still has a previous value (COL1 has "ABC" as
> expected).
>
> Or there is only one way - to set  STORE_NULLS = true?
>
> Thanks,
> Stepan.
>
>


Re: UPSERT null vlaues

2018-04-27 Thread James Taylor
Hi Stepan,
Please post your complete DDL and indicate the version of Phoenix and HBase
you’re using.  Your example should work as expected barring declaration of
the table as immutable or COL2 being part of the primary key.

Thanks,
James

On Fri, Apr 27, 2018 at 6:13 AM Stepan Migunov <
stepan.migu...@firstlinesoftware.com> wrote:

> Hi,
> Could you please clarify, how I can set a value to NULL?
>
> After upsert into temp.table (ROWKEY, COL1, COL2) values (100, "ABC",
> null); the value of COL2 still has a previous value (COL1 has "ABC" as
> expected).
>
> Or there is only one way - to set  STORE_NULLS = true?
>
> Thanks,
> Stepan.
>


Re: Split and distribute regions of SYSTEM.STATS table

2018-04-23 Thread James Taylor
For stats, what do you have your guidepost width set to? Do you only have a
single physical table? We've found that a value of 300MB still provides a
good enough granularity to get good bytes scanned estimates. We're
currently using an HBase API to update stats atomically. We could easily
make it splittable per column family per table if that's helpful. We could
also use the regular batch API and not have it be atomic (since stats are
just an estimate anyway). Please file a JIRA.

In theory, it would be fine to set IN_MEMORY to true, but there are a
couple of practical issues:
- You'll need the fix for PHOENIX-4579 or the property would be set back to
false when another client connects
- We already have client-side caching for stats and catalog and server-side
caching for catalog, so not sure how beneficial this would be.

Thanks,
James

On Sun, Apr 22, 2018 at 3:56 AM, Batyrshin Alexander <0x62...@gmail.com>
wrote:

> If all stats for given table should be on the same region there is no
> benefits on splitting.
>
> Another question: is it ok to set 'IN_MEMORY' => 'true' for CF of SYSTEM.*
> tables?
>
>
> On 20 Apr 2018, at 23:39, James Taylor  wrote:
>
> Thanks for bringing this to our attention. There's a bug here in that the
> SYSTEM.STATS table has a custom split policy that prevents splitting from
> occurring (PHOENIX-4700). We'll get a fix out in 4.14, but in the meantime
> it's safe to split the table, as long as all stats for a given table are on
> the same region.
>
> James
>
> On Fri, Apr 20, 2018 at 1:37 PM, James Taylor 
> wrote:
>
>> Thanks for bringing this to our attention. There's a bug here in that the
>> SYSTEM.STATS
>>
>> On Wed, Apr 18, 2018 at 9:59 AM, Batyrshin Alexander <0x62...@gmail.com>
>> wrote:
>>
>>>  Hello,
>>> I've discovered that SYSTEM.STATS has only 1 region with size 3.25 GB.
>>> Is it ok to split it and distribute over different region servers?
>>
>>
>>
>
>


Re: Split and distribute regions of SYSTEM.STATS table

2018-04-20 Thread James Taylor
Thanks for bringing this to our attention. There's a bug here in that the
SYSTEM.STATS table has a custom split policy that prevents splitting from
occurring (PHOENIX-4700). We'll get a fix out in 4.14, but in the meantime
it's safe to split the table, as long as all stats for a given table are on
the same region.

James

On Fri, Apr 20, 2018 at 1:37 PM, James Taylor 
wrote:

> Thanks for bringing this to our attention. There's a bug here in that the
> SYSTEM.STATS
>
> On Wed, Apr 18, 2018 at 9:59 AM, Batyrshin Alexander <0x62...@gmail.com>
> wrote:
>
>>  Hello,
>> I've discovered that SYSTEM.STATS has only 1 region with size 3.25 GB. Is
>> it ok to split it and distribute over different region servers?
>
>
>


Re: Split and distribute regions of SYSTEM.STATS table

2018-04-20 Thread James Taylor
Thanks for bringing this to our attention. There's a bug here in that the
SYSTEM.STATS

On Wed, Apr 18, 2018 at 9:59 AM, Batyrshin Alexander <0x62...@gmail.com>
wrote:

>  Hello,
> I've discovered that SYSTEM.STATS has only 1 region with size 3.25 GB. Is
> it ok to split it and distribute over different region servers?


Re: hint to use a global index is not working - need to find out why

2018-04-20 Thread James Taylor
Ron - Salting is only recommended when your primary key is monotonically
increasing. It's mainly used to prevent write hotspotting. Also, I think
Ron forgot to mention, but I was working with him a bit earlier on this,
and I couldn't repro the issue either (in current 4.x or in 4.7 release).
Here's the unit test I put together which hints a non covered global index:

@Test
public void testIndexHintWithNonCoveredColumnSelected() throws
Exception {
String schemaName = "";
String dataTableName = "T_FOO";
String indexTableName = "I_FOO";
String dataTableFullName = SchemaUtil.getTableName(schemaName,
dataTableName);
String indexTableFullName = SchemaUtil.getTableName(schemaName,
indexTableName);
try (Connection conn = DriverManager.getConnection(getUrl())) {
conn.createStatement().execute("CREATE TABLE " +
dataTableFullName + "(k INTEGER PRIMARY KEY, v1 VARCHAR, v2 VARCHAR)");
conn.createStatement().execute("CREATE INDEX " + indexTableName
+ " ON " + dataTableName + "(v1)");
PhoenixStatement stmt =
conn.createStatement().unwrap(PhoenixStatement.class);
QueryPlan plan;
stmt.executeQuery("SELECT v1, v2 " +
" FROM " + dataTableFullName +
" WHERE v1='a'");
plan = stmt.getQueryPlan();
assertEquals("Expected (" + dataTableFullName + ") but was " +
plan.getSourceRefs(), 1, plan.getSourceRefs().size());

stmt.executeQuery("SELECT /*+ INDEX(" + dataTableFullName + " "
+ indexTableName + ") */ v1, v2" +
" FROM " + dataTableFullName +
" WHERE v1='a'");
plan = stmt.getQueryPlan();
assertEquals("Expected (" + dataTableFullName + "," +
indexTableFullName + ") but was " +  plan.getSourceRefs(), 2,
plan.getSourceRefs().size());
}
}



On Fri, Apr 20, 2018 at 1:11 PM, Taylor, Ronald (Ronald) <
ronald.tay...@cchmc.org> wrote:

> Hello Sergey,
>
>
>
> Per your request, here are the commands that I used to create the table
> and its indexes.  Hopefully you can find something in here that provides a
> guide as to what we are doing wrong.
>
>
>
> BTW – as I said, we are novices with Phoenix here. One thing we are doing
> is playing around a bit with salting numbers. We believed that the data in
> our test table was small enough to fit on one region server ( < 10 GB), so
> we used a high salt number (24) to try to force HBase to use more than one
> region server, to parallelize over more than one node. Did we get that
> concept right?
>
>
>
> Ron
>
>
>
> %
>
>
> CREATE TABLE variantjoin_RT_salted24 (
> chrom VARCHAR,
> genomic_range VARCHAR,
> reference VARCHAR,
> alternate VARCHAR,
> annotations VARCHAR,
> consequence VARCHAR,
> chrom_int INTEGER,
> onekg_maf DOUBLE,
> coding VARCHAR,
> esp_aa_maf DOUBLE,
> esp_ea_maf DOUBLE,
> exac_maf DOUBLE,
> filter VARCHAR,
> gene VARCHAR,
> impact VARCHAR,
> polyphen VARCHAR,
> sift VARCHAR,
> viva_maf DOUBLE,
> variant_id INTEGER PRIMARY KEY,
> genomic_range_start INTEGER,
> genomic_range_end INTEGER
> ) SALT_BUCKETS = 24, IMMUTABLE_ROWS=false;
>
>
>
>
>
> 0: jdbc:phoenix:> !describe variantjoin_RTsalted24
>
>
> 0: jdbc:phoenix:> !describe variantjoin_RTsalted24
> ++--+-+-
> -+-+
> | TABLE_CAT  | TABLE_SCHEM  |   TABLE_NAME|
> COLUMN_NAME  | |
> ++--+-+-
> -+-+
> ||  | VARIANTJOIN_RTSALTED24  |
> CHROM| |
> ||  | VARIANTJOIN_RTSALTED24  |
> GENOMIC_RANGE| |
> ||  | VARIANTJOIN_RTSALTED24  |
> REFERENCE| |
> ||  | VARIANTJOIN_RTSALTED24  |
> ALTERNATE| |
> ||  | VARIANTJOIN_RTSALTED24  |
> ANNOTATIONS  | |
> ||  | VARIANTJOIN_RTSALTED24  |
> CONSEQUENCE  | |
> ||  | VARIANTJOIN_RTSALTED24  |
> CHROM_INT| |
> ||  | VARIANTJOIN_RTSALTED24  |
> ONEKG_MAF| |
> ||  | VARIANTJOIN_RTSALTED24  |
> CODING   | |
> ||  | VARIANTJOIN_RTSALTED24  |
> ESP_AA_MAF   | |
> ||  | VARIANTJOIN_RTSALTED24  |
> ESP_EA_MAF   | |
> ||  | VARIANTJOIN_RTSALTED24  |
> EXAC_MAF | |
> ||  | VARIANTJOIN_RTSALTED24  |
> FILTER   | |
> ||  | VARIANTJOIN_RTSALTED24  |
> GENE | |
> ||  | VARIANTJOIN_RTSALTED24  |
> IMPACT   | |
> ||  | VARIANTJOIN_RTSALTED24  |
> POLYPHEN | |
> ||  | VARIANTJOIN_RTSALTED24  |
>

Re: using an array field as an index - PHOENIX-1544

2018-04-20 Thread James Taylor
Hi Ron,
Best place to ask questions about a JIRA is on the JIRA itself. If you see
no activity on it for a long time, it often means that no one in the
community has had time to fix it. This probably indicates that it's not as
high priority as other JIRAs that have been fixed. This, of course, is
dependent on the use cases of each contributor to the project. If it's an
important one for your use case, I'd recommend contributing a patch.
Thanks,
James

On Fri, Apr 20, 2018 at 8:29 AM, Taylor, Ronald (Ronald) <
ronald.tay...@cchmc.org> wrote:

> Hello James,
>
>
>
> I did a search using PHOENIX-1544 and could not find any updates to your
> June 2015 post on the Phoenix list, so I wanted to ask: what is the current
> status for indexing array fields over immutable (or even mutable) tables?
> We could certainly use such.
>
>
>
> Ron
>
>
>
> On 2015/06/21 23:10:04, James Taylor  wrote:
>
> > Hey Leon,>
>
> > I filed PHOENIX-1544 a while back for indexing arrays over immutable>
>
> > tables. If you're interested in contributing a patch, that'd be great.>
>
> > I'm happy to help you along the way.>
>
> > Thanks,>
>
> > James>
>
> >
>
> > On Sun, Jun 21, 2015 at 12:48 AM, Leon Prouger 
> wrote:>
>
> > > Hey James, thank you for replying.>
>
> > > Yes you're right, this option is pretty useless is our case. We've
> been>
>
> > > thinking to create a separate table which will model the array with
> one to>
>
> > > many relation, then index it and perform join with the main table for
> every>
>
> > > query. Like:>
>
> > >>
>
> > > Main table: PK(id), data columns>
>
> > > Array table: PK(id, array cell), index of the array cell>
>
> > >>
>
> > > But I wonder is the join gonna be faster then full scan of the main
> table.>
>
> > >>
>
> > > Is there any plans for implementing an array index? Maybe it can be
> done for>
>
> > > immutable tables only.>
>
> > >>
>
> > >>
>
> > > On Wed, Jun 17, 2015 at 10:05 PM James Taylor >
>
> > > wrote:>
>
> > >>>
>
> > >> Hey Leon,>
>
> > >> You can have an array in an index, but it has to be at the end of
> the>
>
> > >> PK constraint which is not very useful and likely not what you want
> ->
>
> > >> it'd essentially be equivalent of having the array at the end of
> your>
>
> > >> primary key constraint.>
>
> > >>>
>
> > >> The other alternative I can think of that may be more useful is to
> use>
>
> > >> functional indexing[1] on specific array elements. You'd need to
> know>
>
> > >> the position of the element that you're indexing and querying
> against>
>
> > >> in advance, though.>
>
> > >>>
>
> > >> [1] http://phoenix.apache.org/secondary_indexing.html#
> Functional_Index>
>
> > >>>
>
> > >> On Wed, Jun 17, 2015 at 4:43 AM, Leon Prouger 
> wrote:>
>
> > >> > Hey folks,>
>
> > >> >>
>
> > >> > Maybe I'm asking too much but I couldn't find a straight answer. Is
> this>
>
> > >> > possible to index an array type with Phoenix?>
>
> > >> >>
>
> > >> > If I can't does anybody tried any alternatives? Like keeping
> another>
>
> > >> > table>
>
> > >> > for the array many to one relation>
>
> >
>
>
>
> Ronald C. Taylor, Ph.D.
> Divisions of Immunobiology and Biomedical Informatics
>
> Cincinnati Children's Hospital Medical Center
>
> Office phone: 513-803-4880
>
> Cell phone: 509-783-7308
>
> Email: ronald.tay...@cchmc.org
>
>
>
>
>


Re: hbase cell storage different bewteen bulk load and direct api

2018-04-19 Thread James Taylor
I believe we still rely on that empty key value, even for compact storage
formats (though theoretically it could likely be made so we don't - JIRA,
please?) A quick test would confirm:
- upsert a row with no last_name or first_name
- select * from T where last_name IS NULL
If the row isn't returned, then we need that empty key value.

Thanks,
James

On Thu, Apr 19, 2018 at 1:58 PM, Sergey Soldatov 
wrote:

> Heh. That looks like a bug actually. This is a 'dummy' KV (
> https://phoenix.apache.org/faq.html#Why_empty_key_value), but I have some
> doubts that we need it for compacted rows.
>
> Thanks,
> Sergey
>
> On Thu, Apr 19, 2018 at 11:30 PM, Lew Jackman  wrote:
>
>> I have not tried the master yet branch yet, however on Phoenix 4.13 this
>> storage discrepancy in hbase is still present with the extra
>> column=M:\x00\x00\x00\x00 cells in hbase when using psql or sqlline.
>>
>> Does anyone have an understanding of the meaning of the column qualifier
>> \x00\x00\x00\x00 ?
>>
>>
>> -- Original Message --
>> From: "Lew Jackman" 
>> To: user@phoenix.apache.org
>> Cc: user@phoenix.apache.org
>> Subject: Re: hbase cell storage different bewteen bulk load and direct api
>> Date: Thu, 19 Apr 2018 13:59:16 GMT
>>
>> The upsert statement appears the same as the psql results - i.e. extra
>> cells. I will try the master branch next. Thanks for the tip.
>>
>> -- Original Message --
>> From: Sergey Soldatov 
>> To: user@phoenix.apache.org
>> Subject: Re: hbase cell storage different bewteen bulk load and direct api
>> Date: Thu, 19 Apr 2018 12:26:25 +0600
>>
>> Hi Lew,
>> no. 1st one looks line incorrect. You may file a bug on that ( I believe
>> that the second case is correct, but you may also check with uploading data
>> using regular upserts). Also, you may check whether the master branch has
>> this issue.
>>
>> Thanks,
>> Sergey
>>
>> On Thu, Apr 19, 2018 at 10:19 AM, Lew Jackman 
>> wrote:
>>
>>> Under Phoenix 4.11 we are seeing some storage discrepancies in hbase
>>> between a load via psql and a bulk load.
>>>
>>> To illustrate in a simple case we have modified the example table from
>>> the load reference https://phoenix.apache.org/bulk_dataload.html
>>>
>>> CREATE TABLE example (
>>> Â Â Â my_pk bigint not null,
>>> Â Â Â m.first_name varchar(50),
>>> Â Â Â m.last_name varchar(50)
>>> Â Â Â CONSTRAINT pk PRIMARY KEY (my_pk))
>>> Â Â Â IMMUTABLE_ROWS=true,
>>> Â Â Â IMMUTABLE_STORAGE_SCHEME = SINGLE_CELL_ARRAY_WITH_OFFSETS,
>>> Â Â Â COLUMN_ENCODED_BYTES = 1;
>>>
>>> Hbase Rows when Loading via PSQL
>>>
>>> x80x00x00x00x00x0009
>>> Â Â Â Â column=M:x00x00x00x00,
>>> timestamp=1524109827690, value=x             Â
>>> x80x00x00x00x00x0009
>>> Â Â Â Â column=M:1, timestamp=1524109827690, value=xJohnDoex00\
>>> \\\x00x00x01x00x05x0
>>> 0x00x00x08x00x00x00x03x02
>>> Â Â Â Â Â Â Â Â Â Â Â Â Â
>>> x80x00x00x00x00x01x092
>>> Â column=M:x00x00x00x00,
>>> timestamp=1524109827690, value=x             Â
>>> x80x00x00x00x00x01x092
>>> Â column=M:1, timestamp=1524109827690, value=xMaryPoppinsx00\
>>> \\\x00x00x01x00x05\\
>>> \\x00x00x00x0Cx00x00
>>> x00x03x02 Â Â Â Â Â Â Â Â Â Â Â Â Â
>>>
>>> Hbase Rows when Loading via MapReduce using CsvBulkLoadTool
>>>
>>> x80x00x00x00x00x0009
>>> Â Â Â Â column=M:1, timestamp=1524110486638, value=xJohnDoex00\
>>> \\\x00x00x01x00x05x0
>>> 0x00x00x08x00x00x00x03x02
>>> Â Â Â Â Â Â Â Â Â Â Â Â Â
>>> x80x00x00x00x00x01x092
>>> Â column=M:1, timestamp=1524110486638, value=xMaryPoppinsx00\
>>> \\\x00x00x01x00x05\\
>>> \\x00x00x00x0Cx00x00
>>> x00x03x02 Â Â Â Â Â Â Â Â Â Â Â Â Â
>>>
>>>
>>> So, the bulk loaded tables have 4 cells for the two rows loaded via psql
>>> whereas a bulk load is missing two cells since it lacks the cells with col
>>> qualifier :x00x00x00x00
>>> Â
>>> Is this behavior correct?
>>> Â
>>> Thanks much for any insight.
>>> Â
>>>
>>>
>>> 
>>> *How To "Remove" Dark Spots*
>>> Gundry MD
>>>
>>> 
>>> http://thirdpartyoffers.netzero.net/TGL3

Re: hotspot in System.catalog table

2018-04-12 Thread James Taylor
It’s client side, but that’ll only impact newly created tables. You’ll need
to use the ALTER TABLE command for existing tables.

On Thu, Apr 12, 2018 at 11:30 PM anil gupta  wrote:

> I have set phoenix.default.update.cache.frequency=6 in hbase-site.xml
> via ambari(we barely alter schema). Is this a client or server side
> property?
>
> On Thu, Apr 12, 2018 at 11:14 PM, anil gupta 
> wrote:
>
>> I c. As per documentation[1], even for commits of upsert system.catalog
>> is called. IMO, ALWAYS seems to be really aggressive. Is there any reason
>> UPDATE_CACHE_FREQUENCY is set to ALWAYS by default? Do we plan to change
>> the default value to 5 or 10 sec? Thanks for your help.
>>
>>
>> PS: we were running into a lot of Phoenix scalability issues due to this.
>>
>> [1] https://phoenix.apache.org/language/index.html#options
>>
>> On Thu, Apr 12, 2018 at 11:06 PM, James Taylor 
>> wrote:
>>
>>> No, that won’t make a difference.
>>>
>>> On Thu, Apr 12, 2018 at 10:51 PM anil gupta 
>>> wrote:
>>>
>>>> Thanks for quick reply, James. We will look into UPDATE_CACHE_FREQUENCY
>>>> property. If we just replace PS with Statement, will it fix the
>>>> problem(AFAIK, Statement is not compiled)?
>>>>
>>>> On Thu, Apr 12, 2018 at 10:43 PM, James Taylor 
>>>> wrote:
>>>>
>>>>> Try setting the UPDATE_CACHE_FREQUENCY table property (and configuring
>>>>> the phoenix.default.update.cache.frequency system-wide property). That'll
>>>>> prevent pinging the region hosting SYSTEM.CATALOG every time a query is
>>>>> compiled. We've found value of even 5 seconds makes a big difference. For
>>>>> more on that, see here[1] and here[2].
>>>>>
>>>>> In the future, we'll let the SYSTEM.CATALOG table span multiple
>>>>> regions - keep an eye on PHOENIX-3534.
>>>>>
>>>>> Thanks,
>>>>> James
>>>>>
>>>>> [1] https://phoenix.apache.org/#Altering
>>>>> [2] https://phoenix.apache.org/language/index.html#options
>>>>>
>>>>> On Thu, Apr 12, 2018 at 10:32 PM, anil gupta 
>>>>> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> System.catalog table seems to be single region table(correct?). We
>>>>>> are currently facing a problem of hotspot on System.catalog table.
>>>>>> One of our app does around 4-5k select queries/sec. And, It is
>>>>>> creating a new preparedstatement everytime. I suspect that while
>>>>>> instantiating a new preparedstatement(contrary to Statement),
>>>>>> system.catalog table is queried first. Hence, it is resulting into
>>>>>> hotspotting. Is my analysis correct?
>>>>>>
>>>>>> (I have already suggested my colleagues to try using Statement
>>>>>> instead of PS if they have to create a new one everytime.)
>>>>>>
>>>>>> --
>>>>>> Thanks & Regards,
>>>>>> Anil Gupta
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Anil Gupta
>>>>
>>>
>>
>>
>> --
>> Thanks & Regards,
>> Anil Gupta
>>
>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>


Re: hotspot in System.catalog table

2018-04-12 Thread James Taylor
No, that won’t make a difference.

On Thu, Apr 12, 2018 at 10:51 PM anil gupta  wrote:

> Thanks for quick reply, James. We will look into UPDATE_CACHE_FREQUENCY
> property. If we just replace PS with Statement, will it fix the
> problem(AFAIK, Statement is not compiled)?
>
> On Thu, Apr 12, 2018 at 10:43 PM, James Taylor 
> wrote:
>
>> Try setting the UPDATE_CACHE_FREQUENCY table property (and configuring
>> the phoenix.default.update.cache.frequency system-wide property). That'll
>> prevent pinging the region hosting SYSTEM.CATALOG every time a query is
>> compiled. We've found value of even 5 seconds makes a big difference. For
>> more on that, see here[1] and here[2].
>>
>> In the future, we'll let the SYSTEM.CATALOG table span multiple regions -
>> keep an eye on PHOENIX-3534.
>>
>> Thanks,
>> James
>>
>> [1] https://phoenix.apache.org/#Altering
>> [2] https://phoenix.apache.org/language/index.html#options
>>
>> On Thu, Apr 12, 2018 at 10:32 PM, anil gupta 
>> wrote:
>>
>>> Hi All,
>>>
>>> System.catalog table seems to be single region table(correct?). We are
>>> currently facing a problem of hotspot on System.catalog table.
>>> One of our app does around 4-5k select queries/sec. And, It is creating
>>> a new preparedstatement everytime. I suspect that while instantiating a new
>>> preparedstatement(contrary to Statement), system.catalog table is queried
>>> first. Hence, it is resulting into hotspotting. Is my analysis correct?
>>>
>>> (I have already suggested my colleagues to try using Statement instead
>>> of PS if they have to create a new one everytime.)
>>>
>>> --
>>> Thanks & Regards,
>>> Anil Gupta
>>>
>>
>>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>


Re: hotspot in System.catalog table

2018-04-12 Thread James Taylor
Try setting the UPDATE_CACHE_FREQUENCY table property (and configuring
the phoenix.default.update.cache.frequency system-wide property). That'll
prevent pinging the region hosting SYSTEM.CATALOG every time a query is
compiled. We've found value of even 5 seconds makes a big difference. For
more on that, see here[1] and here[2].

In the future, we'll let the SYSTEM.CATALOG table span multiple regions -
keep an eye on PHOENIX-3534.

Thanks,
James

[1] https://phoenix.apache.org/#Altering
[2] https://phoenix.apache.org/language/index.html#options

On Thu, Apr 12, 2018 at 10:32 PM, anil gupta  wrote:

> Hi All,
>
> System.catalog table seems to be single region table(correct?). We are
> currently facing a problem of hotspot on System.catalog table.
> One of our app does around 4-5k select queries/sec. And, It is creating a
> new preparedstatement everytime. I suspect that while instantiating a new
> preparedstatement(contrary to Statement), system.catalog table is queried
> first. Hence, it is resulting into hotspotting. Is my analysis correct?
>
> (I have already suggested my colleagues to try using Statement instead of
> PS if they have to create a new one everytime.)
>
> --
> Thanks & Regards,
> Anil Gupta
>


Re: "Unable to discover transaction service" prompt out when tried to create table with transaction = true

2018-03-28 Thread James Taylor
I suspect the issue may be due to the version of Guava in the hbase/lib
directory being too old. Try replacing it with a newer one - I think Tephra
needs Guava 14 or above.

On Wed, Mar 28, 2018 at 2:38 PM ivany...@gmail.com 
wrote:

> Hi, I was trying to enable the transaction function of Phoenix on my
> hbase.  but I keep getting the "Unable to discover transaction service"
> problem.
> My setting:
> 1. Single node running on a docker.
> 2. HBase configured as pseudo distributed mode(Hbase manages zookeeper)
> 3. HBase version 1.2.6, Phoenix: 4.13.1, sqlline 1.2.0
> 4. I put below setting to both hbase-site.xml in folder /phoenix/bin/ and
> /hbase/conf/:
> 
>   phoenix.transactions.enabled
>   true
> 
> 
>   data.tx.snapshot.dir
>   /tmp/tephra/snapshots
> 
> 
>   data.tx.timeout
>   60
> 
>
> Steps:
> 1.  setup above two hbase-site.xml.
> 2.  start hbase.
> 3. start tephra with command: tephra start ---> i also put the
> phoenix/bin to classpath.
> 4. tephra running normally and no exception.
> 5. sqlline.py myServerName:2181
> 6. after get into sqlline mode, basic command of phoenix no problem,  I
> can also create table without transaction enabled.
> 7. when I do below transaction enabling command, the issue prompt out
> CREATE TABLE my_table (k BIGINT PRIMARY KEY, v VARCHAR) TRANSACTIONAL=true;
>
> seems the transaction manager cannot be found when I trigger transaction
> related action.
> I did some research and found that the transaction manager will register
> itself to zookeeper and any transactional action will automatically find
> the manager and get the transaction thing done through it.
> I am wondering if the hbase managed zookeeper works for phoenix
> transaction.   or what error i did and lead to the problem.
>
>
>
>


Re: Slow query help

2018-03-16 Thread James Taylor
Hi Flavio,
You'll need to add a secondary index to SOMEFIELD (or SOMEFIELD + VALID) to
speed that up. You can write it more simply as SELECT COUNT(DISTINCT
SOMEFIELD) FROM TEST.MYTABLE WHERE VALID AND SOMEFIELD IS NOT NULL.

Otherwise, you'll end up doing a full table scan (and use a fair amount of
memory depending on the cardinality of SOMEFIELD). The above with a
secondary index would skip to the distinct values instead.

Thanks,
James

On Fri, Mar 16, 2018 at 8:30 AM, Flavio Pompermaier 
wrote:

> Hi to all,
> I'm running a query like this one on my Phoenix 4.13 (on CDH 5.11.2):
>
> SELECT COUNT(*) FROM (
>   SELECT DISTINCT(SOMEFIELD)
>   FROM TEST.MYTABLE
>   WHERE VALID = TRUE AND SOMEFIELD IS NOT NULL
> )
>
> Unfortunately the query timeouts (timeout is 10 min) Any suggestion
> about how to tune my installation?
>
> Best,
> Flavi
>


Re: Direct HBase vs. Phoenix query performance

2018-03-15 Thread James Taylor
Hi Marcell,
Yes, that's correct - the cache we build for the RHS is only kept around
while the join query is being executed. It'd be interesting to explore
keeping the cache around longer for cases like yours (and probably not too
difficult). We'd need to keep a map that maps the RHS query to its hash
join cache identifier and if found skip the running of the query. Would you
mind filing a JIRA and we can explore further?
Thanks,
James

On Wed, Mar 14, 2018 at 3:40 PM, Marcell Ortutay 
wrote:

> A quick update--I did some inspection of the Phoenix codebase, and it
> looks like my understanding of the coprocessor cache was incorrect. I
> thought it was meant to be used across queries, eg. that the RHS of the
> join would be saved for subsequent queries. In fact this is not the case,
> the coprocessor cache is meant to live only for the duration of the query.
> This explains the performance difference--Phoenix is re-running a long
> subquery for each join, whereas my direct to HBase script saves those
> results across queries.
>
> On Tue, Mar 13, 2018 at 4:56 PM, Marcell Ortutay 
> wrote:
>
>> Hi James,
>>
>> Thanks for the tips. Our row keys are (I think) reasonably optimized.
>> I've made a gist which is an anonymized version of the query, and it
>> indicates which conditions are / are not part of the PK. It is here:
>> https://gist.github.com/ortutay23andme/12f03767db13343ee797c328a4d78c9c
>>
>> I don't (yet) have an anonymized version of the raw HBase Go script
>> available, but after comparing the performance of the two, I've figured out
>> the root cause. The query does a subquery to produce the LHS of one of the
>> hash joins. This can be seen on L5 of the gist above. This subquery is
>> quite long (~1sec) to execute and scans a few million rows. It is shared
>> across all queries so in the raw HBase script I cached / re-used it for all
>> queries. This has a (very large) performance benefit, in particular under
>> high load.
>>
>> My understanding of Phoenix is that it is supposed to do the same thing.
>> It seems like the hash join code has some mechanic for caching data for
>> hash joining using the HBase coprocessor system. I would expect this cache
>> to kick in, and only execute the large subquery once. Looking at the
>> performance of the query (30sec timeouts after ~2qps), this doesn't seem to
>> be happening.
>>
>> I'm wondering if my understanding of the Phoenix join cache is right. Is
>> it correct to expect that it would cache the results of a subquery used in
>> a join? If so, what are possible reasons why it would *not* do so? Any
>> guidance on metrics / optimizations to look at would be appreciated.
>>
>> Thanks,
>> Marcell
>>
>> On Thu, Mar 8, 2018 at 2:59 PM, James Taylor 
>> wrote:
>>
>>> Hi Marcell,
>>> It'd be helpful to see the table DDL and the query too along with an
>>> idea of how many regions might be involved in the query. If a query is a
>>> commonly run query, usually you'll design the row key around optimizing it.
>>> If you have other, simpler queries that have determined your row key, then
>>> another alternative is to add one or more secondary indexes. Another common
>>> technique is to denormalize your data in ways that precompute the join to
>>> avoid having to do it at run time.
>>>
>>> With joins, make sure to order your tables from post filtered largest
>>> (on LHS) to smallest (on RHS). Also, if you're joining on the PK of both
>>> tables, you should use the USE_SORT_MERGE_JOIN hint. Another common tuning
>>> exercise is around determining the best parallelization to use (i.e.
>>> guidepost width) or even disabling parallelization for more than an entire
>>> region's worth of data.
>>>
>>> It'd also be interesting to see the raw HBase code for a query of this
>>> complexity.
>>>
>>> Thanks,
>>> James
>>>
>>> On Thu, Mar 8, 2018 at 1:03 PM, Marcell Ortutay 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am using Phoenix at my company for a large query that is meant to be
>>>> run in real time as part of our application. The query involves several
>>>> aggregations, anti-joins, and an inner query. Here is the (anonymized)
>>>> query plan: https://gist.github.com/ortutay23andme/1da620472cc469e
>>>> d2d8a6fdd0cc7eb01
>>>>
>>>> The query performance on this is not great, it takes about 5sec to
>>>> execute the query, and moreover it p

Re: Direct HBase vs. Phoenix query performance

2018-03-08 Thread James Taylor
Hi Marcell,
It'd be helpful to see the table DDL and the query too along with an idea
of how many regions might be involved in the query. If a query is a
commonly run query, usually you'll design the row key around optimizing it.
If you have other, simpler queries that have determined your row key, then
another alternative is to add one or more secondary indexes. Another common
technique is to denormalize your data in ways that precompute the join to
avoid having to do it at run time.

With joins, make sure to order your tables from post filtered largest (on
LHS) to smallest (on RHS). Also, if you're joining on the PK of both
tables, you should use the USE_SORT_MERGE_JOIN hint. Another common tuning
exercise is around determining the best parallelization to use (i.e.
guidepost width) or even disabling parallelization for more than an entire
region's worth of data.

It'd also be interesting to see the raw HBase code for a query of this
complexity.

Thanks,
James

On Thu, Mar 8, 2018 at 1:03 PM, Marcell Ortutay 
wrote:

> Hi,
>
> I am using Phoenix at my company for a large query that is meant to be run
> in real time as part of our application. The query involves several
> aggregations, anti-joins, and an inner query. Here is the (anonymized)
> query plan: https://gist.github.com/ortutay23andme/
> 1da620472cc469ed2d8a6fdd0cc7eb01
>
> The query performance on this is not great, it takes about 5sec to execute
> the query, and moreover it performs badly under load. If we run ~4qps of
> this query Phoenix starts to timeout and slow down a lot (queries take
> >30sec).
>
> For comparison, I wrote a simple Go script that runs a similar query
> talking directly to HBase. The performance on it is substantially better.
> It executes in ~1.5sec, and can handle loads of ~50-100qps on the same
> cluster.
>
> I'm wondering if anyone has ideas on what might be causing this difference
> in performance? Are there configs / optimizations we can do in Phoenix to
> bring the performance closer to direct HBase queries?
>
> I can provide context on the table sizes etc. if needed.
>
> Thanks,
> Marcell
>
>


Re: Runtime DDL supported?

2018-03-08 Thread James Taylor
Thanks for digging that up, Miles. I've added a comment on the JIRA on how
to go about implementing it here:
https://issues.apache.org/jira/browse/PHOENIX-3547?focusedCommentId=16391739&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16391739

That would be a good first contribution if you're up for it.

Thanks,
James

On Wed, Mar 7, 2018 at 5:09 PM, Miles Spielberg  wrote:

> We found https://issues.apache.org/jira/browse/PHOENIX-3547, which seems
> to be precisely our problem. We would want at least the option to use a
> bigint rather than the int in the JIRA to accommodate massive growth. While
> we intend to have many tenants, we don't intend to use the Phoenix
> "tenant_id" to differentiate them, and instead manage them at our
> application layer, so separate counters per Phoenix tenant would not help
> in our situation.
>
> Miles Spielberg
> Staff Software Engineer
>
>
> O. 650.485.1102 <(650)%20485-1102>
> 900 Jefferson Ave
> <https://maps.google.com/?q=900+Jefferson+Ave+%0D%0A%0D%0ARedwood+City+,+CA+94063&entry=gmail&source=g>
> Redwood City, CA 94063
> <https://maps.google.com/?q=900+Jefferson+Ave+%0D%0A%0D%0ARedwood+City+,+CA+94063&entry=gmail&source=g>
>
> On Wed, Feb 28, 2018 at 10:27 PM, James Taylor 
> wrote:
>
>> Please file a JIRA as it’d be feasible to change this limitation. The
>> easiest way would be to have a separate counter for each tenant. Another
>> way to reduce the number of indexes on tenant specific views would be to
>> factor out common columns to global views and create indexes there.
>>
>> On Tue, Feb 27, 2018 at 3:40 PM Miles Spielberg  wrote:
>>
>>> As we discussed, indexes across views are stored in a single HBase table
>>> associated with the original table (_IDX_). That's grand for
>>> limiting the number of HBase tables created, but I just realized that the
>>> actual index data within is differentiated by the 16-bit "viewIndexId",
>>> which limits us to 64K indexes across all views for a given table. That's
>>> concerning for our use case, especially if its a cumulative autoincrement
>>> across all CREATE INDEX and DROP INDEX operations over the lifetime of the
>>> base table.
>>>
>>> Is there any workaround for this? A quick grep across the source
>>> indicates that the length of viewIndexId is currently hard-coded.
>>>
>>> At least, this limitation should probably be added to the list of
>>> caveats and warnings at https://phoenix.apache.org/views.html.
>>>
>>> Miles Spielberg
>>> Staff Software Engineer
>>>
>>>
>>> O. 650.485.1102 <(650)%20485-1102>
>>> 900 Jefferson Ave
>>> <https://maps.google.com/?q=900+Jefferson+Ave+%0D%0A%0D%0ARedwood+City+,+CA+94063&entry=gmail&source=g>
>>> Redwood City
>>> <https://maps.google.com/?q=900+Jefferson+Ave+%0D%0A%0D%0ARedwood+City+,+CA+94063&entry=gmail&source=g>,
>>> CA 94063
>>> <https://maps.google.com/?q=900+Jefferson+Ave+%0D%0A%0D%0ARedwood+City+,+CA+94063&entry=gmail&source=g>
>>>
>>> On Thu, Feb 22, 2018 at 7:42 AM, James Taylor 
>>> wrote:
>>>
>>>> Another option would be to use dynamic columns[1] when querying across
>>>> views. You’d have to disable column encoding [2] in this case.
>>>>
>>>> [1] http://phoenix.apache.org/dynamic_columns.html
>>>> [2] http://phoenix.apache.org/columnencoding.html
>>>>
>>>> On Wed, Feb 21, 2018 at 11:57 PM Miles Spielberg  wrote:
>>>>
>>>>> I believe each query in a UNION needs to have the same result tuple
>>>>> format, which would work in this toy example, but in the general case each
>>>>> view would have a different schema. We could make the result tuples 
>>>>> conform
>>>>> with each other by selecting NULL literals for every column except those 
>>>>> in
>>>>> a view. It would get quite verbose though. Assuming f1,f2,f3 all have
>>>>> incompatible types, were you suggesting something like this?
>>>>>
>>>>> Select f1, null, null from v1 where PK=?
>>>>> Union all
>>>>> Select null, f2, null from v2 where PK=?
>>>>> Union all
>>>>> Select null, null, f3 from v3 where PK=?
>>>>>
>>>>> We might just run separate parallel queries against each view and
>>>>> merge the results client side. I would guess this should

Re: Runtime DDL supported?

2018-02-28 Thread James Taylor
Please file a JIRA as it’d be feasible to change this limitation. The
easiest way would be to have a separate counter for each tenant. Another
way to reduce the number of indexes on tenant specific views would be to
factor out common columns to global views and create indexes there.

On Tue, Feb 27, 2018 at 3:40 PM Miles Spielberg  wrote:

> As we discussed, indexes across views are stored in a single HBase table
> associated with the original table (_IDX_). That's grand for
> limiting the number of HBase tables created, but I just realized that the
> actual index data within is differentiated by the 16-bit "viewIndexId",
> which limits us to 64K indexes across all views for a given table. That's
> concerning for our use case, especially if its a cumulative autoincrement
> across all CREATE INDEX and DROP INDEX operations over the lifetime of the
> base table.
>
> Is there any workaround for this? A quick grep across the source indicates
> that the length of viewIndexId is currently hard-coded.
>
> At least, this limitation should probably be added to the list of caveats
> and warnings at https://phoenix.apache.org/views.html.
>
> Miles Spielberg
> Staff Software Engineer
>
>
> O. 650.485.1102
> 900 Jefferson Ave
> <https://maps.google.com/?q=900+Jefferson+Ave+%0D%0A%0D%0ARedwood+City+,+CA+94063&entry=gmail&source=g>
> Redwood City
> <https://maps.google.com/?q=900+Jefferson+Ave+%0D%0A%0D%0ARedwood+City+,+CA+94063&entry=gmail&source=g>,
> CA 94063
> <https://maps.google.com/?q=900+Jefferson+Ave+%0D%0A%0D%0ARedwood+City+,+CA+94063&entry=gmail&source=g>
>
> On Thu, Feb 22, 2018 at 7:42 AM, James Taylor 
> wrote:
>
>> Another option would be to use dynamic columns[1] when querying across
>> views. You’d have to disable column encoding [2] in this case.
>>
>> [1] http://phoenix.apache.org/dynamic_columns.html
>> [2] http://phoenix.apache.org/columnencoding.html
>>
>> On Wed, Feb 21, 2018 at 11:57 PM Miles Spielberg  wrote:
>>
>>> I believe each query in a UNION needs to have the same result tuple
>>> format, which would work in this toy example, but in the general case each
>>> view would have a different schema. We could make the result tuples conform
>>> with each other by selecting NULL literals for every column except those in
>>> a view. It would get quite verbose though. Assuming f1,f2,f3 all have
>>> incompatible types, were you suggesting something like this?
>>>
>>> Select f1, null, null from v1 where PK=?
>>> Union all
>>> Select null, f2, null from v2 where PK=?
>>> Union all
>>> Select null, null, f3 from v3 where PK=?
>>>
>>> We might just run separate parallel queries against each view and merge
>>> the results client side. I would guess this should perform well since the
>>> block cache can be leveraged for queries after the first.
>>>
>>> We could also use the HBase API to run a point row get. We'd have to
>>> reimplement decoding for Phoenix's column values, which is not ideal but
>>> quite doable.
>>>
>>> Sent from my iPhone
>>>
>>> On Feb 21, 2018, at 9:09 PM, James Taylor 
>>> wrote:
>>>
>>> Have you tried a UNION ALL query on (f1, f2, f3) instead? It seems
>>> you’re on a good track with multiple views over a single (or handful) of
>>> physical table(s).
>>>
>>> On Wed, Feb 21, 2018 at 6:45 PM Miles Spielberg  wrote:
>>>
>>>> I've done some experimentation with views, with a schema resembling
>>>> this:
>>>>
>>>> create table t1(
>>>>>
>>>>> pk bigint not null primary key
>>>>>
>>>>> );
>>>>>
>>>>>
>>>>>> create view v1(
>>>>>
>>>>> f1 varchar
>>>>>
>>>>> ) AS SELECT * FROM t1;
>>>>>
>>>>> create INDEX v1_f1 ON v1(f1);
>>>>>
>>>>>
>>>>>> create view v2(
>>>>>
>>>>> f2 varchar
>>>>>
>>>>> ) AS SELECT * FROM t1;
>>>>>
>>>>> create INDEX v2_f2 ON v2(f2);
>>>>>
>>>>>
>>>>>> create view v3(
>>>>>
>>>>> f3 varchar
>>>>>
>>>>> ) AS SELECT * FROM t1;
>>>>>
>>>>> create INDEX v3_f3 ON v3(f3);
>>>>>
>>>>>
>>>> Most of the time w

Re: Secondary index question

2018-02-27 Thread James Taylor
Please file a JIRA and include the Phoenix and HBase version. Sounds like
you’ve found a bug.

On Tue, Feb 27, 2018 at 9:21 AM Jonathan Leech  wrote:

> I’ve done what you’re looking for by selecting the pk from the index in a
> nested query and filtering the other column separately.
>
> > On Feb 27, 2018, at 6:39 AM, Alexey Karpov  wrote:
> >
> > Thanks for quick answer, but my case is a slightly different. I've seen
> these
> > links and already use local index. All cases, described in faq,
> index_usage
> > and any other, I've found in this user list, are about SELECT clause. In
> > WHERE clause there is always field from the index.
> >
> > In my case in WHERE clause I have one field from the index and one not
> from
> > the index, combined with AND operator:
> > SELECT * from test WHERE name = 'a' *AND description = 'b'*
> > name - from the index
> > description - not from the index
> >
> > Without filter on description (only on name) Phoenix uses index, as
> expected
> > for local index. But with additional filter Phoenix decides to do a full
> > scan. And my question is: Is there any way to make Phoenix use index in
> such
> > types of queries, without include all fields in index ?
> >
> > Hint does not help:
> > SELECT /*+ INDEX(test ix_test_name) */ name FROM test WHERE name = 'a'
> AND
> > description= 'b'
> > ColumnNotFoundException: ERROR 504 (42703): Undefined column.
> > columnName=DESCRIPTION.
> >
> > Thanks,
> > Alexey.
> >
> >
> >
> > --
> > Sent from: http://apache-phoenix-user-list.1124778.n5.nabble.com/
>


Re: Secondary index question

2018-02-26 Thread James Taylor
See https://phoenix.apache.org/secondary_indexing.html#Index_Usage. We get
this question a fair amount. We have an FAQ, here [1], but it's not a very
complete answer (as it doesn't mention hinting or local indexes), so it'd
be good if it was updated.

Thanks,
James

[1]
https://phoenix.apache.org/faq.html#Why_isnt_my_secondary_index_being_used

On Mon, Feb 26, 2018 at 7:43 AM, Josh Elser  wrote:

> IIRC, Phoenix will only choose to use an index when all columns are
> covered (either the index is on the columns or the columns are explicitly
> configured to be covered in the DDL).
>
>
>
> On 2/26/18 6:45 AM, Alexey Karpov wrote:
>
>> Hi.
>>
>> Let’s say I have a table CREATE TABLE test (id integer NOT NULL PRIMARY
>> KEY,
>> name VARCHAR, description VARCHAR)
>> with local index CREATE LOCAL INDEX ix_test_name ON test(name)
>>
>> For the query SELECT * FROM test WHERE name = 'a'
>> it’s all right, Phoenix uses index.
>>
>> But for the query SELECT * FROM test WHERE name = 'a' AND description =
>> 'b'
>> It makes full scan over the table. Is there any way to make Phoenix use
>> index in such queries(when one field is in index and another is not),
>> without including another field into the index ?
>>
>> I use HDP 2.6.2 and Phoenix 4.7.
>>
>> Best regards,
>> Alexey
>>
>>
>>
>> --
>> Sent from: http://apache-phoenix-user-list.1124778.n5.nabble.com/
>>
>>


Re: Incorrect number of rows affected from DELETE query

2018-02-22 Thread James Taylor
Phoenix returns the number of delete markers that were placed if it’s a
point delete, not the actual number of rows deleted. Otherwise you’d need
to do a read before the delete (which would be costly). It’s possible that
this could be made configurable - please file a JIRA. You could work around
this by adding a nullable column that’s never populated to your primary key
or by adding another ANDed expression to your WHERE clause for a non pk
column which you know is always true.

On Thu, Feb 22, 2018 at 4:05 PM Sergey Soldatov 
wrote:

> Hi Jins,
> If you provide steps to reproduce it would be much easier to understand
> where the problem is. If nothing was deleted the report should be 'No
> rows affected'.
>
> Thanks,
> Sergey
>
> On Mon, Feb 19, 2018 at 4:30 PM, Jins George 
> wrote:
>
>> Hi,
>>
>> I am facing an issue in which the number of rows affected by a DELETE
>> query returns an incorrect value.   The record I am trying to delete does
>> not exists in the table, as evident from the first query but on deletion,
>> it reports 1 row is affected.  Is this a known issue?
>>
>> I have tried this in Phoenix 4.7 & Phoenix 4.13 and both behaves the same
>> way.
>>
>>
>> 0: jdbc:phoenix:localhost> select accountId, subid  from test.mytable
>> where accountid = '1' and subid = '1';
>> +++
>> | ACCOUNTID  | SUBID  |
>> +++
>> +++
>> *No rows selected (0.017 seconds)*
>> 0: jdbc:phoenix:localhost> delete from test.mytable where accountid = '1'
>> and subid = '1';
>> *1 row affected (0.005 seconds)*
>> 0: jdbc:phoenix:localhost>
>>
>>
>> Thanks,
>> Jins George
>>
>
>


Re: Runtime DDL supported?

2018-02-22 Thread James Taylor
Another option would be to use dynamic columns[1] when querying across
views. You’d have to disable column encoding [2] in this case.

[1] http://phoenix.apache.org/dynamic_columns.html
[2] http://phoenix.apache.org/columnencoding.html

On Wed, Feb 21, 2018 at 11:57 PM Miles Spielberg  wrote:

> I believe each query in a UNION needs to have the same result tuple
> format, which would work in this toy example, but in the general case each
> view would have a different schema. We could make the result tuples conform
> with each other by selecting NULL literals for every column except those in
> a view. It would get quite verbose though. Assuming f1,f2,f3 all have
> incompatible types, were you suggesting something like this?
>
> Select f1, null, null from v1 where PK=?
> Union all
> Select null, f2, null from v2 where PK=?
> Union all
> Select null, null, f3 from v3 where PK=?
>
> We might just run separate parallel queries against each view and merge
> the results client side. I would guess this should perform well since the
> block cache can be leveraged for queries after the first.
>
> We could also use the HBase API to run a point row get. We'd have to
> reimplement decoding for Phoenix's column values, which is not ideal but
> quite doable.
>
> Sent from my iPhone
>
> On Feb 21, 2018, at 9:09 PM, James Taylor  wrote:
>
> Have you tried a UNION ALL query on (f1, f2, f3) instead? It seems you’re
> on a good track with multiple views over a single (or handful) of physical
> table(s).
>
> On Wed, Feb 21, 2018 at 6:45 PM Miles Spielberg  wrote:
>
>> I've done some experimentation with views, with a schema resembling this:
>>
>> create table t1(
>>>
>>> pk bigint not null primary key
>>>
>>> );
>>>
>>>
>>>> create view v1(
>>>
>>> f1 varchar
>>>
>>> ) AS SELECT * FROM t1;
>>>
>>> create INDEX v1_f1 ON v1(f1);
>>>
>>>
>>>> create view v2(
>>>
>>> f2 varchar
>>>
>>> ) AS SELECT * FROM t1;
>>>
>>> create INDEX v2_f2 ON v2(f2);
>>>
>>>
>>>> create view v3(
>>>
>>> f3 varchar
>>>
>>> ) AS SELECT * FROM t1;
>>>
>>> create INDEX v3_f3 ON v3(f3);
>>>
>>>
>> Most of the time we'll be accessing data via the indexed views, but we'd
>> also like to be able to query all columns (f1, f2, f3) for a given pk. At
>> the HBase level, this should be doable as a point get on t1. The SQL-y
>> way to express this would probably be with JOINs, but the EXPLAIN plan is
>> not encouraging.
>>
>> > explain SELECT * from t1 left join v1 on v1.pk=t1.pk left join v2 on
>>> v2.pk=t1.pk left  join v3 on v3.pk=t1.pk where t1.pk=12345;
>>> | CLIENT 1-CHUNK 1 ROWS 281 BYTES PARALLEL 1-WAY ROUND ROBIN POINT
>>> LOOKUP ON 1 KEY OVER T1
>>> | PARALLEL LEFT-JOIN TABLE 0
>>> | CLIENT 1-CHUNK PARALLEL 1-WAY ROUND ROBIN FULL SCAN OVER T1
>>> | PARALLEL LEFT-JOIN TABLE 1
>>> | CLIENT 1-CHUNK PARALLEL 1-WAY ROUND ROBIN FULL SCAN OVER T1
>>> | PARALLEL LEFT-JOIN TABLE 2
>>> | CLIENT 1-CHUNK PARALLEL 1-WAY ROUND ROBIN FULL SCAN OVER T1
>>>
>>
>> This is pushing me back towards a design of having a single table, except
>> for the issue of proliferating physical HBase tables for the indexes. Would
>> you advise having a single table + a single view on it containing all
>> columns, to coerce Phoenix to consolidate the indexes into a single
>> physical table? Are there other alternatives we should be considering?
>>
>> Miles Spielberg
>> Staff Software Engineer
>>
>>
>> O. 650.485.1102
>> 900 Jefferson Ave
>> <https://maps.google.com/?q=900+Jefferson+Ave+%0D+%0D+Redwood+City,+CA+94063&entry=gmail&source=g>
>> Redwood City
>> <https://maps.google.com/?q=900+Jefferson+Ave+%0D+%0D+Redwood+City,+CA+94063&entry=gmail&source=g>,
>> CA 94063
>> <https://maps.google.com/?q=900+Jefferson+Ave+%0D+%0D+Redwood+City,+CA+94063&entry=gmail&source=g>
>>
>> On Fri, Feb 16, 2018 at 3:27 PM, James Taylor 
>> wrote:
>>
>>> All indexes on views are stored in a single physical table, so you'll be
>>> ok in that regard.
>>>
>>> If you could file bugs for any local index issues, we'd really
>>> appreciate it. We've been steadily improving local indexes
>>> (see PHOENIX-3941 for some recent perf improvements - applicabl

Re: Large tables in phoenix, issues with relational queries

2018-02-21 Thread James Taylor
Hi Aman,
Will all of your 210 relational tables only have a few millions rows? If
so, have you tried just using something like MySQL? What led you toward a
distributed solution?

When going from a single node RDBMS system to Phoenix, you typically
wouldn't use the schemas directly, but there'd be some amount of
denormalization. Have you seen our Tuning Guide [1]? You'll likely want to
determine the best row key design and minimum number of secondary indexes
to satisfy your most common questions.

More specifically with joins [2], you have to be careful as the Phoenix
optimizer will attempt to do join ordering or figure out the best join
strategy (note there's on going work to improve this with PHOENIX-1556).
Instead, you'll need to you'll need to make sure to list your tables from
largest to smallest (the size after filtering). Also, Phoenix has two join
strategies - hash join and sort merge join. By default, Phoenix will
perform a hash join, but you can use the /*+ USE_SORT_MERGE_JOIN */ hint to
force a sort merge join. The sort merge join will be better if the tables
are already ordered by their join key. If your Report Framework use case is
doing many joins, you'd likely want to add secondary indexes that ensure
that one or both sides are ordered according how you're joining the tables.

Sorry for only providing very general information, but without more
specifics, it's difficult to provide more specific guidance.

Thanks,
James

[1] http://phoenix.apache.org/tuning_guide.html
[2] http://phoenix.apache.org/joins.html

On Mon, Feb 19, 2018 at 12:11 AM, Aman Kumar Jha  wrote:

> Phoenix Team,
>
>
>
> We are using Apache Phoenix on our Reporting Framework that we are
> building , and are facing a lot of challenges with it.  (majorly
> performance challenges). We are severely constrained on Apache Phoenix
> knowledge and would love your help to find someone who can help us get off
> the ground here.
>
>
>
> Our use case is, about 210 relational tables (a few million row in many of
> these tables) are present inside our DB and our reporting framework sits on
> top of the same. Due to many relational tables, the reports mostly result
> in large queries, with multiple joins (mostly left outer). This we think is
> the root cause of most of our problems. A lot of internet searches, get us
> the basics back easily, but we are not getting anything deeper, so that we
> can tune this further.
>
>
>
> At this point, we are really thinking, if Phoenix is the correct choice of
>  technology for the above use case.
>
>
>
> As mentioned earlier, we need help with finding someone who can help us
> move ahead.
>
>
>
> Thanks a lot for your time.
>
>
>
> Regards,
>
> Aman Kumar Jha
>
>
> This email communication (including any attachments) contains confidential
> information and is intended only for the named recipients. If you are not
> the intended recipient, please delete this email communication (including
> any attachments) and hard copies immediately, Any unauthorized use or
> dissemination of this email communication (including any attachments) in
> any manner, is strictly prohibited. This email communication (including any
> attachments), may not be free of viruses, you should carry out your own
> virus checks before opening any attachment to this e-mail. The sender of
> this e-mail and the company shall not be liable for any damage that you may
> sustain as a result of viruses, incompleteness of this message,
> interception of this message, which may arise as a result of e-mail
> transmission.
>


Re: Runtime DDL supported?

2018-02-21 Thread James Taylor
Have you tried a UNION ALL query on (f1, f2, f3) instead? It seems you’re
on a good track with multiple views over a single (or handful) of physical
table(s).

On Wed, Feb 21, 2018 at 6:45 PM Miles Spielberg  wrote:

> I've done some experimentation with views, with a schema resembling this:
>
> create table t1(
>>
>> pk bigint not null primary key
>>
>> );
>>
>>
>>> create view v1(
>>
>> f1 varchar
>>
>> ) AS SELECT * FROM t1;
>>
>> create INDEX v1_f1 ON v1(f1);
>>
>>
>>> create view v2(
>>
>> f2 varchar
>>
>> ) AS SELECT * FROM t1;
>>
>> create INDEX v2_f2 ON v2(f2);
>>
>>
>>> create view v3(
>>
>> f3 varchar
>>
>> ) AS SELECT * FROM t1;
>>
>> create INDEX v3_f3 ON v3(f3);
>>
>>
> Most of the time we'll be accessing data via the indexed views, but we'd
> also like to be able to query all columns (f1, f2, f3) for a given pk. At
> the HBase level, this should be doable as a point get on t1. The SQL-y
> way to express this would probably be with JOINs, but the EXPLAIN plan is
> not encouraging.
>
> > explain SELECT * from t1 left join v1 on v1.pk=t1.pk left join v2 on
>> v2.pk=t1.pk left  join v3 on v3.pk=t1.pk where t1.pk=12345;
>> | CLIENT 1-CHUNK 1 ROWS 281 BYTES PARALLEL 1-WAY ROUND ROBIN POINT LOOKUP
>> ON 1 KEY OVER T1
>> | PARALLEL LEFT-JOIN TABLE 0
>> | CLIENT 1-CHUNK PARALLEL 1-WAY ROUND ROBIN FULL SCAN OVER T1
>> | PARALLEL LEFT-JOIN TABLE 1
>> | CLIENT 1-CHUNK PARALLEL 1-WAY ROUND ROBIN FULL SCAN OVER T1
>> | PARALLEL LEFT-JOIN TABLE 2
>> | CLIENT 1-CHUNK PARALLEL 1-WAY ROUND ROBIN FULL SCAN OVER T1
>>
>
> This is pushing me back towards a design of having a single table, except
> for the issue of proliferating physical HBase tables for the indexes. Would
> you advise having a single table + a single view on it containing all
> columns, to coerce Phoenix to consolidate the indexes into a single
> physical table? Are there other alternatives we should be considering?
>
> Miles Spielberg
> Staff Software Engineer
>
>
> O. 650.485.1102
> 900 Jefferson Ave
> <https://maps.google.com/?q=900+Jefferson+Ave+%0D+%0D+Redwood+City,+CA+94063&entry=gmail&source=g>
> Redwood City
> <https://maps.google.com/?q=900+Jefferson+Ave+%0D+%0D+Redwood+City,+CA+94063&entry=gmail&source=g>,
> CA 94063
> <https://maps.google.com/?q=900+Jefferson+Ave+%0D+%0D+Redwood+City,+CA+94063&entry=gmail&source=g>
>
> On Fri, Feb 16, 2018 at 3:27 PM, James Taylor 
> wrote:
>
>> All indexes on views are stored in a single physical table, so you'll be
>> ok in that regard.
>>
>> If you could file bugs for any local index issues, we'd really appreciate
>> it. We've been steadily improving local indexes (see PHOENIX-3941 for some
>> recent perf improvements - applicable for multi-tenant tables in particular
>> - these will appear in our 4.14 release). Handling non covered columns is
>> pretty isolated, so we should be able to fix bugs you find. Plus, there's a
>> workaround - you can cover your indexes until any issues are fixed.
>>
>> Global, mutable indexes have had many improvements over the last several
>> releases too, but there's more operational overhead if/when a data table
>> gets out of sync with it's index table (plus some amount of configurable
>> eventual consistency or index disablement). With local indexes (and HBase
>> 1.3), this isn't possible.
>>
>> Thanks,
>> James
>>
>> On Fri, Feb 16, 2018 at 3:10 PM, Miles Spielberg  wrote:
>>
>>> Hi James,
>>>
>>> Thanks for the tips around reducing the number of physical tables while
>>> still maintaining the appearance of multiple tables via view definitions.
>>> In our use case we don't anticipate having much if any immutable data, so
>>> unfortunately I don't expect to be able to take advantage of Phoenix's
>>> optimizations there.
>>>
>>> We're expecting many indexes, mostly likely with several per logical
>>> per-tenant table. Given that global indexes are implemented as physical
>>> HBase tables, will the view-oriented optimizations help very much? We've
>>> done some experiments with local indexes on 4.13.2 and found bugs,
>>> particularly with the rewrite optimization to read non-covered columns from
>>> the main table, so we're not confident in using local indexes to optimize
>>> quer

Re: Runtime DDL supported?

2018-02-16 Thread James Taylor
All indexes on views are stored in a single physical table, so you'll be ok
in that regard.

If you could file bugs for any local index issues, we'd really appreciate
it. We've been steadily improving local indexes (see PHOENIX-3941 for some
recent perf improvements - applicable for multi-tenant tables in particular
- these will appear in our 4.14 release). Handling non covered columns is
pretty isolated, so we should be able to fix bugs you find. Plus, there's a
workaround - you can cover your indexes until any issues are fixed.

Global, mutable indexes have had many improvements over the last several
releases too, but there's more operational overhead if/when a data table
gets out of sync with it's index table (plus some amount of configurable
eventual consistency or index disablement). With local indexes (and HBase
1.3), this isn't possible.

Thanks,
James

On Fri, Feb 16, 2018 at 3:10 PM, Miles Spielberg  wrote:

> Hi James,
>
> Thanks for the tips around reducing the number of physical tables while
> still maintaining the appearance of multiple tables via view definitions.
> In our use case we don't anticipate having much if any immutable data, so
> unfortunately I don't expect to be able to take advantage of Phoenix's
> optimizations there.
>
> We're expecting many indexes, mostly likely with several per logical
> per-tenant table. Given that global indexes are implemented as physical
> HBase tables, will the view-oriented optimizations help very much? We've
> done some experiments with local indexes on 4.13.2 and found bugs,
> particularly with the rewrite optimization to read non-covered columns from
> the main table, so we're not confident in using local indexes to optimize
> queries. (I've looked through the 5.0-alpha release notes and couldn't find
> anything related to this issue, so if desired I'll collect info for a
> separate bug report.)
>
> Miles Spielberg
> Staff Software Engineer
>
>
> O. 650.485.1102 <(650)%20485-1102>
> 900 Jefferson Ave
> <https://maps.google.com/?q=900+Jefferson+Ave+%0D+%0D+Redwood+City,+CA+94063&entry=gmail&source=g>
> Redwood City, CA 94063
> <https://maps.google.com/?q=900+Jefferson+Ave+%0D+%0D+Redwood+City,+CA+94063&entry=gmail&source=g>
>
> On Fri, Feb 16, 2018 at 2:49 PM, James Taylor 
> wrote:
>
>> Hi Miles,
>> You'll be fine if you use views [1] and multi-tenancy [2] to limit the
>> number of physical HBase tables. Make sure you read about the limitations
>> of views too [3].
>>
>> Here's the way I've seen this modeled successfully:
>> - create one schema per use case. This will let you leverage some nice
>> features in HBase for quotas and throttling. If you'll have a single use
>> case, you don't have to worry about it. Read about namespaces here [4] and
>> make sure to enable them before you start creating tables.
>> - define an immutable, multi-tenant base table that has TENANT_ID +
>> TYPE_ID primary key. There are optimizations Phoenix does over immutable
>> tables that you'll want to leverage (assuming you have use cases that fit
>> into this category). This Phoenix table will be backed by a physical
>> HBase table, but you won't execute Phoenix DML against it. Think of it as a
>> kind of "abstract" type. Instead, you'll create updatable views over it.
>> - define a regular/mutable, multi-tenant base table that has TENANT_ID +
>> TYPE_ID primary key. Same deal as above, but this would be the base table
>> for any tables in which the rows change in place.
>> - define global views per "logical" table (against either your immutable
>> base table or mutable base table depending on the functionality needed)
>> with each view having a WHERE TYPE_ID='your type identifier' clause which
>> adds specific columns to the primary key. This view will be updatable (i.e.
>> you can execute DML against it). The columns you add to your PK will depend
>> on your most common query patterns.
>> - optionally define indexes on these global views.
>> - each tenant can further extend or just use the global views.
>>
>> FYI, lots of good performance/tuning tips can be found here[5].
>>
>> Thanks,
>> James
>>
>>
>> [1] https://phoenix.apache.org/views.html
>> [2] https://phoenix.apache.org/multi-tenancy.html
>> [3] https://phoenix.apache.org/views.html#Limitations
>> [4] https://phoenix.apache.org/namspace_mapping.html
>> [5] https://phoenix.apache.org/tuning_guide.html
>>
>> On Fri, Feb 16, 2018 at 11:47 AM, Miles Spielberg  wrote:
>>
>>> We're looking at employing Phoenix in a multi-tenant use case where
>>> tenants can create their own tables and indexes, running into totals of
>>> tens-of-thousands of each. Is this a supported scenario, or are we headed
>>> for trouble?
>>>
>>
>>
>


Re: Runtime DDL supported?

2018-02-16 Thread James Taylor
Hi Miles,
You'll be fine if you use views [1] and multi-tenancy [2] to limit the
number of physical HBase tables. Make sure you read about the limitations
of views too [3].

Here's the way I've seen this modeled successfully:
- create one schema per use case. This will let you leverage some nice
features in HBase for quotas and throttling. If you'll have a single use
case, you don't have to worry about it. Read about namespaces here [4] and
make sure to enable them before you start creating tables.
- define an immutable, multi-tenant base table that has TENANT_ID + TYPE_ID
primary key. There are optimizations Phoenix does over immutable tables
that you'll want to leverage (assuming you have use cases that fit into
this category). This Phoenix table will be backed by a physical HBase
table, but you won't execute Phoenix DML against it. Think of it as a kind
of "abstract" type. Instead, you'll create updatable views over it.
- define a regular/mutable, multi-tenant base table that has TENANT_ID +
TYPE_ID primary key. Same deal as above, but this would be the base table
for any tables in which the rows change in place.
- define global views per "logical" table (against either your immutable
base table or mutable base table depending on the functionality needed)
with each view having a WHERE TYPE_ID='your type identifier' clause which
adds specific columns to the primary key. This view will be updatable (i.e.
you can execute DML against it). The columns you add to your PK will depend
on your most common query patterns.
- optionally define indexes on these global views.
- each tenant can further extend or just use the global views.

FYI, lots of good performance/tuning tips can be found here[5].

Thanks,
James


[1] https://phoenix.apache.org/views.html
[2] https://phoenix.apache.org/multi-tenancy.html
[3] https://phoenix.apache.org/views.html#Limitations
[4] https://phoenix.apache.org/namspace_mapping.html
[5] https://phoenix.apache.org/tuning_guide.html

On Fri, Feb 16, 2018 at 11:47 AM, Miles Spielberg  wrote:

> We're looking at employing Phoenix in a multi-tenant use case where
> tenants can create their own tables and indexes, running into totals of
> tens-of-thousands of each. Is this a supported scenario, or are we headed
> for trouble?
>


Re: Creating View Table Using the Date & Time

2018-02-13 Thread James Taylor
No, you’ll need to create a Phoenix table and use Phoenix APIs to write
your data.

On Tue, Feb 13, 2018 at 9:52 PM Vaghawan Ojha  wrote:

> Thank you James, my keys are something like
> this: 2018-02-01-BM50558-1517454912.0-5-1517548497.261604 . the first few
> chars are the date. and these dates are stored in a seperate columns as
> BDATE as well. Do you think I could implement the rowtimestamp in the BDATE
> column?
>
> Thanks
> Vaghawan
>
> On Wed, Feb 14, 2018 at 7:47 AM, James Taylor 
> wrote:
>
>> If the date time column is part of your pk, then you’d be able to use the
>> ROW_TIMESTAMP feature.
>>
>> On Tue, Feb 13, 2018 at 5:04 PM Vaghawan Ojha 
>> wrote:
>>
>>> Yes, the datetime column is part of my primary key, but primary key also
>>> consists other strings.
>>>
>>> Thanks
>>> Vaghawan
>>>
>>> On Tue, Feb 13, 2018 at 11:05 PM, James Taylor 
>>> wrote:
>>>
>>>> The standard way of doing this is to add a TTL for your table [1]. You
>>>> can do this through the ALTER TABLE call [2]. Is the date/time column part
>>>> of your primary key? If so, you can improve performance by declaring this
>>>> column as a ROW_TIMESTAMP [3].
>>>>
>>>> A view is not going to help you - it's not materialized.
>>>>
>>>> Thanks,
>>>> James
>>>>
>>>> [1] http://hbase.apache.org/0.94/book/ttl.html
>>>> [2] https://phoenix.apache.org/language/index.html#alter
>>>> [3] https://phoenix.apache.org/rowtimestamp.html
>>>>
>>>> On Tue, Feb 13, 2018 at 2:42 AM, Vaghawan Ojha 
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm using phoenix 4.12 with hbase 1.2.0, I've a table with few
>>>>> millions of rows, but I don't need much of the old data, Let's say the
>>>>> frequent data I need is data from 2 month back.
>>>>>
>>>>> the query become slow when I read the table using timestamp. So query
>>>>> would be like where date>some date and >>>>
>>>>> I was thinking of creating a veiw table where I could put the latest
>>>>> two months data, but the data there should consist only the latest two
>>>>> months. The parent table is updated daily with the new data, so in my case
>>>>> whenever new day's data comes in, the last one day's data should be 
>>>>> removed
>>>>> from the view, making sure that the view consists two month's data. (e.g 
>>>>> it
>>>>> would always hold last 60 days data)
>>>>> I don't know if that is possible using create view. If it is how can I
>>>>> do it?
>>>>>
>>>>> Any suggestion would be appreciated.
>>>>>
>>>>> Thanks
>>>>> Vaghawan
>>>>>
>>>>
>>>>
>>>
>


Re: Creating View Table Using the Date & Time

2018-02-13 Thread James Taylor
If the date time column is part of your pk, then you’d be able to use the
ROW_TIMESTAMP feature.

On Tue, Feb 13, 2018 at 5:04 PM Vaghawan Ojha  wrote:

> Yes, the datetime column is part of my primary key, but primary key also
> consists other strings.
>
> Thanks
> Vaghawan
>
> On Tue, Feb 13, 2018 at 11:05 PM, James Taylor 
> wrote:
>
>> The standard way of doing this is to add a TTL for your table [1]. You
>> can do this through the ALTER TABLE call [2]. Is the date/time column part
>> of your primary key? If so, you can improve performance by declaring this
>> column as a ROW_TIMESTAMP [3].
>>
>> A view is not going to help you - it's not materialized.
>>
>> Thanks,
>> James
>>
>> [1] http://hbase.apache.org/0.94/book/ttl.html
>> [2] https://phoenix.apache.org/language/index.html#alter
>> [3] https://phoenix.apache.org/rowtimestamp.html
>>
>> On Tue, Feb 13, 2018 at 2:42 AM, Vaghawan Ojha 
>> wrote:
>>
>>> Hi,
>>>
>>> I'm using phoenix 4.12 with hbase 1.2.0, I've a table with few millions
>>> of rows, but I don't need much of the old data, Let's say the frequent data
>>> I need is data from 2 month back.
>>>
>>> the query become slow when I read the table using timestamp. So query
>>> would be like where date>some date and >>
>>> I was thinking of creating a veiw table where I could put the latest two
>>> months data, but the data there should consist only the latest two months.
>>> The parent table is updated daily with the new data, so in my case whenever
>>> new day's data comes in, the last one day's data should be removed from the
>>> view, making sure that the view consists two month's data. (e.g it would
>>> always hold last 60 days data)
>>> I don't know if that is possible using create view. If it is how can I
>>> do it?
>>>
>>> Any suggestion would be appreciated.
>>>
>>> Thanks
>>> Vaghawan
>>>
>>
>>
>


Re: Creating View Table Using the Date & Time

2018-02-13 Thread James Taylor
The standard way of doing this is to add a TTL for your table [1]. You can
do this through the ALTER TABLE call [2]. Is the date/time column part of
your primary key? If so, you can improve performance by declaring this
column as a ROW_TIMESTAMP [3].

A view is not going to help you - it's not materialized.

Thanks,
James

[1] http://hbase.apache.org/0.94/book/ttl.html
[2] https://phoenix.apache.org/language/index.html#alter
[3] https://phoenix.apache.org/rowtimestamp.html

On Tue, Feb 13, 2018 at 2:42 AM, Vaghawan Ojha 
wrote:

> Hi,
>
> I'm using phoenix 4.12 with hbase 1.2.0, I've a table with few millions of
> rows, but I don't need much of the old data, Let's say the frequent data I
> need is data from 2 month back.
>
> the query become slow when I read the table using timestamp. So query
> would be like where date>some date and 
> I was thinking of creating a veiw table where I could put the latest two
> months data, but the data there should consist only the latest two months.
> The parent table is updated daily with the new data, so in my case whenever
> new day's data comes in, the last one day's data should be removed from the
> view, making sure that the view consists two month's data. (e.g it would
> always hold last 60 days data)
> I don't know if that is possible using create view. If it is how can I do
> it?
>
> Any suggestion would be appreciated.
>
> Thanks
> Vaghawan
>


Re: Drop column timeout

2018-02-13 Thread James Taylor
Hi Jacobo,
Please file a JIRA for asynchronous drop column functionality. There's a
few ways that could be implemented. We could execute the call that issues
the delete markers on the server-side in a separate thread (similar to what
we do with UPDATE STATISTICS), or we could support a map-reduce job that
executes the drop column. The former is easier, but wouldn't be resilient
against the cluster being bounced. It'd also put more load on the cluster
(but would execute faster). The latter would be a bit more work and with
the current framework, the MR job would need to be started manually.

You could remove the column manually from the SYSTEM.CATALOG, but you'd
want to bounce the region server hosting this table so that the cache gets
cleared (or call MetaDataService.clearTableFromCache() for the table).

Thanks,
James

On Tue, Feb 13, 2018 at 8:59 AM, Jacobo Coll  wrote:

> Hi Flavio,
>
> I was trying to find a different solution here. This doesn't seem like a
> long term solution, as I expect the table to increase, and these new
> timeouts may not be enough in the future. Also, I don't feel comfortable
> increasing the timeouts that much.
>
>
> - Is there any way of removing a column asynchronously, like when creating
>> indexes?
>> - If not, can I just remove that column from the metadata (as if it was a
>> VIEW instead of a TABLE), and then remove the data using the HBase API?
>>
>
>
> Thanks
>
> 2018-02-13 16:18 GMT+00:00 Flavio Pompermaier :
>
>> I also had similar troubles and I fixed them changing the following
>> params (both on server and client side and restarting hbase):
>>
>> hbase.rpc.timeout (to 60)
>> phoenix.query.timeoutMs (to 60)
>> hbase.client.scanner.timeout.period (from 1 m to 10m)
>> hbase.regionserver.lease.period (from 1 m to 10m)
>>
>> I hope this could help you!
>>
>>
>> On Tue, Feb 13, 2018 at 5:09 PM, Jacobo Coll  wrote:
>>
>>> Hi all,
>>>
>>> I have a table in phoenix with 100M rows and ~3000 columns. I am trying
>>> to remove some columns, but after some seconds, it fails with a timeout
>>> exception:
>>>
>>>
>>> 0: jdbc:phoenix:> ALTER TABLE "ns"."table" DROP COLUMN IF EXISTS
>>> "myColumn";
>>> Error: org.apache.phoenix.exception.PhoenixIOException: Failed to get
>>> result within timeout, timeout=6ms (state=08000,code=101)
>>> org.apache.phoenix.exception.PhoenixIOException:
>>> org.apache.phoenix.exception.PhoenixIOException: Failed to get result
>>> within timeout, timeout=6ms
>>> at org.apache.phoenix.util.ServerUtil.parseServerException(Serv
>>> erUtil.java:111)
>>> at org.apache.phoenix.iterate.BaseResultIterators.getIterators(
>>> BaseResultIterators.java:771)
>>> at org.apache.phoenix.iterate.BaseResultIterators.getIterators(
>>> BaseResultIterators.java:714)
>>> at org.apache.phoenix.iterate.ConcatResultIterator.getIterators
>>> (ConcatResultIterator.java:50)
>>> at org.apache.phoenix.iterate.ConcatResultIterator.currentItera
>>> tor(ConcatResultIterator.java:97)
>>> at org.apache.phoenix.iterate.ConcatResultIterator.next(ConcatR
>>> esultIterator.java:117)
>>> at org.apache.phoenix.iterate.BaseGroupedAggregatingResultItera
>>> tor.next(BaseGroupedAggregatingResultIterator.java:64)
>>> at org.apache.phoenix.iterate.UngroupedAggregatingResultIterato
>>> r.next(UngroupedAggregatingResultIterator.java:39)
>>> at org.apache.phoenix.compile.PostDDLCompiler$2.execute(PostDDL
>>> Compiler.java:285)
>>> at org.apache.phoenix.query.ConnectionQueryServicesImpl.updateD
>>> ata(ConnectionQueryServicesImpl.java:2823)
>>> at org.apache.phoenix.schema.MetaDataClient.dropColumn(MetaData
>>> Client.java:3209)
>>> at org.apache.phoenix.jdbc.PhoenixStatement$ExecutableDropColum
>>> nStatement$1.execute(PhoenixStatement.java:1127)
>>> at org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixState
>>> ment.java:343)
>>> at org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixState
>>> ment.java:331)
>>> at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
>>> at org.apache.phoenix.jdbc.PhoenixStatement.executeMutation(Pho
>>> enixStatement.java:329)
>>> at org.apache.phoenix.jdbc.PhoenixStatement.execute(PhoenixStat
>>> ement.java:1440)
>>> at sqlline.Commands.execute(Commands.java:822)
>>> at sqlline.Commands.sql(Commands.java:732)
>>> at sqlline.SqlLine.dispatch(SqlLine.java:808)
>>> at sqlline.SqlLine.begin(SqlLine.java:681)
>>> at sqlline.SqlLine.start(SqlLine.java:398)
>>> at sqlline.SqlLine.main(SqlLine.java:292)
>>> Caused by: java.util.concurrent.ExecutionException:
>>> org.apache.phoenix.exception.PhoenixIOException: Failed to get result
>>> within timeout, timeout=6ms
>>> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>>> at java.util.concurrent.FutureTask.get(FutureTask.java:206)
>>> at org.apache.phoenix.iterate.BaseResul

Re: Index table in SYSTEM.CATALOG without DATA_TABLE_NAME and INDEX_TYPE

2018-02-06 Thread James Taylor
Hi William,
The system catalog table changes as new features are implemented. The API
that you can count on being stable is JDBC and in particular for metadata,
our DatabaseMetaData implementation. To understand how the system catalog
changes from release to release you'd need to keep an eye on any JIRA that
may impact it.
Thanks,
James

On Tue, Feb 6, 2018 at 9:38 AM, William Shen 
wrote:

> Looks like it is just how Phoenix 4.10 behaves different when creating an
> index...? Is there a place where we document the difference in how table
> metadata is handled for each release?
>
> Thanks in advance for your help!
>
> 0: jdbc:phoenix:labs-darth-journalnode-lv-101> create index test_users on 
> "func11".users("cstId");
> 69 rows affected (7.345 seconds)
>
>
>
> 0: jdbc:phoenix:labs-darth-journalnode-lv-101> select * from system.catalog 
> where TABLE_TYPE = 'i' AND column_family = 'TEST_USERS' OR TABLE_NAME = 
> 'TEST_USERS';
> ++--+-+--+++-+--+---+---+--+--+-+-+---+
> | TENANT_ID  | TABLE_SCHEM  | TABLE_NAME  | COLUMN_NAME  | COLUMN_FAMILY  | 
> TABLE_SEQ_NUM  | TABLE_TYPE  | PK_NAME  | COLUMN_COUNT  | SALT_BUCKETS  | 
> DATA_TABLE_NAME  | INDEX_STATE  | IMMUTABLE_ROWS  | VIEW_STATEMENT  | 
> DEFAULT_COLUMN_FAMILY |
> ++--+-+--+++-+--+---+---+--+--+-+-+---+
> || func11   | TEST_USERS  |  || 0 
>  | i   |  | 3 | 2 | USERS 
>| a| false   | |   
> |
> || func11   | TEST_USERS  | :cstId   || 
> null   | |  | null  | null  | 
> USERS|  | | | 
>   |
> || func11   | TEST_USERS  | :id  || 
> null   | |  | null  | null  | 
> USERS|  | | | 
>   |
> || func11   | USERS   |  | TEST_USERS | 
> 15 | i   |  | null  | null  | 
>  |  | | | 
>   |
> ++--+-+--+++-+--+---+---+--+--+-+-+---+
> 4 rows selected (0.288 seconds)
>
>
> On Fri, Feb 2, 2018 at 1:23 PM William Shen 
> wrote:
>
>> Hi everyone,
>>
>> I am investigating a strange looking entry in our SYSTEM.CATALOG table.
>> The row is an index table (TABLE_TYPE = i) but it does not contain any
>> other index information (no DATA_TABLE_NAME and INDEX_TYPE, etc.).
>>
>> Has anyone encountered similar situation, or is there any other way to
>> investigate how the entry was created?
>>
>> By the way, is there any documentation available on the SYSTEM.CATALOG,
>> that I can check to make I am understanding the information in this table
>> correctly?
>>
>> Thanks!
>>
>> We are running Phoenix 4.10 (upgraded previous from 4.8 and from 4.6)
>> Here is a few more details on this strange "index":
>>
>> SELECT * from system.catalog where table_schem = 'prod' and
>> data_table_name is null and TABLE_TYPE = 'i';
>>
>> TENANT_ID
>>
>> TABLE_SCHEMprod
>>
>> TABLE_NAME RULES
>>
>> COLUMN_NAME
>>
>> COLUMN_FAMILY  IDX_ID_RULES
>>
>> TABLE_SEQ_NUM  0
>>
>> TABLE_TYPE i
>>
>> PK_NAME
>>
>> COLUMN_COUNT   null
>>
>> SALT_BUCKETS   null
>>
>> DATA_TABLE_NAME
>>
>> INDEX_STATE
>>
>> IMMUTABLE_ROWS
>>
>> VIEW_STATEMENT
>>
>> DEFAULT_COLUMN_FAMILY
>>
>> DISABLE_WAL
>>
>> MULTI_TENANT
>>
>> VIEW_TYPE  null
>>
>> VIEW_INDEX_ID  null
>>
>> DATA_TYPE  null
>>
>> COLUMN_SIZEnull
>>
>> DECIMAL_DIGITS null
>>
>> NULLABLE   null
>>
>> ORDINAL_POSITION   null
>>
>> SORT_ORDER null
>>
>> ARRAY_SIZE null
>>
>> VIEW_CONSTANT
>>
>> IS_VIEW_REFERENCED
>>
>> KEY_SEQnull
>>
>> LINK_TYPE  1
>>
>> TYPE_NAME
>>
>> REMARKS
>>
>> SELF_REFERENCING_COL_NAME
>>
>> REF_GENERATION
>>
>> BUFFER_LENGTH  null
>>
>> NUM_PREC_RADIX null
>>
>> COLUMN_DEF
>>
>> SQL_DATA_TYPE  null
>>
>> SQL_DATETIME_SUB 

Re: ROW_TIMESTAMP

2018-02-02 Thread James Taylor
Yes, I think having your own LAST_UPDATED column would be the best option
currently.

On Fri, Feb 2, 2018 at 1:18 PM, Alberto Bengoa 
wrote:

> Hello James,
>
> Thanks for replying.
>
> It really seems that PHOENIX-4552 potentially fits to my purpose. I'll
> track
> this JIRA to get updates about it.
>
> BTW, considering nowadays, there's no option except to update some date
> type
> field on client side every upsert?
>
> Thank you so much.
>
> Alberto
>
>
> James Taylor wrote
> > Hi Alberto,
> > Sounds like you need PHOENIX-4552. If you agree, let's continue the
> > discussion over there.
> > Thanks,
> > James
> >
> > On Fri, Feb 2, 2018 at 9:05 AM, Alberto Bengoa <
>
> > alberto@.com
>
> > >
> > wrote:
> >
> >> Hello Folks,
> >>
> >> I'm working on a project where we need to identify when a row was
> changed
> >> (updated fields). I was wondering if ROW_TIMESTAMP would help me to
> reach
> >> this goal.
> >>
> >> I created the test table bellow, and inserted some data:
> >>
> >> create table test(
> >>   a integer not null,
> >>   b integer,
> >>   last_update date not null
> >>   CONSTRAINT PK PRIMARY KEY (a, last_update row_timestamp)
> >> );
> >>
> >> upsert into test (a, b) values (1, 1);
> >> upsert into test (a, b) values (2, 2);
> >> upsert into test (a, b) values (3, 4);
> >>
> >> 0: jdbc:phoenix:> select * from test;
> >> +++--+
> >> | A  | B  |   LAST_UPDATE|
> >> +++--+
> >> | 1  | 1  | 2018-02-02 16:33:52.345  |
> >> | 2  | 2  | 2018-02-02 16:33:56.714  |
> >> | 3  | 4  | 2018-02-02 16:34:00.281  |
> >> +++--+
> >> 3 rows selected (0.041 seconds)
> >>
> >> So, I've tried to update B value where A = 3;
> >>
> >> 0: jdbc:phoenix:> upsert into test (a, b) values (3, 3);
> >>
> >> Then, I have one "new" row, not an updated row as I need:
> >>
> >> 0: jdbc:phoenix:> select * from test;
> >> +++--+
> >> | A  | B  |   LAST_UPDATE|
> >> +++--+
> >> | 1  | 1  | 2018-02-02 16:33:52.345  |
> >> | 2  | 2  | 2018-02-02 16:33:56.714  |
> >> | 3  | 4  | 2018-02-02 16:34:00.281  |
> >> | 3  | 3  | 2018-02-02 16:36:31.890  |
> >> +++--+
> >> 4 rows selected (0.052 seconds)
> >>
> >> I understand that LAST_UPDATE column is part of the PRIMARY KEY and,
> from
> >> this perspective, it's in fact should be a NEW row. But, on the other
> >> hand,
> >> this not fits my case, because actually I'll have a new row after each
> >> "update" (and I have lots of updates).
> >>
> >> There's any alternative to this on the Phoenix side? I was not expecting
> >> to have to call a now() function from client side all the time to update
> >> a
> >> last_update field.
> >>
> >> Maybe another kind of CONSTRAINT that would be used?
> >>
> >> Phoenix version 4.7 here.
> >>
> >> Thanks in advanced!
> >>
> >> Cheers,
> >> Alberto
> >>
> >>
>
>
>
>
>
> --
> Sent from: http://apache-phoenix-user-list.1124778.n5.nabble.com/
>


Re: Apache Phoenix integration

2018-02-02 Thread James Taylor
Good idea - a joint meetup would be great.

On Fri, Feb 2, 2018 at 11:15 AM, Saurabh Mahapatra <
saurabhmahapatr...@gmail.com> wrote:

> This is absolutely great news! Thanks for sharing, James!
>
> We should have this as a presentation in one of our weekly Drill hangout
> sessions.
>
> It’s about time for us to do a meetup. A joint meetup perhaps?
>
> Saurabh
>
> Sent from my iPhone
>
>
>
> > On Feb 2, 2018, at 11:13 AM, James Taylor 
> wrote:
> >
> > There's also a much deeper integration between Phoenix + Drill (code
> named
> > Drillix) underway that should be possible to complete now that Apache
> Drill
> > is off of it's fork and on a later version of Apache Calcite. I'm hoping
> > that the output of this will be a Phoenix adapter in Drill. See
> > presentation here[1] and WIP code here[2].
> >
> > Thanks,
> > James
> >
> > [1] http://phoenix.apache.org/presentations/Drillix.pdf
> > [2] https://github.com/jacques-n/drill/tree/phoenix_plugin
> >
> >> On Fri, Feb 2, 2018 at 10:21 AM, Kunal Khatua  wrote:
> >>
> >> That's great, Flavio!
> >>
> >> You can create a Google doc for review and share it on the user list.
> >>
> >> @Bridget handles the documentation on the Apache website, so she can do
> >> the final touches and help it find a home on the website.
> >>
> >> -Original Message-
> >> From: Flavio Pompermaier [mailto:pomperma...@okkam.it]
> >> Sent: Friday, February 02, 2018 9:04 AM
> >> To: u...@drill.apache.org
> >> Cc: James Taylor 
> >> Subject: Re: Apache Phoenix integration
> >>
> >> Eventually I made it to integrate Phoenix with Drill! I debugged
> remotely
> >> the drill-embedded via Eclipse and I discovered that the problem was
> that
> >> you need some extra jars to make it work!
> >> Where can I write some documentation about debugging remotely Drill from
> >> Eclipse and integration with Drill?
> >>
> >> On Fri, Feb 2, 2018 at 5:28 PM, Flavio Pompermaier <
> pomperma...@okkam.it>
> >> wrote:
> >>
> >>> What is the fastest way to debug the JDBC plugin from Eclipse? I don't
> >>> see anything in the logs that could help...
> >>> Is it possible to connect directly to the external embedded drill
> >>> running on my machine if I enable jmx?
> >>> it seems that the JDBC connection is established correctly but Drill
> >>> throws an Exception (that is not well unwrapped by Jersey):
> >>>
> >>> 2018-02-02 16:54:04,520 [qtp159619134-56] INFO
> >>> o.a.p.q.ConnectionQueryServicesImpl
> >>> - HConnection established. Stacktrace for informational purposes:
> >>> hconnection-0x1b9fe9f8
> >>> java.lang.Thread.getStackTrace(Thread.java:1552)
> >>> org.apache.phoenix.util.LogUtil.getCallerStackTrace(LogUtil.java:55)
> >>> org.apache.phoenix.query.ConnectionQueryServicesImpl.openConnection(
> >>> ConnectionQueryServicesImpl.java:410)
> >>> org.apache.phoenix.query.ConnectionQueryServicesImpl.access$400(
> >>> ConnectionQueryServicesImpl.java:256)
> >>> org.apache.phoenix.query.ConnectionQueryServicesImpl$12.call(
> >>> ConnectionQueryServicesImpl.java:2408)
> >>> org.apache.phoenix.query.ConnectionQueryServicesImpl$12.call(
> >>> ConnectionQueryServicesImpl.java:2384)
> >>> org.apache.phoenix.util.PhoenixContextExecutor.call(
> >>> PhoenixContextExecutor.java:76)
> >>> org.apache.phoenix.query.ConnectionQueryServicesImpl.init(
> >>> ConnectionQueryServicesImpl.java:2384)
> >>> org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices(
> >>> PhoenixDriver.java:255)
> >>> org.apache.phoenix.jdbc.PhoenixEmbeddedDriver.createConnection(
> >>> PhoenixEmbeddedDriver.java:150)
> >>> org.apache.phoenix.jdbc.PhoenixDriver.connect(PhoenixDriver.java:221)
> >>> org.apache.commons.dbcp.DriverConnectionFactory.createConnection(
> >>> DriverConnectionFactory.java:38)
> >>> org.apache.commons.dbcp.PoolableConnectionFactory.makeObject(
> >>> PoolableConnectionFactory.java:582)
> >>> org.apache.commons.dbcp.BasicDataSource.validateConnectionFactory(
> >>> BasicDataSource.java:1556)
> >>> org.apache.commons.dbcp.BasicDataSource.createPoolableConnectionFactor
> >>> y(BasicDataSource.java:1545)
> >>> org.apache.commons.d

Re: Apache Phoenix integration

2018-02-02 Thread James Taylor
There's also a much deeper integration between Phoenix + Drill (code named
Drillix) underway that should be possible to complete now that Apache Drill
is off of it's fork and on a later version of Apache Calcite. I'm hoping
that the output of this will be a Phoenix adapter in Drill. See
presentation here[1] and WIP code here[2].

Thanks,
James

[1] http://phoenix.apache.org/presentations/Drillix.pdf
[2] https://github.com/jacques-n/drill/tree/phoenix_plugin

On Fri, Feb 2, 2018 at 10:21 AM, Kunal Khatua  wrote:

> That's great, Flavio!
>
> You can create a Google doc for review and share it on the user list.
>
> @Bridget handles the documentation on the Apache website, so she can do
> the final touches and help it find a home on the website.
>
> -Original Message-
> From: Flavio Pompermaier [mailto:pomperma...@okkam.it]
> Sent: Friday, February 02, 2018 9:04 AM
> To: u...@drill.apache.org
> Cc: James Taylor 
> Subject: Re: Apache Phoenix integration
>
> Eventually I made it to integrate Phoenix with Drill! I debugged remotely
> the drill-embedded via Eclipse and I discovered that the problem was that
> you need some extra jars to make it work!
> Where can I write some documentation about debugging remotely Drill from
> Eclipse and integration with Drill?
>
> On Fri, Feb 2, 2018 at 5:28 PM, Flavio Pompermaier 
> wrote:
>
> > What is the fastest way to debug the JDBC plugin from Eclipse? I don't
> > see anything in the logs that could help...
> > Is it possible to connect directly to the external embedded drill
> > running on my machine if I enable jmx?
> > it seems that the JDBC connection is established correctly but Drill
> > throws an Exception (that is not well unwrapped by Jersey):
> >
> > 2018-02-02 16:54:04,520 [qtp159619134-56] INFO
> > o.a.p.q.ConnectionQueryServicesImpl
> > - HConnection established. Stacktrace for informational purposes:
> > hconnection-0x1b9fe9f8
> > java.lang.Thread.getStackTrace(Thread.java:1552)
> > org.apache.phoenix.util.LogUtil.getCallerStackTrace(LogUtil.java:55)
> > org.apache.phoenix.query.ConnectionQueryServicesImpl.openConnection(
> > ConnectionQueryServicesImpl.java:410)
> > org.apache.phoenix.query.ConnectionQueryServicesImpl.access$400(
> > ConnectionQueryServicesImpl.java:256)
> > org.apache.phoenix.query.ConnectionQueryServicesImpl$12.call(
> > ConnectionQueryServicesImpl.java:2408)
> > org.apache.phoenix.query.ConnectionQueryServicesImpl$12.call(
> > ConnectionQueryServicesImpl.java:2384)
> > org.apache.phoenix.util.PhoenixContextExecutor.call(
> > PhoenixContextExecutor.java:76)
> > org.apache.phoenix.query.ConnectionQueryServicesImpl.init(
> > ConnectionQueryServicesImpl.java:2384)
> > org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices(
> > PhoenixDriver.java:255)
> > org.apache.phoenix.jdbc.PhoenixEmbeddedDriver.createConnection(
> > PhoenixEmbeddedDriver.java:150)
> > org.apache.phoenix.jdbc.PhoenixDriver.connect(PhoenixDriver.java:221)
> > org.apache.commons.dbcp.DriverConnectionFactory.createConnection(
> > DriverConnectionFactory.java:38)
> > org.apache.commons.dbcp.PoolableConnectionFactory.makeObject(
> > PoolableConnectionFactory.java:582)
> > org.apache.commons.dbcp.BasicDataSource.validateConnectionFactory(
> > BasicDataSource.java:1556)
> > org.apache.commons.dbcp.BasicDataSource.createPoolableConnectionFactor
> > y(BasicDataSource.java:1545)
> > org.apache.commons.dbcp.BasicDataSource.createDataSource(
> > BasicDataSource.java:1388)
> > org.apache.commons.dbcp.BasicDataSource.getConnection(
> > BasicDataSource.java:1044)
> > org.apache.calcite.adapter.jdbc.JdbcUtils$DialectPool.
> > get(JdbcUtils.java:73)
> > org.apache.calcite.adapter.jdbc.JdbcSchema.createDialect(
> > JdbcSchema.java:138)
> > org.apache.drill.exec.store.jdbc.JdbcStoragePlugin.(
> > JdbcStoragePlugin.java:103)
> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> > sun.reflect.NativeConstructorAccessorImpl.newInstance(
> > NativeConstructorAccessorImpl.java:62)
> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
> > DelegatingConstructorAccessorImpl.java:45)
> > java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> > org.apache.drill.exec.store.StoragePluginRegistryImpl.create(
> > StoragePluginRegistryImpl.java:346)
> > org.apache.drill.exec.store.StoragePluginRegistryImpl.createOrUpdate(
> > StoragePluginRegistryImpl.java:239)
> > org.apache.drill.exec.server.rest.PluginConfigWrapper.
> > createOrUpdateInStorage(PluginConfigWrapper.java:57)
> > org.apac

Re: ROW_TIMESTAMP

2018-02-02 Thread James Taylor
Hi Alberto,
Sounds like you need PHOENIX-4552. If you agree, let's continue the
discussion over there.
Thanks,
James

On Fri, Feb 2, 2018 at 9:05 AM, Alberto Bengoa 
wrote:

> Hello Folks,
>
> I'm working on a project where we need to identify when a row was changed
> (updated fields). I was wondering if ROW_TIMESTAMP would help me to reach
> this goal.
>
> I created the test table bellow, and inserted some data:
>
> create table test(
>   a integer not null,
>   b integer,
>   last_update date not null
>   CONSTRAINT PK PRIMARY KEY (a, last_update row_timestamp)
> );
>
> upsert into test (a, b) values (1, 1);
> upsert into test (a, b) values (2, 2);
> upsert into test (a, b) values (3, 4);
>
> 0: jdbc:phoenix:> select * from test;
> +++--+
> | A  | B  |   LAST_UPDATE|
> +++--+
> | 1  | 1  | 2018-02-02 16:33:52.345  |
> | 2  | 2  | 2018-02-02 16:33:56.714  |
> | 3  | 4  | 2018-02-02 16:34:00.281  |
> +++--+
> 3 rows selected (0.041 seconds)
>
> So, I've tried to update B value where A = 3;
>
> 0: jdbc:phoenix:> upsert into test (a, b) values (3, 3);
>
> Then, I have one "new" row, not an updated row as I need:
>
> 0: jdbc:phoenix:> select * from test;
> +++--+
> | A  | B  |   LAST_UPDATE|
> +++--+
> | 1  | 1  | 2018-02-02 16:33:52.345  |
> | 2  | 2  | 2018-02-02 16:33:56.714  |
> | 3  | 4  | 2018-02-02 16:34:00.281  |
> | 3  | 3  | 2018-02-02 16:36:31.890  |
> +++--+
> 4 rows selected (0.052 seconds)
>
> I understand that LAST_UPDATE column is part of the PRIMARY KEY and, from
> this perspective, it's in fact should be a NEW row. But, on the other hand,
> this not fits my case, because actually I'll have a new row after each
> "update" (and I have lots of updates).
>
> There's any alternative to this on the Phoenix side? I was not expecting
> to have to call a now() function from client side all the time to update a
> last_update field.
>
> Maybe another kind of CONSTRAINT that would be used?
>
> Phoenix version 4.7 here.
>
> Thanks in advanced!
>
> Cheers,
> Alberto
>
>


Re: HBase Timeout on queries

2018-02-01 Thread James Taylor
I don’t think the HBase row_counter job is going to be faster than a
count(*) query. Both require a full table scan, so neither will be
particularly fast.

A couple of alternatives if you’re ok with an approximate count: 1) enable
stats collection (but you can leave off usage to parallelize queries) and
the do a SUM over the size column for the table using stats table directly,
or 2) do a count(*) using TABLESAMPLE clause (again enabling stats as
described above) to prevent a full scan.

On Thu, Feb 1, 2018 at 8:11 AM Flavio Pompermaier 
wrote:

> Hi Anil,
> Obviously I'm not using HBase just for the count query..Most of the time I
> do INSERT and selective queries, I was just trying to figure out if my
> HBase + Phoenix installation is robust enough to deal with a huge amount of
> data..
>
> On Thu, Feb 1, 2018 at 5:07 PM, anil gupta  wrote:
>
>> Hey Flavio,
>>
>> IMHO, If most of your app is just doing full table scans then i am not
>> really sure HBase(or any other NoSql) will be a good fit for your
>> solution.(building an OLAP system?) If you have point lookups and short
>> range scans then HBase/Phoenix will work well.
>> Also, if you wanna do select count(*). The HBase row_counter job will be
>> much faster than phoenix queries.
>>
>> Thanks,
>> Anil Gupta
>>
>> On Thu, Feb 1, 2018 at 7:35 AM, Flavio Pompermaier 
>> wrote:
>>
>>> I was able to make it work changing the following params (both on server
>>> and client side and restarting hbase) and now the query answers in about 6
>>> minutes:
>>>
>>> hbase.rpc.timeout (to 60)
>>> phoenix.query.timeoutMs (to 60)
>>> hbase.client.scanner.timeout.period (from 1 m to 10m)
>>> hbase.regionserver.lease.period (from 1 m to 10m)
>>>
>>> However I'd like to know id those performances could be easily improved
>>> or not. Any ideas?
>>>
>>> On Thu, Feb 1, 2018 at 4:30 PM, Vaghawan Ojha 
>>> wrote:
>>>
 I've the same problem, even after I increased the hbase.rpc.timeout the
 result is same. The difference is that I use 4.12.


 On Thu, Feb 1, 2018 at 8:23 PM, Flavio Pompermaier <
 pomperma...@okkam.it> wrote:

> Hi to all,
> I'm trying to use the brand-new Phoenix 4.13.2-cdh5.11.2 over HBase
> and everything was fine until the data was quite small (about few
> millions). As I inserted 170 M of rows in my table I cannot get the row
> count anymore (using ELECT COUNT) because of
> org.apache.hbase.ipc.CallTimeoutException (operationTimeout 6 
> expired).
> How can I fix this problem? I could increase the hbase.rpc.timeout
> parameter but I suspect I could improve a little bit the HBase performance
> first..the problem is that I don't know how.
>
> Thanks in advance,
> Flavio
>


>>>
>>
>>
>> --
>> Thanks & Regards,
>> Anil Gupta
>>
>


Re: Is first query to a table region way slower?

2018-01-28 Thread James Taylor
Did you do an rs.next() on the first query? Sounds related to HConnection
establishment. Also, least expensive query is SELECT 1 FROM T LIIMIT 1.

Thanks,
James

On Sun, Jan 28, 2018 at 5:39 PM Pedro Boado  wrote:

> Hi all,
>
> I'm running into issues with a java springboot app that ends up querying a
> Phoenix cluster (from out of the cluster) through the non-thin client.
>
> Basically this application has a high latency - around 2 to 4 seconds -
> for the first query per  primary key to each region of a table with 180M
> records ( and 10 regions ) . Following calls - for different keys within
> the same region - have an average response time of ~60-80ms. No secondary
> indexes involved. No writes to the table at all during these queries.
>
> I don't think it's related to HConnection establishing as it was already
> stablished before the query ran ( a SELECT * FROM table LIMIT 1 is executed
> as soon as the datasource is created )
>
> I've been doing some quick profiling and almost all the time is spent
> inside the actual jdbc call.
>
> So here's the question: in your experience, is this normal behaviour - so
> I have to workaround the problem from application code warming up
> connections during app startup -  or is it something unusual? Any
> experience reducing first query latencies?
>
> Thanks!
>
>


Re: Phoenix executeBatch

2018-01-23 Thread James Taylor
Writing to HDFS with a columnar format like Parquet will always be faster
than writing to HBase. How about random access of a row? If you're not
doing point lookups and small range scans, you probably don't want to use
HBase (& Phoenix). HBase is writing more information than is written when
using Parquet. HBase is essentially maintaining an index by row key and
multiple versions of cells. You can reduce the amount of data by using our
storage formats [1] (though it still won't come close to Parquet in HDFS).

Using executeBatch won't make a performance difference unless you're using
the Phoenix Query Server (in which case the difference will be
substantial). For the thick client, I'd recommend turning auto commit off
and batching together ~100 rows before calling commit. These and other tips
can be found in our Tuning Guide [2].

[1] https://phoenix.apache.org/columnencoding.html
[2] https://phoenix.apache.org/tuning_guide.html

On Tue, Jan 23, 2018 at 3:45 AM, Flavio Pompermaier 
wrote:

> Any answer on this..?
>
> On Fri, Jan 12, 2018 at 10:38 AM, Flavio Pompermaier  > wrote:
>
>> Hi to all,
>> looking at the documentation (https://phoenix.apache.org/tu
>> ning_guide.html), in the writing section, there's the following
>> sentence: "Phoenix uses commit() instead of executeBatch() to control batch
>> updates". Am using a Phoenix connection with autocommit enabled +
>> PreparedStatement.executeBatch(). Doesn't Phoenix handle this
>> correctly...?
>> I'm asking this because writing directly to HDFS (as Parquet) takes 1
>> minute, while UPSERTING into HBASE takes 15 m...what can I do to detect
>> what is slowing down the write?
>>
>> Best,
>> Flavio
>>
>
>
>
> --
> Flavio Pompermaier
> Development Department
>
> OKKAM S.r.l.
> Tel. +(39) 0461 041809 <+39%200461%20041809>
>


Re: [ANNOUNCE] Apache Phoenix 4.13.2 for CDH 5.11.2 released

2018-01-20 Thread James Taylor
On Sat, Jan 20, 2018 at 12:29 PM Pedro Boado  wrote:

> The Apache Phoenix team is pleased to announce the immediate availability
> of the 4.13.2 release for CDH 5.11.2. Apache Phoenix enables SQL-based OLTP
> and operational analytics for Apache Hadoop using Apache HBase as its
> backing store and providing integration with other projects in the Apache
> ecosystem such as Spark, Hive, Pig, Flume, and MapReduce. This release is
> compatible with CDH 5.11.2
>
> Highlights of the release include:
>
> * For the first time, support for CDH due to community ask
> * More than 10 fixes over release 4.13.1-HBase-1.2
>
> Source and binary downloads are available here [1]. Folder 'parcels' can
> be directly used as parcel repository from Cloudera Manager ( follow [2] to
> one of Apache mirrors ) .
>
> Thanks,
> Pedro (on behalf of the Apache Phoenix team)
>
> [1] http://phoenix.apache.org/download.html
> [2]
> http://www.apache.org/dyn/closer.lua/phoenix/apache-phoenix-4.13.2-cdh5.11.2/parcels/
> 
>


Re: phoenix query execution plan

2018-01-18 Thread James Taylor
This is a limitation of our optimizer (see PHOENIX-627). Patches are
welcome. The fix would be isolated to WhereOptimizer.java.

Thanks,
James

On Thu, Jan 18, 2018 at 9:46 AM, abhi1 kumar  wrote:

> Hi All,
>
> I am using phoenix 4.7(hbase 1.1.xx) and came across following case of
> query plans :
>
> Table Schema:
> ---
> PK for table is composed of three cols C1+C2+C3 and there are some
> additional columns as well.
>
> 1. query plan for below select query shows skip scan:
>
>  > select * from table where (C1, C2) in ((x1, y1),(x2,y2))
>
> 2.  however, for below query it prefers server side filters :
>
>  > select * from table where ((C1 = x1 and C2=y1) or (C1 = x2 and
> C2=y2))
>
> Now, i was expecting the both to have similar plan with skip/range scan
> but it seems to be different
> based on used query syntax.I tried using skip scan hint but no luck.
>
> Is there any other way we can enforce  skip/range scan on 2nd queries
> because condition list could
> be really large and doing server side filter based query plan is
> inefficient compared to skip/range scan plan ?
>
> Thanks & Regards,
> Abhi
>


Re: Phoenix 4.13 on Hortonworks

2018-01-17 Thread James Taylor
Hi Sumanta,
Phoenix is an Apache project and not tied to any vendor. At Salesforce we
use the regular Apache Phoenix code base instead of any vendor specific
variation and this has worked out well for us.
Thanks,
James

On Wed, Jan 17, 2018 at 9:25 PM Sumanta Gh  wrote:

> Thanks for the info.
> As a solution integrator company, we had chosen Hortonworks and Phoenix in
> one of our big implementation.
> I am curious to know if Phoenix as a product is tied to any vendor.
>
> Regards
> Sumanta
>
>
> -Artem Ervits  wrote: -
> To: user@phoenix.apache.org
> From: Artem Ervits 
> Date: 01/17/2018 06:29PM
>
> Subject: Re: Phoenix 4.13 on Hortonworks
>
> Vendor specific questions should go to their associated forums. That said,
> 4.13 will ship with next major release.
>
> On Jan 17, 2018 6:24 AM, "Pedro Boado"  wrote:
>
>> Oh sorry I didn't know they hadn't worked on new upgrades for such a long
>> time.
>>
>> On 17 Jan 2018 10:50, "Sumanta Gh"  wrote:
>>
>>> Hortonworks latest HDP gives Phoenix 4.7 out of the box.
>>> I tried to upgrade to 4.13 but it is shutting down the region servers.
>>>
>>>
>>> Regards
>>> Sumanta
>>>
>>>
>>> -Pedro Boado  wrote: -
>>> To: user@phoenix.apache.org
>>> From: Pedro Boado 
>>> Date: 01/17/2018 04:04PM
>>> Subject: Re: Phoenix 4.13 on Hortonworks
>>>
>>> Hi,
>>>
>>> Afaik Hortonworks already includes Apache Phoenix as part of the
>>> platform, doesn't it?
>>>
>>> Cheers.
>>>
>>> On 17 Jan 2018 10:30, "Sumanta Gh"  wrote:
>>>
 I am eager to learn if anyone has installed Phoenix 4.13 on Hortonworks
 HDP cluster.
 Please let me know the version number of HDP that was used.

 Regards
 Sumanta

 =-=-=
 Notice: The information contained in this e-mail
 message and/or attachments to it may contain
 confidential or privileged information. If you are
 not the intended recipient, any dissemination, use,
 review, distribution, printing or copying of the
 information contained in this e-mail message
 and/or attachments to it are strictly prohibited. If
 you have received this communication in error,
 please notify us by reply e-mail or telephone and
 immediately and permanently delete the message
 and any attachments. Thank you




Re: UDF for lateral views

2018-01-17 Thread James Taylor
No, this isn't supported from a UDF. You're looking for PHOENIX-4311 to be
implemented. Let's continue the discussion there.

On Wed, Jan 17, 2018 at 7:16 PM, Krishna  wrote:

> According to this blog (http://phoenix-hbase.blogspot.in/2013/04/how-to-
> add-your-own-built-in-function.html), evaluate(...) is responsible for
> processing the input state of the row and filling up ImmutableBytesWritable
> pointer with transformed row.
> Did not find any references that'll support returning multiple rows for
> each input row. Does anyone know if UDF framework can support that?
>
> On Tue, Jan 16, 2018 at 6:07 PM, Krishna  wrote:
>
>> I would like to convert a column of ARRAY data-type such that each
>> element of the array is returned as a row. Hive supports it via Lateral
>> Views (https://cwiki.apache.org/confluence/display/Hive/LanguageMa
>> nual+LateralView)?
>>
>> Does UDF framework in Phoenix allow for building such functions?
>>
>
>


Re: How to reduce write amplification when exists a few global index tables?

2018-01-16 Thread James Taylor
You can use local indexes to reduce write amplification. In that case, all
index writes are local writes so the impact of multiple secondary indexes
is not as severe. Of course, there's a read penalty you'd pay, so make sure
you're ok with that.

On Tue, Jan 16, 2018 at 12:08 AM, cloud.pos...@gmail.com <
cloud.pos...@gmail.com> wrote:

> For real time writing scenario, the throughput will be  down when original
> table have more index table. According to the implement of secondary
> index,  how to accelerate writing speed?
>


Re: Query optimization

2017-12-27 Thread James Taylor
Looks like the second query is sorting the entire PEOPLE table (though it
seems like that shouldn’t be necessary as it’s probably already sorted by
PEOPLE_ID) while the first one is sorting only part of MY_TABLE (which is
likely less data). Might be a bug as the queries look the same.

Please log a JIRA and thanks for all the details.

On Wed, Dec 27, 2017 at 3:44 PM Flavio Pompermaier 
wrote:

> Ok.  So why the 2nd query requires more memory than the first one
> (nonetheless USE_SORT_MERGE_JOIN is used) and can't complete?
>
>
> On 28 Dec 2017 00:33, "James Taylor"  wrote:
>
> A hash join (the default) will be faster but the tables being cached (last
> or RHS table being joined) must be small enough to fit into memory on the
> region server. If it's too big, you can use the USE_SORT_MERGE_JOIN which
> would not have this restriction.
>
> On Wed, Dec 27, 2017 at 3:16 PM, Flavio Pompermaier 
> wrote:
>
>> Just to summarize things...is the best approach, in terms of required
>> memory, for Apache Phoenix queries to use sort merge join? Should inner
>> queries be avoided?
>>
>>
>> On 22 Dec 2017 22:47, "Flavio Pompermaier"  wrote:
>>
>> MYTABLE is definitely much bigger than PEOPLE table, in terms of
>> cardinality. In terms of cells (rows x columns) PEOPLE is probably bigger
>>
>> On 22 Dec 2017 22:36, "Ethan"  wrote:
>>
>>> I see. I think client side probably hold on to the iterators from the
>>> both sides and crawling forward to do the merge sort. in this case should
>>> be no much memory footprint either way where the filter is performed.
>>>
>>> On December 22, 2017 at 1:04:18 PM, James Taylor (jamestay...@apache.org)
>>> wrote:
>>>
>>> There’s no shipping of any tables with a sort merge join.
>>>
>>> On Fri, Dec 22, 2017 at 1:02 PM Ethan Wang  wrote:
>>>
>>>> I see. Looks like it's possible the rhs (MYTABLE) is too big to ship
>>>> around without get filtered first. Just for experiment, if you took out
>>>> hint USE_SORT_MERGE_JOIN, what will be the plan?
>>>>
>>>>
>>>> On December 22, 2017 at 12:46:25 PM, James Taylor (
>>>> jamestay...@apache.org) wrote:
>>>>
>>>> For sort merge join, both post-filtered table results are sorted on the
>>>> server side and then a merge sort is done on the client-side.
>>>>
>>>> On Fri, Dec 22, 2017 at 12:44 PM, Ethan  wrote:
>>>>
>>>>> Hello Flavio,
>>>>>
>>>>> From the plan looks like to me the second query is doing the filter at
>>>>> parent table (PEOPLE). So what is the size of your PEOPLE and MYTABLE
>>>>> (after filtered) respectively?
>>>>>
>>>>> For sort merge join, anyone knows are the both sides get shipped to
>>>>> client to do the merge sort?
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>> On December 22, 2017 at 9:58:30 AM, Flavio Pompermaier (
>>>>> pomperma...@okkam.it) wrote:
>>>>>
>>>>> Any help here...?
>>>>>
>>>>> On 20 Dec 2017 17:58, "Flavio Pompermaier" 
>>>>> wrote:
>>>>>
>>>>>> Hi to all,
>>>>>> I'm trying to find the best query for my use case but I found that
>>>>>> one version work and the other one does not (unless that I don't apply 
>>>>>> some
>>>>>> tuning to timeouts etc like explained in [1]).
>>>>>>
>>>>>> The 2 queries extract the same data but, while the first query
>>>>>> terminates the second does not.
>>>>>> *PS*:  without the USE_SORT_MERGE_JOIN both queries weren't working
>>>>>>
>>>>>> SELECT /*+ USE_SORT_MERGE_JOIN */ COUNT(*)
>>>>>> FROM PEOPLE ds JOIN MYTABLE l ON ds.PERSON_ID = l.LOCALID
>>>>>> WHERE l.EID IS NULL AND l.DSID = 'PEOPLE' AND l.HAS_CANDIDATES =
>>>>>> FALSE;
>>>>>>
>>>>>> +---+-+++
>>>>>> | PLAN
>>>>>>   | EST_BYTES_READ  |
>>>>>> EST_ROWS_READ  |  EST_INFO_TS   |
>>>>>>
>>>>

Re: Add automatic/default SALT

2017-12-27 Thread James Taylor
There's some information in our Tuning Guide[1] on recommendations of when
to use or not use salted tables. We don't recommend it unless your table
has a monotonically increasing primary key. Understanding why is best
explained with an example. Let's say you have a table with SALT_BUCKETS=20.
When you execute a simple query against that table that might return 10
contiguous rows, you'll be executing 20 scans instead of just one. Each
scan will open a block on the region server - that's 20 block fetches
versus what would otherwise be a single block fetch (assuming that the 10
rows being returned are in the same block since they're contiguous). The
only time you're not hit with this 20x block fetch cost is if you're doing
a point lookup (as the client can precompute the salt byte in that case).

[1] https://phoenix.apache.org/tuning_guide.html

On Wed, Dec 27, 2017 at 3:26 PM, Flavio Pompermaier 
wrote:

> Hi Josh,
> Thanks for the feedback. Do you have any concrete example where salted
> tables are 'evil'? However I really like the idea to enable salting using
> some predefined variable (like number of region servers or something like
> that).
> An example could be:
>
> SALT_BUCKETS = $REGION_SERVERS_COUNT
>
> Best,
> Flavio
>
>
> On 12 Dec 2017 01:45, "Josh Elser"  wrote:
>
> I'm a little hesitant of this for a few things I've noticed from lots of
> various installations:
>
> * Salted tables are *not* always more efficient. In fact, I've found
> myself giving advice to not use salted tables a bit more than expected.
> Certain kinds of queries will require much more work if you have salting
> over not having salting
>
> * Considering salt buckets as a measure of parallelism for a table, it's
> impossible for the system to correctly judge what the parallelism of the
> cluster should be. For example, with 10 RS and 1 Phoenix table, you would
> want to start with 10 salt buckets. However, with 10 RS and 100 Phoenix
> tables, you'd *maybe* want to do 3 salt buckets. It's hard to make system
> wide decisions correctly without a global view of the entire system.
>
> I think James was trying to capture some of this in his use of "relative
> conservative default", but I'd take that even a bit farther to say I
> consider it harmful for Phoenix to do that out of the box.
>
> However, I would flip the question upside down instead: what kind of
> suggestions can Phoenix make as a database to the user to _recommend_ to
> them that they enable salting on a table given its schema and important
> queries?
>
>
> On 12/8/17 12:34 PM, James Taylor wrote:
>
>> Hi Flavio,
>> I like the idea of “adaptable configuration” where you specify a config
>> value as a % of some cluster resource (with relatively conservative
>> defaults). Salting is somewhat of a gray area though as it’s not config
>> based, but driven by your DDL. One solution you could implement on top of
>> Phoenix is scripting for DDL that fills in the salt bucket parameter based
>> on cluster size.
>> Thanks,
>> James
>>
>> On Tue, Dec 5, 2017 at 12:50 AM Flavio Pompermaier > <mailto:pomperma...@okkam.it>> wrote:
>>
>> Hi to all,
>> as stated by at the documentation[1] "for optimal performance,
>> number of salt buckets should match number of region servers".
>> So, why not to add an option AUTO/DEFAULT for salting that defaults
>> this parameter to the number of region servers?
>> Otherwise I have to manually connect to HBase, retrieve that number
>> and pass to Phoenix...
>> What do you think?
>>
>> [1] https://phoenix.apache.org/performance.html#Salting
>>
>> Best,
>> Flavio
>>
>>
>


Re: Query optimization

2017-12-27 Thread James Taylor
A hash join (the default) will be faster but the tables being cached (last
or RHS table being joined) must be small enough to fit into memory on the
region server. If it's too big, you can use the USE_SORT_MERGE_JOIN which
would not have this restriction.

On Wed, Dec 27, 2017 at 3:16 PM, Flavio Pompermaier 
wrote:

> Just to summarize things...is the best approach, in terms of required
> memory, for Apache Phoenix queries to use sort merge join? Should inner
> queries be avoided?
>
>
> On 22 Dec 2017 22:47, "Flavio Pompermaier"  wrote:
>
> MYTABLE is definitely much bigger than PEOPLE table, in terms of
> cardinality. In terms of cells (rows x columns) PEOPLE is probably bigger
>
> On 22 Dec 2017 22:36, "Ethan"  wrote:
>
>> I see. I think client side probably hold on to the iterators from the
>> both sides and crawling forward to do the merge sort. in this case should
>> be no much memory footprint either way where the filter is performed.
>>
>> On December 22, 2017 at 1:04:18 PM, James Taylor (jamestay...@apache.org)
>> wrote:
>>
>> There’s no shipping of any tables with a sort merge join.
>>
>> On Fri, Dec 22, 2017 at 1:02 PM Ethan Wang  wrote:
>>
>>> I see. Looks like it's possible the rhs (MYTABLE) is too big to ship
>>> around without get filtered first. Just for experiment, if you took out
>>> hint USE_SORT_MERGE_JOIN, what will be the plan?
>>>
>>>
>>> On December 22, 2017 at 12:46:25 PM, James Taylor (
>>> jamestay...@apache.org) wrote:
>>>
>>> For sort merge join, both post-filtered table results are sorted on the
>>> server side and then a merge sort is done on the client-side.
>>>
>>> On Fri, Dec 22, 2017 at 12:44 PM, Ethan  wrote:
>>>
>>>> Hello Flavio,
>>>>
>>>> From the plan looks like to me the second query is doing the filter at
>>>> parent table (PEOPLE). So what is the size of your PEOPLE and MYTABLE
>>>> (after filtered) respectively?
>>>>
>>>> For sort merge join, anyone knows are the both sides get shipped to
>>>> client to do the merge sort?
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> On December 22, 2017 at 9:58:30 AM, Flavio Pompermaier (
>>>> pomperma...@okkam.it) wrote:
>>>>
>>>> Any help here...?
>>>>
>>>> On 20 Dec 2017 17:58, "Flavio Pompermaier" 
>>>> wrote:
>>>>
>>>>> Hi to all,
>>>>> I'm trying to find the best query for my use case but I found that one
>>>>> version work and the other one does not (unless that I don't apply some
>>>>> tuning to timeouts etc like explained in [1]).
>>>>>
>>>>> The 2 queries extract the same data but, while the first query
>>>>> terminates the second does not.
>>>>> *PS*:  without the USE_SORT_MERGE_JOIN both queries weren't working
>>>>>
>>>>> SELECT /*+ USE_SORT_MERGE_JOIN */ COUNT(*)
>>>>> FROM PEOPLE ds JOIN MYTABLE l ON ds.PERSON_ID = l.LOCALID
>>>>> WHERE l.EID IS NULL AND l.DSID = 'PEOPLE' AND l.HAS_CANDIDATES = FALSE;
>>>>> +---
>>>>> +---
>>>>> --+++
>>>>> | PLAN
>>>>>   | EST_BYTES_READ  |
>>>>> EST_ROWS_READ  |  EST_INFO_TS   |
>>>>> +---
>>>>> +---
>>>>> --+++
>>>>> | SORT-MERGE-JOIN (INNER) TABLES
>>>>>   | 14155777900
>>>>> <(415)%20577-7900> | 12077867   | 1513754378759  |
>>>>> | CLIENT 42-CHUNK 6168903 ROWS 1132461 BYTES PARALLEL 3-WAY
>>>>> FULL SCAN OVER PEOPLE | 14155777900 <(415)%20577-7900>
>>>>>| 12077867   | 1513754378759  |
>>>>> | SERVER FILTER BY FIRST KEY ONLY
>>>>>  | 14155777900
>>>>> <(415)%20577-7900> | 12077867   | 1513754378759  |
>>>>> | CLIENT MERGE SORT
>>>>> 

Re: Query optimization

2017-12-22 Thread James Taylor
There’s no shipping of any tables with a sort merge join.

On Fri, Dec 22, 2017 at 1:02 PM Ethan Wang  wrote:

> I see. Looks like it's possible the rhs (MYTABLE) is too big to ship
> around without get filtered first. Just for experiment, if you took out
> hint USE_SORT_MERGE_JOIN, what will be the plan?
>
>
> On December 22, 2017 at 12:46:25 PM, James Taylor (jamestay...@apache.org)
> wrote:
>
> For sort merge join, both post-filtered table results are sorted on the
> server side and then a merge sort is done on the client-side.
>
> On Fri, Dec 22, 2017 at 12:44 PM, Ethan  wrote:
>
>> Hello Flavio,
>>
>> From the plan looks like to me the second query is doing the filter at
>> parent table (PEOPLE). So what is the size of your PEOPLE and MYTABLE
>> (after filtered) respectively?
>>
>> For sort merge join, anyone knows are the both sides get shipped to
>> client to do the merge sort?
>>
>> Thanks,
>>
>>
>> On December 22, 2017 at 9:58:30 AM, Flavio Pompermaier (
>> pomperma...@okkam.it) wrote:
>>
>> Any help here...?
>>
>> On 20 Dec 2017 17:58, "Flavio Pompermaier"  wrote:
>>
>>> Hi to all,
>>> I'm trying to find the best query for my use case but I found that one
>>> version work and the other one does not (unless that I don't apply some
>>> tuning to timeouts etc like explained in [1]).
>>>
>>> The 2 queries extract the same data but, while the first query
>>> terminates the second does not.
>>> *PS*:  without the USE_SORT_MERGE_JOIN both queries weren't working
>>>
>>> SELECT /*+ USE_SORT_MERGE_JOIN */ COUNT(*)
>>> FROM PEOPLE ds JOIN MYTABLE l ON ds.PERSON_ID = l.LOCALID
>>> WHERE l.EID IS NULL AND l.DSID = 'PEOPLE' AND l.HAS_CANDIDATES = FALSE;
>>>
>>> +---+-+++
>>> | PLAN
>>> | EST_BYTES_READ  | EST_ROWS_READ
>>> |  EST_INFO_TS   |
>>>
>>> +---+-+++
>>> | SORT-MERGE-JOIN (INNER) TABLES
>>> | 14155777900 <(415)%20577-7900>
>>>| 12077867   | 1513754378759  |
>>> | CLIENT 42-CHUNK 6168903 ROWS 1132461 BYTES PARALLEL 3-WAY FULL
>>> SCAN OVER PEOPLE | 14155777900 <(415)%20577-7900> |
>>> 12077867   | 1513754378759  |
>>> | SERVER FILTER BY FIRST KEY ONLY
>>>| 14155777900 <(415)%20577-7900>
>>>| 12077867   | 1513754378759  |
>>> | CLIENT MERGE SORT
>>>| 14155777900 <(415)%20577-7900>
>>>| 12077867   | 1513754378759  |
>>> | AND (SKIP MERGE)
>>> | 14155777900 <(415)%20577-7900>
>>>| 12077867   | 1513754378759  |
>>> | CLIENT 15-CHUNK 5908964 ROWS 2831155679 BYTES PARALLEL 15-WAY
>>> RANGE SCAN OVER MYTABLE [0] - [2]  | 14155777900 <(415)%20577-7900>
>>>  | 12077867   | 1513754378759  |
>>> | SERVER FILTER BY (EID IS NULL AND DSID = 'PEOPLE' AND
>>> HAS_CANDIDATES = false)   | 14155777900
>>> <(415)%20577-7900> | 12077867   | 1513754378759  |
>>> | SERVER SORTED BY [L.LOCALID]
>>> | 14155777900
>>> <(415)%20577-7900> | 12077867   | 1513754378759  |
>>> | CLIENT MERGE SORT
>>>| 14155777900 <(415)%20577-7900>
>>>| 12077867   | 1513754378759  |
>>> | CLIENT AGGREGATE INTO SINGLE ROW
>>> | 14155777900 <(415)%20577-7900>
>>>| 12077867   | 1513754378759  |
>>>
>>> +---+-+++
>>>
>>>
>>> SELECT /*+ USE_SORT_MERGE_JOIN */ COUNT(*)
>>> FROM (SELECT LOCALID FROM MYTABLE
>>> WHERE EID IS NULL AND DSID = 'PEOPLE' AND HAS_CANDIDATES = FALSE) l JOIN
>>> PEOPLE  ds ON ds.PER

Re: Query optimization

2017-12-22 Thread James Taylor
For sort merge join, both post-filtered table results are sorted on the
server side and then a merge sort is done on the client-side.

On Fri, Dec 22, 2017 at 12:44 PM, Ethan  wrote:

> Hello Flavio,
>
> From the plan looks like to me the second query is doing the filter at
> parent table (PEOPLE). So what is the size of your PEOPLE and MYTABLE
> (after filtered) respectively?
>
> For sort merge join, anyone knows are the both sides get shipped to client
> to do the merge sort?
>
> Thanks,
>
>
> On December 22, 2017 at 9:58:30 AM, Flavio Pompermaier (
> pomperma...@okkam.it) wrote:
>
> Any help here...?
>
> On 20 Dec 2017 17:58, "Flavio Pompermaier"  wrote:
>
>> Hi to all,
>> I'm trying to find the best query for my use case but I found that one
>> version work and the other one does not (unless that I don't apply some
>> tuning to timeouts etc like explained in [1]).
>>
>> The 2 queries extract the same data but, while the first query terminates
>> the second does not.
>> *PS*:  without the USE_SORT_MERGE_JOIN both queries weren't working
>>
>> SELECT /*+ USE_SORT_MERGE_JOIN */ COUNT(*)
>> FROM PEOPLE ds JOIN MYTABLE l ON ds.PERSON_ID = l.LOCALID
>> WHERE l.EID IS NULL AND l.DSID = 'PEOPLE' AND l.HAS_CANDIDATES = FALSE;
>> +---
>> +---
>> --+++
>> | PLAN
>>   | EST_BYTES_READ  | EST_ROWS_READ  |
>> EST_INFO_TS   |
>> +---
>> +---
>> --+++
>> | SORT-MERGE-JOIN (INNER) TABLES
>>   | 14155777900 <(415)%20577-7900>
>>| 12077867   | 1513754378759  |
>> | CLIENT 42-CHUNK 6168903 ROWS 1132461 BYTES PARALLEL 3-WAY FULL
>> SCAN OVER PEOPLE | 14155777900 <(415)%20577-7900> |
>> 12077867   | 1513754378759  |
>> | SERVER FILTER BY FIRST KEY ONLY
>>| 14155777900 <(415)%20577-7900>
>>| 12077867   | 1513754378759  |
>> | CLIENT MERGE SORT
>>| 14155777900 <(415)%20577-7900>
>>| 12077867   | 1513754378759  |
>> | AND (SKIP MERGE)
>>   | 14155777900 <(415)%20577-7900>
>>| 12077867   | 1513754378759  |
>> | CLIENT 15-CHUNK 5908964 ROWS 2831155679 BYTES PARALLEL 15-WAY RANGE
>> SCAN OVER MYTABLE [0] - [2]  | 14155777900 <(415)%20577-7900> |
>> 12077867   | 1513754378759  |
>> | SERVER FILTER BY (EID IS NULL AND DSID = 'PEOPLE' AND
>> HAS_CANDIDATES = false)   | 14155777900
>> <(415)%20577-7900> | 12077867   | 1513754378759  |
>> | SERVER SORTED BY [L.LOCALID]
>>   | 14155777900
>> <(415)%20577-7900> | 12077867   | 1513754378759  |
>> | CLIENT MERGE SORT
>>| 14155777900 <(415)%20577-7900>
>>| 12077867   | 1513754378759  |
>> | CLIENT AGGREGATE INTO SINGLE ROW
>>   | 14155777900 <(415)%20577-7900>
>>| 12077867   | 1513754378759  |
>> +---
>> +---
>> --+++
>>
>>
>> SELECT /*+ USE_SORT_MERGE_JOIN */ COUNT(*)
>> FROM (SELECT LOCALID FROM MYTABLE
>> WHERE EID IS NULL AND DSID = 'PEOPLE' AND HAS_CANDIDATES = FALSE) l JOIN
>> PEOPLE  ds ON ds.PERSON_ID = l.LOCALID;
>> +---
>> ---+
>> -+++
>> | PLAN
>>  | EST_BYTES_READ  | EST_ROWS_READ  |
>> EST_INFO_TS   |
>> +---
>> ---+
>> -+++
>> | SORT-MERGE-JOIN (INNER) TABLES
>>  | 14155777900 <(415)%20577-7900>
>>  | 12077867   | 1513754378759  |
>> | CLIENT 15-CHUNK 5908964 ROWS 2831155679 BYTES PARALLEL 3-WAY RANGE
>> SCAN OVER MYTABLE [0] - [2]  | 14155777900 <(415)%20577-7900> |
>> 12077867   | 1513754378759  |
>> | SERVER FILTER BY (EID IS NULL AND DSID = 'PEOPLE' AND
>> HAS_CANDIDATES = false)  | 14155777900 <(415)%20577-7900>
>>| 12077867   | 1513754378759  |
>> | CLIENT MERGE SORT
>>   | 14155777900 <(415)%20577-7900>
>>| 12077867   | 1513754378759  |
>> | AND (SKIP MERGE)
>>  | 141557

Re: Efficient way to get the row count of a table

2017-12-19 Thread James Taylor
The count would change when a major compaction is done. Back in 4.7, it may
have changed when a split occurs too (but this is no longer the case). I'd
recommend moving to a newer version: 4.7 was release almost two years ago
and is six releases back from the current 4.13 release. FWIW, we're getting
ready to release a 4.13 for CDH.

On Tue, Dec 19, 2017 at 4:27 PM, Jins George  wrote:

> Thank you James and Mujtaba  for the responses.
>
> I am OK with an estimate count . So using SYSTEM.STATS table looks good in
> my case. But when I tried this, it gave me inconsistent results.  for
> example.
>
> Time 1:
> select count(*) from myschema.mytable  => 3474085
> select sum(guide_posts_row_count) from system.stats where physical_name =
> 'myschema.device_data'  => 3348090
>
> Time 2 : ( Time1 + ~10 mins)
> select count(*) from myschema.mytable  => 3474106
> select sum(guide_posts_row_count) from system.stats where physical_name =
> 'myschema.device_data'  => 3348080
>
> So I was expecting the stats count to go up but surprisingly, the count
> went down.  Is there a specific configuration or something else that I am
> missing?
>
> I am using phoenix 4.7( on CDH), So cannot try Table sampling feature.
>
> Thanks,
> Jins George
>
>
>
> On 12/19/2017 03:43 PM, Mujtaba Chohan wrote:
>
> Another alternate outside Phoenix is to use
> <http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html>
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/
> mapreduce/RowCounter.html M/R.
>
> On Tue, Dec 19, 2017 at 3:18 PM, James Taylor 
> wrote:
>
>> If it needs to be 100% accurate, then count(*) is the only way. If your
>> data is write-once data, you might be able to track the row count at the
>> application level through some kind of atomic counter in a different table
>> (but this will likely be brittle). If you can live with an estimate, you
>> could enable statistics [1], optionally configuring Phoenix not to use
>> stats for parallelization [2], and query the SYSTEM.STATS table to get an
>> estimate [3].
>>
>> Another interesting alternative if you want the approximate row count
>> when you have a where clause would be to use the new table sampling feature
>> [4]. You'd also want stats enabled for this to be more accurate too.
>>
>> Thanks,
>> James
>>
>>
>> [1] https://phoenix.apache.org/update_statistics.html
>> [2] phoenix.use.stats.parallelization=false
>> [3] select sum(GUIDE_POSTS_ROW_COUNT) from SYSTEM.STATS where
>> physical_name='my_schema.my_table'
>>  and COLUMN_FAMILY='my_first_column_family' -- necessary only if you
>> have multiple column families
>> [4] https://phoenix.apache.org/tablesample.html
>>
>> On Tue, Dec 19, 2017 at 2:57 PM, Jins George < 
>> jins.geo...@aeris.net> wrote:
>>
>>> Hi,
>>>
>>> Is there a way to get the total row count of a phoenix table without
>>> running select count(*) from table ?
>>> my use case is to monitor the record count in a table every x minutes,
>>> so didn't want to put load on the system by running a select count(*) query.
>>>
>>> Thanks,
>>> Jins George
>>>
>>
>>
>
>


Re: Efficient way to get the row count of a table

2017-12-19 Thread James Taylor
If it needs to be 100% accurate, then count(*) is the only way. If your
data is write-once data, you might be able to track the row count at the
application level through some kind of atomic counter in a different table
(but this will likely be brittle). If you can live with an estimate, you
could enable statistics [1], optionally configuring Phoenix not to use
stats for parallelization [2], and query the SYSTEM.STATS table to get an
estimate [3].

Another interesting alternative if you want the approximate row count when
you have a where clause would be to use the new table sampling feature [4].
You'd also want stats enabled for this to be more accurate too.

Thanks,
James


[1] https://phoenix.apache.org/update_statistics.html
[2] phoenix.use.stats.parallelization=false
[3] select sum(GUIDE_POSTS_ROW_COUNT) from SYSTEM.STATS where
physical_name='my_schema.my_table'
 and COLUMN_FAMILY='my_first_column_family' -- necessary only if you
have multiple column families
[4] https://phoenix.apache.org/tablesample.html

On Tue, Dec 19, 2017 at 2:57 PM, Jins George  wrote:

> Hi,
>
> Is there a way to get the total row count of a phoenix table without
> running select count(*) from table ?
> my use case is to monitor the record count in a table every x minutes, so
> didn't want to put load on the system by running a select count(*) query.
>
> Thanks,
> Jins George
>


Re: Hive UDF for creating row key in HBASE

2017-12-18 Thread James Taylor
Hi Chethan,
As Ethan mentioned, take a look first at the Phoenix/Hive integration. If
that doesn't work for you, the best way to get the row key for a phoenix
table is to execute an UPSERT VALUES against the primary key columns
without committing it. We have a utility function that will return the
Cells that would be submitted to the server that you can use to get the row
key. You can do this through a "connectionless" JDBC Connection, so you
don't need any RPCs (including executing the CREATE TABLE call so that
Phoenix knows the metadata).

Take a look at ConnectionlessTest.testConnectionlessUpsert() for an example.

Thanks,
James

On Sun, Dec 17, 2017 at 1:19 PM, Ethan  wrote:

>
> Hi Chethan,
>
> When you write data from HDFS, are you planning to use hive to do the ETL?
> Can we do something like reading from HDFS and use Phoenix to write into to
> HBASE?
>
> There is https://phoenix.apache.org/hive_storage_handler.html, I think is
> enabling Hive to read from phoenix table, not the other way around.
>
> Thanks,
>
> On December 16, 2017 at 8:09:10 PM, Chethan Bhawarlal (
> cbhawar...@collectivei.com) wrote:
>
> Hi Dev,
>
> Currently I am planning to write data from HDFS to HBASE. And to read data
> I am using Phoenix.
>
> Phoenix is converting its primary keys separated by bytes("\x00") and
> storing it in HBASE as row key.
>
> I want to write a custom UDF in hive to create ROW KEY value of HBASE such
> that Phoenix will be able to split it into multiple columns.
>
> Following is the custom UDF code I am trying to write;
>
>
> import org.apache.hadoop.hive.ql.exec.Description;
>
> import org.apache.hadoop.hive.ql.exec.UDF;
>
> import org.apache.hadoop.hive.ql.udf.UDFType;
>
>
> @UDFType(stateful = true)
>
> @Description(name = "hbasekeygenerator", value = "_FUNC_(existing) -
> Returns a unique rowkey value for hbase")
>
> public class CIHbaseKeyGenerator extends UDF{
>
> public String evaluate(String [] args){
>
> byte zerobyte = 0x00;
>
> String zbyte = Byte.toString(zerobyte);
>
> StringBuilder sb = new StringBuilder();
>
>
> for (int i = 0; i < args.length-1;++i) {
>
> sb.append(args[i]);
>
> sb.append(zbyte);
>
>
> }
>
> sb.append(args[args.length-1]);
>
> return sb.toString();
>
> }
>
> }
>
>
> Following are my questions,
>
>
> 1.is it possible to emulate the behavior of phoenix(decoding) using hive
> custom UDF.
>
>
> 2. If it is possible, what is the better approach for this. It will be
> great if some one can share some pointers on this.
>
>
> Thanks,
>
> Chethan.
>
>
>
>
>
>
>
>
>
>
> Collective[i] dramatically improves sales and marketing performance using
> technology, applications and a revolutionary network designed to provide
> next generation analytics and decision-support directly to business users.
> Our goal is to maximize human potential and minimize mistakes. In most
> cases, the results are astounding. We cannot, however, stop emails from
> sometimes being sent to the wrong person. If you are not the intended
> recipient, please notify us by replying to this email's sender and deleting
> it (and any attachments) permanently from your system. If you are, please
> respect the confidentiality of this communication's contents.
>
>


Re: Add automatic/default SALT

2017-12-08 Thread James Taylor
Hi Flavio,
I like the idea of “adaptable configuration” where you specify a config
value as a % of some cluster resource (with relatively conservative
defaults). Salting is somewhat of a gray area though as it’s not config
based, but driven by your DDL. One solution you could implement on top of
Phoenix is scripting for DDL that fills in the salt bucket parameter based
on cluster size.
Thanks,
James

On Tue, Dec 5, 2017 at 12:50 AM Flavio Pompermaier 
wrote:

> Hi to all,
> as stated by at the documentation[1] "for optimal performance, number of
> salt buckets should match number of region servers".
> So, why not to add an option AUTO/DEFAULT for salting that defaults this
> parameter to the number of region servers?
> Otherwise I have to manually connect to HBase, retrieve that number and
> pass to Phoenix...
> What do you think?
>
> [1] https://phoenix.apache.org/performance.html#Salting
>
> Best,
> Flavio
>


[ANNOUNCE] Apache Phoenix 4.13.1 released

2017-12-07 Thread James Taylor
The Apache Phoenix team is pleased to announce the immediate availability
of the 4.13.1 release. Apache Phoenix enables SQL-based OLTP and
operational analytics for Apache Hadoop using Apache HBase as its backing
store and providing integration with other projects in the Apache ecosystem
such as Spark, Hive, Pig, Flume, and MapReduce. This releases is compatible
with HBase 0.98, 1.1, 1.2 and 1.3.

Highlights of the release include:

* Support again for HBase 1.1 and 1.2 due to community ask
* Inclusion of required server-side jars for new COLLATION_KEY function [1]

Source and binary downloads are available here [2]. Note that in the
future, releases will be done separately for each branches by dedicated
release managers. I'll send out an email on this soon.

Thanks,
James (on behalf of the Apache Phoenix team)

[1] https://issues.apache.org/jira/browse/PHOENIX-4384
[2] http://phoenix.apache.org/download.html


Re: upsert data with specific timestamp using CurrentSCN fail with error java.sql.SQLException: ERROR 518 (25502): Mutations are not permitted for a read-only connection.

2017-12-06 Thread James Taylor
Not possible now, but please file a JIRA as it’s possible that this
restriction could be relaxed if there are no indexes.

On Sun, Dec 3, 2017 at 10:40 PM Bulvik, Noam  wrote:

> Assuming I won’t have index on this table at upsert time (I may build them
> later using the index tool) is there a way to make the connection not read
> only ?
>
>
>
> *From:* James Taylor [mailto:jamestay...@apache.org]
> *Sent:* Sunday, December 3, 2017 5:18 PM
> *To:* user@phoenix.apache.org
> *Subject:* Re: upsert data with specific timestamp using CurrentSCN fail
> with error java.sql.SQLException: ERROR 518 (25502): Mutations are not
> permitted for a read-only connection.
>
>
>
> The CurrentSCN property may only be used for reading, not writing as of
> 4.13. We found that this kind of rewriting of history causes indexes to
> become corrupted. The documentation needs to be updated.
>
>
>
> On Sun, Dec 3, 2017 at 7:07 AM Bulvik, Noam  wrote:
>
> Hi,
>
>
>
> I want to upset historic data to a table with indexes, I have there TTL
> defined and I want to load the data as if it was loaded in the correct time
> so it will be cleaned automatically using the TTL mechanism. I implemented
> a small java code to load the data after setting "CurrentSCN" to older date
> ( based on this link
> https://phoenix.apache.org/faq.html#Can_phoenix_work_on_tables_with_arbitrary_timestamp_as_flexible_as_HBase_API
> )
>
>
>
> But I get java.sql.SQLException: ERROR 518 (25502): Mutations are not
> permitted for a read-only connection on each upset. The same code without
> setting CurrentSCN property works fine
>
>
>
> I am using phoenix 4.13.1 – is this a bug? If not , is there something
> missing in the documentation on how to set "CurrentSCN"
>
>
>
> Regards,
>
>
>
> *Noam *
>
>
>
>
>
>
> --
>
>
> PRIVILEGED AND CONFIDENTIAL
> PLEASE NOTE: The information contained in this message is privileged and
> confidential, and is intended only for the use of the individual to whom it
> is addressed and others who have been specifically authorized to receive
> it. If you are not the intended recipient, you are hereby notified that any
> dissemination, distribution or copying of this communication is strictly
> prohibited. If you have received this communication in error, or if any
> problems occur with transmission, please contact sender. Thank you.
>
>
> --
>
> PRIVILEGED AND CONFIDENTIAL
> PLEASE NOTE: The information contained in this message is privileged and
> confidential, and is intended only for the use of the individual to whom it
> is addressed and others who have been specifically authorized to receive
> it. If you are not the intended recipient, you are hereby notified that any
> dissemination, distribution or copying of this communication is strictly
> prohibited. If you have received this communication in error, or if any
> problems occur with transmission, please contact sender. Thank you.
>


Re: Boolean condition doesn't support IS operator

2017-12-05 Thread James Taylor
In SQL, you use IS with NULL or NOT NULL as a way to test for null since
column = NULL will never return true. I’ve never seen a SQL dialect that
allows column IS true or column IS 4 as some kind of alternative to using
an equality expression. More on that here:
https://stackoverflow.com/questions/859762/is-this-the-proper-way-to-do-boolean-test-in-sql

On Tue, Dec 5, 2017 at 8:28 AM Flavio Pompermaier 
wrote:

> At the moment I'm usign VALID = true but IMHO Phoenix should support the
> IS syntax for boolean columns as well..of course this is just a +1 on this
> topic
> If no other one ask for this feature you could ignore this :)
>
> On Tue, Dec 5, 2017 at 4:31 PM, James Taylor 
> wrote:
>
>> How about just using VALID = true or just VALID like this: select * from
>> t where VALID
>>
>> On Tue, Dec 5, 2017 at 2:52 AM Flavio Pompermaier 
>> wrote:
>>
>>> Hi to all,
>>> I'm using Phoenix 4.7 and I cannot use IS operator on boolean values
>>> (e.g. VALID IS TRUE)
>>> Would it be that difficult to support it?
>>>
>>> Best,
>>> Flavio
>>>
>>
>
>
> --
> Flavio Pompermaier
> Development Department
>
> OKKAM S.r.l.
> Tel. +(39) 0461 041809
>


Re: Boolean condition doesn't support IS operator

2017-12-05 Thread James Taylor
How about just using VALID = true or just VALID like this: select * from t
where VALID

On Tue, Dec 5, 2017 at 2:52 AM Flavio Pompermaier 
wrote:

> Hi to all,
> I'm using Phoenix 4.7 and I cannot use IS operator on boolean values (e.g.
> VALID IS TRUE)
> Would it be that difficult to support it?
>
> Best,
> Flavio
>


Re: upsert data with specific timestamp using CurrentSCN fail with error java.sql.SQLException: ERROR 518 (25502): Mutations are not permitted for a read-only connection.

2017-12-03 Thread James Taylor
The CurrentSCN property may only be used for reading, not writing as of
4.13. We found that this kind of rewriting of history causes indexes to
become corrupted. The documentation needs to be updated.

On Sun, Dec 3, 2017 at 7:07 AM Bulvik, Noam  wrote:

> Hi,
>
>
>
> I want to upset historic data to a table with indexes, I have there TTL
> defined and I want to load the data as if it was loaded in the correct time
> so it will be cleaned automatically using the TTL mechanism. I implemented
> a small java code to load the data after setting "CurrentSCN" to older date
> ( based on this link
> https://phoenix.apache.org/faq.html#Can_phoenix_work_on_tables_with_arbitrary_timestamp_as_flexible_as_HBase_API
> )
>
>
>
> But I get java.sql.SQLException: ERROR 518 (25502): Mutations are not
> permitted for a read-only connection on each upset. The same code without
> setting CurrentSCN property works fine
>
>
>
> I am using phoenix 4.13.1 – is this a bug? If not , is there something
> missing in the documentation on how to set "CurrentSCN"
>
>
>
> Regards,
>
>
>
> *Noam *
>
>
>
>
>
> --
>
> PRIVILEGED AND CONFIDENTIAL
> PLEASE NOTE: The information contained in this message is privileged and
> confidential, and is intended only for the use of the individual to whom it
> is addressed and others who have been specifically authorized to receive
> it. If you are not the intended recipient, you are hereby notified that any
> dissemination, distribution or copying of this communication is strictly
> prohibited. If you have received this communication in error, or if any
> problems occur with transmission, please contact sender. Thank you.
>


Re: Help: setting hbase row timestamp in phoenix upserts ?

2017-11-30 Thread James Taylor
The only way I can think of accomplishing this is by using the raw HBase
APIs to write the data but using our utilities to write it in a Phoenix
compatible manner. For example, you could run an UPSERT VALUES statement,
use the PhoenixRuntime.getUncommittedDataIterator()method to get the Cells
that would have been written, update the Cell timestamp as needed, and do
an htable.batch() call to commit them.

On Wed, Nov 29, 2017 at 11:46 AM Pedro Boado  wrote:

> Hi,
>
> I'm looking for a little bit of help trying to get some light over
> ROW_TIMESTAMP.
>
> Some background over the problem ( simplified ) : I'm working in a project
> that needs to create a "enriched" replica of a RBDMS table based on a
> stream of cdc changes off that table.
>
> Each cdc event contains the timestamp of the change plus all the column
> values 'before' and 'after' the change . And each event is pushed to a
> kafka topic.  Because of certain "non-negotiable" design decisions kafka
> guarantees delivering each event at least once, but doesn't guarantee
> ordering for changes over the same row in the source table.
>
> The final step of the kafka-based flow is sinking the information into
> HBase/Phoenix.
>
> As I cannot get in order delivery guarantee from Kafka I need to use the
> cdc event timestamp to ensure that HBase keeps the latest change over a row.
>
> This fits perfectly well with an HBase table design with VERSIONS=1 and
> using the source event timestamp as HBase row/cells timestamp
>
> The thing is that I cannot find a way to define the value of the HBase
> cell from a Phoenix upsert.
>
> I came across the ROW_TIMESTAMP functionality, but I've just found ( I'm
> devastated now ) that the ROW_TIMESTAMP columns store the date in both
> hbase's cell timestamp and in the primary key, meaning that I cannot
> leverage that functionality to keep only the latest change.
>
> Is there a way of defining hbase's row timestamp when doing the UPSERT -
> even by setting it through some obscure hidden jdbc property - ?
>
> I want to avoid by all means doing a checkAndPut as the volume of changes
> is going to be quite bug.
>
>
>
> --
> Un saludo.
> Pedro Boado.
>


Re: phoenix 4.13.0 on hbase 1.2

2017-11-28 Thread James Taylor
I've started a vote for Phoenix 4.13.1 which includes support again for
HBase 1.2. This is what the CDH based releases will be based off of
initially (i.e. the work Pedro has done on PHOENIX-4372).

Please check it out and vote on it. Thanks for volunteering to be release
manager going forward, Pedro.

 James

On Mon, Nov 13, 2017 at 10:17 AM, James Taylor 
wrote:

> We discussed whether or not we should continue with Phoenix releases for
> HBase 1.2, but no one showed any interested in being the release manager
> [1], so we concluded that we would stop doing them. It's important to
> remember that the ASF is a volunteer effort and anyone can step up and take
> on this responsibility. That's essentially how contributors build merit to
> become committers and eventually PMC members and the project continues to
> grow. If you're interested, I suggest you start a new DISCUSS thread on the
> dev list and volunteer. Here's what would need to be done:
> - cherry-pick changes from master between 4.12.0 and 4.13.0 release to
> 4.x-HBase-1.2 branch
> - create a pull request with the above and get a +1 from a committer
> - monitor the Jenkins job that'll run with these changes keeping a lookout
> for any test failures
> - assuming there are no test failures, follow the directions here [2] to
> perform a release
>
> Also, please make sure you subscribe to the dev list so you can
> participate in further discussions.
>
> Thanks,
> James
>
>
> [1] https://lists.apache.org/thread.html/ae13def3c024603ce3cdde8
> 71223cbdbae0219b4efe93ed4e48f55d5@%3Cdev.phoenix.apache.org%3E
> [2] https://phoenix.apache.org/release.html
>
> On Sun, Nov 12, 2017 at 11:38 PM, 최재환  wrote:
>
>> There is no compatibility with hbase 1.2 version in ANNOUNCE.
>>
>> i want to use phoenix 4.13.0 on hbase 1.2
>>
>> Are you planning to make hbase 1.2 compatible?
>>
>>
>>
>> *E-mail : prismso...@gmail.com *
>>
>
>


Re: 4.13.0-HBase-1.1 not released?

2017-11-28 Thread James Taylor
I've started a vote for Phoenix 4.13.1 which includes support again for
HBase 1.1. Please try it out and vote on the release. Thanks for
volunteering to be the release manager going forward, Xavier.

James

On Mon, Nov 20, 2017 at 8:09 AM, Xavier Jodoin  wrote:

> Hi James,
>
> Sorry for the delay I wasn't on the dev mailing list, I'm interested to
> help and I can take the lead for the Hbase 1.1 release.
>
> Xavier
> On 2017-11-18 03:22 PM, James Taylor wrote:
>
> FYI, we'll do one final release for Phoenix on HBase 1.1 (look for a
> 4.13.1 release soon). It looks like HBase 1.1 itself is nearing
> end-of-life, so probably good to move off of it. If someone is interested
> in being the RM for continued Phoenix HBase 1.1 releases, please volunteer.
>
> On Mon, Nov 13, 2017 at 10:24 AM, James R. Taylor 
> wrote:
>
>> Hi Xavier,
>> Please see these threads for some discussion. Would be great if you could
>> volunteer to be the release manager for Phoenix released on HBase 1.1.
>>
>> https://lists.apache.org/thread.html/8a73efa27edb70ea5cbc89b
>> 43c312faefaf2b78751c9459834523b81@%3Cuser.phoenix.apache.org%3E
>> https://lists.apache.org/thread.html/04de7c47724d8ef2ed7414d
>> 5bdc51325b2a0eecd324556d9e83f3718@%3Cdev.phoenix.apache.org%3E
>> https://lists.apache.org/thread.html/ae13def3c024603ce3cdde8
>> 71223cbdbae0219b4efe93ed4e48f55d5@%3Cdev.phoenix.apache.org%3E
>>
>> Thanks,
>> James
>>
>> On 2017-11-13 07:51, Xavier Jodoin  wrote:
>> > Hi,
>> >
>> > I would like to know if there is a reason why phoenix wasn't released
>> > for hbase 1.1?
>> >
>> > Thanks
>> >
>> > Xavier Jodoin
>> >
>> >
>>
>
>
>


Re: [ANNOUNCE] Apache Phoenix 4.13 released

2017-11-24 Thread James Taylor
@Kumar - yes, we’ll do a 4.13.1 release shortly for HBase 1.2 out of the
head of the 4.x-HBase-1.2 branch. Pedro is going to be the RM going forward
for this branch and do CDH release from this. You can track that on
PHOENIX-4372.

@Flavio - Pedro is targeting a CDH 5.11.2 release.

On Fri, Nov 24, 2017 at 8:53 AM Flavio Pompermaier 
wrote:

> Hi to all,
> is there any Parcel available for Phoenix 4.13 and Cloudera CDH 5.9-5.10
> available (HBase 1.2) somewhere?
>
> Best,
> Flavio
>
> On Thu, Nov 23, 2017 at 7:33 AM, Kumar Palaniappan <
> kpalaniap...@marinsoftware.com> wrote:
>
>> @Jmaes, are you still planning to release 4.13HBase1.2?
>>
>> On Sun, Nov 19, 2017 at 1:21 PM, James Taylor 
>> wrote:
>>
>>> Hi Kumar,
>>> I started a discussion [1][2] on the dev list to find an RM for the
>>> HBase 1.2 (and HBase 1.1) branch, but no one initially stepped up, so there
>>> were no plans for a release. Subsequently we've heard from a few folks that
>>> they needed it, and Pedro Boado volunteered to do CDH compatible release
>>> (see PHOENIX-4372) which requires an up to date HBase 1.2 based release.
>>>
>>> So I've volunteered to do one more Phoenix 4.13.1 release for HBase 1.2
>>> and 1.1. I'm hoping you, Pedro and others that need 1.2 based releases can
>>> volunteer to be the RM and do further releases.
>>>
>>> One thing is clear, though - folks need to be on the dev and user lists
>>> so they can take place in DISCUSS threads.
>>>
>>> Thanks,
>>> James
>>>
>>> [1]
>>> https://lists.apache.org/thread.html/5b8b44acb1d3608770309767c3cddecbc6484c29452fe6750d8e1516@%3Cdev.phoenix.apache.org%3E
>>> [2]
>>> https://lists.apache.org/thread.html/70cffa798d5f21ef87b02e07aeca8c7982b0b30251411b7be17fadf9@%3Cdev.phoenix.apache.org%3E
>>>
>>> On Sun, Nov 19, 2017 at 12:23 PM, Kumar Palaniappan <
>>> kpalaniap...@marinsoftware.com> wrote:
>>>
>>>> Are there any plans to release Phoenix 4.13 compatible with HBase 1.2?
>>>>
>>>> On Sat, Nov 11, 2017 at 5:57 PM, James Taylor 
>>>> wrote:
>>>>
>>>>> The Apache Phoenix team is pleased to announce the immediate
>>>>> availability of the 4.13.0 release. Apache Phoenix enables SQL-based OLTP
>>>>> and operational analytics for Apache Hadoop using Apache HBase as its
>>>>> backing store and providing integration with other projects in the Apache
>>>>> ecosystem such as Spark, Hive, Pig, Flume, and MapReduce. The 4.x releases
>>>>> are compatible with HBase 0.98 and 1.3.
>>>>>
>>>>> Highlights of the release include:
>>>>>
>>>>> * Critical bug fix to prevent snapshot creation of SYSTEM.CATALOG when
>>>>> connecting [1]
>>>>> * Numerous bug fixes around handling of row deletion [2]
>>>>> * Improvements to statistics collection [3]
>>>>> * New COLLATION_KEY built-in function for linguistic sort [4]
>>>>>
>>>>> Source and binary downloads are available here [5].
>>>>>
>>>>> [1] https://issues.apache.org/jira/browse/PHOENIX-4335
>>>>> [2]
>>>>> https://issues.apache.org/jira/issues/?jql=labels%20%3D%20rowDeletion
>>>>> [3]
>>>>> https://issues.apache.org/jira/issues/?jql=labels%20%3D%20statsCollection
>>>>> [4] https://phoenix.apache.org/language/functions.html#collation_key
>>>>> [5] http://phoenix.apache.org/download.html
>>>>>
>>>>
>>>>
>>>
>>
>
>
> --
> Flavio Pompermaier
> Development Department
>
> OKKAM S.r.l.
> Tel. +(39) 0461 041809
>


Re: [ANNOUNCE] Apache Phoenix 4.13 released

2017-11-19 Thread James Taylor
Hi Kumar,
I started a discussion [1][2] on the dev list to find an RM for the HBase
1.2 (and HBase 1.1) branch, but no one initially stepped up, so there were
no plans for a release. Subsequently we've heard from a few folks that they
needed it, and Pedro Boado volunteered to do CDH compatible release
(see PHOENIX-4372) which requires an up to date HBase 1.2 based release.

So I've volunteered to do one more Phoenix 4.13.1 release for HBase 1.2 and
1.1. I'm hoping you, Pedro and others that need 1.2 based releases can
volunteer to be the RM and do further releases.

One thing is clear, though - folks need to be on the dev and user lists so
they can take place in DISCUSS threads.

Thanks,
James

[1]
https://lists.apache.org/thread.html/5b8b44acb1d3608770309767c3cddecbc6484c29452fe6750d8e1516@%3Cdev.phoenix.apache.org%3E
[2]
https://lists.apache.org/thread.html/70cffa798d5f21ef87b02e07aeca8c7982b0b30251411b7be17fadf9@%3Cdev.phoenix.apache.org%3E

On Sun, Nov 19, 2017 at 12:23 PM, Kumar Palaniappan <
kpalaniap...@marinsoftware.com> wrote:

> Are there any plans to release Phoenix 4.13 compatible with HBase 1.2?
>
> On Sat, Nov 11, 2017 at 5:57 PM, James Taylor 
> wrote:
>
>> The Apache Phoenix team is pleased to announce the immediate availability
>> of the 4.13.0 release. Apache Phoenix enables SQL-based OLTP and
>> operational analytics for Apache Hadoop using Apache HBase as its backing
>> store and providing integration with other projects in the Apache ecosystem
>> such as Spark, Hive, Pig, Flume, and MapReduce. The 4.x releases are
>> compatible with HBase 0.98 and 1.3.
>>
>> Highlights of the release include:
>>
>> * Critical bug fix to prevent snapshot creation of SYSTEM.CATALOG when
>> connecting [1]
>> * Numerous bug fixes around handling of row deletion [2]
>> * Improvements to statistics collection [3]
>> * New COLLATION_KEY built-in function for linguistic sort [4]
>>
>> Source and binary downloads are available here [5].
>>
>> [1] https://issues.apache.org/jira/browse/PHOENIX-4335
>> [2] https://issues.apache.org/jira/issues/?jql=labels%20%3D%20rowDeletion
>> [3] https://issues.apache.org/jira/issues/?jql=labels%20%3D%
>> 20statsCollection
>> [4] https://phoenix.apache.org/language/functions.html#collation_key
>> [5] http://phoenix.apache.org/download.html
>>
>
>


Re: sqlline.py kills all regionservers

2017-11-18 Thread James Taylor
That’s quite an old version of 0.98 and we no longer support Hadoop 1.
Would it be possible for you to upgrade your cluster and use Hadoop 2
instead?

On Sat, Nov 18, 2017 at 1:16 PM Eisenhut, Roman 
wrote:

> Dear Phoenix community,
>
>
>
> I’m trying to implement apache-phoenix-4.13.0-HBase-0.98-bin on
> Hadoop-1.2.1 and HBase-0.98.6.1 on a 4 node cluster, running the following
> services:
>
>- roman-master:
>   - NameNode
>   - SecondaryNamenode
>   - HMaster
>   - HQuorumPeer
>- roman-rs-1:
>   - DataNode
>   - HRegionServer
>- roman-rs-2:
>   - DataNode
>   - HRegionServer
>- roman-rs-3:
>   - DataNode
>   - HRegionServer
>
>
>
> I have copied the corresponding server.jar into the lib directory of my
> HBase installation on every server.
>
>
>
> *ubuntu@roman-master:~$ ls -l /usr/local/hbase-0.98.6.1-hadoop1/lib | grep
> phoenix*
>
> *-rwxrwxrwx 1 ubuntu ubuntu 25706067 Nov 18 18:15
> phoenix-4.13.0-HBase-0.98-server.jar*
>
>
>
> *ubuntu@roman-rs-1:~$ ls -l /usr/local/hbase-0.98.6.1-hadoop1/lib | grep
> phoenix*
>
> *-rwxrwxrwx 1 ubuntu ubuntu 25706067 Nov 18 18:15
> phoenix-4.13.0-HBase-0.98-server.jar*
>
>
>
> *ubuntu@roman-rs-2:/usr/local$ ls -l /usr/local/hbase-0.98.6.1-hadoop1/lib
> | grep phoenix*
>
> *-rwxrwxrwx 1 ubuntu ubuntu 25706067 Nov 18 18:16
> phoenix-4.13.0-HBase-0.98-server.jar*
>
>
>
> *ubuntu@roman-rs-3:/usr/local$ ls -l /usr/local/hbase-0.98.6.1-hadoop1/lib
> | grep phoenix*
>
> *-rwxrwxrwx 1 ubuntu ubuntu 25706067 Nov 18 18:16
> phoenix-4.13.0-HBase-0.98-server.jar*
>
>
>
> I can:
>
>- Start Hadoop
>- Start HBase
>   - Create a table in HBase
>   - Put data
>   - Get data
>- Browse
>   - http://roman-master:50070/dfshealth.jsp
>   - http://roman-master:60010/master-status
>
>
>
> jps shows all the processes are up and running on each machine and the
> http interfaces show everything is running properly. It seems like the
> system is in perfect health. However, as soon as I try to connect to
> phoenix via  */usr/local/
> apache-phoenix-4.13.0-HBase-0.98-bin/bin/sqlline.py roman-master:2181* on
> roman-master, where roman-maste is my zookeeper quorum, it kills every
> regionservers on the entire cluster. After a long time, I don’t know
> exactly how long, sqlline will return an output (can be found under [3]).
>
>
>
> *Connection call*
>
> ubuntu@roman-master:/usr/local/apache-phoenix-4.13.0-HBase-0.98-bin/bin$
> ./sqlline.py roman-master:2181
>
> Setting property: [incremental, false]
>
> Setting property: [isolation, TRANSACTION_READ_COMMITTED]
>
> issuing: !connect jdbc:phoenix:roman-master:2181 none none
> org.apache.phoenix.jdbc.PhoenixDriver
>
> Connecting to jdbc:phoenix:roman-master:2181
>
> 17/11/18 19:24:33 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
>
>
>
> Underneath you’ll find the following files
>
>- Conf files for
>   - Phoenix
>  - env.sh
>   - Hadoop
>  - hadoop-env.sh
>  - core-site.xml
>  - hdfs-site.xml
>   - HBase
>  - hbase-site.xml
>  - hbase-env.sh
>   - [1] master log on roman-master
>- [2] regionserver log roman-rs-1
>- [3] zookeeper log on roman-master
>- [4] output of sqlline.py
>
>
>
> I’m pretty sure there’s a reasonable explanation for this behavior.
> Nevertheless, I’ve been trying to get phoenix up and running for days now
> and I can’t figure out what I’m doing wrong. I hope someone can point me
> into the right direction.
>
>
>
> Best regards,
>
> Roman
>
>
>
>
>
>
>
>
>
>
>
> *Phoenix conf:*
>
>
> ubuntu@roman-master:/usr/local/apache-phoenix-4.13.0-HBase-0.98-bin/bin/config$
> cat env.sh
>
> JAVA_HOME=/usr/lib/jvm/java-8-oracle/
>
> HBASE_PATH=/usr/local/hbase-0.98.6.1-hadoop1/
>
>
>
> *Hadoop conf:*
>
> *Hadoop-env.sh*
>
>
>
> export JAVA_HOME=/usr/lib/jvm/java-8-oracle
>
> # The maximum amount of heap to use, in MB. Default is 1000.
>
> # export HADOOP_HEAPSIZE=2000
>
> # Command specific options appended to HADOOP_OPTS when specified
>
> export HADOOP_NAMENODE_OPTS="-Dcom.sun.management.jmxremote
> $HADOOP_NAMENODE_OPTS"
>
> export HADOOP_SECONDARYNAMENODE_OPTS="-Dcom.sun.management.jmxremote
> $HADOOP_SECONDARYNAMENODE_OPTS"
>
> export HADOOP_DATANODE_OPTS="-Dcom.sun.management.jmxremote
> $HADOOP_DATANODE_OPTS"
>
> export HADOOP_BALANCER_OPTS="-Dcom.sun.management.jmxremote
> $HADOOP_BALANCER_OPTS"
>
> export HADOOP_JOBTRACKER_OPTS="-Dcom.sun.management.jmxremote
> $HADOOP_JOBTRACKER_OPTS"
>
> _EOF_
>
>
>
> 
>
> 
>
>
>
> 
>
> 
>
> fs.default.name
>
> hdfs://roman-master/
>
> true
>
> 
>
> 
>
>hadoop.tmp.dir
>
>/usr/local/hadoop_tmp_dir
>
> 
>
> 
>
>
>
> 
>
> 
>
> 
>
> 
>
>dfs.name.dir

Re: 4.13.0-HBase-1.1 not released?

2017-11-18 Thread James Taylor
FYI, we'll do one final release for Phoenix on HBase 1.1 (look for a 4.13.1
release soon). It looks like HBase 1.1 itself is nearing end-of-life, so
probably good to move off of it. If someone is interested in being the RM
for continued Phoenix HBase 1.1 releases, please volunteer.

On Mon, Nov 13, 2017 at 10:24 AM, James R. Taylor 
wrote:

> Hi Xavier,
> Please see these threads for some discussion. Would be great if you could
> volunteer to be the release manager for Phoenix released on HBase 1.1.
>
> https://lists.apache.org/thread.html/8a73efa27edb70ea5cbc89b
> 43c312faefaf2b78751c9459834523b81@%3Cuser.phoenix.apache.org%3E
> https://lists.apache.org/thread.html/04de7c47724d8ef2ed7414d
> 5bdc51325b2a0eecd324556d9e83f3718@%3Cdev.phoenix.apache.org%3E
> https://lists.apache.org/thread.html/ae13def3c024603ce3cdde8
> 71223cbdbae0219b4efe93ed4e48f55d5@%3Cdev.phoenix.apache.org%3E
>
> Thanks,
> James
>
> On 2017-11-13 07:51, Xavier Jodoin  wrote:
> > Hi,
> >
> > I would like to know if there is a reason why phoenix wasn't released
> > for hbase 1.1?
> >
> > Thanks
> >
> > Xavier Jodoin
> >
> >
>


Please do not use 4.12.0 release

2017-11-11 Thread James Taylor
FYI, the 4.12.0 release had a critical issue [1] that has been fixed in the
4.13.0 release. Please make sure you do not use the 4.12.0 release and
instead use the 4.13.0 release. Sorry for any inconvenience. More details
on the release may be found here [2].

Thanks,
James

[1] https://issues.apache.org/jira/browse/PHOENIX-4335
[2] https://blogs.apache.org/phoenix/entry/announcing-phoenix-4-13-released


[ANNOUNCE] Apache Phoenix 4.13 released

2017-11-11 Thread James Taylor
The Apache Phoenix team is pleased to announce the immediate availability
of the 4.13.0 release. Apache Phoenix enables SQL-based OLTP and
operational analytics for Apache Hadoop using Apache HBase as its backing
store and providing integration with other projects in the Apache ecosystem
such as Spark, Hive, Pig, Flume, and MapReduce. The 4.x releases are
compatible with HBase 0.98 and 1.3.

Highlights of the release include:

* Critical bug fix to prevent snapshot creation of SYSTEM.CATALOG when
connecting [1]
* Numerous bug fixes around handling of row deletion [2]
* Improvements to statistics collection [3]
* New COLLATION_KEY built-in function for linguistic sort [4]

Source and binary downloads are available here [5].

[1] https://issues.apache.org/jira/browse/PHOENIX-4335
[2] https://issues.apache.org/jira/issues/?jql=labels%20%3D%20rowDeletion
[3]
https://issues.apache.org/jira/issues/?jql=labels%20%3D%20statsCollection
[4] https://phoenix.apache.org/language/functions.html#collation_key
[5] http://phoenix.apache.org/download.html


Re: Spark & UpgradeInProgressException: Cluster is being concurrently upgraded from 4.11.x to 4.12.x

2017-11-11 Thread James Taylor
Hi Stepan,
We discussed whether or not we should continue with Phoenix releases for
HBase 1.1, but no one showed any interested in being the release manager
[1], so we concluded that we would stop doing them. It's important to
remember that the ASF is a volunteer effort and anyone can step up and take
on this responsibility. That's essentially how contributors build merit to
become committers and eventually PMC members and the project continues to
grow. If you're interested, I suggest you start a new DISCUSS thread on the
dev list and volunteer. Here's what would need to be done:
- cherry-pick changes from master between 4.12.0 and 4.13.0 release to
4.x-HBase-1.1 branch
- create a pull request with the above and get a +1 from a committer
- monitor the Jenkins job that'll run with these changes keeping a lookout
for any test failures
- assuming there are no test failures, follow the directions here [2] to
perform a release

Thanks,
James


[1]
https://lists.apache.org/thread.html/ae13def3c024603ce3cdde871223cbdbae0219b4efe93ed4e48f55d5@%3Cdev.phoenix.apache.org%3E
[2] https://phoenix.apache.org/release.html

On Sat, Nov 11, 2017 at 1:02 AM, stepan.migu...@firstlinesoftware.com <
stepan.migu...@firstlinesoftware.com> wrote:

>
>
> On 2017-11-10 22:36, Mujtaba Chohan  wrote:
> > Probably being hit by https://issues.apache.org/jira/browse/PHOENIX-4335
> .
> > Please upgrade to 4.13.0 which will be available by EOD today.
> >
> > On Fri, Nov 10, 2017 at 8:37 AM, Stepan Migunov <
> > stepan.migu...@firstlinesoftware.com> wrote:
> >
> > > Hi,
> > >
> > >
> > >
> > > I have just upgraded my cluster to Phoenix 4.12 and got an issue with
> > > tasks running on Spark 2.2 (yarn cluster mode). Any attempts to use
> method
> > > phoenixTableAsDataFrame to load data from existing database causes an
> > > exception (see below).
> > >
> > >
> > >
> > > The tasks worked fine on version 4.11. I have checked connection with
> > > sqlline - it works now and shows that version is 4.12. Moreover, I have
> > > noticed, that if limit the number of executors to one, the Spark's task
> > > executing successfully too!
> > >
> > >
> > >
> > > It looks like that executors running in parallel "interferes" each
> other’s
> > > and could not acquire version's mutex.
> > >
> > >
> > >
> > > Any suggestions please?
> > >
> > >
> > >
> > > *final Connection connection =
> > > ConnectionUtil.getInputConnection(configuration, overridingProps);*
> > >
> > > *User class threw exception: org.apache.spark.SparkException: Job
> aborted
> > > due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent
> > > failure: Lost task 0.3 in stage 0.0 (TID 36, n7701-hdp005, executor
> 26):
> > > java.lang.RuntimeException:
> > > org.apache.phoenix.exception.UpgradeInProgressException: Cluster is
> being
> > > concurrently upgraded from 4.11.x to 4.12.x. Please retry establishing
> > > connection.*
> > >
> > > *at
> > > org.apache.phoenix.mapreduce.PhoenixInputFormat.getQueryPlan(
> PhoenixInputFormat.java:201)*
> > >
> > > *at
> > > org.apache.phoenix.mapreduce.PhoenixInputFormat.createRecordReader(
> PhoenixInputFormat.java:76)*
> > >
> > > *at
> > > org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.
> scala:180)*
> > >
> > > *at
> > > org.apache.spark.rdd.NewHadoopRDD$$anon$1.(
> NewHadoopRDD.scala:179)*
> > >
> > > *at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:134)*
> > >
> > > *at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)*
> > >
> > > *at org.apache.phoenix.spark.PhoenixRDD.compute(PhoenixRDD.scala:64)*
> > >
> > > *at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)*
> > >
> > > *at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)*
> > >
> > > *at
> > > org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)*
> > >
> > > *at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)*
> > >
> > > *at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)*
> > >
> > > *at
> > > org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)*
> > >
> > > *at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)*
> > >
> > > *at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)*
> > >
> > > *at
> > > org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)*
> > >
> > > *at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)*
> > >
> > > *at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)*
> > >
> > > *at
> > > org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)*
> > >
> > > *at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)*
> > >
> > > *at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)*
> > >
> > > *at
> > > org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)*
> > >
> > > *at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)*
> > >
> > > *at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)*
> > >
> > > *at
> > > org.apach

Re: Enabling Tracing makes HMaster service fail to start

2017-11-09 Thread James Taylor
Please note that we're no longer doing releases for HBase 1.2 due to lack
of interest. If this is important for you, I suggest you volunteer to be RM
for this branch (4.x-HBase-1.2) and make sure to catch up the branch with
the latest bug fixes from our upcoming 4.13 release (in particular
PHOENIX-4335).

On Thu, Nov 9, 2017 at 8:46 AM, Josh Elser  wrote:

> Phoenix-4.12.0-HBase-1.2 should be compatible with HBase 1.2.x. Similarly,
> HBase 1.2.5 should be compatible with Hadoop 2.7.2.
>
> I'll leave you to dig into the code to understand exactly why you're
> seeing the error. You should be able to find the interface/abstract-class
> that you see the error about and come up with a reason as to your error.
>
> On 11/9/17 4:57 AM, Mallieswari Dineshbabu wrote:
>
>> Hi Elser,
>>
>> Thanks for the update. I have tried with log4j.PROPERTIES as additional
>> option only. Let me remove the changes from log4j.PROPERTIES;
>>
>> Regarding version compatibility, I hope I am using compatible version of
>> Phoenix and HBase. Please find the details below,
>> Hadoop - 2.7.2
>> HBase - 1.2.5
>> Phoenix - apache-phoenix-4.12.0-HBase-1.2/
>>
>> Query:
>> Could you please suggest the compatible version of Phoenix for Hadoop
>> 2.7.2 and HBase 1.2.5?
>>
>> Regarding classpath, I have ensured that required classpath are updated
>> properly by running phoenix_utils.py; Except phoenix_classpath all other
>> variables has proper values.
>>
>> Query:
>> Could you please tell what else I miss here regarding classpath?
>>
>> Regards,
>> Mallieswari D
>>
>> On Thu, Nov 9, 2017 at 12:00 AM, Josh Elser > els...@apache.org>> wrote:
>>
>> Please note that there is a difference between Phoenix Tracing and
>> the TRACE log4j level.
>>
>> It appears that you're using a version of Phoenix which is
>> incompatible with the version of HBase/Hadoop that you're running.
>> The implementation of PhoenixMetricsSink is incompatible with the
>> interface/abstract-class that HBase/Hadoop is expecting.
>>
>> This may be a classpath or Phoenix version issue, or you may have
>> stumbled onto a bug.
>>
>> On 11/8/17 6:33 AM, Mallieswari Dineshbabu wrote:
>>
>> Hi All,
>>
>> I am working with HBase-Phoenix, /everything works fine/. In
>> addition trying to enable Tracing
>> > > in Phoenix with the
>> following steps,
>>
>>   1. Copy ‘hadoop-metrics2-hbase.PROPERTIES’ from Phoenix
>> package to
>>  HBase conf folder.
>>   2. ‘hadoop-metrics2-phoenix.PROPERTIES’ file will be in
>> ‘Phoenix/bin’
>>  location by default. So I left it as it is.
>>   3. Add the following property to phoenix configuration,
>>
>> 
>>
>>  phoenix.trace.frequency
>>
>> always
>>
>> 
>>
>> After doing the above, HBase’s HMaster fails to start with the
>> following exception; Please tell if you have any suggestion on
>> this,
>>
>> 2017-11-08 16:46:56,118 INFO  [main] regionserver.RSRpcServices:
>> master/Selfuser-VirtualBox/172.16.203.117:6
>>  
>> server-side HConnection retries=140
>>
>> 2017-11-08 16:46:56,520 INFO  [main] ipc.SimpleRpcScheduler:
>> Using deadline as user call queue, count=3
>>
>> 2017-11-08 16:46:56,554 INFO  [main] ipc.RpcServer:
>> master/Selfuser-VirtualBox/192.16.203.117:6
>>  :
>>
>> started 10 reader(s) listening on port=6
>>
>> *2017-11-08 16:46:56,839 INFO  [main] impl.MetricsConfig: loaded
>> properties from hadoop-metrics2-hbase.properties*
>>
>> *2017-11-08 16:46:56,926 INFO  [main] trace.PhoenixMetricsSink:
>> Writing tracing metrics to phoenix table*
>>
>> *2017-11-08 16:46:56,933 ERROR [main] master.HMasterCommandLine:
>> Master exiting*
>>
>> *java.lang.RuntimeException: Failed construction of Master:
>> class org.apache.hadoop.hbase.master.HMaster. *
>>
>> *at
>> org.apache.hadoop.hbase.master.HMaster.constructMaster(HMast
>> er.java:2512)*
>>
>>
>> at
>> org.apache.hadoop.hbase.master.HMasterCommandLine.startMaste
>> r(HMasterCommandLine.java:231)
>>
>> at
>> org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMaste
>> rCommandLine.java:137)
>>
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>>
>> at
>> org.apache.hadoop.hbase.util.ServerCommandLine.doMain(Server
>> CommandLine.java:126)
>>
>> at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2522)
>>
>> Caused by: java.lang.AbstractMethodError:
>> org.apache.phoenix.trace.PhoenixMetricsSink.init(Lorg/apache
>> /commons/configuratio

Re: Cloudera parcel update

2017-11-09 Thread James Taylor
I agree with JMS and there is interest from the PMC, but no bandwidth to do
the work - we’d look toward others like you to do the work of putting
together an initial pull request, regular pulls to keep things in sync,
RMing releases, etc. These types of contributions would earn merit toward a
committership and eventual nomination for PMC (that’s how the ASF works).
The files would of course be hosted on Apache infrastructure (just like our
current releases and repos).

On Thu, Nov 9, 2017 at 6:15 AM Jean-Marc Spaggiari 
wrote:

> We "simply" need to have a place to host the file, right? From a code
> perspective, it can be another branch in the same repo?
>
> 2017-11-09 8:48 GMT-05:00 Flavio Pompermaier :
>
>> No interest from Phoenix PMCs to provide support to the creation of
>> official Cloudera parcels (at least from Phoenix side)...?
>>
>> On Tue, Oct 31, 2017 at 8:09 AM, Flavio Pompermaier > > wrote:
>>
>>> Anyone from Phoenix...?
>>>
>>> On 27 Oct 2017 16:47, "Pedro Boado"  wrote:
>>>
 For creating a CDH parcel repository the only thing needed is a web
 server where the parcels and the manifest.json is published. But we need
 one.

 I'm in of course. Who can help onboarding this changes and publishing
 etc and getting users to push changes to the project? How do you do this in
 Phoenix? Via another mail list, right?

 Defining regression strategy is probably the most complex bit. And
 automating it is even more complex I think. This is where more work is
 needed.

 On 27 Oct 2017 15:20, "Jean-Marc Spaggiari" 
 wrote:

> See below.
>
> 2017-10-27 8:45 GMT-04:00 Flavio Pompermaier :
>
>> I just need someone who tells which git repository to use, the
>> branching/tagging policy, what should be done to release a parcel (i.e.
>> compile, test ok, update docs, etc). For example, I need someone who 
>> says:
>> to release a Phoenix CDH  parcel the process is this:
>>
>> 1. use this repo (e.g.https://github.com/pboado/phoenix-for-cloudera
>> 
>>  or https://github.com/apahce/phoenix)
>>
>
> Well, if Apache Phoenix maintains it, I feel this should be moved
> under the Apache Phoenix git repository, right?
>
>
>
>> 2. create one or more branches for each supported release (i.e.
>> 4.11-cdh-5.10 and 4.11-cdh-5.11)
>> - this imply to create an official compatibility
>> matrix...obviously it doesn't make sense to issue a 4.11-cdh-4.1 for
>> example)
>>
>
> Indeed.
>
>
>> 3. The test that should be passed to consider a parcel ok for a
>> release
>>
>
> Ha! good idea. Don't know if this can be automated, but deploying the
> parcel, doing rolling upgrades, minor versions and major versions upgrades
> tests, etc. We might be able to come with a list of things to test, and
> increase/improve the list as we move forward...
>
>
>> 4. Which documentation to write
>>
>
> Most probably documenting what has changed between the Apache core
> branch and the updated parce branch? And how to build?
>
>
>> 5. Who is responsible to update Phoenix site and announcements etc?
>>
>
> You? ;)
>
>
>> 6. Call for contributors when a new release is needed and coordinate
>> them
>>
>
> I'm already in! I have one CDH cluster and almost all CDH versions
> VMs... So I can do a lot of tests, as long as it doesn't required a month
> of my time every month ;)
>
> JMS
>
>
>>
>> Kind of those things :)
>>
>> On Fri, Oct 27, 2017 at 2:33 PM, Jean-Marc Spaggiari <
>> jean-m...@spaggiari.org> wrote:
>>
>>> FYI, you can also count on me for that. At least to perform some
>>> testing or gather information, communication, etc.
>>>
>>> Flavio, what can you leading do you need there?
>>>
>>> James, I am also interested ;) So count me in... (My very personal
>>> contribution)
>>>
>>> To setup a repo we just need to have a folder on the existing file
>>> storage with the correct parcel structure so people can point to it. 
>>> That's
>>> not a big deal...
>>>
>>> JMS
>>>
>>> 2017-10-27 5:08 GMT-04:00 Flavio Pompermaier :
>>>
 I can give it a try..is there someone who can lead this thing?

>>>
>>>
>>
>>
>
>>
>


Re: SELECT + ORDER BY vs self-join

2017-10-30 Thread James Taylor
Please file a JIRA and include the explain plan for each of the queries. I
suspect your index is not being used in the first query due to the
selection of all the columns. You can try hinting the query to force your
index to be used. See
https://phoenix.apache.org/secondary_indexing.html#Index_Usage

Thanks,
James

On Mon, Oct 30, 2017 at 7:02 AM, Marcin Januszkiewicz <
januszkiewicz.mar...@gmail.com> wrote:

> We have a wide table with 100M records created with the following DDL:
>
> CREATE TABLE traces (
>   rowkey VARCHAR PRIMARY KEY,
>   time VARCHAR,
>   number VARCHAR,
>   +40 more columns)
>
> We want to select a large (~30M records) subset of this data with the
> query:
>
> SELECT *all columns*
>   FROM traces
>   WHERE (UPPER(number) LIKE 'PO %')
>   ORDER BY time DESC, ROWKEY
>   LIMIT 101;
>
> This times out after 15 minutes and puts a huge load on our cluster.
> We have an alternate way of selecting this data:
>
> SELECT t.rowkey, *all columns*
> FROM TRACES t
> JOIN (
>   SELECT rowkey
>   FROM TRACES
>   WHERE (UPPER(number) LIKE 'PO %')
>   ORDER BY time DESC, ROWKEY
>   LIMIT 101
> ) ix
> ON t.ROWKEY = ix.ROWKEY
> order by t.ROWKEY;
>
> Which completes in just under a minute.
> Is there a better way to construct this query?
> When is using the self-join a worse choice than the simple select?
> Given that we have a functional index on UPPER(number), could this
> potentially be a statistics-based optimizer decision?
>
> --
> Pozdrawiam,
> Marcin Januszkiewicz
>


Re: Indexes not used when ordering by primary key.

2017-10-30 Thread James Taylor
You can see the logic in QueryOptimizer.orderPlansBestToWorst():
- choose plan that optimizes out an ORDER BY
- otherwise, choose plan that uses the most leading columns in the primary
key (i.e. smallest range scan)
- if multiple options, choose the one in which the GROUP BY can be done in
place
- choose smaller table (index) over larger table

We're looking at adding some simple cost-based decisions to improve this as
we know how many bytes will be scanned for each possible table/index being
used. This won't be perfect, but it'll help most cases (and you can always
fallback to hints if you know better).

Thanks,
James

On Mon, Oct 30, 2017 at 8:46 AM, Marcin Januszkiewicz <
katamaran...@gmail.com> wrote:

> On a possibly related note, how does Phoenix choose which index to use
> if there multiple indexes are applicable? Right now it seems that the
> one with the lower sequence number is used. If this is the case,
> similar optimizations could be made to choose the index that will scan
> over a smaller dataset.
>
> On Sat, Oct 14, 2017 at 8:26 AM, James Taylor 
> wrote:
> > Couple of follow up comments:
> > - if you use c1=‘X0’ the index should be used without a hint,  because
> it’s
> > still ordered by the PK when using index.
> > - this wouldn’t necessarily be the case for c1 LIKE 'X0%'.
> >
> >
> > On Fri, Oct 13, 2017 at 8:33 PM James Taylor 
> wrote:
> >>
> >> Yes, this is expected behavior. Phoenix can either optimize based on the
> >> filter in the WHERE clause or the ORDER BY. Since it's not cost based,
> >> Phoenix always chooses to optimize out the ORDER BY (because in general
> it's
> >> more expensive and it doesn't know how much data will be filtered out
> by the
> >> WHERE clause). By using the data table, we know that rows are already
> >> returned in PK order, so there's no reordering required. The hint is
> >> available to override this decision.
> >>
> >> It wouldn't be difficult to introduce some simple cost-based decisions
> if
> >> statistics collection is enabled. In that case, we can get an estimate
> at
> >> compile-time on how much data would be scanned when the index is used.
> If
> >> the amount is low enough, the optimizer could choose to use the index
> and
> >> reorder the results.
> >>
> >> Please file a JIRA and we can discuss further.
> >>
> >> Thanks,
> >> James
> >>
> >> On Fri, Oct 13, 2017 at 6:47 AM, Marcin Januszkiewicz
> >>  wrote:
> >>>
> >>> Small correction the index is local:
> >>> CREATE LOCAL INDEX t_c1_ix ON t (c1);
> >>>
> >>> On Fri, Oct 13, 2017 at 3:43 PM, Marcin Januszkiewicz
> >>>  wrote:
> >>> > Hi,
> >>> >
> >>> > we have some data in a phoenix table that we always want to fetch in
> >>> > the order determined by the primary key:
> >>> >
> >>> > CREATE TABLE t (
> >>> >   rowkey VARCHAR PRIMARY KEY,
> >>> >   c1 VARCHAR,
> >>> >   c2 VARCHAR,
> >>> > )
> >>> >
> >>> > SELECT rowkey, c1, c2 FROM t where c1 LIKE 'X0%' ORDER BY rowkey;
> >>> >
> >>> > We wanted to speed up searches using an index on c1:
> >>> >
> >>> > CREATE INDEX t_c1_ix ON t (c1);
> >>> >
> >>> > However, this index is only used if we drop the ORDER BY clause or
> use
> >>> > a query hint. If we sort by any other field, such as c2, the index
> >>> > will be used.
> >>> >
> >>> > Is this expected behavior? Is there any way of influencing phoenix to
> >>> > use the indexes without using an index hint? The actual table has
> more
> >>> > columns & indexes, and queries are creating programatically. Adding
> >>> > code which would decide which hint to generate would be a little
> >>> > problematic.
> >>
> >>
> >
>


Re: Upserting in batch into a column of all rows by concatenating multiple columns

2017-10-28 Thread James Taylor
See http://phoenix.apache.org/language/index.html#upsert_select

On Sat, Oct 28, 2017 at 4:25 AM Vaghawan Ojha  wrote:

> Hi,
>
> I want to update a column's value in all rows by concatenating values from
> the multiple columns of the rows. In plain sql it's possible to do that but
> I'm not sure if that can be done in Phoenix.
>
> For say I've four columns in a table:
>
> (id, product_id, price, store_id, bill_id)
>
> Now I already have the values of id, product_id, price and store_id in the
> table. But now I want to set the values of bill_id in all rows to something
> like this:
>
> product_id || '-' || store_id
>
> Is it possible? If it is, then how do I do it. I don't know if that can be
> done with Atomic upsert.
>
> Any help would be great.
>
> Thanks
> Vaghawan
>


Re: Cloudera parcel update

2017-10-27 Thread James Taylor
Yes, that's the idea I was floating - if you and Pedro (and others) are
interested, you could contribute the CDH compatible versions and these
could be bundled in our source and binary releases. The community,
including yourselves, would support it (in the same way we support Phoenix
today). Not sure what's involved in setting up a repo of Phoenix parcels,
but perhaps that can be done on Apache infrastructure too.


On Fri, Oct 27, 2017 at 1:22 AM, Flavio Pompermaier 
wrote:

> I totally agree wrt some compatibility matrix. I think that CDH
> compatibility and parcels release should be something that people of Apache
> Phoenix could support somehow..am I wrong?
>
> On Fri, Oct 27, 2017 at 10:06 AM, Pedro Boado 
> wrote:
>
>> There are problems regarding version support and bugfixes. As HBase in
>> CDH has been "enhaced" by Cloudera we'd need to mark certain specific cdh
>> versions as supported (up to major.minor.fix version) . For instance CDH
>> 5.11.1 doesn't pass tests because of an unsolved issue in Cloudera's HBase.
>>
>> I think this is not about sharing a parcel now and stop doing it, right?
>> Also there are "financial" issues in terms of servers that I'd probably
>> need help with.
>>
>> On 27 Oct 2017 08:49, "Flavio Pompermaier"  wrote:
>>
>>> If you need a hand I can help because we also need Phoenix on CDH. Maybe
>>> I could writie some documentation about it's installation and usage, on the
>>> README or on the official Phoenix site. Let's set up a an unofficial (but
>>> working) repo of Phoenix Parcels!
>>>
>>> On Fri, Oct 27, 2017 at 9:12 AM, Pedro Boado 
>>> wrote:
>>>
>>>> Well I'd say it's the only option left to some of us to use Phoenix in
>>>> our clients/companies. Most of them only go for commercial distributions of
>>>> Hadoop for obvious reasons and cannot afford keeping up with frequent
>>>> platform update. In our case each platform migration is bigger (in terms of
>>>> effort etc) than some projects.
>>>>
>>>> Cloudera stopped porting it to CDH because of strategic reasons.
>>>> Basically they say that Kudu does everything that Phoenix do... . Marketing
>>>> strategy I guess. And we cannot stay with Phoenix 4.5 and CDH 5.5 (and no
>>>> bugfixes) forever.
>>>>
>>>> In the other hand Hortonworks already keeps Phoenix within their own
>>>> distribution so in this case the vendor itself sees strategic advantage on
>>>> this.
>>>>
>>>> As long as I work for the same company I'll have to port Phoenix to
>>>> CDH. So I could help with it guys.
>>>>
>>>> Maybe my only issue to publish the parcels is getting web hosting for
>>>> them.
>>>>
>>>>
>>>>
>>>> On 27 Oct 2017 06:35, "James Taylor"  wrote:
>>>>
>>>>> This is great, Pedro. Thanks so much for porting everything over.
>>>>> Would it make sense to try to have a CDH compatible release with each
>>>>> Phoenix release? Who would sign up to do this? Same question for HDP
>>>>> releases.
>>>>>
>>>>> Thanks,
>>>>> James
>>>>>
>>>>> On Thu, Oct 26, 2017 at 2:43 PM, Pedro Boado 
>>>>> wrote:
>>>>>
>>>>>> Sorry, it s provided "as is" . Try a "mvn clean package -D
>>>>>> skipTests=true" .
>>>>>>
>>>>>> And grab the parcel from phoenix-parcel/target
>>>>>>
>>>>>> On 26 Oct 2017 22:21, "Flavio Pompermaier" 
>>>>>> wrote:
>>>>>>
>>>>>>> And how do you use the parcel? Where it is generated?any
>>>>>>> documentation about this?
>>>>>>>
>>>>>>> On 26 Oct 2017 20:37, "Pedro Boado"  wrote:
>>>>>>>
>>>>>>>> I've done it for Phoenix 4.11 and CDH 5.11.2 based on previous work
>>>>>>>> from chiastic-security.
>>>>>>>>
>>>>>>>> https://github.com/pboado/phoenix-for-cloudera/tree/4.11-cdh5.11.2
>>>>>>>>
>>>>>>>> All integrations tests running, and I've added a parcel module for
>>>>>>>> parcel-generation in rhel6.
>>>>>>>>
>>>>>>>> Contribu

Re: Cloudera parcel update

2017-10-26 Thread James Taylor
This is great, Pedro. Thanks so much for porting everything over. Would it
make sense to try to have a CDH compatible release with each Phoenix
release? Who would sign up to do this? Same question for HDP releases.

Thanks,
James

On Thu, Oct 26, 2017 at 2:43 PM, Pedro Boado  wrote:

> Sorry, it s provided "as is" . Try a "mvn clean package -D skipTests=true"
> .
>
> And grab the parcel from phoenix-parcel/target
>
> On 26 Oct 2017 22:21, "Flavio Pompermaier"  wrote:
>
>> And how do you use the parcel? Where it is generated?any documentation
>> about this?
>>
>> On 26 Oct 2017 20:37, "Pedro Boado"  wrote:
>>
>>> I've done it for Phoenix 4.11 and CDH 5.11.2 based on previous work from
>>> chiastic-security.
>>>
>>> https://github.com/pboado/phoenix-for-cloudera/tree/4.11-cdh5.11.2
>>>
>>> All integrations tests running, and I've added a parcel module for
>>> parcel-generation in rhel6.
>>>
>>> Contributions are welcome for supporting other parcel formats.
>>>
>>> I've also tried to compile Phoenix 4.12 but some IT are failing and I
>>> haven't looked further into them yet.
>>>
>>> On 26 Oct 2017 18:42, "Flavio Pompermaier"  wrote:
>>>
>>> I could give it a try...any reference about it?where can I find this
>>> latest parcel you produced? Any feedback from Cloudera?
>>>
>>>
>>> On 26 Oct 2017 18:38, "Jean-Marc Spaggiari" 
>>> wrote:
>>>
>>> It is. The parcel is not just a packaging of the Phoenix code into a
>>> different format. It requires some modifications. However, it's doable...
>>> Andrew applied those modifications on a later version and we packaged it
>>> into a Parcel. So it's definitely doable. Might be interesting to do that
>>> for the last version, but will require some efforts...
>>>
>>> 2017-10-26 12:07 GMT-04:00 Flavio Pompermaier :
>>>
 Looking at [1] they say that "Attempt to run the official
 apache-phoenix-4.10.0-HBase-1.2 on CDH 5.12 failed due to tight
 coupling of CLAB_PHOENIX with HBase, and what looks like subtle API
 incompatibilities between Phoenix-embedded and CDH HBase classes.
 Namespace mapping partially worked, however". Is that true?

 [1] https://community.cloudera.com/t5/Cloudera-Labs/Apache-P
 hoenix-Support-for-CDH-5-8-x-5-9x-5-10-x-5-11-x/m-p/58729/hi
 ghlight/true#M378

 On Thu, Oct 26, 2017 at 12:13 AM, Flavio Pompermaier <
 pomperma...@okkam.it> wrote:

> I'll take care of it ;)
>
> On 25 Oct 2017 23:45, "Sergey Soldatov" 
> wrote:
>
>> Hi Flavio,
>>
>> It looks like you need to ask the vendor, not the community about
>> their plan for further releases.
>>
>> Thanks,
>> Sergey
>>
>> On Wed, Oct 25, 2017 at 2:21 PM, Flavio Pompermaier <
>> pomperma...@okkam.it> wrote:
>>
>>> Hi to all,
>>> the latest Phoenix Cloudera parcel I can see is 4.7...any plan to
>>> release a newer version?
>>>
>>> I'd need at least Phoenix 4.9..anyone using it?
>>>
>>> Best,
>>> Flavio
>>>
>>
>>



>>>
>>>
>>>


Re: Load HFiles in Apache Phoenix

2017-10-20 Thread James Taylor
If you put together a nice example, we can post a link to it from the FAQ.
Sorry, but with open source, the answer is often "go look at the source
code". :-)

On Fri, Oct 20, 2017 at 2:13 PM, snhir...@gmail.com 
wrote:

>
>
> On 2017-10-20 17:07, James Taylor  wrote:
> > Load Phoenix into Eclipse and search for references to
> > PhoenixRuntime.getUncommittedDataIterator(). There's even a unit test
> does
> > this.
> >
>
> Ok, I appreciate the response.  But I've already encountered the source
> code during my searches and it really isn't very enlightening in terms of
> how one simply uses it.  I'll take your advice and go after the unit test
> next.
>


Re: Load HFiles in Apache Phoenix

2017-10-20 Thread James Taylor
Load Phoenix into Eclipse and search for references to
PhoenixRuntime.getUncommittedDataIterator(). There's even a unit test does
this.

On Fri, Oct 20, 2017 at 2:04 PM, snhir...@gmail.com 
wrote:

>
>
> On 2017-10-20 16:49, James Taylor  wrote:
> > Here's a little more info:
> > https://phoenix.apache.org/faq.html#Why_empty_key_value
> >
> > Lot's of hits here too:
> > http://search-hadoop.com/?project=Phoenix&q=empty+key+value
> >
> > On Fri, Oct 20, 2017 at 1:45 PM, sn5  wrote:
> >
> > > It would be very helpful to see a complete, working example (preferably
> > > with
> > > some comments) of this hfile load technique.  Apparently it's a known
> > > idiom,
> > > but I've spent most of the afternoon searching Google and cannot find a
> > > single reference other than this thread.  In particular I do not
> understand
> > > what is meant by "
> > > ..loading the empty column".
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Sent from: http://apache-phoenix-user-list.1124778.n5.nabble.com/
> > >
> >
>
> Thanks, but that answers only a part of my question.  I would like to see
> a reference to the entire idiom of using the uncommitted data from a
> transaction that will be subsequently rolled back.  I can sort of infer
> what's going on from that original post, but cannot find any further
> references or examples.
>
>


  1   2   3   4   5   6   7   8   9   10   >