date:20170224

Re: PySpark and Phoenix Dynamic Columns

2017-02-24 Thread Josh Mahonin

Hi Craig,

I think this is an open issue in PHOENIX-2648 (
https://issues.apache.org/jira/browse/PHOENIX-2648)

There seems to be a workaround by using a 'VIEW' instead, as mentioned in
that ticket.

Good luck,

Josh

On Thu, Feb 23, 2017 at 11:56 PM, Craig Roberts 
wrote:

> Hi all,
>
> I've got a (very) basic Spark application in Python that selects some
> basic information from my Phoenix table. I can't quite figure out how (or
> even if I can) select dynamic columns through this, however.
>
> Here's what I have;
>
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SQLContext
>
> conf = SparkConf().setAppName("pysparkPhoenixLoad").setMaster("local")
> sc = SparkContext(conf=conf)
> sqlContext = SQLContext(sc)
>
> df = sqlContext.read.format("org.apache.phoenix.spark") \
>.option("table", """MYTABLE("dyamic_column" VARCHAR)""") \
>.option("zkUrl", "127.0.0.1:2181:/hbase-unsecure") \
>.load()
>
> df.show()
> df.printSchema()
>
>
> I get a "org.apache.phoenix.schema.TableNotFoundException:" error for the
> above.
>
> If I try and load the data frame as a table and query that with SQL:
>
> sqlContext.registerDataFrameAsTable(df, "test")
> sqlContext.sql("""SELECT * FROM test("dynamic_column" VARCHAR)""")
>
>
> I get a bit of a strange exception:
>
> py4j.protocol.Py4JJavaError: An error occurred while calling o37.sql.
> : java.lang.RuntimeException: [1.19] failure: ``union'' expected but `('
> found
>
> SELECT * FROM test("dynamic_column" VARCHAR)
>
>
>
> Does anybody have a pointer on whether this is supported and how I might
> be able to query a dynamic column? I haven't found much information on the
> wider Internet about Spark + Phoenix integration for this kind of
> thing...Simple selects are working. Final note: I have (rather stupidly)
> lower-cased my column names in Phoenix, so I need to quote them when I
> execute a query (I'll be changing this as soon as possible).
>
> Any assistance would be appreciated :)
> *-- Craig*
>

Re: Select date range performance issue

2017-02-24 Thread NaHeon Kim

Jonathan,

So much helpful explanations!
Skip scan and index including any column in a query make a big difference.
I'll try row timestamp more and see what happens.

Thanks,
NaHeon

2017-02-24 13:25 GMT+09:00 Jonathan Leech :

> 1. No, I am not confused. A skip scan would "skip" over entire ranges of
> obj_id and all create_dt values for it. This will only be effective if
> there are many less distinct values of obj_id than there are total rows. If
> there are too many distinct obj_ids then it either wont speed the query up
> at all, or not enough, but it's simple to try it and see.
>
> 2. Your index isnt used because it doesn't contain the other columns used
> in the query; e.g your query is isn't "covered". You get the column(s)
> defined in the index + anything in the rowkey. You can also use the
> "include" keyword to add other columns to the index. Alternatively, you can
> look at "local" indexes, or it may be possible to nest the query with a
> sub-select to fetch the desired primary key values from the index by
> create_dt and the others from the main table.
>
> 3. No, not all. Phoenix will assign the the internal hbase timestamp of
> the row to whatever you set to create_dt. It can also automatically set it
> to the current time when you create the row, if you want it to. This has
> other implications; e.g if you set a TTL, versions, etc in hbase. It can
> speed up queries, especially those that execute on the most recent data
> written, but prior to hbase compaction. Advanced stuff and performance is
> highly dependent on your specific use case and hbase compaction settings...
>
> On Feb 23, 2017, at 7:59 PM, NaHeon Kim  wrote:
>
> Thanks for suggestion.
>
> Here's further questions:
> 1. create_dt (not obj_id, I think you confused) would have large sets of
> date, so SKIP_SCAN hint might be not useful.
>
> 2. I created secondary index on create_dt
>create index IDX1_CREATE_DT on MY_TABLE(CREATE_DT;
>
> However, EXPLAIN still shows query plan of FULL SCAN.
> Giving index hint on SELECT doesn't work as well.
>
> CLIENT 2-CHUNK 0 ROWS 0 BYTES PARALLEL 2-WAY FULL SCAN OVER
> NEWS_KEYWORD_COUNT
>SERVER FILTER BY (CREATE_DT > TIMESTAMP '2017-03-01 00:00:00.000' AND
> CREATE_DT < TIMESTAMP '2017-04-01 00:00:00.000')
>SERVER AGGREGATE INTO DISTINCT ROWS BY [KEYWORD]
> CLIENT MERGE SORT
> CLIENT 100 ROW LIMIT
>
> 3. ROW_TIMESTAMP is time of current query execution time, right?
> Then it's not a right choice. :-(
>
>
> 2017-02-24 1:54 GMT+09:00 Jonathan Leech :
>
>> If there are not a large number of distinct values of obj_id, try a
>> SKIP_SCAN hint. Otherwise, the secondary index should work, make sure it's
>> actually used via explain. Finally, you might try the ROW_TIMESTAMP feature
>> if it fits your use case.
>>
>> On Feb 22, 2017, at 11:30 PM, NaHeon Kim  wrote:
>>
>> Hi all,
>> I've seen performance problem when selecting rows within date range.
>>
>> My table schema is:
>>
>> CREATE TABLE MY_TABLE (
>>OBJ_ID varchar(20) not null,
>>CREATE_DT timestamp not null,
>>KEYWORD varchar(100) not null,
>>COUNT integer,
>>CONSTRAINT PK PRIMARY KEY (OBJ_ID,CREATE_DT,KEYWORD)
>> );
>>
>> MY_TABLE has almost 5,200,000 rows,
>> CREATE_DT has about 6 months range.
>>
>> And executed query:
>>
>> SELECT KEYWORD, SUM(COUNT)
>> FROM MY_TABLE
>> WHERE CREATE_DT > to_timestamp('2016-03-01 00:00')
>> AND CREATE_DT < to_timestamp('2016-04-01 00:00')
>> GROUP BY KEYWORD;
>>
>> It tooks 46 seconds, too slow than expected, cause CREATE_DT is one of
>> row key.
>> I created a secondary index on CREATE_DT but there's no improvement.
>>
>> Query plan looks weird:
>> CLIENT 2-CHUNK 0 ROWS 0 BYTES PARALLEL 2-WAY FULL SCAN OVER
>> NEWS_KEYWORD_COUNT
>>  SERVER FILTER BY (CREATE_DT > TIMESTAMP '2017-03-01 00:00:00.000'
>> AND CREATE_DT < TIMESTAMP '2017-04-01 00:00:00.000')
>>  SERVER AGGREGATE INTO DISTINCT ROWS BY [KEYWORD]
>> CLIENT MERGE SORT
>> CLIENT 100 ROW LIMIT
>>
>> BUT If CREATE_DT comes first of row key, plan says range scan will be
>> done.
>>
>> Any suggestion? : )
>>
>> Thanks,
>> NaHeon
>>
>>
>

Re: PySpark and Phoenix Dynamic Columns

Re: Select date range performance issue

2 matches

Site Navigation

Mail list logo

Footer information