Re: phoenix spark options not supporint query in dbtable

2017-08-17 Thread Josh Mahonin
You're mostly at the mercy of HBase and Phoenix to ensure that your data is evenly distributed in the underlying regions. You could look at pre-splitting or salting [1] your tables, as well as adjusting the guidepost parameters [2] if you need finer tuned control. If you end up with more idle Spar

Re: phoenix spark options not supporint query in dbtable

2017-08-17 Thread Kanagha
Thanks for the details. I tested out and saw that the no.of partitions equals to the no.of parallel scans run upon DataFrame load in phoenix 4.10. Also, how can we ensure that the read gets evenly distributed as tasks across the no.of executors set for the job? I'm running phoenixTableAsDataFrame

Re: phoenix spark options not supporint query in dbtable

2017-08-17 Thread Josh Mahonin
Hi, Phoenix is able to parallelize queries based on the underlying HBase region splits, as well as its own internal guideposts based on statistics collection [1] The phoenix-spark connector exposes those splits to Spark for the RDD / DataFrame parallelism. In order to test this out, you can try r

Re: phoenix spark options not supporint query in dbtable

2017-08-17 Thread Kanagha
Also, I'm using phoenixTableAsDataFrame API to read from a pre-split phoenix table. How can we ensure read is parallelized across all executors? Would salting/pre-splitting tables help in providing parallelism? Appreciate any inputs. Thanks Kanagha On Wed, Aug 16, 2017 at 10:16 PM, kanagha wro

Re: phoenix spark options not supporint query in dbtable

2017-08-16 Thread kanagha
Hi Josh, Per your previous post, it is mentioned "The phoenix-spark parallelism is based on the splits provided by the Phoenix query planner, and has no requirements on specifying partition columns or upper/lower bounds." Does it depend upon the region splits on the input table for parallelism? C

Re: phoenix spark options not supporint query in dbtable

2016-06-09 Thread Josh Mahonin
SparkContext("local", "phoenix-test")* > > *val sqlContext = new SQLContext(sc)* > > > > *// Load the columns 'ID' and 'COL1' from TABLE1 as a DataFrame* > > *val df = sqlContext.phoenixTableAsDataFrame(* > > * "TABLE1&q

RE: phoenix spark options not supporint query in dbtable

2016-06-09 Thread Long, Xindian
"TABLE1", Array("ID", "COL1"), conf = configuration ) df.show From: Josh Mahonin [mailto:jmaho...@gmail.com] Sent: 2016年6月9日 9:44 To: user@phoenix.apache.org Subject: Re: phoenix spark options not supporint query in dbtable Hi Xindian, The phoenix-spark integration

Re: phoenix spark options not supporint query in dbtable

2016-06-09 Thread Josh Mahonin
Hi Xindian, The phoenix-spark integration is based on the Phoenix MapReduce layer, which doesn't support aggregate functions. However, as you mentioned, both filtering and pruning predicates are pushed down to Phoenix. With an RDD or DataFrame loaded, all of Spark's various aggregation methods are