Re: Issues while Running Apache Phoenix against TPC-H data

2016-08-24 Thread John Leach
Thanks Mujtaba… Regards, John > On Aug 24, 2016, at 2:29 PM, Mujtaba Chohan wrote: > > That sounds about right for loading CSV directly on a 5-8 node cluster. As > Gabriel/James mentioned in another thread, CSVBulkLoadTool with pre-split > table might offer significantly

Re: Issues while Running Apache Phoenix against TPC-H data

2016-08-24 Thread Mujtaba Chohan
That sounds about right for loading CSV directly on a 5-8 node cluster. As Gabriel/James mentioned in another thread, CSVBulkLoadTool with pre-split table might offer significantly better performance for large datasets. On Tue, Aug 23, 2016 at 2:17 PM, John Leach wrote:

Re: Issues while Running Apache Phoenix against TPC-H data

2016-08-23 Thread John Leach
So to load a TB of data would take around 2 days? Does that seem right to you? Regards, John > On Aug 23, 2016, at 3:07 PM, Mujtaba Chohan wrote: > > Since there are 100 files on which this 600M row data is split. 5 separate > psql script running in parallel on single

Re: Issues while Running Apache Phoenix against TPC-H data

2016-08-23 Thread Mujtaba Chohan
Since there are 100 files on which this 600M row data is split. 5 separate psql script running in parallel on single machine ran that loaded data from files 1-20, 21-40, 41-60, 61-80, 81-100. Performance get affected as keys are in sequence in these files which lead to hot-spotting of RS, for this

Re: Issues while Running Apache Phoenix against TPC-H data

2016-08-23 Thread John Leach
Mujtaba, Not following the import process. The 5 parallel psql clients means that you manually split the data into 5 buckets/files/directories and then run 5 import scripts simultaneously? If we wanted to benchmark import performance, what would be the right model for that? Thanks this is

Re: Issues while Running Apache Phoenix against TPC-H data

2016-08-23 Thread Mujtaba Chohan
FYI re-loaded with Phoenix 4.8/HBase 0.98.20 on a 8 node cluster with 64G total/12G HBase heap. *Data Load* * 5.5 hours for 600M rows * Method: Direct CSV load using psql.py script * # client machines: 1 * Batch size 1K * Key order: *Sequential* * 5 parallel psql clients * No missing rows due to

Re: Issues while Running Apache Phoenix against TPC-H data

2016-08-22 Thread John Leach
It looks like you guys already have most of the TPCH queries running based on Enis’s talk in Ireland this year. Very cool. (Slide 20: Phoenix can execute most of the TPC-H queries!) Regards, John Leach > On Aug 19, 2016, at 8:28 PM, Nick Dimiduk wrote: > > It's

Re: Issues while Running Apache Phoenix against TPC-H data

2016-08-19 Thread Andrew Purtell
> Maybe there's such a test harness that already exists for TPC? TPC provides tooling but it's all proprietary. The generated data can be kept separately (Druid does it at least - http://druid.io/blog/2014/03/17/benchmarking-druid.html ​). I'd say there would be one time setup: generation of

Re: Issues while Running Apache Phoenix against TPC-H data

2016-08-19 Thread James Taylor
On Fri, Aug 19, 2016 at 3:01 PM, Andrew Purtell wrote: > > I have a long interest in 'canned' loadings. Interesting ones are hard to > > come by. If Phoenix ran any or a subset of TPCs, I'd like to try it. > > Likewise > > > But I don't want to be the first to try it. I am

Re: Issues while Running Apache Phoenix against TPC-H data

2016-08-19 Thread larsh
for this. -- Lars From: John Leach <jle...@splicemachine.com> To: dev@phoenix.apache.org; la...@apache.org Sent: Friday, August 19, 2016 2:34 PM Subject: Re: Issues while Running Apache Phoenix against TPC-H data Sorry for the delay on this end… Each Region Server has 24 Gigs

Re: Issues while Running Apache Phoenix against TPC-H data

2016-08-19 Thread Andrew Purtell
> I have a long interest in 'canned' loadings. Interesting ones are hard to > come by. If Phoenix ran any or a subset of TPCs, I'd like to try it. Likewise > But I don't want to be the first to try it. I am not a Phoenix expert. Same here, I'd just email dev@phoenix with a report that TPC query

Re: Issues while Running Apache Phoenix against TPC-H data

2016-08-19 Thread Stack
On Fri, Aug 19, 2016 at 1:19 PM, James Taylor wrote: > On Fri, Aug 19, 2016 at 11:37 AM, Stack wrote: > > > On Thu, Aug 18, 2016 at 5:54 PM, James Taylor > > wrote: > > > > > The data loaded fine for us. > > > > > > Mind

Re: Issues while Running Apache Phoenix against TPC-H data

2016-08-19 Thread John Leach
> sizes, etc). Phoenix runs inside of the region server, and hence their > configuration is extremely important. > -- Lars > > From: James Taylor <jamestay...@apache.org> > To: "dev@phoenix.apache.org" <dev@phoenix.apache.org> > Sent: Friday, Augus

Re: Issues while Running Apache Phoenix against TPC-H data

2016-08-19 Thread larsh
tay...@apache.org> To: "dev@phoenix.apache.org" <dev@phoenix.apache.org> Sent: Friday, August 19, 2016 1:19 PM Subject: Re: Issues while Running Apache Phoenix against TPC-H data On Fri, Aug 19, 2016 at 11:37 AM, Stack <st...@duboce.net> wrote: > On Thu, Aug 18, 2016 at

Re: Issues while Running Apache Phoenix against TPC-H data

2016-08-19 Thread James Taylor
On Fri, Aug 19, 2016 at 11:37 AM, Stack wrote: > On Thu, Aug 18, 2016 at 5:54 PM, James Taylor > wrote: > > > The data loaded fine for us. > > > Mind describing what you did to get it to work and with what versions and > configurations and with what TPC

Re: Issues while Running Apache Phoenix against TPC-H data

2016-08-19 Thread Stack
On Thu, Aug 18, 2016 at 5:54 PM, James Taylor wrote: > The data loaded fine for us. Mind describing what you did to get it to work and with what versions and configurations and with what TPC loading and how much of the workload was supported? Was it a one-off project?

Re: Issues while Running Apache Phoenix against TPC-H data

2016-08-18 Thread James Taylor
The data loaded fine for us. If TPC is not representative of real workloads, I'm not sure there's value in spending a lot of time running them. But if it's important to the dev/user community and gets contributed, that'd be great too. I guess that's one of the great things about open source.

Re: Issues while Running Apache Phoenix against TPC-H data

2016-08-18 Thread Stack
On Thu, Aug 18, 2016 at 3:00 PM, James Taylor wrote: > On Thu, Aug 18, 2016 at 10:48 AM, Stack wrote: > ... > I'm not sure how the TCP benchmarks map to the real world use > cases of our user community. I'd think the TPC loadings would be worth

Re: Issues while Running Apache Phoenix against TPC-H data

2016-08-18 Thread Peter Conrad
Got it. So in the mean time I will try to keep my eyes on the questions as they come in and I'll figure out a way to capture the answers. I'm pretty focused on the Tuning Guide for now, but maybe I'll start looking at other ways to improve the docs (unless I get swamped by other priorities). Have

Re: Issues while Running Apache Phoenix against TPC-H data

2016-08-18 Thread James Taylor
Thanks, Peter. The main means of interaction at Apache are email and JIRAs. These can then lead to commits (including website updates). I think it's less about the medium of communication and more about the defining the right processes, coordination, workflow, and automation that would need to be

Re: Issues while Running Apache Phoenix against TPC-H data

2016-08-18 Thread Peter Conrad
James: Is there a formalized way that people from the community can get me information that I can then collate, restructure, and rewrite into docs? I am on the email lists, and I'm doing what I can to collect information from there, but a more focused effort might also be productive. Peter

Re: Issues while Running Apache Phoenix against TPC-H data

2016-08-18 Thread James Taylor
On Thu, Aug 18, 2016 at 10:48 AM, Stack wrote: > > > Would be cool if there was a page on how to do tpc-h along with what works > and what does not from the suite, even if it was just for the latest > release. Yes, agreed. That'd be a good first contribution - a one pager on

Re: Issues while Running Apache Phoenix against TPC-H data

2016-08-18 Thread Stack
It was 4.7 phoenix on what version of hadoop/hbase Amit? Seven hours seems too long to load the data. > It varies greatly on a use case by use case basis and requires experimentation. Would be cool if there was a page on how to do tpc-h along with what works and what does not from the suite,

Re: Issues while Running Apache Phoenix against TPC-H data

2016-08-16 Thread Amit Mudgal
Hi Teams, Apologies for the late reply but i was trying to upload the data in LINEITEM table and my experience was not very good with the older version of phoenix 4.7 but we did have a beefy cluster as pointed by my colleague earlier. After the jobs got completed i have seen some erratic

Re: Issues while Running Apache Phoenix against TPC-H data

2016-08-15 Thread Aaron Molitor
James, I am working with Amit on this task. We have switched to an 9 node (8 RS) cluster running HP 2.4.2 with a mostly vanilla install. I think our next steps are to incorporate Mujtaba's changes into our cluster config and re-run, we'll factor in your suggestions as well. Is there a

Re: Issues while Running Apache Phoenix against TPC-H data

2016-08-15 Thread James Taylor
Hi Aaron, For commercial distros, you need to talk to the vendor. HDP 2.4.2 has a very old version of Phoenix - 4.4 which is 4 minor releases back (an eon in OS time). If you need something with commercial support, maybe you can get an early access of the next HDP release, but I'd recommend just

Re: Issues while Running Apache Phoenix against TPC-H data

2016-08-15 Thread James Taylor
Hi Amit, Couple more performance tips on top of what Mujtaba already mentioned: - Use the latest Phoenix (4.8.0). There are some great performance enhancements in here, especially around usage of DISTINCT. We've also got some new encoding schemes to reduce table sizes in our encodecolumns branch

Re: Issues while Running Apache Phoenix against TPC-H data

2016-08-12 Thread Mujtaba Chohan
Hi Amit, * What's the heap size of each of your region servers? * Do you see huge amount of disk reads when you do a select count(*) from tpch.lineitem? If yes then try setting snappy compression on your table followed by major compaction * Were there any deleted rows in this table? What's the

Issues while Running Apache Phoenix against TPC-H data

2016-08-12 Thread Amit Mudgal
> > Hi Dev team, > > I was evaluating Apache Phoenix against the TPC-H data based on the > presentation given at Hadoop summit in june stating that most TPC-H queries > should run. > Here is the setup details i have in my local environment : > > 1. One master node and 3 region servers with