Thanks Mujtaba…
Regards,
John
> On Aug 24, 2016, at 2:29 PM, Mujtaba Chohan wrote:
>
> That sounds about right for loading CSV directly on a 5-8 node cluster. As
> Gabriel/James mentioned in another thread, CSVBulkLoadTool with pre-split
> table might offer significantly
That sounds about right for loading CSV directly on a 5-8 node cluster. As
Gabriel/James mentioned in another thread, CSVBulkLoadTool with pre-split
table might offer significantly better performance for large datasets.
On Tue, Aug 23, 2016 at 2:17 PM, John Leach
wrote:
So to load a TB of data would take around 2 days? Does that seem right to you?
Regards,
John
> On Aug 23, 2016, at 3:07 PM, Mujtaba Chohan wrote:
>
> Since there are 100 files on which this 600M row data is split. 5 separate
> psql script running in parallel on single
Since there are 100 files on which this 600M row data is split. 5 separate
psql script running in parallel on single machine ran that loaded data from
files 1-20, 21-40, 41-60, 61-80, 81-100. Performance get affected as keys
are in sequence in these files which lead to hot-spotting of RS, for this
Mujtaba,
Not following the import process.
The 5 parallel psql clients means that you manually split the data into 5
buckets/files/directories and then run 5 import scripts simultaneously?
If we wanted to benchmark import performance, what would be the right model for
that?
Thanks this is
FYI re-loaded with Phoenix 4.8/HBase 0.98.20 on a 8 node cluster with 64G
total/12G HBase heap.
*Data Load*
* 5.5 hours for 600M rows
* Method: Direct CSV load using psql.py script
* # client machines: 1
* Batch size 1K
* Key order: *Sequential*
* 5 parallel psql clients
* No missing rows due to
It looks like you guys already have most of the TPCH queries running based on
Enis’s talk in Ireland this year. Very cool.
(Slide 20: Phoenix can execute most of the TPC-H queries!)
Regards,
John Leach
> On Aug 19, 2016, at 8:28 PM, Nick Dimiduk wrote:
>
> It's
> Maybe there's such a test harness that already exists for TPC?
TPC provides tooling but it's all proprietary. The generated data can be
kept separately (Druid does it at least -
http://druid.io/blog/2014/03/17/benchmarking-druid.html
).
I'd say there would be one time setup: generation of
On Fri, Aug 19, 2016 at 3:01 PM, Andrew Purtell wrote:
> > I have a long interest in 'canned' loadings. Interesting ones are hard to
> > come by. If Phoenix ran any or a subset of TPCs, I'd like to try it.
>
> Likewise
>
> > But I don't want to be the first to try it. I am
for this.
-- Lars
From: John Leach <jle...@splicemachine.com>
To: dev@phoenix.apache.org; la...@apache.org
Sent: Friday, August 19, 2016 2:34 PM
Subject: Re: Issues while Running Apache Phoenix against TPC-H data
Sorry for the delay on this end…
Each Region Server has 24 Gigs
> I have a long interest in 'canned' loadings. Interesting ones are hard to
> come by. If Phoenix ran any or a subset of TPCs, I'd like to try it.
Likewise
> But I don't want to be the first to try it. I am not a Phoenix expert.
Same here, I'd just email dev@phoenix with a report that TPC query
On Fri, Aug 19, 2016 at 1:19 PM, James Taylor
wrote:
> On Fri, Aug 19, 2016 at 11:37 AM, Stack wrote:
>
> > On Thu, Aug 18, 2016 at 5:54 PM, James Taylor
> > wrote:
> >
> > > The data loaded fine for us.
> >
> >
> > Mind
> sizes, etc). Phoenix runs inside of the region server, and hence their
> configuration is extremely important.
> -- Lars
>
> From: James Taylor <jamestay...@apache.org>
> To: "dev@phoenix.apache.org" <dev@phoenix.apache.org>
> Sent: Friday, Augus
tay...@apache.org>
To: "dev@phoenix.apache.org" <dev@phoenix.apache.org>
Sent: Friday, August 19, 2016 1:19 PM
Subject: Re: Issues while Running Apache Phoenix against TPC-H data
On Fri, Aug 19, 2016 at 11:37 AM, Stack <st...@duboce.net> wrote:
> On Thu, Aug 18, 2016 at
On Fri, Aug 19, 2016 at 11:37 AM, Stack wrote:
> On Thu, Aug 18, 2016 at 5:54 PM, James Taylor
> wrote:
>
> > The data loaded fine for us.
>
>
> Mind describing what you did to get it to work and with what versions and
> configurations and with what TPC
On Thu, Aug 18, 2016 at 5:54 PM, James Taylor
wrote:
> The data loaded fine for us.
Mind describing what you did to get it to work and with what versions and
configurations and with what TPC loading and how much of the workload was
supported? Was it a one-off project?
The data loaded fine for us. If TPC is not representative of real
workloads, I'm not sure there's value in spending a lot of time running
them. But if it's important to the dev/user community and gets contributed,
that'd be great too. I guess that's one of the great things about open
source.
On Thu, Aug 18, 2016 at 3:00 PM, James Taylor
wrote:
> On Thu, Aug 18, 2016 at 10:48 AM, Stack wrote:
>
...
> I'm not sure how the TCP benchmarks map to the real world use
> cases of our user community.
I'd think the TPC loadings would be worth
Got it. So in the mean time I will try to keep my eyes on the questions as
they come in and I'll figure out a way to capture the answers. I'm pretty
focused on the Tuning Guide for now, but maybe I'll start looking at other
ways to improve the docs (unless I get swamped by other priorities).
Have
Thanks, Peter. The main means of interaction at Apache are email and JIRAs.
These can then lead to commits (including website updates). I think it's
less about the medium of communication and more about the defining the
right processes, coordination, workflow, and automation that would need to
be
James:
Is there a formalized way that people from the community can get me
information that I can then collate, restructure, and rewrite into docs? I
am on the email lists, and I'm doing what I can to collect information from
there, but a more focused effort might also be productive.
Peter
On Thu, Aug 18, 2016 at 10:48 AM, Stack wrote:
>
>
> Would be cool if there was a page on how to do tpc-h along with what works
> and what does not from the suite, even if it was just for the latest
> release.
Yes, agreed. That'd be a good first contribution - a one pager on
It was 4.7 phoenix on what version of hadoop/hbase Amit?
Seven hours seems too long to load the data.
> It varies greatly on a use case by use case basis and requires
experimentation.
Would be cool if there was a page on how to do tpc-h along with what works
and what does not from the suite,
Hi Teams,
Apologies for the late reply but i was trying to upload the data in LINEITEM
table and my experience was not very good with the older version of phoenix 4.7
but we did have a beefy cluster as pointed by my colleague earlier.
After the jobs got completed i have seen some erratic
James,
I am working with Amit on this task. We have switched to an 9 node (8 RS)
cluster running HP 2.4.2 with a mostly vanilla install. I think our next steps
are to incorporate Mujtaba's changes into our cluster config and re-run, we'll
factor in your suggestions as well.
Is there a
Hi Aaron,
For commercial distros, you need to talk to the vendor. HDP 2.4.2 has a
very old version of Phoenix - 4.4 which is 4 minor releases back (an eon in
OS time). If you need something with commercial support, maybe you can get
an early access of the next HDP release, but I'd recommend just
Hi Amit,
Couple more performance tips on top of what Mujtaba already mentioned:
- Use the latest Phoenix (4.8.0). There are some great performance
enhancements in here, especially around usage of DISTINCT. We've also got
some new encoding schemes to reduce table sizes in our encodecolumns branch
Hi Amit,
* What's the heap size of each of your region servers?
* Do you see huge amount of disk reads when you do a select count(*) from
tpch.lineitem? If yes then try setting snappy compression on your table
followed by major compaction
* Were there any deleted rows in this table? What's the
>
> Hi Dev team,
>
> I was evaluating Apache Phoenix against the TPC-H data based on the
> presentation given at Hadoop summit in june stating that most TPC-H queries
> should run.
> Here is the setup details i have in my local environment :
>
> 1. One master node and 3 region servers with
29 matches
Mail list logo