Re: GSOC 2018 Proposal review

2018-03-26 Thread Garima Natani
Hi Mark,

Thanks a lot for your feedback towards improving my proposal.
I have updated timeline as per changes suggested by you. Can you please
review it again?
Below is the link to shared google docs.
*https://docs.google.com/document/d/1W2N8zRXvylj5cxW8aKHvPDKFgj0Qk59UIJr9SPwFsa8/edit#
*


Regards,
Garima Natani
Mtech, First year
Geoinformatics and Natural Resources Engineering
Centre of Studies in Resources Engineering
Indian Institute of Technology, Bombay.
Have a nice day!!





On Tue, Mar 27, 2018 at 7:34 AM, Mark Wong  wrote:

> Hi Garima,
>
> On Tue, Mar 20, 2018 at 10:36:05PM +0530, Garima Natani wrote:
> > Hi All,
> >
> > I am looking forward to working with PostgreSQL organization In GSoC
> 2018.
>
> Glad to see you're interested in this project!
>
> > I have created Proposal for "Develop Performance Farm Database and
> Website"
> > Project. Please,
> >
> > Please, can you review it send comments?
>
> I think the general idea looks ok.  The biggest item I'd adjust is that
> the performance farm plants (to distinguish them from the build farm
> animals) shouldn't connect directly to the database.  Some of the django
> web app work needs to be moved up into phase one, because I'd propose
> the web app should also have an API that the build farm plants should
> use to transfer results.
>
> Good luck!
>
> Regards,
> Mark
>
> --
> Mark Wonghttp://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, RemoteDBA, Training & Services
>


Re: GSOC 2018 Proposal review

2018-03-26 Thread Mark Wong
Hi Garima,

On Tue, Mar 20, 2018 at 10:36:05PM +0530, Garima Natani wrote:
> Hi All,
> 
> I am looking forward to working with PostgreSQL organization In GSoC 2018.

Glad to see you're interested in this project!

> I have created Proposal for "Develop Performance Farm Database and Website"
> Project. Please,
> 
> Please, can you review it send comments?

I think the general idea looks ok.  The biggest item I'd adjust is that
the performance farm plants (to distinguish them from the build farm
animals) shouldn't connect directly to the database.  Some of the django
web app work needs to be moved up into phase one, because I'd propose
the web app should also have an API that the build farm plants should
use to transfer results.

Good luck!

Regards,
Mark

-- 
Mark Wonghttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, RemoteDBA, Training & Services



Re: [GSoC 2018] Proposal Draft

2018-03-18 Thread Andrey Borodin
Hi Kefar!

> 18 марта 2018 г., в 5:34, Kefan Yang  написал(а):
> 
> I am Kefan Yang, a third-year Computing Science student from Simon Fraser 
> University, Canada. I am very interested in the sorting algorithm 
> benchmarking and implementation issue you mentioned on the idealist of Google 
> Summer of Code 2018. 
> 
> I am currently working on my proposal, but I am a little bit worried about 
> whether I am on the right track. I've encountered some problems in the 
> implementation of the benchmark of those sorting algorithms. I briefly 
> described my current ideas in the proposal draft, but they may change a bit 
> as I read more research papers. I sincerely hope to get some feedback from 
> you to improve my proposal, especially for the implementation part.
> 
> I've attached a proposal draft to this email. It's a brief one but I guess 
> it's enough to show my current ideas:)

Peter have already added some context, I'll comment a bit on contents of your 
proposal.

I expect implementation of both timsort (dual pivot qsort) and introsort 
(depth-limited qsort). MySQL recently switched to introsort (they just used STL 
version instead of his own qsort).
Sorting routines are important part of many DBMS parts. But, indeed, sorting of 
actual user data is related to qsort very distantly - often DBMS deals with 
external sorting.

There is built-in Postgres way to benchmark - pgbench performance. But I do not 
think that it's output will be statistically conclusive. I think that 
benchmarking will be at least as time consuming as implementation itself.

BTW I think it is good to add Postgres hash table implementation to this 
project too. Currently, we use Larson's linear hashing (HTAB struct, 
dynahash.c), which is somewhat seldom used nowadays. It would be good to tests 
cuckoo hashing and some open addressing against HTAB interface.

Please do not forget to add to your proposal review of other's patches. If you 
will produce some useful output in a patch form and you will want it committed 
you will have to review someone else patch. This is required by Postgres dev 
process.

Best regards, Andrey Borodin.

Re: [GSoC 2018] Proposal Draft

2018-03-17 Thread Peter Geoghegan
On Sat, Mar 17, 2018 at 7:00 PM, Kefan Yang  wrote:
> What I am trying to say here is that similar optimizations can be applied to
> novel algorithms or other implementations of quicksort.

A novel algorithm is something to avoid here, because novel techniques
tend to only work out for specific datasets and data distributions. In
general, we care about a wide variety of cases, and are very keen to
avoid regressions.

> The paper about "Dual-Pivot Quicksort" is really helpful and it has a Java
> implementation already. I can definitely make use of that.

Cool. Please be careful to pick source material that has a license
that is compatible with the PostgreSQL license.

> Also, I am wondering what's the normal approach to testing if a certain
> sorting implementation brings performance gain in this project?

It varies quite a bit. You can search the lists archives for some of
this. For example, Tomas Vondra often tests my sorting patches by
subjecting them to a variety of tests with varied distributions and
data types. His results are then collated in a spreadsheet for easy
analysis. Maybe you can replicate that.

The only obvious trend is that everything is tested using real SQL
statements (including CREATE INDEX statements). In the past,
enhancements that were successfully incorporated tended to either
obviously have little or no downside, or have a very significant
upside that made it worth talking about accepting a small regression
in a minority of less representative cases.

> More
> specifically,  You mentioned that little progress has been made with novel
> algorithmic enhancement. How can we get that conclusion?

That's simply been the trend. It isn't particularly likely that
Postgres will be a project that pioneers some kind of algorithmic
enhancement that has broad applicability, that could just as easily
benefit a general purpose library sort routine. If you're interested
in algorithmic enhancements in the sense that I understand the term,
then that's the high bar that would have to be met -- that would
amount to a new insight in an area that has already been researched in
enormous detail by many people, most of whom don't know much about
database systems in particular.

On the other hand, techniques like abbreviated keys work well because
they effectively exploit things like the first normal form, and the
fact that a high start up cost for the sort is already very likely. It
is a technique specifically tailored for a database system, and modern
hardware characteristics.

-- 
Peter Geoghegan



Re: [GSoC 2018] Proposal Draft

2018-03-17 Thread Kefan Yang
Thanks for your quick feedback!

"""
Industrial implementation of selected sorting algorithm:
The industrial version is basically an optimization based on the benchmark
implementation. I plan to use optimizations like checking if input
array is already sorted
or applying insertion sort directly for short arrays to see if the
performance can be
improved. I am still looking for other common optimization methods.

"""
What I am trying to say here is that similar optimizations can be applied
to novel algorithms or other implementations of quicksort.

The paper about "Dual-Pivot Quicksort" is really helpful and it has a Java
implementation already. I can definitely make use of that.

Also, I am wondering what's the normal approach to testing if a certain
sorting implementation brings performance gain in this project? More
specifically,  You mentioned that little progress has been made with novel
algorithmic enhancement. How can we get that conclusion?

I am still working on my proposal and I will post a new version soon:)

Regards,
Kefan


2018-03-17 18:05 GMT-07:00 Peter Geoghegan :

> On Sat, Mar 17, 2018 at 5:34 PM, Kefan Yang  wrote:
> > I am Kefan Yang, a third-year Computing Science student from Simon Fraser
> > University, Canada. I am very interested in the sorting algorithm
> > benchmarking and implementation issue you mentioned on the idealist of
> > Google Summer of Code 2018.
>
> > I've attached a proposal draft to this email. It's a brief one but I
> guess
> > it's enough to show my current ideas:)
>
> The proposal reads:
>
> """
> Industrial implementation of selected sorting algorithm:
> The industrial version is basically an optimization based on the benchmark
> implementation. I plan to use optimizations like checking if input
> array is already sorted
> or applying insertion sort directly for short arrays to see if the
> performance can be
> improved. I am still looking for other common optimization methods.
>
> """
>
> But our qsort implementation, which is the Bentley & McIlroy
> implementation with a couple of customizations, already does
> everything you mention (I refer to the precheck algorithmic
> customization, and the way that we check to see which side of a pivot
> to use real recursion with to avoid stack overflows). I suggest that
> you read the paper [1] -- the code that we use is almost directly
> lifted from the paper. The opaque sounding variable names are the
> same, and the C code from the paper is structured in exactly the same
> way.
>
> I think that this won't be a particularly easy project to get
> committed. I suggest that if you go forward with it that you
> specifically look into integrating "Dual-Pivot Quicksort" [2] as a
> whole cloth replacement for the B implementation. It seems like this
> has some chance of working out, because it's more or less acknowledged
> by Bentley himself to be a kind of successor to his own industrial
> quicksort implementation [3] -- it's derived from it. Note that Java 7
> uses dual pivot quicksort when sorting integers.
>
> In general, we've had a lot of success with optimizing sorting in the
> past few years by focusing on things like avoiding cache misses in
> comparators. There has been much less success with algorithmic
> improvements, and no success at all with novel algorithmic
> enhancements. PostgreSQL development just isn't a great way to do that
> sort of thing.
>
> BTW, I noticed that you go on to say this:
>
> """
> However,
> since disk operation is much expensive than in-memory sorting, I am
> not sure if we can
> observe a significant difference in this way.
>
> """
>
> I think that you'd be surprised at how often this isn't true these
> days, at least when sorting enough data for spilling to disk to be
> relevant. The main reason for that is that the cost of writing out
> runs increases linearly, and therefore eventually becomes very small
> compared to the costs of sorting itself, which increases at a
> linearithmic rate.
>
> [1] http://cs.fit.edu/~pkc/classes/writing/samples/
> bentley93engineering.pdf
> [2] http://codeblab.com/wp-content/uploads/2009/09/DualPivotQuicksort.pdf
> [3] http://mail.openjdk.java.net/pipermail/core-libs-dev/2009-
> September/002630.html
> --
> Peter Geoghegan
>


Re: [GSoC 2018] Proposal Draft

2018-03-17 Thread Peter Geoghegan
On Sat, Mar 17, 2018 at 5:34 PM, Kefan Yang  wrote:
> I am Kefan Yang, a third-year Computing Science student from Simon Fraser
> University, Canada. I am very interested in the sorting algorithm
> benchmarking and implementation issue you mentioned on the idealist of
> Google Summer of Code 2018.

> I've attached a proposal draft to this email. It's a brief one but I guess
> it's enough to show my current ideas:)

The proposal reads:

"""
Industrial implementation of selected sorting algorithm:
The industrial version is basically an optimization based on the benchmark
implementation. I plan to use optimizations like checking if input
array is already sorted
or applying insertion sort directly for short arrays to see if the
performance can be
improved. I am still looking for other common optimization methods.

"""

But our qsort implementation, which is the Bentley & McIlroy
implementation with a couple of customizations, already does
everything you mention (I refer to the precheck algorithmic
customization, and the way that we check to see which side of a pivot
to use real recursion with to avoid stack overflows). I suggest that
you read the paper [1] -- the code that we use is almost directly
lifted from the paper. The opaque sounding variable names are the
same, and the C code from the paper is structured in exactly the same
way.

I think that this won't be a particularly easy project to get
committed. I suggest that if you go forward with it that you
specifically look into integrating "Dual-Pivot Quicksort" [2] as a
whole cloth replacement for the B implementation. It seems like this
has some chance of working out, because it's more or less acknowledged
by Bentley himself to be a kind of successor to his own industrial
quicksort implementation [3] -- it's derived from it. Note that Java 7
uses dual pivot quicksort when sorting integers.

In general, we've had a lot of success with optimizing sorting in the
past few years by focusing on things like avoiding cache misses in
comparators. There has been much less success with algorithmic
improvements, and no success at all with novel algorithmic
enhancements. PostgreSQL development just isn't a great way to do that
sort of thing.

BTW, I noticed that you go on to say this:

"""
However,
since disk operation is much expensive than in-memory sorting, I am
not sure if we can
observe a significant difference in this way.

"""

I think that you'd be surprised at how often this isn't true these
days, at least when sorting enough data for spilling to disk to be
relevant. The main reason for that is that the cost of writing out
runs increases linearly, and therefore eventually becomes very small
compared to the costs of sorting itself, which increases at a
linearithmic rate.

[1] http://cs.fit.edu/~pkc/classes/writing/samples/bentley93engineering.pdf
[2] http://codeblab.com/wp-content/uploads/2009/09/DualPivotQuicksort.pdf
[3] 
http://mail.openjdk.java.net/pipermail/core-libs-dev/2009-September/002630.html
-- 
Peter Geoghegan



[GSoC 2018] Proposal Draft

2018-03-17 Thread Kefan Yang
Hi everyone,

I am Kefan Yang, a third-year Computing Science student from Simon Fraser
University, Canada. I am very interested in the *sorting algorithm
benchmarking and implementation* issue you mentioned on the idealist of
Google Summer of Code 2018.

I am currently working on my proposal, but I am a little bit worried about
whether I am on the right track. I've encountered some problems in the
implementation of the benchmark of those sorting algorithms. I briefly
described my current ideas in the proposal draft, but they may change a bit
as I read more research papers. I sincerely hope to get some feedback from
you to improve my proposal, especially for the implementation part.

I've attached a proposal draft to this email. It's a brief one but I guess
it's enough to show my current ideas:)

Sincerely,
Kefan Yang


proposal_draft.docx
Description: MS-Word 2007 document


Re: GSOC 2018 proposal

2018-03-12 Thread Aleksander Alekseev
Hello Charles,

> I am currently preparing a proposal for pg_thrift project. I noticed
> that there are several protocols supported by thrift, which ones do we
> have higher priority? I mean which ones I need to implement during
> this project?

Binary protocols, i.e. TBinaryProtocol and TCompactProtocol. The first
one is a bit faster but more redundant, the second one is slower but
more compact. It's your choice which one to implement first, but at
least one binary protocol should be fully supported (ideally - both).

As far as I'm aware other protocols are rarely used and are not fully
implemented in most existing libraries.

-- 
Best regards,
Aleksander Alekseev


signature.asc
Description: PGP signature


GSOC 2018 proposal

2018-03-11 Thread Charles Cui
Hi Aleksander,

   I am currently preparing a proposal for pg_thrift project. I noticed
that there are several protocols supported by thrift, which ones do we have
higher priority? I mean which ones I need to implement during this project?





Thanks, Charles.