Re: [HACKERS] GSOC - TOAST'ing in slices

2017-03-16 Thread George Papadrosou
Hello all, 

thank you for your replies.  I agree with Alexander Korotkov that it is 
important to have a quality patch at the end of the summer. 

Stephen, you mentioned PostGIS, but the conversation seems to lean towards 
JSONB. What are your thoughts?

Also, if I am to include some ideas/approaches in the proposal, it seems I 
should really focus on understanding how a specific data type is used, queried 
and indexed, which is a lot of exploring for a newcomer in postgres code.

In the meanwhile, I am trying to find how jsonb is indexed and queried. After I 
grasp the current situation I will be to think about new approaches.

Regards,
George 

> On 15 Μαρ 2017, at 15:53, Tom Lane <t...@sss.pgh.pa.us> wrote:
> 
> Robert Haas <robertmh...@gmail.com <mailto:robertmh...@gmail.com>> writes:
>> On Tue, Mar 14, 2017 at 10:03 PM, George Papadrosou
>> <gpapadro...@gmail.com> wrote:
>>> The project’s idea is implement different slicing approaches according to
>>> the value’s datatype. For example a text field could be split upon character
>>> boundaries while a JSON document would be split in a way that allows fast
>>> access to it’s keys or values.
> 
>> Hmm.  So if you had a long text field containing multibyte characters,
>> and you split it after, say, every 1024 characters rather than after
>> every N bytes, then you could do substr() without detoasting the whole
>> field.  On the other hand, my guess is that you'd waste a fair amount
>> of space in the TOAST table, because it's unlikely that the chunks
>> would be exactly the right size to fill every page of the table
>> completely.  On balance it seems like you'd be worse off, because
>> substr() probably isn't all that common an operation.
> 
> Keep in mind also that slicing on "interesting" boundaries rather than
> with the current procrustean-bed approach could save you at most one or
> two chunk fetches per access.  So the upside seems limited.  Moreover,
> how are you going to know whether a given toast item has been stored
> according to your newfangled approach?  I doubt we're going to accept
> forcing a dump/reload for this.
> 
> IMO, the real problem here is to be able to predict which chunk(s) to
> fetch at all, and I'd suggest focusing on that part of the problem rather
> than changes to physical storage.  It's hard to see how to do anything
> very smart for text (except in the single-byte-encoding case, which is
> already solved).  But the JSONB format was designed with some thought
> to this issue, so you might be able to get some traction there.
> 
>   regards, tom lane



Re: [HACKERS] GSOC - TOAST'ing in slices

2017-03-14 Thread George Papadrosou
Hello!

Thank you for your message. I was just about to send this email when I got 
yours. 

> I don't recall seeing an email from you about this yet?  My apologies if
> I missed it

My apologies for the inconvenience, I wish I could start earlier with this but 
there was so much coursework reaching it’s deadline.

I have prepared a very basic proposal for the TOAST project which I am 
attaching below.  You will notice that the proposal is too basic. I would 
appreciate some guidance on how we could dive more into the project’s details 
so I can elaborate more in the proposal.

Also, I haven’t considered the PostGIS project when thinking of toast’able data 
types, so I will study it a bit in the meanwhile. 

Please find the proposal draft below. 
Thanks!
George

Abstract

In PostgreSQL, a field value is compressed and then stored in chunks in a 
separate table called TOAST table [1]. Currently there is no indication of 
which piece of the original data made it to which chunk in the TOAST table. If 
a subset of the value is needed, all of the chunks have to be re-formed and 
decompressed to the original value.

The project’s idea is implement different slicing approaches according to the 
value’s datatype. For example a text field could be split upon character 
boundaries while a JSON document would be split in a way that allows fast 
access to it’s keys or values.

Benefits to the PostgreSQL Community

Knowing about the data that each chunk holds, we could keep important chunks 
closer to computations as well as store them in indices.

Project details

?

Deliverables 

- Implement “semantic” slicing for datatypes that support slicing into TOAST 
tables. These datatypes will be the Text, Array, JSON/JSONb  and XML data types.

- Include the important chunks in the indices? (Not really sure about the data 
that indices contain at this time)

Timeline

- Until May 30: Study about Postgres internals, on-disk data structures, review 
relevant code and algorithms used, define slicing approaches and agree on 
implementation details .

- Until June 26: Implement the slicing approaches for the Text, Array, 
JSON/JSONb, XML

- June 26 - 30: Student/Mentor evaluations period and safety buffer 

- Until July 24: Make indices take advantage of the new slicing approaches

- July  24 - 28: Student/Mentor evaluations period and safety buffer  

- Until August 21: Improve testing and documentation

- August  21 - 29: Submit code and final evaluations

Bio 

Contact 
Name, email, phone etc




> On 15 Μαρ 2017, at 03:39, Stephen Frost <sfr...@snowman.net> wrote:
> 
> George,
> 
> * George Papadrosou (gpapadro...@gmail.com) wrote:
>> I understand your efforts and I am willing to back down. This is not the 
>> only project that appeals to me :)
> 
> Thank you very much for your willingness to adapt. :)
> 
>> Mr. Frost, Mr. Munro,  thank you for your suggestions. I am now between the 
>> TOAST’ing slices and the predicate locking project. I am keen on the fact 
>> the “toasting” project is related to on-disk data structures so I will 
>> probably send you an email about that later today.
> 
> .  I have added Alexander Korotkov to the CC list as he was
> also listed as a possible mentor for TOAST'ing in slices.
> 
> As it relates to TOAST'ing in slices, it would be good to think through
> how we would represent and store the information about how a particular
> object has been split up.  Note that PostgreSQL is very extensible in
> its type system and therefore we would need a way for new data types
> which are added to the system to be able to define how data of that data
> type is to be split and a way to store the information they need to
> regarding such a split.
> 
> In particular, the PostGIS project adds multiple data types which are
> variable in length and often end up TOAST'd because they are large
> geospatial objects, anything we come up with for TOAST'ing in slices
> will need to be something that the PostGIS project could leverage.
> 
>> In general, I would like to undertake a project interesting enough and 
>> important for Postgres. Also, I could take into account if you favor one 
>> over another, so please let me know. I understand that these projects should 
>> be strictly defined to fit in the GSOC period, however the potential for 
>> future improvements or challenges is what drives and motivates me.
> 
> We are certainly very interested in having you continue on and work with
> the PostgreSQL community moving forward, though we do need to be sure to
> scope the project goals within the GSOC requirements.
> 
> Thanks!
> 
> Stephen



Re: [HACKERS] GSOC Introduction / Eliminate O(N^2) scaling from rw-conflict tracking in serializable transactions

2017-03-10 Thread George Papadrosou
Hi all and thank you for your quick replies.

> [two people interested in the same GSoC project]

Mr. Grittner thank you for sharing this ahead of time.


Liu(is this your first name?),

> I have been concentrating on it for a long time, reading papers, reading 
> source codes, and discussing details with Mr Grittner.  

I understand your efforts and I am willing to back down. This is not the only 
project that appeals to me :)


Mr. Frost, Mr. Munro,  thank you for your suggestions. I am now between the 
TOAST’ing slices and the predicate locking project. I am keen on the fact the 
“toasting” project is related to on-disk data structures so I will probably 
send you an email about that later today.

In general, I would like to undertake a project interesting enough and 
important for Postgres. Also, I could take into account if you favor one over 
another, so please let me know. I understand that these projects should be 
strictly defined to fit in the GSOC period, however the potential for future 
improvements or challenges is what drives and motivates me.

Thank you!
George




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] GSOC Introduction / Eliminate O(N^2) scaling from rw-conflict tracking in serializable transactions

2017-03-09 Thread George Papadrosou
Hello psql hackers,

my name is George Papadrosou, this is my first semester as graduate student at 
Georgia Tech and would like to submit a proposal to Google Summer of Code, for 
the project "Eliminate O(N^2) scaling from rw-conflict tracking in serializable 
transactions”.

A short bio, I have a CS undergraduate degree from Athens University of 
Economics and Business. I had taken two databases courses where the first one 
was about sql, relational algebra, xpath and generally using an RDBMS while the 
second one was more about the internals, like  storage, indexing, query 
planning and transactions.

I have 3+ years professional experience in web technologies with a focus on the 
backend and I recently started my master with specialization in computing 
systems. One of my first courses that I am finishing this May is High 
Performance Computing(parallel algorithms), which seems to be closely related 
to this GSOC project.

I have not done any research on databases yet but I regard this project as an 
opportunity to make an initial contact with postgres' internals until I dive 
more into database algorithms. My future goal is to work on databases full
time.

I am going to prepare a draft proposal for this project and share it with you 
soon. The project’s description is pretty clear, do you think it should be more 
strictly defined in the proposal?

Until then, I would like to familiarize myself a bit with the codebase and fix 
some bug/todo. I didn’t find many [E] marked tasks in the todo list so the task 
I was thinking is "\s without arguments (display history) fails with libedit, 
doesn't use pager either - psql \s not working on OSX”. However, it works on my 
OSX El Capitan laptop with Postgres 9.4.4. Would you suggest some other starter 
task?

Looking forward to hearing from you soon.
Best regards,
George


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers