date:20100113

[
https://issues.apache.org/jira/browse/COUCHDB-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799769#action_12799769
]

Paul Joseph Davis commented on COUCHDB-620:
---

The error reporting issue is that if you've got four docs in the pipeline, and
the process dies, then its hard to tell which document caused the error. And
generally retrying will just cause another error.

Generating views is extremely slow - makes CouchDB hard to use with
non-trivial number of docs
--

Key: COUCHDB-620
URL: https://issues.apache.org/jira/browse/COUCHDB-620
Project: CouchDB
Issue Type: Improvement
Components: Infrastructure
Affects Versions: 0.10
Environment: Ubuntu 9.10 64 bit, CouchDB 0.10
Reporter: Roger Binns
Assignee: Damien Katz
Attachments: pipelining.jpg

Generating views is extremely slow. For example adding 10 million documents
takes less than 10 minutes but generating some simple views on the same docs
takes over 4 hours.
Using top you can see that CouchDB (erlang) and couchjs between them cannot
even saturate a single CPU let alone the I/O system. Under ideal conditions
performance should be limited by cpu, disk or memory. This implies that the
processes are doing simple things in lockstep accumulating latencies in each
process as well as the communication between them which when multiplied by
the number of documents can amount to a lot.
Some suggestions:
* Run as many couchjs instances as there are processor cores and scatter work
amongst them
* Have some sort of pipelining in the erlang so that the moment the first
byte of response is received from couchjs the data is sent for the next
request (the JSON conversion, HTTP headers etc should all have been assembled
already) to reduce latencies. Do whatever is most similar in couchjs (eg use
separate threads to read requests, process them and write responses).
* Use the equivalent of HTTP pipelining when talking to couchjs so that it
always has a doc ready to work on rather than having to transmit an entire
response and then wait for erlang to think and provide an entire new request
A simple test of success is to have a database with a million or so documents
with a trivial view and have view creation max out the CPU,. memory or disk.
Some things in CouchDB make this a particularly nasty problem. View data is
not replicated so replicating documents can lead the view data by a large
margin on the recipient database. This can lead to inconsistencies. You
also can't expect users to then wait minutes (or hours) for a request to
complete because the view generation got that far behind. (My own plans now
are to not use replication and instead create the database file on another
couchdb instance and then rsync the binary database file over instead!)
Although stale=ok is available, you still have no idea if the response will
be quick or take however long view generation does. (Sure I could add some
sort of timeout and complicate the code but then what value do I pick? If I
have a user waiting I want an answer ASAP or I have to give them some
horrible error message. Taking a long wait and then giving a timeout is even
worse!)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (COUCHDB-583) storing attachments in compressed form and serving them in compressed form if accepted by the client

[
https://issues.apache.org/jira/browse/COUCHDB-583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799829#action_12799829
]

Paul Joseph Davis commented on COUCHDB-583:
---

Just some quick thoughts reading through the diff:

I'm not a fan of the file containing a list of compressible types. There are
too many types that will just make that configuration hard. Not to mention
exposing an entirely new API endpoint to work with those types is also
needlessly complex.

I'd prefer to see an automatic test trying to compress the first 4K or so of an
attachment and use a heuristic to determine whether it compressed enough to
justify compressing the entire attachment. If that's not doable, the
compressible type system should be integrated into the current configuration
mechanism.

For testing from FireFox it might be best to expose a attachment is stored in
compressed form attribute in the _attachments member.

Passing around the Y and N binaries as a flag for an attachment
being compressed is un-erlangy. true and false atoms would be better.

Test code does not belong in couch_httpd.erl.

Is there something I'm missing on why we need to leak couch_util:gzip*
functions into couch_httpd_db.erl instead of putting all of that logic into
couch_stream.erl?

Is there nothing in mochiweb to handle accept-encoding parsing?

Instead of naming tests test1 - test17 and comments above each test, just use
a descriptive test name. It might help to group related tests as well so that
tests are easier to find.

Data in the etap tests shouldn't be stored inline when its that big. Create
data files and use the test helpers to reference the filenames and read from
disk.

storing attachments in compressed form and serving them in compressed form if
accepted by the client

Key: COUCHDB-583
URL: https://issues.apache.org/jira/browse/COUCHDB-583
Project: CouchDB
Issue Type: New Feature
Components: Database Core, HTTP Interface
Environment: CouchDB trunk
Reporter: Filipe Manana
Attachments: couchdb-583-trunk-3rd-try.patch,
couchdb-583-trunk-4th-try-trunk.patch, couchdb-583-trunk-5th-try.patch,
couchdb-583-trunk-6th-try.patch, couchdb-583-trunk-7th-try.patch,
couchdb-583-trunk-8th-try.patch, couchdb-583-trunk-9th-try.patch,
jira-couchdb-583-1st-try-trunk.patch, jira-couchdb-583-2nd-try-trunk.patch

This feature allows Couch to gzip compress attachments as they are being
received and store them in compressed form.
When a client asks for downloading an attachment (e.g. GET
somedb/somedoc/attachment.txt), the attachment is sent in compressed form if
the client's http request has gzip specified as a valid transfer encoding for
the response (using the http header Accept-Encoding). Otherwise couch
decompresses the attachment before sending it back to the client.
Attachments are compressed only if their MIME type matches one of those
listed in a separate config file. Compression level is also configurable in
the default.ini file.
This follows Damien's suggestion from 30 November:
Perhaps we need a separate user editable ini file to specify compressable or
non-compressable files (would probably be too big for the regular ini file).
What do other web servers do?
Also, a potential optimization is to compress the file while writing to disk,
and serve the compressed bytes directly to clients that can handle it, and
decompressed for those that can't. For compressable types, it's a win for
both disk IO for reads and writes, and CPU on read.
Patch attached.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (COUCHDB-623) File format for views is space and time inefficient - use a better one

File format for views is space and time inefficient - use a better one
--

 Key: COUCHDB-623
 URL: https://issues.apache.org/jira/browse/COUCHDB-623
 Project: CouchDB
  Issue Type: Improvement
  Components: Database Core
Affects Versions: 0.10
Reporter: Roger Binns


This was discussed on the dev mailing list over the last few days and noted 
here so it isn't forgotten.

The main database file format is optimised for data integrity - not losing or 
mangling documents - and rightly so.

That same append-only format is also used for views where it is a poor fit.  
The more random the ordering of data supplied, the larger the btree.  The 
larger the keys (in bytes) the larger the btree.  As an example my 2GB of raw 
JSON data turns into a 3.9GB CouchDB database but a 27GB view file (before 
compacting to 900MB).  Since views are not replicated, this requires a 
disproportionate amount of disk space on each receiving server (not to mention 
I/O load).  The format also affects view generation performance.  By loading my 
documents into CouchDB in an order by the most emitted value in views I was 
able to reduce load time from 75 minutes to 40 minutes with the view file size 
being 15GB instead of 27GB, but still very distant from the 900MB post 
compaction.

Views are a performance enhancement.  They save you from having to visit every 
document when doing some queries.  The data within in a view is generated and 
hence the only consequence of losing view data is a performance one and the 
view can be regenerated anyway.  Consequently the file format should be one 
that is optimised for performance and size.  The only integrity feature needed 
is the ability to tell that the view is potentially corrupt (eg the power 
failed while it was being generated/updated).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: openid 1.1 authentication handler

2010-01-13 Thread Matteo Caprari

Hi.

As suggested by chris on the user list, it could be interesting to
integrate the openid handler
in the couch.

I guess before going any further there should be some discussion here
and I'll probably need some suggestions about code guidelines and
prepping the makefiles.

I should also mention that the goal is to have the couch work both as
openid client and endpoint.

cheers

On Wed, Jan 13, 2010 at 6:03 PM, Chris Anderson jch...@apache.org wrote:
 On Wed, Jan 13, 2010 at 9:20 AM, Matteo Caprari
 matteo.capr...@gmail.com wrote:
 Hi.

 I've released an authentication handler that adds support for
 authenticating with openid 1.1.
 It works but needs to be stressed a bit.

 Source and readme:
 http://github.com/mcaprari/couchdb-openid

 blogged (copied the readme):
 http://caprazzi.net/posts/openid-authentication-handler-for-couchdb/

 This looks really cool. If you want to work on getting it into CouchDB
 (might require some cleanup) you should bring it up on the dev list,
 and put the patch into Jira:

 http://issues.apache.org/jira/browse/COUCHDB

 Chris


 cheers
 --
 :Matteo Caprari
 matteo.capr...@gmail.com




 --
 Chris Anderson
 http://jchrisa.net
 http://couch.io




-- 
:Matteo Caprari
matteo.capr...@gmail.com

[jira] Commented: (COUCHDB-623) File format for views is space and time inefficient - use a better one

2010-01-13 Thread Chris Anderson (JIRA)

[
https://issues.apache.org/jira/browse/COUCHDB-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799864#action_12799864
]

Chris Anderson commented on COUCHDB-623:

It's worth nothing that another advantage to using the storage btrees is the
MVCC guarantees. This means that a slow client can take its sweet time to
traverse the view index, and is not affected by ongoing writes or deletes.

This is crucial for the consistency guarantees views make.

It is not very hard to create alternate view index systems (like
CouchDB-Lounge) and the overhead of running as an external is negligible. One
fine way to prototype a view system that optimizes for different things would
be as an external.

File format for views is space and time inefficient - use a better one
--

Key: COUCHDB-623
URL: https://issues.apache.org/jira/browse/COUCHDB-623
Project: CouchDB
Issue Type: Improvement
Components: Database Core
Affects Versions: 0.10
Reporter: Roger Binns

This was discussed on the dev mailing list over the last few days and noted
here so it isn't forgotten.
The main database file format is optimised for data integrity - not losing or
mangling documents - and rightly so.
That same append-only format is also used for views where it is a poor fit.
The more random the ordering of data supplied, the larger the btree. The
larger the keys (in bytes) the larger the btree. As an example my 2GB of raw
JSON data turns into a 3.9GB CouchDB database but a 27GB view file (before
compacting to 900MB). Since views are not replicated, this requires a
disproportionate amount of disk space on each receiving server (not to
mention I/O load). The format also affects view generation performance. By
loading my documents into CouchDB in an order by the most emitted value in
views I was able to reduce load time from 75 minutes to 40 minutes with the
view file size being 15GB instead of 27GB, but still very distant from the
900MB post compaction.
Views are a performance enhancement. They save you from having to visit
every document when doing some queries. The data within in a view is
generated and hence the only consequence of losing view data is a performance
one and the view can be regenerated anyway. Consequently the file format
should be one that is optimised for performance and size. The only integrity
feature needed is the ability to tell that the view is potentially corrupt
(eg the power failed while it was being generated/updated).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (COUCHDB-623) File format for views is space and time inefficient - use a better one

[
https://issues.apache.org/jira/browse/COUCHDB-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799878#action_12799878
]

Roger Binns commented on COUCHDB-623:
-

What are the consistency guarantees that views make? I can't find any
documentation about it anywhere! (There is plenty about the main db, but
nothing about views.)

I can't see any that you can make as the view data is derived from the
documents and the documents can be changed at any point. For example while the
first row of a view is being returned the same corresponding document could
have been deleted. The slow client example can also lead to inconsistent
data - for example it may update a document on one connection and then access
the view on a second connection and due to timing end up with the view not
including that document.

The only consistency guarantee I can see is that if you do not
add/change/delete the documents for the period shortly before and then during
view retrieval until the view is completely retrieved then the view will
reflect the documents correctly at that time. If there is any form of
concurrency between the documents and the views then there cannot be guarantees
unless CouchDB introduced a transactioning system.

I do see how the append only btree/mvcc format makes the view retrieval code
easier to write, but users of CouchDB do not care how hard the code is to write
:-)

File format for views is space and time inefficient - use a better one
--

Key: COUCHDB-623
URL: https://issues.apache.org/jira/browse/COUCHDB-623
Project: CouchDB
Issue Type: Improvement
Components: Database Core
Affects Versions: 0.10
Reporter: Roger Binns

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (COUCHDB-623) File format for views is space and time inefficient - use a better one

[
https://issues.apache.org/jira/browse/COUCHDB-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799891#action_12799891
]

Paul Joseph Davis commented on COUCHDB-623:
---

The consistency guarantee refers to the file format used guarantees on disk
consistency the same as is done for the main database file (ie, tail append
MVCC style). Its not a reference to figuring out the sync between the main db
and the view. As you point out doing things like querying with stale=ok can
give you a view result that does not reflect the most recent changes to the
database or reflects changes from other clients etc etc.

File format for views is space and time inefficient - use a better one
--

Key: COUCHDB-623
URL: https://issues.apache.org/jira/browse/COUCHDB-623
Project: CouchDB
Issue Type: Improvement
Components: Database Core
Affects Versions: 0.10
Reporter: Roger Binns

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (COUCHDB-623) File format for views is space and time inefficient - use a better one

2010-01-13 Thread Adam Kocoloski (JIRA)

[
https://issues.apache.org/jira/browse/COUCHDB-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799896#action_12799896
]

Adam Kocoloski commented on COUCHDB-623:

I believe by consistency guarantees Chris meant that a view request uses a
single snapshot of the view index for the entire response. Even if documents
are changed in the interim, and even if someone else has triggered a view
update, your response will still accurately reflect the state of the DB at a
single moment in time.

File format for views is space and time inefficient - use a better one
--

Key: COUCHDB-623
URL: https://issues.apache.org/jira/browse/COUCHDB-623
Project: CouchDB
Issue Type: Improvement
Components: Database Core
Affects Versions: 0.10
Reporter: Roger Binns

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (COUCHDB-623) File format for views is space and time inefficient - use a better one

2010-01-13 Thread Chris Anderson

On Wed, Jan 13, 2010 at 11:34 AM, Adam Kocoloski (JIRA) j...@apache.org wrote:

[
https://issues.apache.org/jira/browse/COUCHDB-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799896#action_12799896
]

Adam Kocoloski commented on COUCHDB-623:

Thanks Adam, that's exactly what I'm talking about.

File format for views is space and time inefficient - use a better one
--

Key: COUCHDB-623
URL: https://issues.apache.org/jira/browse/COUCHDB-623
Project: CouchDB
Issue Type: Improvement
Components: Database Core
Affects Versions: 0.10
Reporter: Roger Binns

This was discussed on the dev mailing list over the last few days and noted
here so it isn't forgotten.
The main database file format is optimised for data integrity - not losing
or mangling documents - and rightly so.
That same append-only format is also used for views where it is a poor fit.
The more random the ordering of data supplied, the larger the btree. The
larger the keys (in bytes) the larger the btree. As an example my 2GB of
raw JSON data turns into a 3.9GB CouchDB database but a 27GB view file
(before compacting to 900MB). Since views are not replicated, this requires
a disproportionate amount of disk space on each receiving server (not to
mention I/O load). The format also affects view generation performance. By
loading my documents into CouchDB in an order by the most emitted value in
views I was able to reduce load time from 75 minutes to 40 minutes with the
view file size being 15GB instead of 27GB, but still very distant from the
900MB post compaction.
Views are a performance enhancement. They save you from having to visit
every document when doing some queries. The data within in a view is
generated and hence the only consequence of losing view data is a
performance one and the view can be regenerated anyway. Consequently the
file format should be one that is optimised for performance and size. The
only integrity feature needed is the ability to tell that the view is
potentially corrupt (eg the power failed while it was being
generated/updated).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

--
Chris Anderson
http://jchrisa.net
http://couch.io

[jira] Closed: (COUCHDB-623) File format for views is space and time inefficient - use a better one

2010-01-13 Thread Damien Katz (JIRA)

[
https://issues.apache.org/jira/browse/COUCHDB-623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Damien Katz closed COUCHDB-623.
---

Resolution: Invalid
Assignee: Damien Katz

Closing as Invalid this has no objective criteria for being resolved.

File format for views is space and time inefficient - use a better one
--

Key: COUCHDB-623
URL: https://issues.apache.org/jira/browse/COUCHDB-623
Project: CouchDB
Issue Type: Improvement
Components: Database Core
Affects Versions: 0.10
Reporter: Roger Binns
Assignee: Damien Katz

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Created: (COUCHDB-620) Generating views is extremely slow - makes CouchDB hard to use with non-trivial number of docs

2010-01-13 Thread Damien Katz

Lets have this discussion on the dev mailing list. That's what it's for.

-Damien


On Jan 10, 2010, at 9:27 PM, Roger Binns (JIRA) wrote:

 Generating views is extremely slow - makes CouchDB hard to use with 
 non-trivial number of docs
 --
 
 Key: COUCHDB-620
 URL: https://issues.apache.org/jira/browse/COUCHDB-620
 Project: CouchDB
  Issue Type: Improvement
  Components: Infrastructure
Affects Versions: 0.10
 Environment: Ubuntu 9.10 64 bit, CouchDB 0.10
Reporter: Roger Binns
 
 
 Generating views is extremely slow.  For example adding 10 million documents 
 takes less than 10 minutes but generating some simple views on the same docs 
 takes over 4 hours.
 
 Using top you can see that CouchDB (erlang) and couchjs between them cannot 
 even saturate a single CPU let alone the I/O system.  Under ideal conditions 
 performance should be limited by cpu, disk or memory.  This implies that the 
 processes are doing simple things in lockstep accumulating latencies in each 
 process as well as the communication between them which when multiplied by 
 the number of documents can amount to a lot.
 
 Some suggestions:
 
 * Run as many couchjs instances as there are processor cores and scatter work 
 amongst them
 
 * Have some sort of pipelining in the erlang so that the moment the first 
 byte of response is received from couchjs the data is sent for the next 
 request (the JSON conversion, HTTP headers etc should all have been assembled 
 already) to reduce latencies.  Do whatever is most similar in couchjs (eg use 
 separate threads to read requests, process them and write responses).
 
 * Use the equivalent of HTTP pipelining when talking to couchjs so that it 
 always has a doc ready to work on rather than having to transmit an entire 
 response and then wait for erlang to think and provide an entire new request
 
 A simple test of success is to have a database with a million or so documents 
 with a trivial view and have view creation max out the CPU,. memory or disk.
 
 Some things in CouchDB make this a particularly nasty problem.  View data is 
 not replicated so replicating documents can lead the view data by a large 
 margin on the recipient database.  This can lead to inconsistencies.  You 
 also can't expect users to then wait minutes (or hours) for a request to 
 complete because the view generation got that far behind.  (My own plans now 
 are to not use replication and instead create the database file on another 
 couchdb instance and then rsync the binary database file over instead!)
 
 Although stale=ok is available, you still have no idea if the response will 
 be quick or take however long view generation does.  (Sure I could add some 
 sort of timeout and complicate the code but then what value do I pick?  If I 
 have a user waiting I want an answer ASAP or I have to give them some 
 horrible error message.  Taking a long wait and then giving a timeout is even 
 worse!)
 
 -- 
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.

[jira] Commented: (COUCHDB-623) File format for views is space and time inefficient - use a better one

[
https://issues.apache.org/jira/browse/COUCHDB-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799916#action_12799916
]

Roger Binns commented on COUCHDB-623:
-

Not again Damien :-)

Simple criteria - the size of the view file should be proportionate to the data
in a view on initial generation. If you want raw numbers, the view file should
be no larger than double the sum of JSON encoded key, value and _id for each
row.

The current multiplier is 15 to 27 times as much which is ludicrous. Even post
compactation the file is a little on the large side. And because the view
results are not replicated, the overhead has to be incurred on every machine
that replication happens to.

Or put another way, if people are planning on deploying CouchDB how much space
would you advise them to provision?

When I started, the answer for 10million documents/2.5GB of raw JSON is 72GB:

23GB for DB, another 21GB for the compacted version, 27+GB for view file,
another 1+GB for compacted view file

By shortening ids to 4 bytes instead of 16 we get:

4GB for DB, another 4GB for compacted, 27GB for view file, another 1GB for
compacted view file

By being able to sort my documents to be ordered by the most commonly emitted
view key:

4GB for DB, another 4GB for compacted, 15GB for view file, another 1GB for
compacted view file

Since the view/DB coexists at the same time as the compaction you need space
for both simultaneously. 10 million documents/2GB of data is not something that
makes any existing database system sweat.

File format for views is space and time inefficient - use a better one
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (COUCHDB-583) storing attachments in compressed form and serving them in compressed form if accepted by the client

2010-01-13 Thread Filipe Manana (JIRA)

[
https://issues.apache.org/jira/browse/COUCHDB-583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799922#action_12799922
]

Filipe Manana commented on COUCHDB-583:
---

Hi Paul,

thanks for you're feedback.

Passing around the Y and N binaries as a flag for an attachment
being compressed is un-erlangy. true and false atoms would be better.

Well, this was mostly because I read somewhere in Armstrong's book that
binaries are preferred (more efficient) for IO operations (network, disk
storage). But I agree, using true / false atoms is more readable.

Is there nothing in mochiweb to handle accept-encoding parsing?

I don't think so, at least in the mochiweb included with couch. It's probably
better to move this accept-encoding parsing functions, and the respective test
functions, into the mochiweb sources.

I'll get back to work and enhance the patch following your remarks.

cheers

storing attachments in compressed form and serving them in compressed form if
accepted by the client

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (COUCHDB-583) storing attachments in compressed form and serving them in compressed form if accepted by the client

2010-01-13 Thread Damien Katz (JIRA)

[
https://issues.apache.org/jira/browse/COUCHDB-583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799937#action_12799937
]

Damien Katz commented on COUCHDB-583:
-

I haven't looked at the patch, but I agree with most of Paul comments, except
for figuring out when to compress files. Lots of compressed files might have
uncompressed headers in the file, leading to unnecessary compression. MP3s with
id3v2 tags immediately come to mind.

storing attachments in compressed form and serving them in compressed form if
accepted by the client

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (COUCHDB-583) storing attachments in compressed form and serving them in compressed form if accepted by the client

2010-01-13 Thread Filipe Manana (JIRA)

[
https://issues.apache.org/jira/browse/COUCHDB-583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799947#action_12799947
]

Filipe Manana commented on COUCHDB-583:
---

Hum,

Lets open a votation :)

1) use an heuristic, as suggested by Paul

2) or a file listing the mime types worth compressing

3) some other alternative?

cheers

storing attachments in compressed form and serving them in compressed form if
accepted by the client

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (COUCHDB-583) storing attachments in compressed form and serving them in compressed form if accepted by the client

[
https://issues.apache.org/jira/browse/COUCHDB-583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799948#action_12799948
]

Paul Joseph Davis commented on COUCHDB-583:
---

Hrm, 4KiB of headers even? That is a good point though. But I'd still be quite
hesitant to make it a whitelist of of content types to compress. Unless maybe
we allowed text/* or similar. Or perhaps is should be a blacklist that could do
the * match?

storing attachments in compressed form and serving them in compressed form if
accepted by the client

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (COUCHDB-623) File format for views is space and time inefficient - use a better one

[
https://issues.apache.org/jira/browse/COUCHDB-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799965#action_12799965
]

Roger Binns commented on COUCHDB-623:
-

The view consistency stuff is a red herring. If you are not making changes to
the DB then any file format will work and give consistent results.

If you are making changes to the docs then no scheme short of
transactions/locking will ensure that the view is consistent with the
documents. It will always be possible for documents to be referenced by the
view that are not in the DB and for documents to be in the DB that are not in
the view. I see no point in trying to even make the view consistent with a
point in time while DB changes are happening since it gives no performance
efficiency nor any space efficiency - in fact the extreme opposites.

The point of views is to give me information fast that I could only otherwise
obtain by visiting all the documents. That is what they should be optimized
for.

File format for views is space and time inefficient - use a better one
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (COUCHDB-623) File format for views is space and time inefficient - use a better one

2010-01-13 Thread Chris Anderson

On Wed, Jan 13, 2010 at 2:11 PM, Roger Binns (JIRA) j...@apache.org wrote:

[
https://issues.apache.org/jira/browse/COUCHDB-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799965#action_12799965
]

Roger Binns commented on COUCHDB-623:
-

The view consistency stuff is a red herring. If you are not making changes
to the DB then any file format will work and give consistent results.

The point of views is to give me information fast that I could only otherwise
obtain by visiting all the documents. That is what they should be optimized
for.

the current views are optimized for youre red herring. here it
actually matters is the ability to give transactional information
about things like bank account balances.

see: http://books.couchdb.org/relax/reference/recipes for Banking

without MVCC views, there's no way to query accurately at all when
inserts are underway (short of blocking reads during writes).

If you need something with less consistency, you are encouraged to
wrap your own indexing system around couchdb's map reduce runtime, or
even build your own runtime.

has anyone used Hadoop as an external yet?

Chris

File format for views is space and time inefficient - use a better one
--

This was discussed on the dev mailing list over the last few days and noted
here so it isn't forgotten.
The main database file format is optimised for data integrity - not losing
or mangling documents - and rightly so.
That same append-only format is also used for views where it is a poor fit.
The more random the ordering of data supplied, the larger the btree. The
larger the keys (in bytes) the larger the btree. As an example my 2GB of
raw JSON data turns into a 3.9GB CouchDB database but a 27GB view file
(before compacting to 900MB). Since views are not replicated, this requires
a disproportionate amount of disk space on each receiving server (not to
mention I/O load). The format also affects view generation performance. By
loading my documents into CouchDB in an order by the most emitted value in
views I was able to reduce load time from 75 minutes to 40 minutes with the
view file size being 15GB instead of 27GB, but still very distant from the
900MB post compaction.
Views are a performance enhancement. They save you from having to visit
every document when doing some queries. The data within in a view is
generated and hence the only consequence of losing view data is a
performance one and the view can be regenerated anyway. Consequently the
file format should be one that is optimised for performance and size. The
only integrity feature needed is the ability to tell that the view is
potentially corrupt (eg the power failed while it was being
generated/updated).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

--
Chris Anderson
http://jchrisa.net
http://couch.io

Re: [jira] Commented: (COUCHDB-623) File format for views is space and time inefficient - use a better one

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Chris Anderson wrote:
 see: http://books.couchdb.org/relax/reference/recipes for Banking
 
 without MVCC views, there's no way to query accurately at all when
 inserts are underway (short of blocking reads during writes).

I am afraid I do not understand what you are saying.  Sure the scheme listed
in the book makes sense, but only if a transaction maps exactly to one
document (which I guess is the point).  Even then I still don't see the
relevance.  Things would only break down if the view returned partial
information (eg if a single document caused two view rows to be emitted but
only one of those was returned.) BTW views do not return the update_seq so
as an end user you still do not know up to date it is.

The file format does not need to protect each view row, but does need to do
so for the main database where the unit is a document.

For example the view file format could use an atomic unit of 10,000
document's view output or some number of megabytes.  That unit can still be
regenerated if something bad happened (a rare circumstance such as untimely
power failure).

 If you need something with less consistency, you are encouraged to
 wrap your own indexing system around couchdb's map reduce runtime, or
 even build your own runtime.

I am becoming very tempted to just dump CouchDB for SQLite with a trivial
REST front end, since it appears that CouchDB is just not capable of
handling 10million documents/2GB of data in anything resembling a sensible
amount of disk space or compute time for the foreseeable future.

Roger
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktOTyUACgkQmOOfHg372QRoZwCgqMCpYfZT3aHYXGMfqfMzXpk6
1UIAoN+CV+wtsyOW8Ndiq7c/qM5Qt4+Y
=7gg2
-END PGP SIGNATURE-

Objective criteria

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

[I wrote a personal query to Damien which he asked me to repeat here.]

Both 620 and 623 were closed by Damien because they lacked objective
criteria.  For both tickets there are criteria I consider objective :-)
There are also suggestions on how to address the issues.

Consequently my query is if the criteria are not objective enough, does
Damien not agree with them, not care about the underlying issues, think that
a 10 million document/2GB raw json data set is outside the scope of what
CouchDB should cope with, want this stuff in the wiki etc?

Roger
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktOUP4ACgkQmOOfHg372QS/hQCfSP9Edy+wrZRFwItFmDD3mNcN
yyIAn2z9XvJigm2xKk/r4CgAUqZp1t/i
=JMLG
-END PGP SIGNATURE-

Re: [jira] Commented: (COUCHDB-623) File format for views is space and time inefficient - use a better one

2010-01-13 Thread Chris Anderson

On Wed, Jan 13, 2010 at 2:54 PM, Roger Binns rog...@rogerbinns.com wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Chris Anderson wrote:
 see: http://books.couchdb.org/relax/reference/recipes for Banking

 without MVCC views, there's no way to query accurately at all when
 inserts are underway (short of blocking reads during writes).

 I am afraid I do not understand what you are saying.  Sure the scheme listed
 in the book makes sense, but only if a transaction maps exactly to one
 document (which I guess is the point).  Even then I still don't see the
 relevance.  Things would only break down if the view returned partial
 information (eg if a single document caused two view rows to be emitted but
 only one of those was returned.) BTW views do not return the update_seq so
 as an end user you still do not know up to date it is.

If that would help, there are I think people working on an update_seq
patch for views.


 The file format does not need to protect each view row, but does need to do
 so for the main database where the unit is a document.

A reduce giving an balance for a particular account, could be effected
by documents being inserted anywhere in the db. The current map reduce
system guarantees that the balance returned reflects a consistent
snapshot of the database, even if other operations are ongoing. (eg a
given transfer will appear consistently, even if those same accounts
are undergoing concurrent operations for other transfers.)

We can't atomically prevent overdrafts, but how many banks do that, anyway?


 For example the view file format could use an atomic unit of 10,000
 document's view output or some number of megabytes.  That unit can still be
 regenerated if something bad happened (a rare circumstance such as untimely
 power failure).

There are alternate storage systems which either use locking or avoid
any consistency guarantees at all.


 If you need something with less consistency, you are encouraged to
 wrap your own indexing system around couchdb's map reduce runtime, or
 even build your own runtime.

 I am becoming very tempted to just dump CouchDB for SQLite with a trivial
 REST front end, since it appears that CouchDB is just not capable of
 handling 10million documents/2GB of data in anything resembling a sensible
 amount of disk space or compute time for the foreseeable future.


It sounds like Couch does just fine if you run compaction. Perhaps we
should recommend view compaction more aggressively.

Chris

 Roger
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.9 (GNU/Linux)
 Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

 iEYEARECAAYFAktOTyUACgkQmOOfHg372QRoZwCgqMCpYfZT3aHYXGMfqfMzXpk6
 1UIAoN+CV+wtsyOW8Ndiq7c/qM5Qt4+Y
 =7gg2
 -END PGP SIGNATURE-





-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Re: Objective criteria

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Chris Anderson wrote:
 The ticketing system should be for smaller scope issues, I think.

I see it more as a don't forget about this plus somewhere for others to
say this also affects me or here is additional information/angles.
Obviously there is a fine line between that kind of thing and a discussion.
 My big concern is that the issue was hashed out here over a few days then
the thread goes dead, and the issue is forgotten.  A JIRA report of open
issues should be a todo list of bugs to fix and improvements to make.

 Optimizing the view server is an agreed goal of the community.

Maybe in people's heads, but it wasn't written down anywhere such as the
tracker or the roadmap.  In fact the front page of couchdb.org claims that
Erlang allows for the CouchDB design to be scalable and the overview page
makes an efficient claim in the last sentence.  The current implementation
is neither of these.

 Probably the best way to help is to take a look at all the work
 Damien's done in trunk (the pipelining) and perhaps the parallel
 writers optimization he has. 

BTW I have been using trunk for over a week.  It is better than 0.10 that I
was using before but not that much of an improvement.  And changes in the
way I generate some of my data have hurt me again (I can either order for
_id or for view keys but not both at the same time) so my initial DB has now
gone from 4GB to 15GB (I optimized for views).

 We could really use a way to take the
 benchmarks you ran, and put them into the buildbot.

Sadly I can't do it with my real data because it belongs to someone else.
However I hereby commit to produce a representative benchmark that is
substantially similar in performance and data within the next two weeks.
(Also note that there is nothing special about what I am doing - anyone with
similar numbers of documents has similar issues.)

I'm hoping that more can be done about the size issues soon too.  (I think
that addressing the size issues will help a lot since it will require way
less CPU and I/O to produce and use smaller files.)

Roger
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktOWSgACgkQmOOfHg372QQ9rQCfTiSs1etiafvG1z4q1yQC0ZRy
+3AAn0j2TEPuW/AXTvNZl9KPfTT6hGNn
=rSWg
-END PGP SIGNATURE-

Re: Objective criteria

2010-01-13 Thread Mikeal Rogers

I have been putting together some stuff that seems pertinent to this
discussion.

I'm working on a performance suite that tests a variety of concurrent
performance scenarios.

I have the client code written but I'm still working on the automated
build/test code. Once that is finished I plan to do some GitHub integration
and some charting.

The idea here is to chart the performance differences between a GitHub
branch at a certain commit compared to the performance of the latest release
and the latest trunk.

If someone has an idea of how they might increase performance they could
point this tool at their GitHub branch and reference the differences in
performance between their code and the latest release and trunk.

I'll send another email once i have some pretty graphs to show off :)

-Mikeal

On Wed, Jan 13, 2010 at 5:19 PM, Damien Katz dam...@apache.org wrote:


 On Jan 13, 2010, at 3:37 PM, Roger Binns wrote:

  -BEGIN PGP SIGNED MESSAGE-
  Hash: SHA1
 
  Chris Anderson wrote:
  The ticketing system should be for smaller scope issues, I think.
 
  I see it more as a don't forget about this plus somewhere for others to
  say this also affects me or here is additional information/angles.
  Obviously there is a fine line between that kind of thing and a
 discussion.
  My big concern is that the issue was hashed out here over a few days then
  the thread goes dead, and the issue is forgotten.  A JIRA report of open
  issues should be a todo list of bugs to fix and improvements to make.

 That's fine, the issue is that bugs saying It's too slow is always true
 for someone. Many people find the view indexing performance just fine, many
 do not.

 Since CouchDB makes no performance or size guarantees, you can't call the
 general performance a bug. Unless you have specific bug to fix, or
 enhancement to make, don't use JIRA. Use the dev mailing list to make your
 case, see if someone will produce a patch or find a measurable bottleneck
 that can be addressed. JIRA is not the place for discussions about the
 design of CouchDB components. Neither is IRC.

 Also, if you want something without view performance problems but similar
 to CouchDB, you should look at MongoDB.

 -Damien


 
  Optimizing the view server is an agreed goal of the community.
 
  Maybe in people's heads, but it wasn't written down anywhere such as the
  tracker or the roadmap.  In fact the front page of couchdb.org claims
 that
  Erlang allows for the CouchDB design to be scalable and the overview page
  makes an efficient claim in the last sentence.  The current
 implementation
  is neither of these.
 
  Probably the best way to help is to take a look at all the work
  Damien's done in trunk (the pipelining) and perhaps the parallel
  writers optimization he has.
 
  BTW I have been using trunk for over a week.  It is better than 0.10 that
 I
  was using before but not that much of an improvement.  And changes in the
  way I generate some of my data have hurt me again (I can either order for
  _id or for view keys but not both at the same time) so my initial DB has
 now
  gone from 4GB to 15GB (I optimized for views).
 
  We could really use a way to take the
  benchmarks you ran, and put them into the buildbot.
 
  Sadly I can't do it with my real data because it belongs to someone else.
  However I hereby commit to produce a representative benchmark that is
  substantially similar in performance and data within the next two weeks.
  (Also note that there is nothing special about what I am doing - anyone
 with
  similar numbers of documents has similar issues.)
 
  I'm hoping that more can be done about the size issues soon too.  (I
 think
  that addressing the size issues will help a lot since it will require way
  less CPU and I/O to produce and use smaller files.)
 
  Roger
  -BEGIN PGP SIGNATURE-
  Version: GnuPG v1.4.9 (GNU/Linux)
  Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
 
  iEYEARECAAYFAktOWSgACgkQmOOfHg372QQ9rQCfTiSs1etiafvG1z4q1yQC0ZRy
  +3AAn0j2TEPuW/AXTvNZl9KPfTT6hGNn
  =rSWg
  -END PGP SIGNATURE-

Re: Objective criteria

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Damien Katz wrote:
 That's fine, the issue is that bugs saying It's too slow is always true 
 for someone.

I did give specific numbers - ie 10 million documents, 2GB of JSON data etc
and the amount of time taken as well as space.  I doubt you'd find anyone
that considers 4 hours or 27GB to be reasonable numbers for that :-)

 Many people find the view indexing performance just fine, many do not.

True.  For example one of my other projects currently has 100 documents and
I have no issue any part of CouchDB for that.

What isn't clear is statement of reasonable expectations - should I be able
to handle 10 million documents in CouchDB?  Will it ever handle that?  Do
you as the project leader care about that?  Everything has a sweet spot and
I am not asking you to make 10 million documents be encompassed by the sweet
spot, but clearly if you never intend for CouchDB to handle that much data
then I need to go elsewhere.

 Since CouchDB makes no performance or size guarantees,

How about publishing some?  Not guarantees but rather some expectations.
For example if someone has 1GB of JSON data in 1 million documents what
would be an expectation of size? The bugs can then be about substantial
divergences from that.

 Unless you have specific bug to fix, or enhancement to make, don't use JIRA.

The issues you closed listed specific enhancements (pipelining, multiple
instances, different file format etc).  I do acknowledge that I didn't
supply code but I can't do everything :-)  All my personal projects are open
source - it isn't like I am trying to take and never give.

 Also, if you want something without view performance problems but similar to 
 CouchDB, you should look at MongoDB.

I did research the alternatives I could find.  CouchDB is the only solution
that was designed for replication (and hence offline working, occasional
disconnection, any topology for replication etc).  CouchDB is also the only
one that allows for indices/views on data that is calculated rather than
just extracting a particular value statically from the docs.  (That can be
worked around by calculated values and shoving them into the docs but is
less elegant.)

Other than that MongoDB seemed to be the nicest.  But I really want CouchDB
to take over the world.  The concepts are right.  The replication point of
view is right etc.  It not handling millions of documents in a reasonable
amount of space and time is not right IMHO but I still don't know what the
project opinion is.

Roger
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktOjaUACgkQmOOfHg372QQBcACgmi/Rn5jtsiFvJZy0ksC6F6BU
8fMAn225w2RxsTKY8M/cg/29YSxc6mek
=4qrM
-END PGP SIGNATURE-

Re: Objective criteria