Re: [v8-users] Streams and InternalPackedArray queue scalability

2017-01-19 Thread 'Adam Rice' via v8-users
Thanks for the detailed response! It makes sense of all the stuff I was 
seeing.

It's too early to say whether small chunk sizes will see a significant 
amount of use or not. I'd like to support them anyway.

Thank you for the cursor idea. That will give us protection against the 
possible removal of left-trimming from v8 in future.

On Wednesday, 18 January 2017 19:26:53 UTC+9, Jakob Kummerow wrote:
>
> Interesting case :-)
>
> The implementation of InternalPackedArray is essentially the same as plain 
> JavaScript "Array", the biggest difference is that InternalPackedArray is 
> not exposed to websites and therefore not monkey-patchable.
> Array.shift() is inherently slow, because it requires all remaining 
> elements to be moved. (Array.push()/Array.pop() are much faster.)
> V8 uses a nasty trick to make Array.shift() fast: object left-trimming, 
> where the object header is moved, so that the elements themselves can stay 
> in place. This is an ugly hack and we would like to get rid of it.
> Left-trimming is not possible in "large object space", a special area of 
> V8's heap for large objects. The limit for what constitutes a "large 
> object" is the regular heap's page size, which is 512KB.
> Arrays over-allocate their backing store in steps; when you start with 
> "new Array()" and keep push()ing onto it, the sequence of backing store 
> capacities is 4, 23, 52, 95, 160, 257, 403, 622, 950, 1442, 2180, 3287, 
> 4948, 7439, 11176, 16781, 25189, 37801, *56719*, 85096, 127661, ...
> So when the 56720th element is pushed, the backing store is grown to 85096 
> * 8 bytes = 680768 bytes > 512KB, so the new backing store must be 
> allocated in large-object space. Every subsequent shift() on that array is 
> an O(n) operation.
> The reason the threshold is four times as high on your Android tablet is 
> because we ship 32-bit Chrome builds to Android (so you can fit 2x as many 
> pointers into the same space), and older V8 versions use 1MB pages (so you 
> can fit 2x as large objects into regular pages).
>
> Long story short:
> • your patch looks fine
> • worrying about deopts is probably unnecessary, but only measuring will 
> tell for sure
> • then again switching between two implementations is probably also 
> unnecessary (encapsulating the behavior you want into one class is probably 
> cleaner)
> • you can split at 32768, sure, or at any random integer between 1 and 
> 5 (I wouldn't rely on the specific numbers I quoted above, because 
> that's an implementation detail and could change at any time)
> • ideally you don't use Array.shift() at all, and instead have a "cursor" 
> index for the next element to be consumed; drop the Queue's "front" when 
> cursor === front.length (this would probably work well with a chunk size in 
> the [100..1000] range or so)
> • if the chunks you're queuing are indeed typically 1K or bigger, then you 
> probably don't need to bother with arrays at all, and can just use a plain 
> simple linked list (because a few pointers of overhead per element are 
> negligible if they give you guaranteed O(1) push/shift performance 
> regardless of queue length), like so:
>
> class Queue {
>   constructor() {
> this.size = 0;
> this.front = null;
> this.back = null;
>   }
>
>   get length() { return this.size; }
>
>   push(element) {
> ++this.size;
> if (this.size === 1) {
>   this.front = this.back = { value: element, next: null };
> } else {
>   var node = { value: element, next: null };
>   this.back.next = node;
>   this.back = node;
> }
>   }
>
>   shift() {
> --this.size;
> var result = this.front.value;
> this.front = this.front.next;
> return result;
>   }
> }
>
> On Wed, Jan 18, 2017 at 8:24 AM, Jochen Eisinger  > wrote:
>
>> +Domenic Denicola  
>>
>> On Wed, Jan 18, 2017 at 4:25 AM Adam Rice > > wrote:
>>
>>> I work on the Chrome implementation of ReadableStream and WritableStream 
>>> . They are implemented in Javascript 
>>> using v8 extras.
>>>
>>> They currently use an InternalPackedArray to implement a queue 
>>> structure, however I have found a scalability issue. The benchmarks and 
>>> repro can be found at http://crbug.com/681493 ("ReadableStream, 
>>> WritableStream get dramatically slower when queue grows to 56720 chunks").
>>>
>>> I have a proposed fix at http://crrev.com/2637863002. If possible, I 
>>> would like someone who is familiar with the implementation of 
>>> InternalPackedArray to review it.
>>>
>>> Particular things I'd like to know:
>>>
>>>- Is it worth worrying about deopt or would it be better to use 
>>>InternalPackedArray for small queues and only switch to Queue for larger 
>>>ones?
>>>- Is 32768 a good size to split the arrays at? Can we be reasonably 
>>>sure that it is small enough to get good behaviour on all platforms and 
>>>architectures?
>>>
>>> and if possible
>>>
>>>- why does performance change so dramaticall

Re: [v8-users] Streams and InternalPackedArray queue scalability

2017-01-18 Thread Jakob Kummerow
Interesting case :-)

The implementation of InternalPackedArray is essentially the same as plain
JavaScript "Array", the biggest difference is that InternalPackedArray is
not exposed to websites and therefore not monkey-patchable.
Array.shift() is inherently slow, because it requires all remaining
elements to be moved. (Array.push()/Array.pop() are much faster.)
V8 uses a nasty trick to make Array.shift() fast: object left-trimming,
where the object header is moved, so that the elements themselves can stay
in place. This is an ugly hack and we would like to get rid of it.
Left-trimming is not possible in "large object space", a special area of
V8's heap for large objects. The limit for what constitutes a "large
object" is the regular heap's page size, which is 512KB.
Arrays over-allocate their backing store in steps; when you start with "new
Array()" and keep push()ing onto it, the sequence of backing store
capacities is 4, 23, 52, 95, 160, 257, 403, 622, 950, 1442, 2180, 3287,
4948, 7439, 11176, 16781, 25189, 37801, *56719*, 85096, 127661, ...
So when the 56720th element is pushed, the backing store is grown to 85096
* 8 bytes = 680768 bytes > 512KB, so the new backing store must be
allocated in large-object space. Every subsequent shift() on that array is
an O(n) operation.
The reason the threshold is four times as high on your Android tablet is
because we ship 32-bit Chrome builds to Android (so you can fit 2x as many
pointers into the same space), and older V8 versions use 1MB pages (so you
can fit 2x as large objects into regular pages).

Long story short:
• your patch looks fine
• worrying about deopts is probably unnecessary, but only measuring will
tell for sure
• then again switching between two implementations is probably also
unnecessary (encapsulating the behavior you want into one class is probably
cleaner)
• you can split at 32768, sure, or at any random integer between 1 and
5 (I wouldn't rely on the specific numbers I quoted above, because
that's an implementation detail and could change at any time)
• ideally you don't use Array.shift() at all, and instead have a "cursor"
index for the next element to be consumed; drop the Queue's "front" when
cursor === front.length (this would probably work well with a chunk size in
the [100..1000] range or so)
• if the chunks you're queuing are indeed typically 1K or bigger, then you
probably don't need to bother with arrays at all, and can just use a plain
simple linked list (because a few pointers of overhead per element are
negligible if they give you guaranteed O(1) push/shift performance
regardless of queue length), like so:

class Queue {
  constructor() {
this.size = 0;
this.front = null;
this.back = null;
  }

  get length() { return this.size; }

  push(element) {
++this.size;
if (this.size === 1) {
  this.front = this.back = { value: element, next: null };
} else {
  var node = { value: element, next: null };
  this.back.next = node;
  this.back = node;
}
  }

  shift() {
--this.size;
var result = this.front.value;
this.front = this.front.next;
return result;
  }
}

On Wed, Jan 18, 2017 at 8:24 AM, Jochen Eisinger 
wrote:

> +Domenic Denicola 
>
> On Wed, Jan 18, 2017 at 4:25 AM Adam Rice  wrote:
>
>> I work on the Chrome implementation of ReadableStream and WritableStream
>> . They are implemented in Javascript
>> using v8 extras.
>>
>> They currently use an InternalPackedArray to implement a queue structure,
>> however I have found a scalability issue. The benchmarks and repro can be
>> found at http://crbug.com/681493 ("ReadableStream, WritableStream get
>> dramatically slower when queue grows to 56720 chunks").
>>
>> I have a proposed fix at http://crrev.com/2637863002. If possible, I
>> would like someone who is familiar with the implementation of
>> InternalPackedArray to review it.
>>
>> Particular things I'd like to know:
>>
>>- Is it worth worrying about deopt or would it be better to use
>>InternalPackedArray for small queues and only switch to Queue for larger
>>ones?
>>- Is 32768 a good size to split the arrays at? Can we be reasonably
>>sure that it is small enough to get good behaviour on all platforms and
>>architectures?
>>
>> and if possible
>>
>>- why does performance change so dramatically at a threshold?
>>
>> Thanks,
>> Adam Rice
>>
>> --
>> --
>> v8-users mailing list
>> v8-users@googlegroups.com
>> http://groups.google.com/group/v8-users
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "v8-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to v8-users+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
> --
> --
> v8-users mailing list
> v8-users@googlegroups.com
> http://groups.google.com/group/v8-users
> ---
> You received this message because you are subscribed to t

Re: [v8-users] Streams and InternalPackedArray queue scalability

2017-01-17 Thread Jochen Eisinger
+Domenic Denicola 

On Wed, Jan 18, 2017 at 4:25 AM Adam Rice  wrote:

> I work on the Chrome implementation of ReadableStream and WritableStream
> . They are implemented in Javascript
> using v8 extras.
>
> They currently use an InternalPackedArray to implement a queue structure,
> however I have found a scalability issue. The benchmarks and repro can be
> found at http://crbug.com/681493 ("ReadableStream, WritableStream get
> dramatically slower when queue grows to 56720 chunks").
>
> I have a proposed fix at http://crrev.com/2637863002. If possible, I
> would like someone who is familiar with the implementation of
> InternalPackedArray to review it.
>
> Particular things I'd like to know:
>
>- Is it worth worrying about deopt or would it be better to use
>InternalPackedArray for small queues and only switch to Queue for larger
>ones?
>- Is 32768 a good size to split the arrays at? Can we be reasonably
>sure that it is small enough to get good behaviour on all platforms and
>architectures?
>
> and if possible
>
>- why does performance change so dramatically at a threshold?
>
> Thanks,
> Adam Rice
>
> --
> --
> v8-users mailing list
> v8-users@googlegroups.com
> http://groups.google.com/group/v8-users
> ---
> You received this message because you are subscribed to the Google Groups
> "v8-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to v8-users+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
-- 
v8-users mailing list
v8-users@googlegroups.com
http://groups.google.com/group/v8-users
--- 
You received this message because you are subscribed to the Google Groups 
"v8-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to v8-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[v8-users] Streams and InternalPackedArray queue scalability

2017-01-17 Thread Adam Rice
I work on the Chrome implementation of ReadableStream and WritableStream
. They are implemented in Javascript
using v8 extras.

They currently use an InternalPackedArray to implement a queue structure,
however I have found a scalability issue. The benchmarks and repro can be
found at http://crbug.com/681493 ("ReadableStream, WritableStream get
dramatically slower when queue grows to 56720 chunks").

I have a proposed fix at http://crrev.com/2637863002. If possible, I would
like someone who is familiar with the implementation of InternalPackedArray
to review it.

Particular things I'd like to know:

   - Is it worth worrying about deopt or would it be better to use
   InternalPackedArray for small queues and only switch to Queue for larger
   ones?
   - Is 32768 a good size to split the arrays at? Can we be reasonably sure
   that it is small enough to get good behaviour on all platforms and
   architectures?

and if possible

   - why does performance change so dramatically at a threshold?

Thanks,
Adam Rice

-- 
-- 
v8-users mailing list
v8-users@googlegroups.com
http://groups.google.com/group/v8-users
--- 
You received this message because you are subscribed to the Google Groups 
"v8-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to v8-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.