[jira] [Created] (ARROW-5284) [Rust] Replace libc with std::alloc for memory allocation

2019-05-07 Thread Chao Sun (JIRA)
Chao Sun created ARROW-5284:
---

 Summary: [Rust] Replace libc with std::alloc for memory allocation
 Key: ARROW-5284
 URL: https://issues.apache.org/jira/browse/ARROW-5284
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Chao Sun
Assignee: Chao Sun






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS][JAVA]Support Fast/Unsafe Vector APIs for Arrow

2019-05-07 Thread Fan Liya
Hi Jacques,

Thanks a lot for your comments.

I have evaluated the assembly code of original Arrow API, as well as the
unsafe API in our PR 
Generally, the assembly code generated by JIT for both APIs are of high
quality, and for most cases, the assembly code are almost the same.

However, some checks can be further removed. The following figures give an
example (the figures are too big to be attached, so I have attached them in
a JIRA comment. Please see comment
.
Sorry
for the inconvenience):

The first figure shows the code of original Arrow API, while the second
shows the code for the unsafe API.
It can be observed that for the unsafe API, the amounts of the source, byte
and assembly code are all smaller. So it can be expected that the
performance of unsafe API is better.

Concerning this particular example for the Float8Vector, I think it is
reasonable to further remove the check in the get method:
Before we call the get method, we must check if the value is null, so the
check in the get method is redundant. And this is a typical scenario of
using Arrow API (check and then get), at least for our scenario (please
take a glimpse of our benchmark in PR
).

Concerning the other problem, about the real algorithm in our scenario. I
want to make two points:

1. SQL engines are performance critical, so 30% is a large number for us.
For the past year, it took our team several months just to improve the
performance of our runtime engine by around 15%.

2. The performance of engine heavily depends on the performance of Arrow.
Most SQL engines are memory-intensive, so the performance of get/set
methods is the key. To get a flavor of the algorithms in our engine, please
refer to PR . That is the core
algorithm of our operator, which is executed many times during the
processing of a SQL query. You can find that the computation is relatively
simple, and most method calls are memory accesses.

Best,
Liya Fan

On Mon, May 6, 2019 at 5:52 PM Jacques Nadeau  wrote:

> I am still asking the same question: can you please analyze the assembly
> the JIT is producing and look to identify why the disabled bounds checking
> is at 30% and what types of things we can do to address. For example, we
> have talked before about a bytecode transformer that simply removes the
> bounds checking when loading Arrow if you want that behavior. If necessary,
> that may be a big win from a code maintenance standpoint over having
> duplicate interfaces.
>
> The static block seems like a non-problem. You could simply load another
> class that system property before loading any Arrow code. If you're
> proposing a code change to solve your problem today, this seems just as
> feasible.
>
> The other question: in a real algorithm, how much does that 30% matter?
> Your benchmarks are entirely about this one call whereas real workloads are
> impacted by many things and the time in writing/reading vectors is
> miniscule versus other things.
>
> On Mon, May 6, 2019 at 1:16 PM Fan Liya  wrote:
>
> > Hi Jacques,
> >
> > Thank you so much for your kind reminder.
> >
> > To come up with some performance data, I have set up an environment and
> > run some micro-benchmarks.
> > The server runs Linux, has 64 cores and has 256 GB memory.
> > The benchmarks are simple iterations over some double vectors (the source
> > file is attached):
> >
> >   @Benchmark
> >   @BenchmarkMode(Mode.AverageTime)
> >   @OutputTimeUnit(TimeUnit.MICROSECONDS)
> >   public void testSafe() {
> > safeSum = 0;
> > for (int i = 0; i < VECTOR_LENGTH; i++) {
> >   safeVector.set(i, i + 10.0);
> >   safeSum += safeVector.get(i);
> > }
> >   }
> >
> >   @Benchmark
> >   @BenchmarkMode(Mode.AverageTime)
> >   @OutputTimeUnit(TimeUnit.MICROSECONDS)
> >   public void testUnSafe() {
> > unSafeSum = 0;
> > for (int i = 0; i < VECTOR_LENGTH; i++) {
> >   unsafeVector.set(i, i + 10.0);
> >   unSafeSum += unsafeVector.get(i);
> > }
> >   }
> >
> > The safe vector in the testSafe benchmark is from the original Arrow
> > implementation, whereas the unsafe vector in the testUnsafe benchmark is
> > based on our initial implementation in PR
> >  (This is not the final
> > version. However, we believe much overhead has been removed).
> > The evaluation is based on JMH framework (thanks to the suggestion from
> > Jacques Nadeau). The benchmarks are run so many times by the framework
> that
> > the effects of JIT are well considered.
> >
> > In the first experiment, we use the default configuration (boundary
> > checking enabled), and the original Arrow vector is about 4 times slower:
> >
> > Benchmark   Mode  Cnt  

Re: [DISCUSS][C++] Static versus variable Arrow dictionary encoding

2019-05-07 Thread Wes McKinney
I have started working on this some to assess what is involved.

My present plan is to have

FixedDictionaryType and FixedDictionaryArray
VariableDictionaryType and VariableDictionaryArray
deprecate (?) current DictionaryType/DictionaryArray names, for
clarity (thoughts about this would be welcome -- this will make the
patch diff much larger)

Given that dictionaries can change in IPC streams, I believe the
correct approach is to change IPC read/write paths to deal only in
variable dictionary arrays.

It has occurred to me to question whether it is worth maintaining two
variants versus having only the single general purpose variable
dictionary form. I'm not totally sure -- in the fixed/static case you
can assume that the dictionary is a fixed quantity and avoid any
checking when working with multiple arrays. On the flip side, if you
have multiple arrays all having the same dictionary, then verifying
this fact is cheap (if the dictionary in each case is always _the same
object_, so dict_a->Equals(dict_b) is cheap). If I could start the
project over, I think that I would have preferred to only have the
variable form and wait for more use cases for the less flexible fixed
case -- in the case of interop with tools like R and Python pandas
that have built-in categorical (factor) types, generally only a single
piece of array data is being worked with, and so fixed and variable
are equivalent when you only have one array.

In any case, I will at least endeavor to disentangle logic that makes
assumptions about whether the dictionary is knowable from the type
object and put up a patch for discussion, probably later this week or
first thing next week (since I am speaking at a conference later this
week)

- Wes

On Wed, May 1, 2019 at 11:38 AM Hatem Helal  wrote:
>
> Thanks Wes, your proposed additional data type makes more sense to me.
>
> > As a first use case for this I would be personally looking to address 
> > reads of encoded data from
> > Parquet format without an intermediate pass through dense format
> > (which can be slow and wasteful for heavily compressed string data)
>
> Feel free to grab ARROW-3772 off of me...I had hoped to work on it after 
> finishing ARROW-3769 but it seems that introducing this additional data type 
> will be necessary to make progress on that issue.
>
>
>


[jira] [Created] (ARROW-5283) [C++][Plasma] Server crash when creating an aborted object 3 times

2019-05-07 Thread shengjun.li (JIRA)
shengjun.li created ARROW-5283:
--

 Summary: [C++][Plasma] Server crash when creating an aborted 
object 3 times
 Key: ARROW-5283
 URL: https://issues.apache.org/jira/browse/ARROW-5283
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.13.0
Reporter: shengjun.li
 Fix For: 0.14.0


cpp/CMakeLists.txt
  option(ARROW_PLASMA "Build the plasma object store along with Arrow" ON)

sequence:
(1) call PlasmaClient::Create(id_object, data_size, 0, 0, &buff, 0)
(2) call PlasmaClient::Release(id_object)
(3) call PlasmaClient::Abort(id_object)

(4) call PlasmaClient::Create(id_object, data_size, 0, 0, &buff, 0) // where 
the id_object is the same as (1)
(5) call PlasmaClient::Release(id_object)
(6) call PlasmaClient::Abort(id_object)

(7) call PlasmaClient::Create(id_object, data_size, 0, 0, &buff, 0) // where 
the id_object is the same as (1)
server crash!


F0508 10:03:09.546859 32587 eviction_policy.cc:27]  Check failed: it == 
item_map_.end() 
*** Check failure stack trace: ***
*** Aborted at 1557280989 (unix time) try "date -d @1557280989" if you are 
using GNU date ***
PC: @ 0x7f5403a46428 gsignal
*** SIGABRT (@0x3e87f4b) received by PID 32587 (TID 0x7f5406950f80) from 
PID 32587; stack trace: ***
    @ 0x7f5403dec390 (unknown)
    @ 0x7f5403a46428 gsignal
    @ 0x7f5403a4802a abort
    @ 0x7f5405780f69 google::logging_fail()
    @ 0x7f5405782a3d google::LogMessage::Fail()
    @ 0x7f5405785054 google::LogMessage::SendToLog()
    @ 0x7f540578255b google::LogMessage::Flush()
    @ 0x7f5405782779 google::LogMessage::~LogMessage()
    @ 0x7f54053f98bd arrow::util::ArrowLog::~ArrowLog()
    @   0x4afcae plasma::LRUCache::Add()
    @   0x4b00f1 plasma::EvictionPolicy::ObjectCreated()
    @   0x4b61e0 plasma::PlasmaStore::CreateObject()
    @   0x4babcc plasma::PlasmaStore::ProcessMessage()
    @   0x4b95c3 _ZZN6plasma11PlasmaStore13ConnectClientEiENKUliE_clEi
    @   0x4bdb80 
_ZNSt17_Function_handlerIFviEZN6plasma11PlasmaStore13ConnectClientEiEUliE_E9_M_invokeERKSt9_Any_dataOi
    @   0x4aba58 std::function<>::operator()()
    @   0x4aaf67 plasma::EventLoop::FileEventCallback()
    @   0x4dc1bd aeProcessEvents
    @   0x4dc37e aeMain
    @   0x4ab25b plasma::EventLoop::Start()
    @   0x4c00c1 plasma::PlasmaStoreRunner::Start()
    @   0x4bc77b plasma::StartServer()
    @   0x4bd3eb main
    @ 0x7f5403a31830 __libc_start_main
    @   0x49e9f9 _start
    @    0x0 (unknown)
Aborted (core dumped)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5282) Can't read data from parquet file in C++ library

2019-05-07 Thread worker24h (JIRA)
worker24h created ARROW-5282:


 Summary: Can't read data from parquet file in C++ library
 Key: ARROW-5282
 URL: https://issues.apache.org/jira/browse/ARROW-5282
 Project: Apache Arrow
  Issue Type: Bug
Reporter: worker24h


Specified the second param *parquet::ReaderProperties* When I used 
parquet::ParquetFileReader::Open, it can't work.
 The following code:
{code:java}
parquet::ReaderProperties _properties;
_properties = parquet::ReaderProperties(); 
_properties.enable_buffered_stream();  // used  buffer stream.  Don't set 
buffer-size
parquet_reader = parquet::ParquetFileReader::Open(_parquet, _properties);
...
int32_t value;
parquet::Int32Reader* int32_reader =
static_cast(column_reader.get());
int32_reader->Skip(_current_line_of_group);// skip lines of processed.
rows_read = int32_reader->ReadBatch(1, nullptr, nullptr, &value, &values_read); 
 

{code}
The interface *Skip* throw exception:

{color:#FF}{{Couldn't deserialize thrift: TProtocolException: Invalid data 
Deserializing page header failed.}}{color}

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5281) [Rust] [Parquet] Move DataPageBuilder to test_common

2019-05-07 Thread Renjie Liu (JIRA)
Renjie Liu created ARROW-5281:
-

 Summary: [Rust] [Parquet] Move DataPageBuilder to test_common
 Key: ARROW-5281
 URL: https://issues.apache.org/jira/browse/ARROW-5281
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Renjie Liu
Assignee: Renjie Liu


DataPageBuilder is a helpful tool for mocking test page data, it's worthy to 
move it to test_common so that other parts can reuse it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5280) [C++] Find a better solution to the conda compilers macOS issue

2019-05-07 Thread Neal Richardson (JIRA)
Neal Richardson created ARROW-5280:
--

 Summary: [C++] Find a better solution to the conda compilers macOS 
issue
 Key: ARROW-5280
 URL: https://issues.apache.org/jira/browse/ARROW-5280
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Developer Tools
Reporter: Neal Richardson


See [https://github.com/apache/arrow/pull/4231#pullrequestreview-234617308] and 
https://issues.apache.org/jira/browse/ARROW-4935. Conda's `compilers` require 
an old macOS SDK, which makes installation awkward at best. We can _almost_ 
build on macOS without conda `compilers`, but the jemalloc failure remains. As 
Uwe says, "Maybe we can figure out a way in conda-forge to use newer compilers 
than the ones referenced by the {{compilers}} package."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5279) [C++] Support reading delta dictionaries in IPC streams

2019-05-07 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5279:
---

 Summary: [C++] Support reading delta dictionaries in IPC streams
 Key: ARROW-5279
 URL: https://issues.apache.org/jira/browse/ARROW-5279
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.14.0


This JIRA covers the read path for delta dictionaries. The write path is a bit 
more of a can of worms (since the deltas must be computed)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-07 Thread Wes McKinney
On Tue, May 7, 2019 at 12:26 PM John Muehlhausen  wrote:
>
> Wes, are we saying that `pa.ipc.open_file(...).read_pandas()` already reads
> the future Feather format? If not, how will the future format differ?  I
> will work on my access pattern with this format instead of the current
> feather format.  Sorry I was not clear on that earlier.
>

Yes, under the hood those will use the same zero-copy binary protocol
code paths to read the file.

> Micah, thank you!
>
> On Tue, May 7, 2019 at 11:44 AM Micah Kornfield 
> wrote:
>
> > Hi John,
> > To give a specific pointer [1] describes how the streaming protocol is
> > stored to a file.
> >
> > [1] https://arrow.apache.org/docs/format/IPC.html#file-format
> >
> > On Tue, May 7, 2019 at 9:40 AM Wes McKinney  wrote:
> >
> > > hi John,
> > >
> > > As soon as the R folks can install the Arrow R package consistently,
> > > the intent is to replace the Feather internals with the plain Arrow
> > > IPC protocol where we have much better platform support all around.
> > >
> > > If you'd like to experiment with creating an API for pre-allocating
> > > fixed-size Arrow protocol blocks and then mutating the data and
> > > metadata on disk in-place, please be our guest. We don't have the
> > > tools developed yet to do this for you
> > >
> > > - Wes
> > >
> > > On Tue, May 7, 2019 at 11:25 AM John Muehlhausen  wrote:
> > > >
> > > > Thanks Wes:
> > > >
> > > > "the current Feather format is deprecated" ... yes, but there will be a
> > > > future file format that replaces it, correct?  And my discussion of
> > > > immutable "portions" of Arrow buffers, rather than immutability of the
> > > > entire buffer, applies to IPC as well, right?  I am only championing
> > the
> > > > idea that this future file format have the convenience that certain
> > > > preallocated rows can be ignored based on a metadata setting.
> > > >
> > > > "I recommend batching your writes" ... this is extremely inefficient
> > and
> > > > adds unacceptable latency, relative to the proposed solution.  Do you
> > > > disagree?  Either I have a batch length of 1 to minimize latency, which
> > > > eliminates columnar advantages on the read side, or else I add latency.
> > > > Neither works, and it seems that a viable alternative is within sight?
> > > >
> > > > If you don't agree that there is a core issue and opportunity here, I'm
> > > not
> > > > sure how to better make my case
> > > >
> > > > -John
> > > >
> > > > On Tue, May 7, 2019 at 11:02 AM Wes McKinney 
> > > wrote:
> > > >
> > > > > hi John,
> > > > >
> > > > > On Tue, May 7, 2019 at 10:53 AM John Muehlhausen 
> > wrote:
> > > > > >
> > > > > > Wes et al, I completed a preliminary study of populating a Feather
> > > file
> > > > > > incrementally.  Some notes and questions:
> > > > > >
> > > > > > I wrote the following dataframe to a feather file:
> > > > > > ab
> > > > > > 0  0123456789  0.0
> > > > > > 1  0123456789  NaN
> > > > > > 2  0123456789  NaN
> > > > > > 3  0123456789  NaN
> > > > > > 4None  NaN
> > > > > >
> > > > > > In re-writing the flatbuffers metadata (flatc -p doesn't
> > > > > > support --gen-mutable! yuck! C++ to the rescue...), it seems that
> > > > > > read_feather is not affected by NumRows?  It seems to be driven
> > > entirely
> > > > > by
> > > > > > the per-column Length values?
> > > > > >
> > > > > > Also, it seems as if one of the primary usages of NullCount is to
> > > > > determine
> > > > > > whether or not a bitfield is present?  In the initialization data
> > > above I
> > > > > > was careful to have a null value in each column in order to
> > generate
> > > a
> > > > > > bitfield.
> > > > >
> > > > > Per my prior e-mails, the current Feather format is deprecated, so
> > I'm
> > > > > only willing to engage on a discussion of the official Arrow binary
> > > > > protocol that we use for IPC (memory mapping) and RPC (Flight).
> > > > >
> > > > > >
> > > > > > I then wiped the bitfields in the file and set all of the string
> > > indices
> > > > > to
> > > > > > one past the end of the blob buffer (all strings empty):
> > > > > >   a   b
> > > > > > 0  None NaN
> > > > > > 1  None NaN
> > > > > > 2  None NaN
> > > > > > 3  None NaN
> > > > > > 4  None NaN
> > > > > >
> > > > > > I then set the first record to some data by consuming some of the
> > > string
> > > > > > blob and row 0 and 1 indices, also setting the double:
> > > > > >
> > > > > >ab
> > > > > > 0  Hello, world!  5.0
> > > > > > 1   None  NaN
> > > > > > 2   None  NaN
> > > > > > 3   None  NaN
> > > > > > 4   None  NaN
> > > > > >
> > > > > > As mentioned above, NumRows seems to be ignored.  I tried adjusting
> > > each
> > > > > > column Length to mask off higher rows and it worked for 4 (hide
> > last
> > > row)
> > > > > > but not for less than 4.  I take this to be due to math used to
> > > figure
> > > > > out
> > > > > > where the buffers are relative to one anoth

RE: [DISCUSS][C++][Proposal] Threading engine for Arrow

2019-05-07 Thread Malakhov, Anton
> From: Jed Brown [mailto:j...@jedbrown.org]
> Sent: Monday, May 6, 2019 16:35

> Nice paper, thanks!  Did you investigate latency impact from the IPC counting
> semaphore?  Is your test code available?
Not that deep. Basically I was looking only if its positive effect is enough to 
overcome the impact of oversubscription or not. It is, but not in all the 
cases. It is also hard to separate one impact/effect from another, e.g.: some 
parallel regions ask for all the threads but use a few, which results in 
undersubscription when serializing parallel regions in OpenMP. IPC for 
coordinating TBB processes solves the resource exhaustion problem and gives 
additional performance in some cases. However, Linux is usually good enough for 
scheduling multiple multithreaded processes. I guess it's because it sees how 
threads are grouped, which is not the case for multiple concurrent parallel 
regions with OpenMP threads in the same process.
All the results from the blog, paper, talks, and demo are available at 
https://github.com/IntelPython/composability_bench

Regards
// Anton


Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-07 Thread John Muehlhausen
Wes, are we saying that `pa.ipc.open_file(...).read_pandas()` already reads
the future Feather format? If not, how will the future format differ?  I
will work on my access pattern with this format instead of the current
feather format.  Sorry I was not clear on that earlier.

Micah, thank you!

On Tue, May 7, 2019 at 11:44 AM Micah Kornfield 
wrote:

> Hi John,
> To give a specific pointer [1] describes how the streaming protocol is
> stored to a file.
>
> [1] https://arrow.apache.org/docs/format/IPC.html#file-format
>
> On Tue, May 7, 2019 at 9:40 AM Wes McKinney  wrote:
>
> > hi John,
> >
> > As soon as the R folks can install the Arrow R package consistently,
> > the intent is to replace the Feather internals with the plain Arrow
> > IPC protocol where we have much better platform support all around.
> >
> > If you'd like to experiment with creating an API for pre-allocating
> > fixed-size Arrow protocol blocks and then mutating the data and
> > metadata on disk in-place, please be our guest. We don't have the
> > tools developed yet to do this for you
> >
> > - Wes
> >
> > On Tue, May 7, 2019 at 11:25 AM John Muehlhausen  wrote:
> > >
> > > Thanks Wes:
> > >
> > > "the current Feather format is deprecated" ... yes, but there will be a
> > > future file format that replaces it, correct?  And my discussion of
> > > immutable "portions" of Arrow buffers, rather than immutability of the
> > > entire buffer, applies to IPC as well, right?  I am only championing
> the
> > > idea that this future file format have the convenience that certain
> > > preallocated rows can be ignored based on a metadata setting.
> > >
> > > "I recommend batching your writes" ... this is extremely inefficient
> and
> > > adds unacceptable latency, relative to the proposed solution.  Do you
> > > disagree?  Either I have a batch length of 1 to minimize latency, which
> > > eliminates columnar advantages on the read side, or else I add latency.
> > > Neither works, and it seems that a viable alternative is within sight?
> > >
> > > If you don't agree that there is a core issue and opportunity here, I'm
> > not
> > > sure how to better make my case
> > >
> > > -John
> > >
> > > On Tue, May 7, 2019 at 11:02 AM Wes McKinney 
> > wrote:
> > >
> > > > hi John,
> > > >
> > > > On Tue, May 7, 2019 at 10:53 AM John Muehlhausen 
> wrote:
> > > > >
> > > > > Wes et al, I completed a preliminary study of populating a Feather
> > file
> > > > > incrementally.  Some notes and questions:
> > > > >
> > > > > I wrote the following dataframe to a feather file:
> > > > > ab
> > > > > 0  0123456789  0.0
> > > > > 1  0123456789  NaN
> > > > > 2  0123456789  NaN
> > > > > 3  0123456789  NaN
> > > > > 4None  NaN
> > > > >
> > > > > In re-writing the flatbuffers metadata (flatc -p doesn't
> > > > > support --gen-mutable! yuck! C++ to the rescue...), it seems that
> > > > > read_feather is not affected by NumRows?  It seems to be driven
> > entirely
> > > > by
> > > > > the per-column Length values?
> > > > >
> > > > > Also, it seems as if one of the primary usages of NullCount is to
> > > > determine
> > > > > whether or not a bitfield is present?  In the initialization data
> > above I
> > > > > was careful to have a null value in each column in order to
> generate
> > a
> > > > > bitfield.
> > > >
> > > > Per my prior e-mails, the current Feather format is deprecated, so
> I'm
> > > > only willing to engage on a discussion of the official Arrow binary
> > > > protocol that we use for IPC (memory mapping) and RPC (Flight).
> > > >
> > > > >
> > > > > I then wiped the bitfields in the file and set all of the string
> > indices
> > > > to
> > > > > one past the end of the blob buffer (all strings empty):
> > > > >   a   b
> > > > > 0  None NaN
> > > > > 1  None NaN
> > > > > 2  None NaN
> > > > > 3  None NaN
> > > > > 4  None NaN
> > > > >
> > > > > I then set the first record to some data by consuming some of the
> > string
> > > > > blob and row 0 and 1 indices, also setting the double:
> > > > >
> > > > >ab
> > > > > 0  Hello, world!  5.0
> > > > > 1   None  NaN
> > > > > 2   None  NaN
> > > > > 3   None  NaN
> > > > > 4   None  NaN
> > > > >
> > > > > As mentioned above, NumRows seems to be ignored.  I tried adjusting
> > each
> > > > > column Length to mask off higher rows and it worked for 4 (hide
> last
> > row)
> > > > > but not for less than 4.  I take this to be due to math used to
> > figure
> > > > out
> > > > > where the buffers are relative to one another since there is only
> one
> > > > > metadata offset for all of: the (optional) bitset, index column and
> > (if
> > > > > string) blobs.
> > > > >
> > > > > Populating subsequent rows would proceed in a similar way until all
> > of
> > > > the
> > > > > blob storage has been consumed, which may come before the
> > pre-allocated
> > > > > rows have been consumed.
> > > > >
> > > > > So what does this mean for 

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-07 Thread Wes McKinney
hi John,

As soon as the R folks can install the Arrow R package consistently,
the intent is to replace the Feather internals with the plain Arrow
IPC protocol where we have much better platform support all around.

If you'd like to experiment with creating an API for pre-allocating
fixed-size Arrow protocol blocks and then mutating the data and
metadata on disk in-place, please be our guest. We don't have the
tools developed yet to do this for you

- Wes

On Tue, May 7, 2019 at 11:25 AM John Muehlhausen  wrote:
>
> Thanks Wes:
>
> "the current Feather format is deprecated" ... yes, but there will be a
> future file format that replaces it, correct?  And my discussion of
> immutable "portions" of Arrow buffers, rather than immutability of the
> entire buffer, applies to IPC as well, right?  I am only championing the
> idea that this future file format have the convenience that certain
> preallocated rows can be ignored based on a metadata setting.
>
> "I recommend batching your writes" ... this is extremely inefficient and
> adds unacceptable latency, relative to the proposed solution.  Do you
> disagree?  Either I have a batch length of 1 to minimize latency, which
> eliminates columnar advantages on the read side, or else I add latency.
> Neither works, and it seems that a viable alternative is within sight?
>
> If you don't agree that there is a core issue and opportunity here, I'm not
> sure how to better make my case
>
> -John
>
> On Tue, May 7, 2019 at 11:02 AM Wes McKinney  wrote:
>
> > hi John,
> >
> > On Tue, May 7, 2019 at 10:53 AM John Muehlhausen  wrote:
> > >
> > > Wes et al, I completed a preliminary study of populating a Feather file
> > > incrementally.  Some notes and questions:
> > >
> > > I wrote the following dataframe to a feather file:
> > > ab
> > > 0  0123456789  0.0
> > > 1  0123456789  NaN
> > > 2  0123456789  NaN
> > > 3  0123456789  NaN
> > > 4None  NaN
> > >
> > > In re-writing the flatbuffers metadata (flatc -p doesn't
> > > support --gen-mutable! yuck! C++ to the rescue...), it seems that
> > > read_feather is not affected by NumRows?  It seems to be driven entirely
> > by
> > > the per-column Length values?
> > >
> > > Also, it seems as if one of the primary usages of NullCount is to
> > determine
> > > whether or not a bitfield is present?  In the initialization data above I
> > > was careful to have a null value in each column in order to generate a
> > > bitfield.
> >
> > Per my prior e-mails, the current Feather format is deprecated, so I'm
> > only willing to engage on a discussion of the official Arrow binary
> > protocol that we use for IPC (memory mapping) and RPC (Flight).
> >
> > >
> > > I then wiped the bitfields in the file and set all of the string indices
> > to
> > > one past the end of the blob buffer (all strings empty):
> > >   a   b
> > > 0  None NaN
> > > 1  None NaN
> > > 2  None NaN
> > > 3  None NaN
> > > 4  None NaN
> > >
> > > I then set the first record to some data by consuming some of the string
> > > blob and row 0 and 1 indices, also setting the double:
> > >
> > >ab
> > > 0  Hello, world!  5.0
> > > 1   None  NaN
> > > 2   None  NaN
> > > 3   None  NaN
> > > 4   None  NaN
> > >
> > > As mentioned above, NumRows seems to be ignored.  I tried adjusting each
> > > column Length to mask off higher rows and it worked for 4 (hide last row)
> > > but not for less than 4.  I take this to be due to math used to figure
> > out
> > > where the buffers are relative to one another since there is only one
> > > metadata offset for all of: the (optional) bitset, index column and (if
> > > string) blobs.
> > >
> > > Populating subsequent rows would proceed in a similar way until all of
> > the
> > > blob storage has been consumed, which may come before the pre-allocated
> > > rows have been consumed.
> > >
> > > So what does this mean for my desire to incrementally write these
> > > (potentially memory-mapped) pre-allocated files and/or Arrow buffers in
> > > memory?  Some thoughts:
> > >
> > > - If a single value (such as NumRows) were consulted to determine the
> > table
> > > row dimension then updating this single value would serve to tell a
> > reader
> > > which rows are relevant.  Assuming this value is updated after all other
> > > mutations are complete, and assuming that mutations are only appends
> > > (addition of rows), then concurrency control involves only ensuring that
> > > this value is not examined while it is being written.
> > >
> > > - NullCount presents a concurrency problem if someone reads the file
> > after
> > > this field has been updated, but before NumRows has exposed the new
> > record
> > > (or vice versa).  The idea previously mentioned that there will "likely
> > > [be] more statistics in the future" feels like it might be scope creep to
> > > me?  This is a data representation, not a calculation framework?  If
> > > NullCount had its genesis i

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-07 Thread Micah Kornfield
Hi John,
To give a specific pointer [1] describes how the streaming protocol is
stored to a file.

[1] https://arrow.apache.org/docs/format/IPC.html#file-format

On Tue, May 7, 2019 at 9:40 AM Wes McKinney  wrote:

> hi John,
>
> As soon as the R folks can install the Arrow R package consistently,
> the intent is to replace the Feather internals with the plain Arrow
> IPC protocol where we have much better platform support all around.
>
> If you'd like to experiment with creating an API for pre-allocating
> fixed-size Arrow protocol blocks and then mutating the data and
> metadata on disk in-place, please be our guest. We don't have the
> tools developed yet to do this for you
>
> - Wes
>
> On Tue, May 7, 2019 at 11:25 AM John Muehlhausen  wrote:
> >
> > Thanks Wes:
> >
> > "the current Feather format is deprecated" ... yes, but there will be a
> > future file format that replaces it, correct?  And my discussion of
> > immutable "portions" of Arrow buffers, rather than immutability of the
> > entire buffer, applies to IPC as well, right?  I am only championing the
> > idea that this future file format have the convenience that certain
> > preallocated rows can be ignored based on a metadata setting.
> >
> > "I recommend batching your writes" ... this is extremely inefficient and
> > adds unacceptable latency, relative to the proposed solution.  Do you
> > disagree?  Either I have a batch length of 1 to minimize latency, which
> > eliminates columnar advantages on the read side, or else I add latency.
> > Neither works, and it seems that a viable alternative is within sight?
> >
> > If you don't agree that there is a core issue and opportunity here, I'm
> not
> > sure how to better make my case
> >
> > -John
> >
> > On Tue, May 7, 2019 at 11:02 AM Wes McKinney 
> wrote:
> >
> > > hi John,
> > >
> > > On Tue, May 7, 2019 at 10:53 AM John Muehlhausen  wrote:
> > > >
> > > > Wes et al, I completed a preliminary study of populating a Feather
> file
> > > > incrementally.  Some notes and questions:
> > > >
> > > > I wrote the following dataframe to a feather file:
> > > > ab
> > > > 0  0123456789  0.0
> > > > 1  0123456789  NaN
> > > > 2  0123456789  NaN
> > > > 3  0123456789  NaN
> > > > 4None  NaN
> > > >
> > > > In re-writing the flatbuffers metadata (flatc -p doesn't
> > > > support --gen-mutable! yuck! C++ to the rescue...), it seems that
> > > > read_feather is not affected by NumRows?  It seems to be driven
> entirely
> > > by
> > > > the per-column Length values?
> > > >
> > > > Also, it seems as if one of the primary usages of NullCount is to
> > > determine
> > > > whether or not a bitfield is present?  In the initialization data
> above I
> > > > was careful to have a null value in each column in order to generate
> a
> > > > bitfield.
> > >
> > > Per my prior e-mails, the current Feather format is deprecated, so I'm
> > > only willing to engage on a discussion of the official Arrow binary
> > > protocol that we use for IPC (memory mapping) and RPC (Flight).
> > >
> > > >
> > > > I then wiped the bitfields in the file and set all of the string
> indices
> > > to
> > > > one past the end of the blob buffer (all strings empty):
> > > >   a   b
> > > > 0  None NaN
> > > > 1  None NaN
> > > > 2  None NaN
> > > > 3  None NaN
> > > > 4  None NaN
> > > >
> > > > I then set the first record to some data by consuming some of the
> string
> > > > blob and row 0 and 1 indices, also setting the double:
> > > >
> > > >ab
> > > > 0  Hello, world!  5.0
> > > > 1   None  NaN
> > > > 2   None  NaN
> > > > 3   None  NaN
> > > > 4   None  NaN
> > > >
> > > > As mentioned above, NumRows seems to be ignored.  I tried adjusting
> each
> > > > column Length to mask off higher rows and it worked for 4 (hide last
> row)
> > > > but not for less than 4.  I take this to be due to math used to
> figure
> > > out
> > > > where the buffers are relative to one another since there is only one
> > > > metadata offset for all of: the (optional) bitset, index column and
> (if
> > > > string) blobs.
> > > >
> > > > Populating subsequent rows would proceed in a similar way until all
> of
> > > the
> > > > blob storage has been consumed, which may come before the
> pre-allocated
> > > > rows have been consumed.
> > > >
> > > > So what does this mean for my desire to incrementally write these
> > > > (potentially memory-mapped) pre-allocated files and/or Arrow buffers
> in
> > > > memory?  Some thoughts:
> > > >
> > > > - If a single value (such as NumRows) were consulted to determine the
> > > table
> > > > row dimension then updating this single value would serve to tell a
> > > reader
> > > > which rows are relevant.  Assuming this value is updated after all
> other
> > > > mutations are complete, and assuming that mutations are only appends
> > > > (addition of rows), then concurrency control involves only ensuring
> that
> > > > this value is not exami

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-07 Thread John Muehlhausen
Thanks Wes:

"the current Feather format is deprecated" ... yes, but there will be a
future file format that replaces it, correct?  And my discussion of
immutable "portions" of Arrow buffers, rather than immutability of the
entire buffer, applies to IPC as well, right?  I am only championing the
idea that this future file format have the convenience that certain
preallocated rows can be ignored based on a metadata setting.

"I recommend batching your writes" ... this is extremely inefficient and
adds unacceptable latency, relative to the proposed solution.  Do you
disagree?  Either I have a batch length of 1 to minimize latency, which
eliminates columnar advantages on the read side, or else I add latency.
Neither works, and it seems that a viable alternative is within sight?

If you don't agree that there is a core issue and opportunity here, I'm not
sure how to better make my case

-John

On Tue, May 7, 2019 at 11:02 AM Wes McKinney  wrote:

> hi John,
>
> On Tue, May 7, 2019 at 10:53 AM John Muehlhausen  wrote:
> >
> > Wes et al, I completed a preliminary study of populating a Feather file
> > incrementally.  Some notes and questions:
> >
> > I wrote the following dataframe to a feather file:
> > ab
> > 0  0123456789  0.0
> > 1  0123456789  NaN
> > 2  0123456789  NaN
> > 3  0123456789  NaN
> > 4None  NaN
> >
> > In re-writing the flatbuffers metadata (flatc -p doesn't
> > support --gen-mutable! yuck! C++ to the rescue...), it seems that
> > read_feather is not affected by NumRows?  It seems to be driven entirely
> by
> > the per-column Length values?
> >
> > Also, it seems as if one of the primary usages of NullCount is to
> determine
> > whether or not a bitfield is present?  In the initialization data above I
> > was careful to have a null value in each column in order to generate a
> > bitfield.
>
> Per my prior e-mails, the current Feather format is deprecated, so I'm
> only willing to engage on a discussion of the official Arrow binary
> protocol that we use for IPC (memory mapping) and RPC (Flight).
>
> >
> > I then wiped the bitfields in the file and set all of the string indices
> to
> > one past the end of the blob buffer (all strings empty):
> >   a   b
> > 0  None NaN
> > 1  None NaN
> > 2  None NaN
> > 3  None NaN
> > 4  None NaN
> >
> > I then set the first record to some data by consuming some of the string
> > blob and row 0 and 1 indices, also setting the double:
> >
> >ab
> > 0  Hello, world!  5.0
> > 1   None  NaN
> > 2   None  NaN
> > 3   None  NaN
> > 4   None  NaN
> >
> > As mentioned above, NumRows seems to be ignored.  I tried adjusting each
> > column Length to mask off higher rows and it worked for 4 (hide last row)
> > but not for less than 4.  I take this to be due to math used to figure
> out
> > where the buffers are relative to one another since there is only one
> > metadata offset for all of: the (optional) bitset, index column and (if
> > string) blobs.
> >
> > Populating subsequent rows would proceed in a similar way until all of
> the
> > blob storage has been consumed, which may come before the pre-allocated
> > rows have been consumed.
> >
> > So what does this mean for my desire to incrementally write these
> > (potentially memory-mapped) pre-allocated files and/or Arrow buffers in
> > memory?  Some thoughts:
> >
> > - If a single value (such as NumRows) were consulted to determine the
> table
> > row dimension then updating this single value would serve to tell a
> reader
> > which rows are relevant.  Assuming this value is updated after all other
> > mutations are complete, and assuming that mutations are only appends
> > (addition of rows), then concurrency control involves only ensuring that
> > this value is not examined while it is being written.
> >
> > - NullCount presents a concurrency problem if someone reads the file
> after
> > this field has been updated, but before NumRows has exposed the new
> record
> > (or vice versa).  The idea previously mentioned that there will "likely
> > [be] more statistics in the future" feels like it might be scope creep to
> > me?  This is a data representation, not a calculation framework?  If
> > NullCount had its genesis in the optional nature of the bitfield, I would
> > suggest that perhaps NullCount can be dropped in favor of always
> supplying
> > the bitfield, which in any event is already contemplated by the spec:
> > "Implementations may choose to always allocate one anyway as a matter of
> > convenience."  If the concern is space savings, Arrow is already an
> > extremely uncompressed format.  (Compression is something I would also
> > consider to be scope creep as regards Feather... compressed filesystems
> can
> > be employed and there are other compressed dataframe formats.)  However,
> if
> > protecting the 4 bytes required to update NowRows turns out to be no
> easier
> > than protecting all of the statistical bytes as well a

[jira] [Created] (ARROW-5278) [C#] ArrowBuffer should either implement IEquatable correctly or not at all

2019-05-07 Thread Eric Erhardt (JIRA)
Eric Erhardt created ARROW-5278:
---

 Summary: [C#] ArrowBuffer should either implement IEquatable 
correctly or not at all
 Key: ARROW-5278
 URL: https://issues.apache.org/jira/browse/ARROW-5278
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Eric Erhardt


See the discussion 
[here|https://github.com/apache/arrow/pull/3925/#discussion_r281378027].

ArrowBuffer currently implement IEquatable, but doesn't override `GetHashCode`.

We should either implement IEquatable correctly by overriding Equals and 
GetHashCode, or remove IEquatable all together.

Looking at ArrowBuffer's [Equals 
implementation|https://github.com/apache/arrow/blob/08829248fd540b7e3bd96b980e357f8a4db7970e/csharp/src/Apache.Arrow/ArrowBuffer.cs#L66-L69],
 it compares each value in the buffer, which is not very efficient. Also, this 
implementation is not consistent with how `Memory` implements IEquatable - 
[https://source.dot.net/#System.Private.CoreLib/shared/System/Memory.cs,500].

If we continue implementing IEquatable on ArrowBuffer, we should consider 
implementing it in the same fashion as Memory does.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-07 Thread Wes McKinney
hi John,

On Tue, May 7, 2019 at 10:53 AM John Muehlhausen  wrote:
>
> Wes et al, I completed a preliminary study of populating a Feather file
> incrementally.  Some notes and questions:
>
> I wrote the following dataframe to a feather file:
> ab
> 0  0123456789  0.0
> 1  0123456789  NaN
> 2  0123456789  NaN
> 3  0123456789  NaN
> 4None  NaN
>
> In re-writing the flatbuffers metadata (flatc -p doesn't
> support --gen-mutable! yuck! C++ to the rescue...), it seems that
> read_feather is not affected by NumRows?  It seems to be driven entirely by
> the per-column Length values?
>
> Also, it seems as if one of the primary usages of NullCount is to determine
> whether or not a bitfield is present?  In the initialization data above I
> was careful to have a null value in each column in order to generate a
> bitfield.

Per my prior e-mails, the current Feather format is deprecated, so I'm
only willing to engage on a discussion of the official Arrow binary
protocol that we use for IPC (memory mapping) and RPC (Flight).

>
> I then wiped the bitfields in the file and set all of the string indices to
> one past the end of the blob buffer (all strings empty):
>   a   b
> 0  None NaN
> 1  None NaN
> 2  None NaN
> 3  None NaN
> 4  None NaN
>
> I then set the first record to some data by consuming some of the string
> blob and row 0 and 1 indices, also setting the double:
>
>ab
> 0  Hello, world!  5.0
> 1   None  NaN
> 2   None  NaN
> 3   None  NaN
> 4   None  NaN
>
> As mentioned above, NumRows seems to be ignored.  I tried adjusting each
> column Length to mask off higher rows and it worked for 4 (hide last row)
> but not for less than 4.  I take this to be due to math used to figure out
> where the buffers are relative to one another since there is only one
> metadata offset for all of: the (optional) bitset, index column and (if
> string) blobs.
>
> Populating subsequent rows would proceed in a similar way until all of the
> blob storage has been consumed, which may come before the pre-allocated
> rows have been consumed.
>
> So what does this mean for my desire to incrementally write these
> (potentially memory-mapped) pre-allocated files and/or Arrow buffers in
> memory?  Some thoughts:
>
> - If a single value (such as NumRows) were consulted to determine the table
> row dimension then updating this single value would serve to tell a reader
> which rows are relevant.  Assuming this value is updated after all other
> mutations are complete, and assuming that mutations are only appends
> (addition of rows), then concurrency control involves only ensuring that
> this value is not examined while it is being written.
>
> - NullCount presents a concurrency problem if someone reads the file after
> this field has been updated, but before NumRows has exposed the new record
> (or vice versa).  The idea previously mentioned that there will "likely
> [be] more statistics in the future" feels like it might be scope creep to
> me?  This is a data representation, not a calculation framework?  If
> NullCount had its genesis in the optional nature of the bitfield, I would
> suggest that perhaps NullCount can be dropped in favor of always supplying
> the bitfield, which in any event is already contemplated by the spec:
> "Implementations may choose to always allocate one anyway as a matter of
> convenience."  If the concern is space savings, Arrow is already an
> extremely uncompressed format.  (Compression is something I would also
> consider to be scope creep as regards Feather... compressed filesystems can
> be employed and there are other compressed dataframe formats.)  However, if
> protecting the 4 bytes required to update NowRows turns out to be no easier
> than protecting all of the statistical bytes as well as part of the same
> "critical section" (locks: yuck!!) then statistics pose no issue.  I have a
> feeling that the availability of an atomic write of 4 bytes will depend on
> the storage mechanism... memory vs memory map vs write() etc.
>
> - The elephant in the room appears to be the presumptive binary yes/no on
> mutability of Arrow buffers.  Perhaps the thought is that certain batch
> processes will be wrecked if anyone anywhere is mutating buffers in any
> way.  But keep in mind I am not proposing general mutability, only
> appending of new data.  *A huge amount of batch processing that will take
> place with Arrow is on time-series data (whether financial or otherwise).
> It is only natural that architects will want the minimal impedance mismatch
> when it comes time to grow those time series as the events occur going
> forward.*  So rather than say that I want "mutable" Arrow buffers, I would
> pitch this as a call for "immutable populated areas" of Arrow buffers
> combined with the concept that the populated area can grow up to whatever
> was preallocated.  This will not affect anyone who has "memoized" a
> dimension and wants to 

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-07 Thread John Muehlhausen
Wes et al, I completed a preliminary study of populating a Feather file
incrementally.  Some notes and questions:

I wrote the following dataframe to a feather file:
ab
0  0123456789  0.0
1  0123456789  NaN
2  0123456789  NaN
3  0123456789  NaN
4None  NaN

In re-writing the flatbuffers metadata (flatc -p doesn't
support --gen-mutable! yuck! C++ to the rescue...), it seems that
read_feather is not affected by NumRows?  It seems to be driven entirely by
the per-column Length values?

Also, it seems as if one of the primary usages of NullCount is to determine
whether or not a bitfield is present?  In the initialization data above I
was careful to have a null value in each column in order to generate a
bitfield.

I then wiped the bitfields in the file and set all of the string indices to
one past the end of the blob buffer (all strings empty):
  a   b
0  None NaN
1  None NaN
2  None NaN
3  None NaN
4  None NaN

I then set the first record to some data by consuming some of the string
blob and row 0 and 1 indices, also setting the double:

   ab
0  Hello, world!  5.0
1   None  NaN
2   None  NaN
3   None  NaN
4   None  NaN

As mentioned above, NumRows seems to be ignored.  I tried adjusting each
column Length to mask off higher rows and it worked for 4 (hide last row)
but not for less than 4.  I take this to be due to math used to figure out
where the buffers are relative to one another since there is only one
metadata offset for all of: the (optional) bitset, index column and (if
string) blobs.

Populating subsequent rows would proceed in a similar way until all of the
blob storage has been consumed, which may come before the pre-allocated
rows have been consumed.

So what does this mean for my desire to incrementally write these
(potentially memory-mapped) pre-allocated files and/or Arrow buffers in
memory?  Some thoughts:

- If a single value (such as NumRows) were consulted to determine the table
row dimension then updating this single value would serve to tell a reader
which rows are relevant.  Assuming this value is updated after all other
mutations are complete, and assuming that mutations are only appends
(addition of rows), then concurrency control involves only ensuring that
this value is not examined while it is being written.

- NullCount presents a concurrency problem if someone reads the file after
this field has been updated, but before NumRows has exposed the new record
(or vice versa).  The idea previously mentioned that there will "likely
[be] more statistics in the future" feels like it might be scope creep to
me?  This is a data representation, not a calculation framework?  If
NullCount had its genesis in the optional nature of the bitfield, I would
suggest that perhaps NullCount can be dropped in favor of always supplying
the bitfield, which in any event is already contemplated by the spec:
"Implementations may choose to always allocate one anyway as a matter of
convenience."  If the concern is space savings, Arrow is already an
extremely uncompressed format.  (Compression is something I would also
consider to be scope creep as regards Feather... compressed filesystems can
be employed and there are other compressed dataframe formats.)  However, if
protecting the 4 bytes required to update NowRows turns out to be no easier
than protecting all of the statistical bytes as well as part of the same
"critical section" (locks: yuck!!) then statistics pose no issue.  I have a
feeling that the availability of an atomic write of 4 bytes will depend on
the storage mechanism... memory vs memory map vs write() etc.

- The elephant in the room appears to be the presumptive binary yes/no on
mutability of Arrow buffers.  Perhaps the thought is that certain batch
processes will be wrecked if anyone anywhere is mutating buffers in any
way.  But keep in mind I am not proposing general mutability, only
appending of new data.  *A huge amount of batch processing that will take
place with Arrow is on time-series data (whether financial or otherwise).
It is only natural that architects will want the minimal impedance mismatch
when it comes time to grow those time series as the events occur going
forward.*  So rather than say that I want "mutable" Arrow buffers, I would
pitch this as a call for "immutable populated areas" of Arrow buffers
combined with the concept that the populated area can grow up to whatever
was preallocated.  This will not affect anyone who has "memoized" a
dimension and wants to continue to consider the then-current data as
immutable... it will be immutable and will always be immutable according to
that then-current dimension.

Thanks in advance for considering this feedback!  I absolutely require
efficient row-wise growth of an Arrow-like buffer to deal with time series
data in near real time.  I believe that preallocation is (by far) the most
efficient way to accomplish this.  I hope to be able to use Arrow!  If I
cann

[jira] [Created] (ARROW-5277) [C#] MemoryAllocator.Allocate(length: 0) should not return null

2019-05-07 Thread Eric Erhardt (JIRA)
Eric Erhardt created ARROW-5277:
---

 Summary: [C#] MemoryAllocator.Allocate(length: 0) should not 
return null
 Key: ARROW-5277
 URL: https://issues.apache.org/jira/browse/ARROW-5277
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C#
Reporter: Eric Erhardt


See the conversation 
[here|https://github.com/apache/arrow/pull/3925#discussion_r281187184].

We should change MemoryAllocator to not return `null` when the requested memory 
length is `0`. Instead, we should create a cached "NullObject" IMemoryOwner 
that has a no-op `Dispose` method, and always returns `Memory.Empty`.

This way consuming code doesn't need to check for `null` being returned from 
MemoryAllocator.Allocate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5276) [C#] NativeMemoryAllocator expose an option for clearing allocated memory

2019-05-07 Thread Eric Erhardt (JIRA)
Eric Erhardt created ARROW-5276:
---

 Summary: [C#] NativeMemoryAllocator expose an option for clearing 
allocated memory
 Key: ARROW-5276
 URL: https://issues.apache.org/jira/browse/ARROW-5276
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C#
Reporter: Eric Erhardt


See the discussion 
[here|https://github.com/apache/arrow/pull/3925#discussion_r281192698].

We should expose an option on NativeMemoryAllocator for controlling whether the 
allocated memory is cleared or not.

Maybe we should make the default not clear the memory, that way it is the best 
performing by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5275) [C++] Write generic filesystem tests

2019-05-07 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-5275:
-

 Summary: [C++] Write generic filesystem tests
 Key: ARROW-5275
 URL: https://issues.apache.org/jira/browse/ARROW-5275
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


We need a suite of implementation-agnostic tests for filesystem 
implementations, to make it easy to validate each implementation against the 
expected semantics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)