[jira] [Created] (ARROW-2189) [C++] Seg. fault on make_shared

2018-02-19 Thread Rares Vernica (JIRA)
Rares Vernica created ARROW-2189:


 Summary: [C++] Seg. fault on make_shared
 Key: ARROW-2189
 URL: https://issues.apache.org/jira/browse/ARROW-2189
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.8.0
 Environment: Debian jessie in a Docker container
libarrow-dev 0.8.0-2 (Ubuntu trusty)

Reporter: Rares Vernica


When creating a {{PoolBuffer}}, I get a {{Segmentation fault}} when I use 
{{make_shared}}. If I use the {{shared_ptr}} constructor of {{reset}}, it works 
fine. Here is an example:
{code:java}
#include 

int main()
{
arrow::MemoryPool* pool = arrow::default_memory_pool();

arrow::Int64Builder builder(pool);
builder.Append(1);

// #1
// std::shared_ptr buffer(new arrow::PoolBuffer(pool));
// #2
// std::shared_ptr buffer;
// buffer.reset(new arrow::PoolBuffer(pool));
// #3
auto buffer = std::make_shared(pool);
}
{code}
{code:java}
> g++-4.9 -std=c++11 -larrow foo.cpp && ./a.out
Segmentation fault (core dumped)
{code}
The example works fine with {{#1}} or {{#2}} options. It also works if the 
builder is commented out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2188) [JS] Error on Travis-CI during gulp build

2018-02-19 Thread Phillip Cloud (JIRA)
Phillip Cloud created ARROW-2188:


 Summary: [JS] Error on Travis-CI during gulp build
 Key: ARROW-2188
 URL: https://issues.apache.org/jira/browse/ARROW-2188
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Affects Versions: 0.8.0
Reporter: Phillip Cloud


Failing builds:

https://travis-ci.org/apache/arrow/jobs/343649349
https://travis-ci.org/apache/arrow/jobs/343649353

Error message:

{code}
Error: potentially unsafe regular expression: ^(?:(?!(?:[\[!*+?$^"'.\\/]+)).)+
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2187) RFC: Organize language implementations in a top-level lib/ directory

2018-02-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2187:
---

 Summary: RFC: Organize language implementations in a top-level 
lib/ directory
 Key: ARROW-2187
 URL: https://issues.apache.org/jira/browse/ARROW-2187
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Wes McKinney


As we acquire more Arrow implementations, the number of top-level directories 
may grow significantly. We might consider nesting these implementations under a 
new top-level directory, similar to Apache Thrift: 
https://github.com/apache/thrift (see the "lib/" directory)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2186) [C++] Clean up architecture specific compiler flags

2018-02-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2186:
---

 Summary: [C++] Clean up architecture specific compiler flags
 Key: ARROW-2186
 URL: https://issues.apache.org/jira/browse/ARROW-2186
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


I noticed that {{-maltivec}} is being passed to the compiler on Linux, with an 
x86_64 processor. That seemed odd to me. It prompted me to look more generally 
at our compiler flags related to hardware optimizations. We have the ability to 
pass {{-msse3}}, but there is a {{ARROW_USE_SSE}} which is only used as a 
define in some headers. There is {{ARROW_ALTIVEC}}, but no option to pass 
{{-march}}. Nothing related to AVX/AVX2/AVX512. I think this could do for an 
overhaul



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2185) Remove CI directives from squashed commit messages

2018-02-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2185:
---

 Summary: Remove CI directives from squashed commit messages
 Key: ARROW-2185
 URL: https://issues.apache.org/jira/browse/ARROW-2185
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Wes McKinney
 Fix For: 0.9.0


In our PR squash tool, we are potentially picking up CI directives like {{[skip 
appveyor]}} from intermediate commits. We should regex these away and instead 
use directives in the PR title if we wish the commit to master to behave in a 
certain way



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: C++ OutputStream for both StdoutStream and FileOutputStream

2018-02-19 Thread Wes McKinney
hi Rares,

I agree this is a rough edge. I opened
https://issues.apache.org/jira/browse/ARROW-2184 so we can review and
be more consistent about using the base interfaces

For your use case I would recommend doing

std::shared_ptr tmp;
RETURN_NOT_OK(FileOutputStream::Open(file_name, &tmp));
f = tmp;

- Wes

On Mon, Feb 19, 2018 at 6:54 PM, Rares Vernica  wrote:
> Hi,
>
> This might be more a C++ question, but I'm trying to have one variable
> store the output stream for both StdoutStream and FileOutputStream. I do
> this:
>
> shared_ptr f;
> if (fn == "stdout")
> f.reset(new StdoutStream());
> else
> FileOutputStream::Open(fn, false, &f);
>
> As is, the code does not work because Open expects
> shared_ptr. If I do a cast:
>
> FileOutputStream::Open(fn, false,
> &(dynamic_pointer_cast(f)));
>
> I get an error: taking address of temporary [-fpermissive]
>
> What would be a good way of having one variable for both branches of the if
> statement?
>
> Thanks!
> Rares


[jira] [Created] (ARROW-2184) [C++] Add static ctor for FileOutputStream returning shared_ptr to base OutputStream

2018-02-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2184:
---

 Summary: [C++] Add static ctor for FileOutputStream returning 
shared_ptr to base OutputStream
 Key: ARROW-2184
 URL: https://issues.apache.org/jira/browse/ARROW-2184
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.9.0


It would be useful for most IO ctors to return pointers to the base interface 
that they implement rather than the subclass. Whether we deprecate the current 
ones will vary on a case by case basis



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


C++ OutputStream for both StdoutStream and FileOutputStream

2018-02-19 Thread Rares Vernica
Hi,

This might be more a C++ question, but I'm trying to have one variable
store the output stream for both StdoutStream and FileOutputStream. I do
this:

shared_ptr f;
if (fn == "stdout")
f.reset(new StdoutStream());
else
FileOutputStream::Open(fn, false, &f);

As is, the code does not work because Open expects
shared_ptr. If I do a cast:

FileOutputStream::Open(fn, false,
&(dynamic_pointer_cast(f)));

I get an error: taking address of temporary [-fpermissive]

What would be a good way of having one variable for both branches of the if
statement?

Thanks!
Rares


[jira] [Created] (ARROW-2183) [C++] Add helper CMake function for globbing the right header files

2018-02-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2183:
---

 Summary: [C++] Add helper CMake function for globbing the right 
header files 
 Key: ARROW-2183
 URL: https://issues.apache.org/jira/browse/ARROW-2183
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Wes McKinney


Brought up by discussion in https://github.com/apache/arrow/pull/1631 on 
ARROW-2179. We should collect header files but do not install ones containing 
particular patterns for non-public headers, like {{-internal}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2182) [Python] ASV benchmark setup does not account for C++ library changing

2018-02-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2182:
---

 Summary: [Python] ASV benchmark setup does not account for C++ 
library changing
 Key: ARROW-2182
 URL: https://issues.apache.org/jira/browse/ARROW-2182
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.9.0


See https://github.com/apache/arrow/blob/master/python/README-benchmarks.md

Perhaps we could create a helper script that will run all the currently-defined 
benchmarks for a specific commit, and ensure that we are running against 
pristine, up-to-date release builds of Arrow (and any other dependencies, like 
parquet-cpp) at that commit? 

cc [~pitrou]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Merge multiple record batches

2018-02-19 Thread Wes McKinney
The function is just pyarrow.concat_tables. It's missing from the API
reference and ought to have a small section in the documentation.
Patches welcomed

https://issues.apache.org/jira/browse/ARROW-2181

On Mon, Feb 19, 2018 at 5:04 PM, Bryan Cutler  wrote:
> Hi Rares,
>
> I'm not sure what version of Arrow you are using, but pyarrow.Table has a
> function to concat multiple tables together so the usage would be something
> like this:
>
> table_all = pa.Table.concat_tables([table1, table2])
>
> On Wed, Feb 14, 2018 at 4:01 AM, ALBERTO Bocchinfuso <
> alberto_boc...@hotmail.it> wrote:
>
>> Hi,
>> I don’t think I understood perfectly your point, but I try to give you the
>> answer that looks the simplest to me.
>> In your code there isn’t any operation on table 1 and 2 separately, it
>> just looks like you want to merge all those RecordBatches.
>> Now I think that:
>>
>>   1.  you can use the to_batches() operation reported in the API for
>> Table, but I never tried it myself. In this way you create 2 tables, create
>> batches from these tables, put the batches togheter.
>>   2.  I would rather store ALL the BATCHES in the two streams in the SAME
>> python LIST, and then create an unique table using from_batches() as you
>> suggested. That’s because in your code you create two tables even though
>> you don’t seem to care about them.
>>
>> I didn’t try, but I think that you can go both ways and then tell us if
>> the result is the same and if one of the two is faster then the other.
>>
>> Alberto
>>
>> Da: Rares Vernica
>> Inviato: mercoledì 14 febbraio 2018 05:13
>> A: dev@arrow.apache.org
>> Oggetto: Merge multiple record batches
>>
>> Hi,
>>
>> If I have multiple RecordBatchStreamReader inputs, what is the recommended
>> way to get all the RecordBatch from all the inputs together, maybe in a
>> Table? They all have the same schema. The source for the readers are
>> different files.
>>
>> So, I do something like:
>>
>> reader1 = pa.open_stream('foo')
>> table1 = reader1.read_all()
>>
>> reader2 = pa.open_stream('bar')
>> table2 = reader2.read_all()
>>
>> # table_all = ???
>> # OR maybe I don't need to create table1 and table2
>> # table_all = pa.Table.from_batches( ??? )
>>
>> Thanks!
>> Rares
>>
>>


[jira] [Created] (ARROW-2181) [Python] Add concat_tables to API reference, add documentation on use

2018-02-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2181:
---

 Summary: [Python] Add concat_tables to API reference, add 
documentation on use
 Key: ARROW-2181
 URL: https://issues.apache.org/jira/browse/ARROW-2181
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney


This omission of documentation was mentioned on the mailing list on February 
13. The documentation should illustrate the contrast between 
{{Table.from_batches}} and {{concat_tables}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2180) [C++] Remove APIs deprecated in 0.8.0 release

2018-02-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2180:
---

 Summary: [C++] Remove APIs deprecated in 0.8.0 release
 Key: ARROW-2180
 URL: https://issues.apache.org/jira/browse/ARROW-2180
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.9.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Merge multiple record batches

2018-02-19 Thread Bryan Cutler
Hi Rares,

I'm not sure what version of Arrow you are using, but pyarrow.Table has a
function to concat multiple tables together so the usage would be something
like this:

table_all = pa.Table.concat_tables([table1, table2])

On Wed, Feb 14, 2018 at 4:01 AM, ALBERTO Bocchinfuso <
alberto_boc...@hotmail.it> wrote:

> Hi,
> I don’t think I understood perfectly your point, but I try to give you the
> answer that looks the simplest to me.
> In your code there isn’t any operation on table 1 and 2 separately, it
> just looks like you want to merge all those RecordBatches.
> Now I think that:
>
>   1.  you can use the to_batches() operation reported in the API for
> Table, but I never tried it myself. In this way you create 2 tables, create
> batches from these tables, put the batches togheter.
>   2.  I would rather store ALL the BATCHES in the two streams in the SAME
> python LIST, and then create an unique table using from_batches() as you
> suggested. That’s because in your code you create two tables even though
> you don’t seem to care about them.
>
> I didn’t try, but I think that you can go both ways and then tell us if
> the result is the same and if one of the two is faster then the other.
>
> Alberto
>
> Da: Rares Vernica
> Inviato: mercoledì 14 febbraio 2018 05:13
> A: dev@arrow.apache.org
> Oggetto: Merge multiple record batches
>
> Hi,
>
> If I have multiple RecordBatchStreamReader inputs, what is the recommended
> way to get all the RecordBatch from all the inputs together, maybe in a
> Table? They all have the same schema. The source for the readers are
> different files.
>
> So, I do something like:
>
> reader1 = pa.open_stream('foo')
> table1 = reader1.read_all()
>
> reader2 = pa.open_stream('bar')
> table2 = reader2.read_all()
>
> # table_all = ???
> # OR maybe I don't need to create table1 and table2
> # table_all = pa.Table.from_batches( ??? )
>
> Thanks!
> Rares
>
>


Re: problems with statically linked Boost

2018-02-19 Thread Wes McKinney
> How would you plan to update Boost even within the ecosystem?

Yep, maintainers have been automatically updating the version pins
across the ecosystem. Example pin:
https://github.com/conda-forge/arrow-cpp-feedstock/pull/44

Just to put this sentiment out there publicly, and it's not really an
"Arrow problem" -- it seems inevitable (on a 20 year horizon -- who
knows how long it will actually take) that we will see something like
conda-forge develop with the following features:

* Is essentially a monorepo build system for a hybrid OSS development
setup like you're describing
* Can be run on your own cloud infrastructure / VPC, with support
tooling for building in an air-gapped / no-internet environment
* Reproducible runtime environments a la conda
* Build open source dependencies and your closed source dependencies
using the same build toolchain. Updating a package version triggers
all downstream dependencies to rebuild
* Support for nightly / bleeding edge builds of any group of
interdependent packages, installable from a $FOO-nightly "channel" to
use "conda" lingo
* Any package can be configured to run unit tests as part of the
package validation process
* Non-hostile UX for developers -- conda-forge is rife with problems
and a routine time suck for all of us. For example, conducting a
cascade of interdependent changes is a nightmare (Arrow C++ -> Parquet
C++ -> Arrow Python)

Basically, this should be something like Bazel + Docker (with some
alternative solution for Windows users) + conda, with nice tooling and
UI. Or, to put it another way, a DIY, self-governed version of what
Anaconda Inc. and conda-forge are collectively providing to the
community.

I have long argued that centralized packaging systems are a liability
to OSS consumers ([1], [2]). But it's hard to build your own toolchain
from the bottom up. I don't see why it *needs* to be so hard.

LargeCos like Google have already solved this problem for themselves,
but their solutions are not easily accessible to most. It's great that
the build systems (like Bazel) are being open sourced, but there's a
lot more stuff that needs to get built to make things easier for shops
with less devops resources.

If someone wants to build this "Monorepo System of your Dreams",
please take this idea and run with it and you'll have me as a happy
customer someday. I'm not passionate enough about packaging and devops
alone to work on it personally, though the tooling is causing me and
so many people I know so much pain I'm not sure where to turn right
now.

- Wes

[1]: http://wesmckinney.com/blog/conda-forge-centos-moment/
[2]: http://wesmckinney.com/blog/the-problem-with-conda-forge-right-now/

On Sat, Feb 17, 2018 at 4:33 PM, Alex Samuel  wrote:
> Sorry, I think I wasn't clear.  I mean the broader issue of C++ extension
> code.  In principle it should be possible to mix C++ extension modules in
> the same Python process, at least in some cases, especially if they link
> their own dependencies statically.  While "just do it our way, use our
> tools" is fine for most cases, it might not be fine for some, for a wide
> variety of reasons.  If everyone could agree on C++ compilers and libraries,
> Linux distros would have standardized ages ago I suppose.  This isn't a
> conda-specific problem; however for conda it becomes a runtime problem
> rather than a build-time one.  But I concede it's a hard problem.  Mainly,
> what I'm suggesting is that the policies and use cases be explicit.  Thanks
> for clarifying.
>
> Regarding Boost particularly, all I meant was that the Boost dependency
> wasn't apparent in the usual ways, which made debugging problems harder.
>
> How would you plan to update Boost even within the ecosystem?  Bump the
> version in the toolchain, rebuild the world, and update all environments?
> Without explicit dependency, how do you prevent someone from running parquet
> statically linked to Boost N from running in an environment with
> boost-cpp==N+1 installed?
>
> Thanks,
> Alex
>
>
>
>
>
> On 02/17/2018 04:20 PM, Wes McKinney wrote:
>>>
>>> However, extension modules are always going to have to share the Python
>>> process, so this policy kind of says, you can't use external C++ extension
>>> code with conda.
>>
>>
>> This is a bit too extreme. What I meant is that you should try not to
>> mix C++ build toolchains. I think this is good advice even without
>> conda/conda-forge in the loop. If conda-forge were supplying the
>> library / build toolchain for the rest of your projects, then
>> everything would be OK.
>>
>>> Given the policy, it seems slightly better to link Boost dynamically.
>>
>>
>> We could do this, but it seems like a last resort workaround to the
>> core problem, which is the mixed build toolchain issue. I don't know
>> what Boost's ABI guarantees are, but dynamic linking isn't guaranteed
>> to solve using two libraries built against different versions of Boost
>> in the same process. The boost-cpp package is a pretty chunky runt

[jira] [Created] (ARROW-2179) [C++] arrow/util/io-util.h missing from libarrow-dev

2018-02-19 Thread Rares Vernica (JIRA)
Rares Vernica created ARROW-2179:


 Summary: [C++] arrow/util/io-util.h missing from libarrow-dev
 Key: ARROW-2179
 URL: https://issues.apache.org/jira/browse/ARROW-2179
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Rares Vernica


{{arrow/util/io-util.h}} is missing from the {{libarow-dev}} package 
(ubuntu/trusty): 
{code:java}
> ls -1 /usr/include/arrow/util/
bit-stream-utils.h
bit-util.h
bpacking.h
compiler-util.h
compression.h
compression_brotli.h
compression_lz4.h
compression_snappy.h
compression_zlib.h
compression_zstd.h
cpu-info.h
decimal.h
hash-util.h
hash.h
key_value_metadata.h
logging.h
macros.h
parallel.h
rle-encoding.h
sse-util.h
stl.h
type_traits.h
variant
variant.h
visibility.h
{code}

{code:java}
> apt-cache show libarrow-dev
Package: libarrow-dev
Architecture: amd64
Version: 0.8.0-2
Multi-Arch: same
Priority: optional
Section: libdevel
Source: apache-arrow
Maintainer: Kouhei Sutou 
Installed-Size: 5696
Depends: libarrow0 (= 0.8.0-2)
Filename: pool/trusty/universe/a/apache-arrow/libarrow-dev_0.8.0-2_amd64.deb
Size: 602716
MD5sum: de5f2bfafd90ff29e4b192f4e5d26605
SHA1: e3d9146b30f07c07b62f8bdf9f779d0ee5d05a75
SHA256: 30a89b2ac6845998f22434e660b1a7c9d91dc8b2ba947e1f4333b3cf74c69982
SHA512: 
99f511bee6645a68708848a58b4eba669a2ec8c45fb411c56ed2c920d3ff34552c77821eff7e428c886d16e450bdd25cc4e67597972f77a4255f302a56d1eac8
Homepage: https://arrow.apache.org/
Description: Apache Arrow is a data processing library for analysis
 .
 This package provides header files.
Description-md5: e4855d5dbadacb872bf8c4ca67f624e3
{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [VOTE] Release Apache Arrow JavaScript 0.3.0 - RC0

2018-02-19 Thread Wes McKinney
+1 (binding)

Ran dev/release/js-verify-release-candidate.sh with Node 9.2. Looks good

On Mon, Feb 19, 2018 at 3:54 PM, Wes McKinney  wrote:
> Hello all,
>
> I'd like to propose the 1st release candidate (rc0) of Apache
> Arrow JavaScript version 0.3.0.  This will be the second JavaScript
> release, made separately from the main project releases.
>
> The source release rc0 is hosted at [1].
>
> This release candidate is based on commit
> 7d992de1de7dd276eb9aeda349376e79b62da11c
>
> Please download, verify checksums and signatures, run the unit tests, and vote
> on the release. The easiest way is to use the JavaScript-specific release
> verification script dev/release/js-verify-release-candidate.sh.
>
> The vote will be open for at least 24 hours and will close once
> enough PMCs have approved the release.
>
> [ ] +1 Release this as Apache Arrow JavaScript 0.3.0
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow JavaScript 0.3.0 because...
>
> Thanks,
> Wes
>
> How to validate a release signature:
> https://httpd.apache.org/dev/verification.html
>
> [1]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-js-0.3.0-rc0/
> [2]: 
> https://github.com/apache/arrow/tree/7d992de1de7dd276eb9aeda349376e79b62da11c


[VOTE] Release Apache Arrow JavaScript 0.3.0 - RC0

2018-02-19 Thread Wes McKinney
Hello all,

I'd like to propose the 1st release candidate (rc0) of Apache
Arrow JavaScript version 0.3.0.  This will be the second JavaScript
release, made separately from the main project releases.

The source release rc0 is hosted at [1].

This release candidate is based on commit
7d992de1de7dd276eb9aeda349376e79b62da11c

Please download, verify checksums and signatures, run the unit tests, and vote
on the release. The easiest way is to use the JavaScript-specific release
verification script dev/release/js-verify-release-candidate.sh.

The vote will be open for at least 24 hours and will close once
enough PMCs have approved the release.

[ ] +1 Release this as Apache Arrow JavaScript 0.3.0
[ ] +0
[ ] -1 Do not release this as Apache Arrow JavaScript 0.3.0 because...

Thanks,
Wes

How to validate a release signature:
https://httpd.apache.org/dev/verification.html

[1]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-js-0.3.0-rc0/
[2]: 
https://github.com/apache/arrow/tree/7d992de1de7dd276eb9aeda349376e79b62da11c


[jira] [Created] (ARROW-2178) [JS] Fix JS html FileReader example

2018-02-19 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-2178:


 Summary:  [JS] Fix JS html FileReader example
 Key: ARROW-2178
 URL: https://issues.apache.org/jira/browse/ARROW-2178
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Reporter: Brian Hulette
Assignee: Paul Taylor
 Fix For: JS-0.3.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2177) [C++] Remove support for specifying negative scale values in DecimalType

2018-02-19 Thread Phillip Cloud (JIRA)
Phillip Cloud created ARROW-2177:


 Summary: [C++] Remove support for specifying negative scale values 
in DecimalType
 Key: ARROW-2177
 URL: https://issues.apache.org/jira/browse/ARROW-2177
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Affects Versions: 0.8.0
Reporter: Phillip Cloud






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2176) [C++] Extend DictionaryBuilder to support delta dictionaries

2018-02-19 Thread Dimitri Vorona (JIRA)
Dimitri Vorona created ARROW-2176:
-

 Summary: [C++] Extend DictionaryBuilder to support delta 
dictionaries
 Key: ARROW-2176
 URL: https://issues.apache.org/jira/browse/ARROW-2176
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Dimitri Vorona
 Fix For: 0.9.0


[The IPC format|https://arrow.apache.org/docs/ipc.html] specifies a possibility 
of sending additional dictionary batches with a previously seen id and a 
isDelta flag to extend the existing dictionaries with new entries. Right now, 
the DictioniaryBuilder (as well as IPC writer and reader) do not support 
generation of delta dictionaries.

This pull request contains a basic implementation of the DictionaryBuilder with 
delta dictionaries support. The use API can be seen in the dictionary tests 
(i.e. 
[here|https://github.com/alendit/arrow/blob/delta_dictionary_builder/cpp/src/arrow/array-test.cc#L1773]).
 The basic idea is that the user just reuses the builder object after calling 
Finish(Array*) for the first time. Subsequent calls to Append will create new 
entries only for the unseen element and reuse id from previous dictionaries for 
the seen ones.

Some considerations:
 # The API is pretty implicit, and additional flag for Finish, which explicitly 
indicates a desire to use the builder for delta dictionary generation might be 
expedient from the error avoidance point of view.
 # Right now the implementation uses an additional "overflow dictionary" to 
store the seen items. This adds a copy on each Finish call and an additional 
lookup at each GetItem or Append call. I assume, we might get away with 
returning Array slices at Finish, which would remove the need for an additional 
overflow dictionary. If the gist of the PR is approved, I can look into further 
optimizations.

The Writer and Reader extensions would be pretty simple, since the 
DictionaryBuilder API remains basically the same. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2175) [Python] arrow_ep build is triggering during parquet-cpp build in Travis CI

2018-02-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2175:
---

 Summary: [Python] arrow_ep build is triggering during parquet-cpp 
build in Travis CI
 Key: ARROW-2175
 URL: https://issues.apache.org/jira/browse/ARROW-2175
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.9.0


see e.g. https://travis-ci.org/apache/arrow/jobs/342781531#L5546. This may be 
related to upstream changes in Parquet



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2174) [JS] Export format and schema enums

2018-02-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2174:
---

 Summary: [JS] Export format and schema enums
 Key: ARROW-2174
 URL: https://issues.apache.org/jira/browse/ARROW-2174
 Project: Apache Arrow
  Issue Type: Improvement
  Components: JavaScript
Reporter: Wes McKinney
Assignee: Paul Taylor
 Fix For: JS-0.3.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2173) [Python] NumPyBuffer destructor should hold the GIL

2018-02-19 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2173:
-

 Summary: [Python] NumPyBuffer destructor should hold the GIL
 Key: ARROW-2173
 URL: https://issues.apache.org/jira/browse/ARROW-2173
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.8.0
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


Failure to hold the GIL can lead to crashes, depending on presence of several 
threads or whatever the object allocator needs to do.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2172) [Python] Incorrect conversion from Numpy array when stride % itemsize != 0

2018-02-19 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2172:
-

 Summary: [Python] Incorrect conversion from Numpy array when 
stride % itemsize != 0
 Key: ARROW-2172
 URL: https://issues.apache.org/jira/browse/ARROW-2172
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.8.0
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


In the example below, the input array has a stride that's not a multiple of the 
itemsize:

{code:python}
>>> data = np.array([(42, True), (43, False)],
...:dtype=[('x', np.int32), ('y', np.bool_)])
...:
...:
>>> data['x']
array([42, 43], dtype=int32)
>>> pa.array(data['x'], type=pa.int32())

[
  42,
  11009
]
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2171) [Python] OwnedRef is fragile

2018-02-19 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2171:
-

 Summary: [Python] OwnedRef is fragile
 Key: ARROW-2171
 URL: https://issues.apache.org/jira/browse/ARROW-2171
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.8.0
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


Some uses of OwnedRef can implicitly invoke its (default) copy constructor, 
which will lead to extraneous decrefs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)