Job posting

2018-06-27 Thread Ryan Murray
Hey All,

Apologies ahead of time for the potential spam. I have been working with
Jacques and co for about 6 months to bring Dremio to our front office
trading organisation at UBS. We are now expanding the team and I am looking
to hire some killer devs in London to work on Arrow, Gandiva and other
cutting edge toys.

For my money this is the most exciting team in any investment bank and its
the only one I know of where one can actually contribute to both open
source projects and to a front office trading system. Here[1] is the link
or chat me for details.

Looking forward to showing off some of our work on this list very soon.

Best,
Ryan

[1]
https://jobs.ubs.com/TGnewUI/Search/home/HomeWithPreLoad?PageType=JobDetails&jobId=176009&PartnerId=25008&SiteId=5012&JobReqLang=1&JobSiteId=5012&JobSiteInfo=176009_5012&phid=88648&codes=ILINKEDCH&nonloginid=#jobDetails=176009_5012


[jira] [Created] (ARROW-2758) [Plasma] Use Scope enum in Plasma

2018-06-27 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2758:
-

 Summary: [Plasma] Use Scope enum in Plasma
 Key: ARROW-2758
 URL: https://issues.apache.org/jira/browse/ARROW-2758
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Plasma (C++)
Reporter: Philipp Moritz
 Fix For: 0.10.0


Modernize our usage of enums in plasma:
 # add option "--scoped-enum" to Flat Buffer Compiler.
 # change the old-styled c++ enum to c++11 style.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2757) [Plasma] Huge pages test failing

2018-06-27 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2757:
-

 Summary: [Plasma] Huge pages test failing
 Key: ARROW-2757
 URL: https://issues.apache.org/jira/browse/ARROW-2757
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


See

```

=== FAILURES 
=== _ 
test_use_huge_pages __ @pytest.mark.skipif(not 
os.path.exists("/mnt/hugepages"), reason="requires hugepage support") def 
test_use_huge_pages(): import pyarrow.plasma as plasma with 
plasma.start_plasma_store( plasma_store_memory=DEFAULT_PLASMA_STORE_MEMORY, 
plasma_directory="/mnt/hugepages", use_hugepages=True) as (plasma_store_name, 
p): plasma_client = plasma.connect(plasma_store_name, "", 64) > 
create_object(plasma_client, 1) pyarrow/tests/test_plasma.py:773: _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pyarrow/tests/test_plasma.py:79: in create_object seal=seal) 
pyarrow/tests/test_plasma.py:68: in create_object_with_id memory_buffer = 
client.create(object_id, data_size, metadata) pyarrow/_plasma.pyx:300: in 
pyarrow._plasma.PlasmaClient.create 
check_status(self.client.get().Create(object_id.data, data_size, _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > raise 
PlasmaStoreFull(message) E PlasmaStoreFull: 
/home/travis/build/apache/arrow/cpp/src/plasma/client.cc:375 code: 
ReadCreateReply(buffer.data(), buffer.size(), &id, &object, &store_fd, 
&mmap_size) E object does not fit in the plasma store

```

seems to be failing consistently since 
[https://github.com/apache/arrow/pull/2062] (which is unrelated)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[DISCUSS] Joda Time -> Java8 Time

2018-06-27 Thread Li Jin
Hi,

There has been a recent pr 2171 for ARROW-2015 to replace Joda time with
Java8 time.

I think this change is good as we move toward newer Java version and wonder
if we should include this in the 0.10 release.

The biggest concern is that this is a breaking change and could impact
downstream projects like Dremio. What do people think?

Li


Re: Arrow sync at 12pm Eastern today

2018-06-27 Thread Phillip Cloud
I won't be able to make it today.

On Wed, Jun 27, 2018 at 10:48 AM Wes McKinney  wrote:

> https://meet.google.com/vtm-teks-phx
>


Re: Some initial GPU questions

2018-06-27 Thread Anthony Scopatz
Hello Wes, Antoine,

Thanks for your very detailed responses!

It is really good to know that what is in arrow/gpu now is already setup to
integrate into various GPU producer / consumers.

The other responses made sense (assume in-memory and rely on orchestration,
explicit over implicit, roadmap discussions on confluence, integrating CIs).

Thanks again!
Be Well
Anthony

On Tue, Jun 26, 2018 at 1:04 PM Wes McKinney  wrote:

> hi Anthony,
>
> Antoine is right that a Device abstraction is needed. I hadn't seen
> ARROW-2447 (I was on vacation in April) but I will comment there.
>
> It would be helpful to collect more requirements from GPU users -- one
> of the reasons that I set up the arrow/gpu project to begin with was
> to help catalyze collaborations with the GPU community. Unfortunately,
> that hasn't really happened yet after nearly a year, so hopefully we
> can get more folks involved in the near future.
>
> Some answers to your questions inline:
>
> On Tue, Jun 26, 2018 at 11:55 AM, Anthony Scopatz 
> wrote:
> > Hello All,
> >
> > As some of you may know, a few of us at Quansight have started (in
> > parntership with NVIDIA) have started looking at Arrow's GPU capabilites.
> > We are excited to help improve and expand Arrow's GPU support, but we did
> > have a few initial scoping questions.
> >
> > Feel free to break these out into separate discussion threads if needed.
> > Hopefully, some of them will be easy enough to answer.
> >
> >
> >1. What is the status of the GPU code in arrow now? E.g.
> >https://github.com/apache/arrow/tree/master/cpp/src/arrow/gpu Is
> anyone
> >actively working on this part of the code base? Are there other folks
> >working on GPU support? I'd love to chat, if so!
>
> The code there is basically waiting for one or more stakeholder users
> to get involved and help drive the roadmap. What is there now is
> pretty basic.
>
> To give you some context, I observed that some parts of this project
> (IPC / data structure reconstruction on GPUs) were being reimplemented
> in https://github.com/gpuopenanalytics/libgdf. So I started by setting
> up basic abstractions to plug the CUDA driver API into Arrow's various
> abstract interfaces for memory management and IO. I then implemented
> GPU-specialized IPC read and write functions so that these code paths
> in arrow/ipc can function without having the data be addressable in
> CPU memory. See the GPU IPC unit tests here:
>
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/gpu/cuda-test.cc#L311
>
> I contributed some patches to MapD and hoped to rework more of their
> Arrow interop to use these functions, but didn't get 100% the way
> there last winter.
>
> With MapD, libgdf, BlazingDB, and other current and future GPU Arrow
> producers and consumers, I think there's plenty of components like
> these that it would make sense to develop here.
>
> >2. Should arrow compute assume that everything fits in memory? Arrow
> >seem to handle data that is larger than memory via the Buffer API. Are
> >there restrictions that using Buffers imply that we should be aware
> of?
>
> This is a large question. Many database systems work on
> larger-than-memory datasets by splitting the problem into fragments
> that do fit into memory. I think it would be reasonable to start by
> developing computational functions that operate on in-memory data,
> then leaving it up to a task scheduler implementation to orchestrate
> an algorithm on larger-than-memory datasets. This is similar to how
> Dask has used pandas to work in an out-of-core and distributed
> setting.
>
> >3. What is the imagined interface be the pyarrow and a GPU DataFrame?
> >One idea is to have the selection of main memory and the GPU to be
> totally
> >transparent to the user. Another possible suggestion is to be
> explicit to
> >the user about where the data lives, for example:
> >
> >>>> import pyarrow as pa
> >>>> a = pa.array(..., type=...) # create pyarrow array instance
> >>>> a_g = a.to_gpu() # send `a` to GPU
> >>>> def foo(a): ... return ... # a function doing operations with `a`
> >>>> r = foo(a) # perform operations with `a`, runs on CPU
> >>>> r_g = foo(a_g) # perform operations with `a_g`, runs on GPU
> >>>> assert r == r_g.to_mem() # results are the same
>
> Data frames are kind of a semantic construct. As an example, pandas
> utilizes data structures and a mix of low-level algorithms that run
> against NumPy arrays to define the semantics for what is a "pandas
> DataFrame". But, since the Arrow columnar format was born from the
> needs of analytic database systems and in-memory analytics systems
> like pandas, we've captured more of the semantics of data frames than
> in a generic array computing library.
>
> In the case of Arrow, we have strived to be "front end agnostic", so
> if the objective is to develop front ends for Python programmers, then
> our goal would be to provide within pyarrow the

Arrow sync at 12pm Eastern today

2018-06-27 Thread Wes McKinney
https://meet.google.com/vtm-teks-phx


[jira] [Created] (ARROW-2756) [Python] Reduce redundant imports and minor fixes in parquet tests

2018-06-27 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-2756:
--

 Summary: [Python] Reduce redundant imports and minor fixes in 
parquet tests
 Key: ARROW-2756
 URL: https://issues.apache.org/jira/browse/ARROW-2756
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2755) [Python] Allow using Ninja to build extension

2018-06-27 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-2755:
--

 Summary: [Python] Allow using Ninja to build extension
 Key: ARROW-2755
 URL: https://issues.apache.org/jira/browse/ARROW-2755
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] Developing a standard memory layout for in-memory records / "row-oriented" data

2018-06-27 Thread Wes McKinney
> I'm not sure this makes sense as an external stable api. I definitely think 
> it is useful as an internal representation for use within a particular 
> algorithm. I also think that can be informed by the particular algorithm that 
> you're working on.

I agree that this is definitely needed in certain algorithms, e.g.
certain types of hashing. And the memory layout that's best for a
given algorithm might change. Since we have a number of support
structures already in place for columnar data, like dictionary
encoding, it would be easier to implement these things for
row-oriented data.

I think the question is really about open standards. Our original
focus when we started the project was to develop an open standard for
columnar data. It seems valuable to have one for row-oriented data.
Given how many systems have developed their own internal formats, it
seems like an inevitability. This begs the questions: if it does not
happen here, then where? and if not now, then when?

That being said, it's hard to say how feasible the project would be
until we gather more requirements and non-requirements.

On Wed, Jun 27, 2018 at 3:20 AM, Siddharth Teotia  wrote:
> I am wondering if this can be considered as an opportunity to implement
> support in Arrow for building high performance in-memory row stores for low
> latency and high throughput key based queries. In other words, we can
> design the in-memory record format keeping efficient RDMA reads as one of
> the goals too. Consider two data structures in memory -- a  hash table and
> a row-store comprising of records in Arrow row format. Hashtable points to
> row store and information can be read from both data structures without
> interrupting the CPU on server. This client-server code-path support can
> also be incorporated into Arrow Flight
>
> On Tue, Jun 26, 2018 at 7:49 PM, Jacques Nadeau  wrote:
>
>> I'm not sure this makes sense as an external stable api. I definitely think
>> it is useful as an internal representation for use within a particular
>> algorithm. I also think that can be informed by the particular algorithm
>> that you're working on.
>>
>> We definitely had this requirement in Dremio and came up with an internal
>> representation that we are happy with for the use in hash tables. I'll try
>> to dig up the design docs we had around this but the actual
>> pivoting/unpivoting code that we developed can be seen here: [1], [2].
>>
>> Our main model is two blocks: a fixed width block and a variable width
>> block (with the fixed width block also carrying address & length of the
>> variable data). Fixed width is randomly accessible and variable width is
>> randomly accessible through fixed width.
>>
>> [1]
>> https://github.com/dremio/dremio-oss/blob/master/sabot/
>> kernel/src/main/java/com/dremio/sabot/op/common/ht2/Pivots.java
>> [2]
>> https://github.com/dremio/dremio-oss/blob/master/sabot/
>> kernel/src/main/java/com/dremio/sabot/op/common/ht2/Unpivots.java
>>
>> On Tue, Jun 26, 2018 at 10:20 AM, Wes McKinney 
>> wrote:
>>
>> > hi Antoine,
>> >
>> > On Sun, Jun 24, 2018 at 1:06 PM, Antoine Pitrou 
>> > wrote:
>> > >
>> > > Hi Wes,
>> > >
>> > > Le 24/06/2018 à 08:24, Wes McKinney a écrit :
>> > >>
>> > >> If this sounds interesting to the community, I could help to kickstart
>> > >> a design process which would likely take a significant amount of time.
>> > >> The requirements could be complex (i.e. we might want to support
>> > >> variable-size record fields while also providing random access
>> > >> guarantees).
>> > >
>> > > What do you call "variable-sized" here? A scheme where the length of a
>> > > record's field is determined by the value of another field in the same
>> > > record?
>> >
>> > As an example, here is a fixed size record
>> >
>> > record foo {
>> >   a: int32;
>> >   b: float64;
>> >   c: uint8;
>> > }
>> >
>> > With padding suppose this is 16 bytes per record; so if we have a
>> > column of these, then random accessing any value in any record is
>> > simple.
>> >
>> > Here's a variable-length record:
>> >
>> > record bar {
>> >   a: string;
>> >   b: list;
>> > }
>> >
>> > What I've seen done to represent this in memory is to have a fixed
>> > size record followed by a sidecar containing the variable-length data,
>> > so the fixed size portion might look something like
>> >
>> > a_offset: int32;
>> > a_length: int32;
>> > b_offset: int32;
>> > b_length: int32;
>> >
>> > So from this, you can do random access into the record. If you wanted
>> > to do random access on a _column_ of such records, it is similar to
>> > our current variable-length Binary type. So it might be that the
>> > underlying Arrow memory layout would be FixedSizeBinary for fixed-size
>> > records and variable Binary for variable-size records.
>> >
>> > - Wes
>> >
>> > >
>> > >
>> > >
>> > > Regards
>> > >
>> > > Antoine.
>> >
>>


[jira] [Created] (ARROW-2754) [Python] When installing pyarrow via pip, a debug build is created

2018-06-27 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2754:
---

 Summary: [Python] When installing pyarrow via pip, a debug build 
is created
 Key: ARROW-2754
 URL: https://issues.apache.org/jira/browse/ARROW-2754
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.10.0


I noticed this in the log for https://github.com/apache/arrow/issues/2173. We 
should probably change the default build type to {{release}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2753) [GLib] Add garrow_schema_*_field()

2018-06-27 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-2753:
---

 Summary: [GLib] Add garrow_schema_*_field()
 Key: ARROW-2753
 URL: https://issues.apache.org/jira/browse/ARROW-2753
 Project: Apache Arrow
  Issue Type: New Feature
  Components: GLib
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou
 Fix For: 0.10.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] Developing a standard memory layout for in-memory records / "row-oriented" data

2018-06-27 Thread Siddharth Teotia
I am wondering if this can be considered as an opportunity to implement
support in Arrow for building high performance in-memory row stores for low
latency and high throughput key based queries. In other words, we can
design the in-memory record format keeping efficient RDMA reads as one of
the goals too. Consider two data structures in memory -- a  hash table and
a row-store comprising of records in Arrow row format. Hashtable points to
row store and information can be read from both data structures without
interrupting the CPU on server. This client-server code-path support can
also be incorporated into Arrow Flight

On Tue, Jun 26, 2018 at 7:49 PM, Jacques Nadeau  wrote:

> I'm not sure this makes sense as an external stable api. I definitely think
> it is useful as an internal representation for use within a particular
> algorithm. I also think that can be informed by the particular algorithm
> that you're working on.
>
> We definitely had this requirement in Dremio and came up with an internal
> representation that we are happy with for the use in hash tables. I'll try
> to dig up the design docs we had around this but the actual
> pivoting/unpivoting code that we developed can be seen here: [1], [2].
>
> Our main model is two blocks: a fixed width block and a variable width
> block (with the fixed width block also carrying address & length of the
> variable data). Fixed width is randomly accessible and variable width is
> randomly accessible through fixed width.
>
> [1]
> https://github.com/dremio/dremio-oss/blob/master/sabot/
> kernel/src/main/java/com/dremio/sabot/op/common/ht2/Pivots.java
> [2]
> https://github.com/dremio/dremio-oss/blob/master/sabot/
> kernel/src/main/java/com/dremio/sabot/op/common/ht2/Unpivots.java
>
> On Tue, Jun 26, 2018 at 10:20 AM, Wes McKinney 
> wrote:
>
> > hi Antoine,
> >
> > On Sun, Jun 24, 2018 at 1:06 PM, Antoine Pitrou 
> > wrote:
> > >
> > > Hi Wes,
> > >
> > > Le 24/06/2018 à 08:24, Wes McKinney a écrit :
> > >>
> > >> If this sounds interesting to the community, I could help to kickstart
> > >> a design process which would likely take a significant amount of time.
> > >> The requirements could be complex (i.e. we might want to support
> > >> variable-size record fields while also providing random access
> > >> guarantees).
> > >
> > > What do you call "variable-sized" here? A scheme where the length of a
> > > record's field is determined by the value of another field in the same
> > > record?
> >
> > As an example, here is a fixed size record
> >
> > record foo {
> >   a: int32;
> >   b: float64;
> >   c: uint8;
> > }
> >
> > With padding suppose this is 16 bytes per record; so if we have a
> > column of these, then random accessing any value in any record is
> > simple.
> >
> > Here's a variable-length record:
> >
> > record bar {
> >   a: string;
> >   b: list;
> > }
> >
> > What I've seen done to represent this in memory is to have a fixed
> > size record followed by a sidecar containing the variable-length data,
> > so the fixed size portion might look something like
> >
> > a_offset: int32;
> > a_length: int32;
> > b_offset: int32;
> > b_length: int32;
> >
> > So from this, you can do random access into the record. If you wanted
> > to do random access on a _column_ of such records, it is similar to
> > our current variable-length Binary type. So it might be that the
> > underlying Arrow memory layout would be FixedSizeBinary for fixed-size
> > records and variable Binary for variable-size records.
> >
> > - Wes
> >
> > >
> > >
> > >
> > > Regards
> > >
> > > Antoine.
> >
>


[jira] [Created] (ARROW-2752) [GLib] Document garrow_decimal_data_type_new()

2018-06-27 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-2752:
---

 Summary: [GLib] Document garrow_decimal_data_type_new()
 Key: ARROW-2752
 URL: https://issues.apache.org/jira/browse/ARROW-2752
 Project: Apache Arrow
  Issue Type: Improvement
  Components: GLib
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou
 Fix For: 0.10.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)