Re: [Format] Semantics for dictionary batches in streams

2019-09-09 Thread Micah Kornfield
Yes, I opened a JIRA, I'm going to try to make a proposal that consolidates
all the recent dictionary discussions.

On Mon, Sep 9, 2019 at 12:21 PM Wes McKinney  wrote:

> hi Micah,
>
> I think we should formulate changes to format/Columnar.rst and have a
> vote, what do you think?
>
> On Thu, Aug 29, 2019 at 2:23 AM Micah Kornfield 
> wrote:
> >>
> >>
> >> > I was thinking the file format must satisfy one of two conditions:
> >> > 1.  Exactly one dictionarybatch per encoded column
> >> > 2.  DictionaryBatches are interleaved correctly.
> >>
> >> Could you clarify?
> >
> > I think you clarified it very well :) My motivation for suggesting the
> additional complexity is I see two use-cases for the file format.  These
> roughly correspond with the two options I suggested:
> > 1.  We are encoding data from scratch.  In this case, it seems like all
> dictionaries would be built incrementally, not need replacement and we
> write them at the end of the file [1]
> >
> > 2.  The data being written out is essentially a "tee" off of some stream
> that is generating new dictionaries requiring replacement on the fly (i.e.
> reading back two parquet files).
> >
> >>  It might be better to disallow replacements
> >> in the file format (which does introduce semantic slippage between the
> >> file and stream formats as Antoine was saying).
> >
> > It is is certainly possible, to accept the slippage from the stream
> format for now and later add this capability, since it should be forwards
> compatible.
> >
> > Thanks,
> > Micah
> >
> > [1] There is also medium complexity option where we require one
> non-delta dictionary and as many delta dictionaries as the user want.
> >
> > On Wed, Aug 28, 2019 at 7:50 AM Wes McKinney 
> wrote:
> >>
> >> On Tue, Aug 27, 2019 at 6:05 PM Micah Kornfield 
> wrote:
> >> >
> >> > I was thinking the file format must satisfy one of two conditions:
> >> > 1.  Exactly one dictionarybatch per encoded column
> >> > 2.  DictionaryBatches are interleaved correctly.
> >>
> >> Could you clarify? In the first case, there is no issue with
> >> dictionary replacements. I'm not sure about the second case -- if a
> >> dictionary id appears twice, then you'll see it twice in the file
> >> footer. I suppose you could look at the file offsets to determine
> >> whether a dictionary batch precedes a particular record batch block
> >> (to know which dictionary you should be using), but that's rather
> >> complicated to implement. It might be better to disallow replacements
> >> in the file format (which does introduce semantic slippage between the
> >> file and stream formats as Antoine was saying).
> >>
> >> >
> >> > On Tuesday, August 27, 2019, Wes McKinney 
> wrote:
> >> >
> >> > > On Tue, Aug 27, 2019 at 3:55 PM Antoine Pitrou 
> wrote:
> >> > > >
> >> > > >
> >> > > > Le 27/08/2019 à 22:31, Wes McKinney a écrit :
> >> > > > > So the current situation we have right now in C++ is that if we
> tried
> >> > > > > to create an IPC stream from a sequence of record batches that
> don't
> >> > > > > all have the same dictionary, we'd run into two scenarios:
> >> > > > >
> >> > > > > * Batches that either have a prefix of a prior-observed
> dictionary, or
> >> > > > > the prior dictionary is a prefix of their dictionary. For
> example,
> >> > > > > suppose that the dictionary sent for an id was ['A', 'B', 'C']
> and
> >> > > > > then there's a subsequent batch with ['A', 'B', 'C', 'D', 'E'].
> In
> >> > > > > such case we could compute and send a delta batch
> >> > > > >
> >> > > > > * Batches with a dictionary that is a permutation of values, and
> >> > > > > possibly new unique values.
> >> > > > >
> >> > > > > In this latter case, without the option of replacing an
> existing ID in
> >> > > > > the stream, we would have to do a unification / permutation of
> indices
> >> > > > > and then also possibly send a delta batch. We should probably
> have
> >> > > > > code at some point that deals with both cases, but in the
> meantime I
> >> > > > > would like to allow dictionaries to be redefined in this case.
> Seems
> >> > > > > like we might need a vote to formalize this?
> >> > > >
> >> > > > Isn't the stream format deviating from the file format then?  In
> the
> >> > > > file format, IIUC, dictionaries can appear after the respective
> record
> >> > > > batches, so there's no way to tell whether the original or
> redefined
> >> > > > version of a dictionary is being referred to.
> >> > >
> >> > > You make a good point -- we can consider changes to the file format
> to
> >> > > allow for record batches to have different dictionaries. Even
> handling
> >> > > delta dictionaries with the current file format would be a bit
> tedious
> >> > > (though not indeterminate)
> >> > >
> >> > > > Regards
> >> > > >
> >> > > > Antoine.
> >> > >
>


Re: [RESULT] [VOTE] Alter Arrow binary protocol to address 8-byte Flatbuffer alignment requirements (2nd vote)

2019-09-09 Thread Bryan Cutler
Sounds good to me also and I don't think we need a vote either.

On Sat, Sep 7, 2019 at 7:36 PM Micah Kornfield 
wrote:

> +1 on this, I also don't think a vote is necessary as long as we make the
> change before 0.15.0
>
> On Saturday, September 7, 2019, Wes McKinney  wrote:
>
> > I see, thank you for catching this nuance.
> >
> > I agree that using {0x, 0x} for EOS will resolve the
> > issue while allowing implementations to be backwards compatible (i.e.
> > handling the 4-byte EOS from older payloads).
> >
> > I'm not sure that we need to have a vote about this, what do others
> think?
> >
> > On Sat, Sep 7, 2019 at 12:47 AM Ji Liu 
> wrote:
> > >
> > > Hi all,
> > >
> > > During the java code review[1], seems there is a problem with the
> > current implementations(C++/Java etc) when reaching EOS, since the new
> > format EOS is 8 bytes and the reader only reads 4 bytes when reach the
> end
> > of stream, and the additional 4 bytes will not be read which cause
> problems
> > for following up readings.
> > >
> > > There are some optional suggestions[2] as below, we should reach
> > consistent and fix this problem before 0.15 release.
> > > i. For the new format, an 8-byte EOS token should look like
> {0x,
> > 0x}, so we read the continuation token first, and then know to
> read
> > the next 4 bytes, which are then 0 to signal EOS.ii. Reader just remember
> > the state, so if it reads the continuation token from the beginning, then
> > read all 8 bytes at the end.
> > >
> > > Thanks,
> > > Ji Liu
> > >
> > > [1] https://github.com/apache/arrow/pull/5229
> > > [2] https://github.com/apache/arrow/pull/5229#discussion_r321715682
> > >
> > >
> > >
> > >
> > > --
> > > From:Eric Erhardt 
> > > Send Time:2019年9月5日(星期四) 07:16
> > > To:dev@arrow.apache.org ; Ji Liu <
> > niki...@aliyun.com>
> > > Cc:emkornfield ; Paul Taylor <
> ptay...@apache.org>
> > > Subject:RE: [RESULT] [VOTE] Alter Arrow binary protocol to address
> > 8-byte Flatbuffer alignment requirements (2nd vote)
> > >
> > > The C# PR is up.
> > >
> > > https://github.com/apache/arrow/pull/5280
> > >
> > > Eric
> > >
> > > -Original Message-
> > > From: Eric Erhardt 
> > > Sent: Wednesday, September 4, 2019 10:12 AM
> > > To: dev@arrow.apache.org; Ji Liu 
> > > Cc: emkornfield ; Paul Taylor <
> ptay...@apache.org
> > >
> > > Subject: RE: [RESULT] [VOTE] Alter Arrow binary protocol to address
> > 8-byte Flatbuffer alignment requirements (2nd vote)
> > >
> > > I'm working on a PR for the C# bindings. I hope to have it up in the
> > next day or two. Integration tests for C# would be a great addition at
> some
> > point - it's been on my backlog. For now I plan on manually testing it.
> > >
> > > -Original Message-
> > > From: Wes McKinney 
> > > Sent: Tuesday, September 3, 2019 10:17 PM
> > > To: Ji Liu 
> > > Cc: emkornfield ; dev ;
> > Paul Taylor 
> > > Subject: Re: [RESULT] [VOTE] Alter Arrow binary protocol to address
> > 8-byte Flatbuffer alignment requirements (2nd vote)
> > >
> > > hi folks,
> > >
> > > We now have patches up for Java, JS, and Go. How are we doing on the
> > code reviews for getting these in?
> > >
> > > Since C# implements the binary protocol, the C# developers might want
> to
> > look at this before the 0.15.0 release also. Absent integration tests
> it's
> > difficult to verify the C# library, though
> > >
> > > Thanks
> > >
> > > On Thu, Aug 29, 2019 at 8:13 AM Ji Liu  wrote:
> > > >
> > > > Here is the Java implementation
> > > >
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith
> > > > ub.com
> %2Fapache%2Farrow%2Fpull%2F5229data=02%7C01%7CEric.Erhardt%
> > > > 40microsoft.com
> %7C90f02600c4ce40ff5c9008d730e66b68%7C72f988bf86f141af9
> > > >
> 1ab2d7cd011db47%7C1%7C0%7C637031638512163816sdata=b87u5x8lLvfdnU5
> > > > 6LrGzYR8H0Jh8FfwY2cVjbOsY9hY%3Dreserved=0
> > > >
> > > > cc @Wes McKinney @emkornfield
> > > >
> > > > Thanks,
> > > > Ji Liu
> > > >
> > > > --
> > > > From:Ji Liu  Send Time:2019年8月28日(星期三)
> > > > 17:34 To:emkornfield ; dev
> > > >  Cc:Paul Taylor 
> > > > Subject:Re: [RESULT] [VOTE] Alter Arrow binary protocol to address
> > > > 8-byte Flatbuffer alignment requirements (2nd vote)
> > > >
> > > > I could take the Java implementation and will take a close watch on
> > this issue in the next few days.
> > > >
> > > > Thanks,
> > > > Ji Liu
> > > >
> > > >
> > > > --
> > > > From:Micah Kornfield  Send
> Time:2019年8月28日(星期三)
> > > > 17:14 To:dev  Cc:Paul Taylor
> > > > 
> > > > Subject:Re: [RESULT] [VOTE] Alter Arrow binary protocol to address
> > > > 8-byte Flatbuffer alignment requirements (2nd vote)
> > > >
> > > > I should have integration tests with 0.14.1 generated binaries in the
> > > > next few days.  I think the one remaining unassigned piece 

[Discuss] [Java] DateMilliVector.getObject() return type (LocalDateTime vs LocalDate)

2019-09-09 Thread Micah Kornfield
Yongbo Zhang,
Opened up a pull request to have DateMilliVector return a LocalDate instead
of a LocalDateTime object.

Do people have opinions if this breaking change is worth the correctness?

Thanks,
Micah

[1] https://github.com/apache/arrow/pull/5315

On Sat, Sep 7, 2019 at 4:14 PM Yongbo Zhang 
wrote:

> Summary: [Java] DateMilliVector.getObject() should return a LocalDate, not
> a LocalDateTime
> Key: ARROW-1984
> URL: https://issues.apache.org/jira/browse/ARROW-1984
> Pull Request: https://github.com/apache/arrow/pull/5315
> Project: Apache Arrow
> Issue Type: Bug
> Components: Java
> Reporter: Vanco Buca
> Assignee: Yongbo Zhang
> Fix For: 0.15.0
>
> This is an API breaking change therefore we may want to discuss about it
> before merging any PRs in.
>


[jira] [Created] (ARROW-6504) [Python][Packaging] Add mimalloc to Windows conda packages for better performance

2019-09-09 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6504:
---

 Summary: [Python][Packaging] Add mimalloc to Windows conda 
packages for better performance
 Key: ARROW-6504
 URL: https://issues.apache.org/jira/browse/ARROW-6504
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.15.0






--
This message was sent by Atlassian Jira
(v8.3.2#803003)


Re: Plasma scenarios

2019-09-09 Thread Sutou Kouhei
If we build the GLib-based library with MSVC, it doesn't
require MSYS nor Cygwin. It just requires MSVC.

In 
 

  "RE: Plasma scenarios" on Mon, 9 Sep 2019 22:05:26 +,
  Eric Erhardt  wrote:

> I don't think the C# bindings would use the Glib-based libraries on Windows 
> if it requires installing MSYS2 or Cygwin on the end-user's Windows machine. 
> So don't go through the work building the Glib-based libraries with MSVC on 
> account of the C# library.
> 
> -Original Message-
> From: Sutou Kouhei  
> Sent: Monday, September 9, 2019 4:43 PM
> To: dev@arrow.apache.org
> Subject: Re: Plasma scenarios
> 
> Hi,
> 
>> In theory you could use the GLib-based library with MSVC, the main 
>> requirement is gobject-introspection
>> 
>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith
>> ub.com%2FGNOME%2Fgobject-introspection%2Fblob%2Fmaster%2FMSVC.README.r
>> stdata=02%7C01%7CEric.Erhardt%40microsoft.com%7Cca22053d07d84cc6d
>> 98a08d7356ec83b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637036622
>> 234878257sdata=2V8%2Fdf1jBeXmgZChjnTJU37ZqOQTf0GrLKw5d%2B%2FsFaY%
>> 3Dreserved=0
> 
> Generally, we can use the GLib-based library without GObject Introspection if 
> we write bindings by hand. (We can generate bindings automatically with 
> GObject Introspection.)
> 
> But we need to some tasks to build the GLib-based library with MSVC. I'll 
> work on it in a few months.
> 
> 
> Thanks,
> --
> kou
> 
> In 
>   "Re: Plasma scenarios" on Mon, 9 Sep 2019 12:00:00 -0500,
>   Wes McKinney  wrote:
> 
>> hi Eric,
>> 
>> On Fri, Sep 6, 2019 at 5:09 PM Eric Erhardt 
>>  wrote:
>>>
>>> I was looking for the high level scenarios for the Plasma In-Memory Object 
>>> Store. A colleague of mine suggested we could use it to pass data between a 
>>> C# process and a Python process.
>>>
>>> I've read the intro blog [0] on Plasma, which describes using the same data 
>>> set from multiple processes - which sounds like the same scenario as above.
>>>
>>> I am trying to prioritize creating C# bindings for the Plasma client. So 
>>> I'd like to know all the scenarios that would could be enabled with Plasma.
>>>
>>> For example:
>>> - could using Plasma speed up Pandas UDFs in PySpark? Because the data 
>>> wouldn't have to go across the socket between Java and Python, but instead 
>>> would be memory-mapped. We have similar functionality in .NET for Apache 
>>> Spark.
>> 
>> Memory still would need to be copied into the memory-mappable file, so 
>> it's unclear whether this would be faster than passing the data 
>> through a socket as it's being done now.
>> 
>>> - Is Plasma being used by Nvidia RAPIDS?
>> 
>> AFAIK it is not. It doesn't seem out of the question, though, given 
>> that we have some level of CUDA support in Plasma now.
>> 
>>>
>>> I know Plasma today is not supported on Windows, but I think support could 
>>> be added since Windows supports memory mapped files (through a different 
>>> API than mmap) and it now supports Unix Domain Sockets [1].
>>>
>>> Also - side question about the c_glib bindings. I assume those will only 
>>> ever work on Windows with something like Cygwin or MSYS2, right? Would 
>>> people be opposed to adding pure "C" exports to the plasma library so the 
>>> C# bindings could use it? (similar to the JNI support today).
>>>
>> 
>> In theory you could use the GLib-based library with MSVC, the main 
>> requirement is gobject-introspection
>> 
>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith
>> ub.com%2FGNOME%2Fgobject-introspection%2Fblob%2Fmaster%2FMSVC.README.r
>> stdata=02%7C01%7CEric.Erhardt%40microsoft.com%7Cca22053d07d84cc6d
>> 98a08d7356ec83b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637036622
>> 234883247sdata=8o2XPCj7xIkUgQSNMwJYMdqHVG2BNlNOqvE0P00TvEE%3D
>> ;reserved=0
>> 
>> Note that GLib itself is LGPL-licensed -- since it is an optional 
>> component in Apache Arrow, it is OK for optional components to have an 
>> LGPL dependency (though ASF projects aren't allowed to have 
>> mandatory/hard dependencies on LGPL). So if you do go that route just 
>> beware the possible issues you might have down the road.
>> 
>> I have no objection to adding a "plasma/plasma-c.h" with C exports.
>> 
>>> Eric
>>>
>>> [0] 
>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fray
>>> -project.github.io%2F2017%2F08%2F08%2Fplasma-in-memory-object-store.h
>>> tmldata=02%7C01%7CEric.Erhardt%40microsoft.com%7Cca22053d07d84cc
>>> 6d98a08d7356ec83b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637036
>>> 622234883247sdata=XcXFtxsbgjXntJzX3foLTJQfgdQabEHQkneQeRQDWU0%3D
>>> reserved=0 [1] 
>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdev
>>> blogs.microsoft.com%2Fcommandline%2Faf_unix-comes-to-windows%2Fd
>>> ata=02%7C01%7CEric.Erhardt%40microsoft.com%7Cca22053d07d84cc6d98a08d7
>>> 356ec83b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637036622234883
>>> 247sdata=%2BN3hCDkppSQAHM2AMgk6SBunF70mjgXwD%2Boesz41aq0%3D

[jira] [Created] (ARROW-6503) [C++] Add an argument of memory pool object to SparseTensorConverter

2019-09-09 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-6503:
---

 Summary: [C++] Add an argument of memory pool object to 
SparseTensorConverter
 Key: ARROW-6503
 URL: https://issues.apache.org/jira/browse/ARROW-6503
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Kenta Murata
Assignee: Kenta Murata


According to the comment 
https://github.com/apache/arrow/pull/5290#discussion_r322244745, we need to 
have variants of some functions for supplying a memory pool object to 
SparseTensorConverter function.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6502) [GLib][CI] MinGW failure in CI

2019-09-09 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6502:
---

 Summary: [GLib][CI] MinGW failure in CI
 Key: ARROW-6502
 URL: https://issues.apache.org/jira/browse/ARROW-6502
 Project: Apache Arrow
  Issue Type: Bug
  Components: GLib
Reporter: Wes McKinney
 Fix For: 0.15.0


This failure seems to have crept in to master

https://ci.appveyor.com/project/wesm/arrow/build/job/ocfkn9m0a3ux1ur5#L2288



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6501) [Format][C++] Remove non_zero_length field from SparseIndex

2019-09-09 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-6501:
---

 Summary: [Format][C++] Remove non_zero_length field from 
SparseIndex
 Key: ARROW-6501
 URL: https://issues.apache.org/jira/browse/ARROW-6501
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Format
Reporter: Kenta Murata
Assignee: Kenta Murata


We can remove non_zero_length field from SparseIndex because it can be supplied 
from the shape of the indices tensor.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


Re: Can the R interface to write_parquet accept strings?

2019-09-09 Thread Wes McKinney
I'm referring to the arrow-devel and parquet-devel packages, which are C++
packages. If you built the R library (using install.package) against
version 0.14.0 and then upgraded arrow-devel to 0.14.1 without rebuilding
the R library, you could have this issue.

I would recommend reinstalling the R package and see if the problem goes
away.

On Mon, Sep 9, 2019, 6:34 PM Daniel Feenberg  wrote:

>
>
>
> On Mon, 9 Sep 2019, Wes McKinney wrote:
>
> > I'm a bit confused by the error message
> >
> > "
> > Error in write_parquet_file(to_arrow(table), file) :
> >   Arrow error: IOError: Metadata contains Thrift LogicalType that is
> >   not recognized.
> > "
> >
> > This error comes from
> >
> >
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/types.cc#L455
> >
> > This function should not be called at all during the execution of
> > "write_parquet_file".
> >
> > Daniel, is it possible you changed the C++ library installed after
> > building the "arrow" R package? The R package must generally be
> > recompiled when the C++ library is upgraded
> >
>
> We are not aware of changing anything in C++. It is just as yum left it.
> We didn't compile the R arrow package at all, just used what yum supplied
> from the distribution. Are you suggesting we compile the R package
> ourselves, that the Scientific Linux distribution packages are
> inconsistent? Note that the default C++ is rather old and it would be
> problem to update it, since so many other packages depend on it. But we
> could update Arrow, I suppose.
>
> Daniel Feenberg
>


Re: Can the R interface to write_parquet accept strings?

2019-09-09 Thread Daniel Feenberg





On Mon, 9 Sep 2019, Wes McKinney wrote:


I'm a bit confused by the error message

"
Error in write_parquet_file(to_arrow(table), file) :
  Arrow error: IOError: Metadata contains Thrift LogicalType that is
  not recognized.
"

This error comes from

https://github.com/apache/arrow/blob/master/cpp/src/parquet/types.cc#L455

This function should not be called at all during the execution of
"write_parquet_file".

Daniel, is it possible you changed the C++ library installed after
building the "arrow" R package? The R package must generally be
recompiled when the C++ library is upgraded



We are not aware of changing anything in C++. It is just as yum left it. 
We didn't compile the R arrow package at all, just used what yum supplied 
from the distribution. Are you suggesting we compile the R package 
ourselves, that the Scientific Linux distribution packages are 
inconsistent? Note that the default C++ is rather old and it would be 
problem to update it, since so many other packages depend on it. But we 
could update Arrow, I suppose.


Daniel Feenberg


Re: [Discuss][FlightRPC] Extensions to Flight: middleware and DoPut tickets

2019-09-09 Thread David Li
I'm happy to start a new thread to focus on DoPut specifically.
Middleware for Java has been in review.

Best,
David

On 9/9/19, Wes McKinney  wrote:
> Ah, I think I'm referring to the format change around DoPut, for which
> there is not a PR yet. Sorry for my confusion
>
> Do we want to start a separate discussion thread about that?
>
> https://docs.google.com/document/d/1hrwxNwPU1aOD_1ciRUOaGeUCyXYOmu6IxxCfY6Stj6w/edit?usp=sharing
>
> On Mon, Sep 9, 2019 at 3:51 PM Antoine Pitrou  wrote:
>>
>>
>> Isn't a middleware an implementation-specific concern?  Does it need a
>> formal vote?
>>
>> Regards
>>
>> Antoine.
>>
>>
>> Le 09/09/2019 à 22:49, Wes McKinney a écrit :
>> > It seems like there is positive feedback on the PR. Do we want to have
>> > a vote about this?
>> >
>> > On Mon, Aug 12, 2019 at 7:54 AM David Li  wrote:
>> >>
>> >> I've (finally) put up a draft implementation of middleware for Java:
>> >> https://github.com/apache/arrow/pull/5068
>> >>
>> >> Hopefully this helps clarify how the proposal works.
>> >>
>> >> Best,
>> >> David
>> >>
>> >> On 7/25/19, David Li  wrote:
>> >>> Thanks for the feedback, Antoine. That would be a natural method to
>> >>> have - then the server could deny uploads (as you mention) or note
>> >>> that the stream already exists. I've updated the proposal to reflect
>> >>> that, leaving more detailed semantics (e.g. append vs overwrite)
>> >>> application-defined.
>> >>>
>> >>> Best,
>> >>> David
>> >>>
>> >>> On 7/25/19, Antoine Pitrou  wrote:
>> 
>>  Le 08/07/2019 à 16:33, David Li a écrit :
>> > Hi all,
>> >
>> > I've put together two more proposals for Flight, motivated by
>> > projects
>> > we've been working on. I'd appreciate any comments on the
>> > design/reasoning; I'm already working on the implementation,
>> > alongside
>> > some other improvements to Flight.
>> >
>> > The first is to modify the DoPut call to follow the same request
>> > pattern as DoGet. This is a format change and would require a vote.
>> >
>> > https://docs.google.com/document/d/1hrwxNwPU1aOD_1ciRUOaGeUCyXYOmu6IxxCfY6Stj6w/edit?usp=sharing
>> 
>>  It seems it would be useful to introduce a GetPutInfo (or
>>  GetUploadInfo)
>>  so as to allow differential behaviour between getting and putting.
>> 
>>  (one trivial case would be to disallow uploading altogether :-)))
>> 
>>  Regards
>> 
>>  Antoine.
>> 
>> >>>
>


[jira] [Created] (ARROW-6500) [Java] How to use RootAllocator in a low memory setting?

2019-09-09 Thread Andong Zhan (Jira)
Andong Zhan created ARROW-6500:
--

 Summary: [Java] How to use RootAllocator in a low memory setting?
 Key: ARROW-6500
 URL: https://issues.apache.org/jira/browse/ARROW-6500
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Affects Versions: 0.13.0
Reporter: Andong Zhan


When I run this simple code with JVM setting: "-Xmx64m"
{code:java}
package com.snowflake;
import org.apache.arrow.memory.RootAllocator;

public class TestArrow
{
  public static void main(String args[]) throws Exception
  {
new RootAllocator(Integer.MAX_VALUE);
  }
}
{code}
and got the following error
{code:java}
Picked up JAVA_TOOL_OPTIONS: 
-Djavax.net.ssl.trustStore=/etc/pki/ca-trust/extracted/java/cacertsPicked up 
JAVA_TOOL_OPTIONS: 
-Djavax.net.ssl.trustStore=/etc/pki/ca-trust/extracted/java/cacertsException in 
thread "main" java.lang.ExceptionInInitializerError at 
org.apache.arrow.memory.BaseAllocator.createEmpty(BaseAllocator.java:263) at 
org.apache.arrow.memory.BaseAllocator.(BaseAllocator.java:89) at 
org.apache.arrow.memory.RootAllocator.(RootAllocator.java:34) at 
org.apache.arrow.memory.RootAllocator.(RootAllocator.java:30) at 
com.snowflake.TestArrow.main(TestArrow.java:13)Caused by: 
java.lang.NullPointerException at 
io.netty.buffer.PooledByteBufAllocatorL$InnerAllocator.(PooledByteBufAllocatorL.java:145)
 at 
io.netty.buffer.PooledByteBufAllocatorL.(PooledByteBufAllocatorL.java:49) 
at 
org.apache.arrow.memory.AllocationManager.(AllocationManager.java:61) 
... 5 more
Process finished with exit code 1
{code}
So how to use RootAllocator in such low memory case?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


RE: Plasma scenarios

2019-09-09 Thread Eric Erhardt
I don't think the C# bindings would use the Glib-based libraries on Windows if 
it requires installing MSYS2 or Cygwin on the end-user's Windows machine. So 
don't go through the work building the Glib-based libraries with MSVC on 
account of the C# library.

-Original Message-
From: Sutou Kouhei  
Sent: Monday, September 9, 2019 4:43 PM
To: dev@arrow.apache.org
Subject: Re: Plasma scenarios

Hi,

> In theory you could use the GLib-based library with MSVC, the main 
> requirement is gobject-introspection
> 
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith
> ub.com%2FGNOME%2Fgobject-introspection%2Fblob%2Fmaster%2FMSVC.README.r
> stdata=02%7C01%7CEric.Erhardt%40microsoft.com%7Cca22053d07d84cc6d
> 98a08d7356ec83b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637036622
> 234878257sdata=2V8%2Fdf1jBeXmgZChjnTJU37ZqOQTf0GrLKw5d%2B%2FsFaY%
> 3Dreserved=0

Generally, we can use the GLib-based library without GObject Introspection if 
we write bindings by hand. (We can generate bindings automatically with GObject 
Introspection.)

But we need to some tasks to build the GLib-based library with MSVC. I'll work 
on it in a few months.


Thanks,
--
kou

In 
  "Re: Plasma scenarios" on Mon, 9 Sep 2019 12:00:00 -0500,
  Wes McKinney  wrote:

> hi Eric,
> 
> On Fri, Sep 6, 2019 at 5:09 PM Eric Erhardt 
>  wrote:
>>
>> I was looking for the high level scenarios for the Plasma In-Memory Object 
>> Store. A colleague of mine suggested we could use it to pass data between a 
>> C# process and a Python process.
>>
>> I've read the intro blog [0] on Plasma, which describes using the same data 
>> set from multiple processes - which sounds like the same scenario as above.
>>
>> I am trying to prioritize creating C# bindings for the Plasma client. So I'd 
>> like to know all the scenarios that would could be enabled with Plasma.
>>
>> For example:
>> - could using Plasma speed up Pandas UDFs in PySpark? Because the data 
>> wouldn't have to go across the socket between Java and Python, but instead 
>> would be memory-mapped. We have similar functionality in .NET for Apache 
>> Spark.
> 
> Memory still would need to be copied into the memory-mappable file, so 
> it's unclear whether this would be faster than passing the data 
> through a socket as it's being done now.
> 
>> - Is Plasma being used by Nvidia RAPIDS?
> 
> AFAIK it is not. It doesn't seem out of the question, though, given 
> that we have some level of CUDA support in Plasma now.
> 
>>
>> I know Plasma today is not supported on Windows, but I think support could 
>> be added since Windows supports memory mapped files (through a different API 
>> than mmap) and it now supports Unix Domain Sockets [1].
>>
>> Also - side question about the c_glib bindings. I assume those will only 
>> ever work on Windows with something like Cygwin or MSYS2, right? Would 
>> people be opposed to adding pure "C" exports to the plasma library so the C# 
>> bindings could use it? (similar to the JNI support today).
>>
> 
> In theory you could use the GLib-based library with MSVC, the main 
> requirement is gobject-introspection
> 
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith
> ub.com%2FGNOME%2Fgobject-introspection%2Fblob%2Fmaster%2FMSVC.README.r
> stdata=02%7C01%7CEric.Erhardt%40microsoft.com%7Cca22053d07d84cc6d
> 98a08d7356ec83b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637036622
> 234883247sdata=8o2XPCj7xIkUgQSNMwJYMdqHVG2BNlNOqvE0P00TvEE%3D
> ;reserved=0
> 
> Note that GLib itself is LGPL-licensed -- since it is an optional 
> component in Apache Arrow, it is OK for optional components to have an 
> LGPL dependency (though ASF projects aren't allowed to have 
> mandatory/hard dependencies on LGPL). So if you do go that route just 
> beware the possible issues you might have down the road.
> 
> I have no objection to adding a "plasma/plasma-c.h" with C exports.
> 
>> Eric
>>
>> [0] 
>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fray
>> -project.github.io%2F2017%2F08%2F08%2Fplasma-in-memory-object-store.h
>> tmldata=02%7C01%7CEric.Erhardt%40microsoft.com%7Cca22053d07d84cc
>> 6d98a08d7356ec83b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637036
>> 622234883247sdata=XcXFtxsbgjXntJzX3foLTJQfgdQabEHQkneQeRQDWU0%3D
>> reserved=0 [1] 
>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdev
>> blogs.microsoft.com%2Fcommandline%2Faf_unix-comes-to-windows%2Fd
>> ata=02%7C01%7CEric.Erhardt%40microsoft.com%7Cca22053d07d84cc6d98a08d7
>> 356ec83b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637036622234883
>> 247sdata=%2BN3hCDkppSQAHM2AMgk6SBunF70mjgXwD%2Boesz41aq0%3D
>> reserved=0


[jira] [Created] (ARROW-6499) [C++] Add support for bundled Boost with MSVC

2019-09-09 Thread Sutou Kouhei (Jira)
Sutou Kouhei created ARROW-6499:
---

 Summary: [C++] Add support for bundled Boost with MSVC
 Key: ARROW-6499
 URL: https://issues.apache.org/jira/browse/ARROW-6499
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Sutou Kouhei
Assignee: Sutou Kouhei






--
This message was sent by Atlassian Jira
(v8.3.2#803003)


Re: Plasma scenarios

2019-09-09 Thread Sutou Kouhei
Hi,

> In theory you could use the GLib-based library with MSVC, the main
> requirement is gobject-introspection
> 
> https://github.com/GNOME/gobject-introspection/blob/master/MSVC.README.rst

Generally, we can use the GLib-based library without GObject
Introspection if we write bindings by hand. (We can generate
bindings automatically with GObject Introspection.)

But we need to some tasks to build the GLib-based library
with MSVC. I'll work on it in a few months.


Thanks,
--
kou

In 
  "Re: Plasma scenarios" on Mon, 9 Sep 2019 12:00:00 -0500,
  Wes McKinney  wrote:

> hi Eric,
> 
> On Fri, Sep 6, 2019 at 5:09 PM Eric Erhardt
>  wrote:
>>
>> I was looking for the high level scenarios for the Plasma In-Memory Object 
>> Store. A colleague of mine suggested we could use it to pass data between a 
>> C# process and a Python process.
>>
>> I've read the intro blog [0] on Plasma, which describes using the same data 
>> set from multiple processes - which sounds like the same scenario as above.
>>
>> I am trying to prioritize creating C# bindings for the Plasma client. So I'd 
>> like to know all the scenarios that would could be enabled with Plasma.
>>
>> For example:
>> - could using Plasma speed up Pandas UDFs in PySpark? Because the data 
>> wouldn't have to go across the socket between Java and Python, but instead 
>> would be memory-mapped. We have similar functionality in .NET for Apache 
>> Spark.
> 
> Memory still would need to be copied into the memory-mappable file, so
> it's unclear whether this would be faster than passing the data
> through a socket as it's being done now.
> 
>> - Is Plasma being used by Nvidia RAPIDS?
> 
> AFAIK it is not. It doesn't seem out of the question, though, given
> that we have some level of CUDA support in Plasma now.
> 
>>
>> I know Plasma today is not supported on Windows, but I think support could 
>> be added since Windows supports memory mapped files (through a different API 
>> than mmap) and it now supports Unix Domain Sockets [1].
>>
>> Also - side question about the c_glib bindings. I assume those will only 
>> ever work on Windows with something like Cygwin or MSYS2, right? Would 
>> people be opposed to adding pure "C" exports to the plasma library so the C# 
>> bindings could use it? (similar to the JNI support today).
>>
> 
> In theory you could use the GLib-based library with MSVC, the main
> requirement is gobject-introspection
> 
> https://github.com/GNOME/gobject-introspection/blob/master/MSVC.README.rst
> 
> Note that GLib itself is LGPL-licensed -- since it is an optional
> component in Apache Arrow, it is OK for optional components to have an
> LGPL dependency (though ASF projects aren't allowed to have
> mandatory/hard dependencies on LGPL). So if you do go that route just
> beware the possible issues you might have down the road.
> 
> I have no objection to adding a "plasma/plasma-c.h" with C exports.
> 
>> Eric
>>
>> [0] 
>> https://ray-project.github.io/2017/08/08/plasma-in-memory-object-store.html
>> [1] https://devblogs.microsoft.com/commandline/af_unix-comes-to-windows/


Re: Can the R interface to write_parquet accept strings?

2019-09-09 Thread Wes McKinney
I'm a bit confused by the error message

"
 Error in write_parquet_file(to_arrow(table), file) :
   Arrow error: IOError: Metadata contains Thrift LogicalType that is
   not recognized.
"

This error comes from

https://github.com/apache/arrow/blob/master/cpp/src/parquet/types.cc#L455

This function should not be called at all during the execution of
"write_parquet_file".

Daniel, is it possible you changed the C++ library installed after
building the "arrow" R package? The R package must generally be
recompiled when the C++ library is upgraded

On Mon, Sep 9, 2019 at 4:29 PM Daniel Feenberg  wrote:
>
>
>
> On Mon, 9 Sep 2019, Neal Richardson wrote:
>
> > Hi Daniel,
> > This works on my machine:
> >
> >> library(arrow)
> >> write_parquet(data.frame(y = c("a", "b", "c"), stringsAsFactors=FALSE), 
> >> file= "string.parquet")
> >> read_parquet("string.parquet")
> >  y
> > 1 a
> > 2 b
> > 3 c
> >>
> >
> > (The function masking warnings are all from library(tidyverse) and
> > aren't relevant here.)
> >
> > What OS are you on, and how did you install the arrow package? I'm on
> > macOS and installed arrow from CRAN, but if that's not the case for
> > you, then your C++ library may have different capabilities.
>
> Here are the details of our installation:
>
> 1) OS:
> --
> Scientific Linux 7
> uname: Linux 3.10.0-957.1.3.el7.x86_64 #1 SMP Mon Nov 26 12:36:06 CST 2018 
> x86_64 x86_64 x86_64 GNU/Linux
>
> 2) gcc version:
> 
> # gcc --version
> gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36)
>
>
> 3) arrow and parquet library installation:
> --
> yum install arrow-devel parquet-devel
>
> versions:
> arrow-devel: yum info arrow-devel
> Installed Packages
> Name: arrow-devel
> Arch: x86_64
> Version : 0.14.1
> Release : 1.el7
> Size: 20 M
> Repo: installed
> From repo   : apache-arrow
> Summary : Libraries and header files for Apache Arrow C++
> URL : https://arrow.apache.org/
> License : Apache-2.0
> Description : Libraries and header files for Apache Arrow C++.
>
>
> yum info parquet-devel
> Installed Packages
> Name: parquet-devel
> Arch: x86_64
> Version : 0.14.1
> Release : 1.el7
> Size: 6.4 M
> Repo: installed
> >From repo   : apache-arrow
> Summary : Libraries and header files for Apache Parquet C++
> URL : https://arrow.apache.org/
> License : Apache-2.0
> Description : Libraries and header files for Apache Parquet C++.
>
>
> 4) R arrow installation:
> --
> install.packages("arrow")
>
> and also
>
> install.packages("sparklyr")
>
> Thanks for taking an interest.
>
> Daniel Feenberg
>
>
>


Re: Plasma scenarios

2019-09-09 Thread Sutou Kouhei
Hi,

> I know Plasma today is not supported on Windows, but I think support could be 
> added since Windows supports memory mapped files (through a different API 
> than mmap) and it now supports Unix Domain Sockets [1].
> ...
> [1] https://devblogs.microsoft.com/commandline/af_unix-comes-to-windows/

Thanks for the information. I read the document.

It seems that Unix domain socket on Windows doesn't support
file descriptor passing:

> Ancillary data: Linux‘s unix socket implementation supports passing ancillary 
> data such as passing file descriptors

Plasma uses this feature:
  https://github.com/apache/arrow/blob/master/cpp/src/plasma/fling.cc#L33
  https://github.com/apache/arrow/blob/master/cpp/src/plasma/client.cc#L940


Thanks,
--
kou

In 
 

  "Plasma scenarios" on Fri, 6 Sep 2019 22:09:38 +,
  Eric Erhardt  wrote:

> I was looking for the high level scenarios for the Plasma In-Memory Object 
> Store. A colleague of mine suggested we could use it to pass data between a 
> C# process and a Python process.
> 
> I've read the intro blog [0] on Plasma, which describes using the same data 
> set from multiple processes - which sounds like the same scenario as above.
> 
> I am trying to prioritize creating C# bindings for the Plasma client. So I'd 
> like to know all the scenarios that would could be enabled with Plasma. 
> 
> For example:
> - could using Plasma speed up Pandas UDFs in PySpark? Because the data 
> wouldn't have to go across the socket between Java and Python, but instead 
> would be memory-mapped. We have similar functionality in .NET for Apache 
> Spark.
> - Is Plasma being used by Nvidia RAPIDS?
> 
> I know Plasma today is not supported on Windows, but I think support could be 
> added since Windows supports memory mapped files (through a different API 
> than mmap) and it now supports Unix Domain Sockets [1].
> 
> Also - side question about the c_glib bindings. I assume those will only ever 
> work on Windows with something like Cygwin or MSYS2, right? Would people be 
> opposed to adding pure "C" exports to the plasma library so the C# bindings 
> could use it? (similar to the JNI support today).
> 
> Eric
> 
> [0] 
> https://ray-project.github.io/2017/08/08/plasma-in-memory-object-store.html
> [1] https://devblogs.microsoft.com/commandline/af_unix-comes-to-windows/


Re: Can the R interface to write_parquet accept strings?

2019-09-09 Thread Daniel Feenberg




On Mon, 9 Sep 2019, Neal Richardson wrote:


Hi Daniel,
This works on my machine:


library(arrow)
write_parquet(data.frame(y = c("a", "b", "c"), stringsAsFactors=FALSE), file= 
"string.parquet")
read_parquet("string.parquet")

 y
1 a
2 b
3 c




(The function masking warnings are all from library(tidyverse) and
aren't relevant here.)

What OS are you on, and how did you install the arrow package? I'm on
macOS and installed arrow from CRAN, but if that's not the case for
you, then your C++ library may have different capabilities.


Here are the details of our installation:

1) OS:
--
Scientific Linux 7
uname: Linux 3.10.0-957.1.3.el7.x86_64 #1 SMP Mon Nov 26 12:36:06 CST 2018 
x86_64 x86_64 x86_64 GNU/Linux

2) gcc version:

# gcc --version
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36)


3) arrow and parquet library installation:
--
yum install arrow-devel parquet-devel

versions:
arrow-devel: yum info arrow-devel
Installed Packages
Name: arrow-devel
Arch: x86_64
Version : 0.14.1
Release : 1.el7
Size: 20 M
Repo: installed

From repo   : apache-arrow

Summary : Libraries and header files for Apache Arrow C++
URL : https://arrow.apache.org/
License : Apache-2.0
Description : Libraries and header files for Apache Arrow C++.


yum info parquet-devel
Installed Packages
Name: parquet-devel
Arch: x86_64
Version : 0.14.1
Release : 1.el7
Size: 6.4 M
Repo: installed

From repo   : apache-arrow

Summary : Libraries and header files for Apache Parquet C++
URL : https://arrow.apache.org/
License : Apache-2.0
Description : Libraries and header files for Apache Parquet C++.


4) R arrow installation:
--
install.packages("arrow")

and also

install.packages("sparklyr")

Thanks for taking an interest.

Daniel Feenberg





Re: [Discuss][FlightRPC] Extensions to Flight: middleware and DoPut tickets

2019-09-09 Thread Wes McKinney
Ah, I think I'm referring to the format change around DoPut, for which
there is not a PR yet. Sorry for my confusion

Do we want to start a separate discussion thread about that?

https://docs.google.com/document/d/1hrwxNwPU1aOD_1ciRUOaGeUCyXYOmu6IxxCfY6Stj6w/edit?usp=sharing

On Mon, Sep 9, 2019 at 3:51 PM Antoine Pitrou  wrote:
>
>
> Isn't a middleware an implementation-specific concern?  Does it need a
> formal vote?
>
> Regards
>
> Antoine.
>
>
> Le 09/09/2019 à 22:49, Wes McKinney a écrit :
> > It seems like there is positive feedback on the PR. Do we want to have
> > a vote about this?
> >
> > On Mon, Aug 12, 2019 at 7:54 AM David Li  wrote:
> >>
> >> I've (finally) put up a draft implementation of middleware for Java:
> >> https://github.com/apache/arrow/pull/5068
> >>
> >> Hopefully this helps clarify how the proposal works.
> >>
> >> Best,
> >> David
> >>
> >> On 7/25/19, David Li  wrote:
> >>> Thanks for the feedback, Antoine. That would be a natural method to
> >>> have - then the server could deny uploads (as you mention) or note
> >>> that the stream already exists. I've updated the proposal to reflect
> >>> that, leaving more detailed semantics (e.g. append vs overwrite)
> >>> application-defined.
> >>>
> >>> Best,
> >>> David
> >>>
> >>> On 7/25/19, Antoine Pitrou  wrote:
> 
>  Le 08/07/2019 à 16:33, David Li a écrit :
> > Hi all,
> >
> > I've put together two more proposals for Flight, motivated by projects
> > we've been working on. I'd appreciate any comments on the
> > design/reasoning; I'm already working on the implementation, alongside
> > some other improvements to Flight.
> >
> > The first is to modify the DoPut call to follow the same request
> > pattern as DoGet. This is a format change and would require a vote.
> >
> > https://docs.google.com/document/d/1hrwxNwPU1aOD_1ciRUOaGeUCyXYOmu6IxxCfY6Stj6w/edit?usp=sharing
> 
>  It seems it would be useful to introduce a GetPutInfo (or GetUploadInfo)
>  so as to allow differential behaviour between getting and putting.
> 
>  (one trivial case would be to disallow uploading altogether :-)))
> 
>  Regards
> 
>  Antoine.
> 
> >>>


Re: [Discuss][FlightRPC] Extensions to Flight: middleware and DoPut tickets

2019-09-09 Thread Antoine Pitrou


Isn't a middleware an implementation-specific concern?  Does it need a
formal vote?

Regards

Antoine.


Le 09/09/2019 à 22:49, Wes McKinney a écrit :
> It seems like there is positive feedback on the PR. Do we want to have
> a vote about this?
> 
> On Mon, Aug 12, 2019 at 7:54 AM David Li  wrote:
>>
>> I've (finally) put up a draft implementation of middleware for Java:
>> https://github.com/apache/arrow/pull/5068
>>
>> Hopefully this helps clarify how the proposal works.
>>
>> Best,
>> David
>>
>> On 7/25/19, David Li  wrote:
>>> Thanks for the feedback, Antoine. That would be a natural method to
>>> have - then the server could deny uploads (as you mention) or note
>>> that the stream already exists. I've updated the proposal to reflect
>>> that, leaving more detailed semantics (e.g. append vs overwrite)
>>> application-defined.
>>>
>>> Best,
>>> David
>>>
>>> On 7/25/19, Antoine Pitrou  wrote:

 Le 08/07/2019 à 16:33, David Li a écrit :
> Hi all,
>
> I've put together two more proposals for Flight, motivated by projects
> we've been working on. I'd appreciate any comments on the
> design/reasoning; I'm already working on the implementation, alongside
> some other improvements to Flight.
>
> The first is to modify the DoPut call to follow the same request
> pattern as DoGet. This is a format change and would require a vote.
>
> https://docs.google.com/document/d/1hrwxNwPU1aOD_1ciRUOaGeUCyXYOmu6IxxCfY6Stj6w/edit?usp=sharing

 It seems it would be useful to introduce a GetPutInfo (or GetUploadInfo)
 so as to allow differential behaviour between getting and putting.

 (one trivial case would be to disallow uploading altogether :-)))

 Regards

 Antoine.

>>>


Re: [Discuss][FlightRPC] Extensions to Flight: middleware and DoPut tickets

2019-09-09 Thread Wes McKinney
It seems like there is positive feedback on the PR. Do we want to have
a vote about this?

On Mon, Aug 12, 2019 at 7:54 AM David Li  wrote:
>
> I've (finally) put up a draft implementation of middleware for Java:
> https://github.com/apache/arrow/pull/5068
>
> Hopefully this helps clarify how the proposal works.
>
> Best,
> David
>
> On 7/25/19, David Li  wrote:
> > Thanks for the feedback, Antoine. That would be a natural method to
> > have - then the server could deny uploads (as you mention) or note
> > that the stream already exists. I've updated the proposal to reflect
> > that, leaving more detailed semantics (e.g. append vs overwrite)
> > application-defined.
> >
> > Best,
> > David
> >
> > On 7/25/19, Antoine Pitrou  wrote:
> >>
> >> Le 08/07/2019 à 16:33, David Li a écrit :
> >>> Hi all,
> >>>
> >>> I've put together two more proposals for Flight, motivated by projects
> >>> we've been working on. I'd appreciate any comments on the
> >>> design/reasoning; I'm already working on the implementation, alongside
> >>> some other improvements to Flight.
> >>>
> >>> The first is to modify the DoPut call to follow the same request
> >>> pattern as DoGet. This is a format change and would require a vote.
> >>>
> >>> https://docs.google.com/document/d/1hrwxNwPU1aOD_1ciRUOaGeUCyXYOmu6IxxCfY6Stj6w/edit?usp=sharing
> >>
> >> It seems it would be useful to introduce a GetPutInfo (or GetUploadInfo)
> >> so as to allow differential behaviour between getting and putting.
> >>
> >> (one trivial case would be to disallow uploading altogether :-)))
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >


[jira] [Created] (ARROW-6498) [C++][CI] Download googletest tarball and use for EP build to avoid occasional flakiness

2019-09-09 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6498:
---

 Summary: [C++][CI] Download googletest tarball and use for EP 
build to avoid occasional flakiness
 Key: ARROW-6498
 URL: https://issues.apache.org/jira/browse/ARROW-6498
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Continuous Integration
Reporter: Wes McKinney


Failures such as 
https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/27281370/job/dn0ji349v8popkd9
 seem to be happening a fair amount.

We might try to avoid this by wget-ing a tarball and setting 
{{$ARROW_GTEST_URL}}. Open to other ideas about how to reduce flakiness



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


Re: [Format] Semantics for dictionary batches in streams

2019-09-09 Thread Wes McKinney
hi Micah,

I think we should formulate changes to format/Columnar.rst and have a
vote, what do you think?

On Thu, Aug 29, 2019 at 2:23 AM Micah Kornfield  wrote:
>>
>>
>> > I was thinking the file format must satisfy one of two conditions:
>> > 1.  Exactly one dictionarybatch per encoded column
>> > 2.  DictionaryBatches are interleaved correctly.
>>
>> Could you clarify?
>
> I think you clarified it very well :) My motivation for suggesting the 
> additional complexity is I see two use-cases for the file format.  These 
> roughly correspond with the two options I suggested:
> 1.  We are encoding data from scratch.  In this case, it seems like all 
> dictionaries would be built incrementally, not need replacement and we write 
> them at the end of the file [1]
>
> 2.  The data being written out is essentially a "tee" off of some stream that 
> is generating new dictionaries requiring replacement on the fly (i.e. reading 
> back two parquet files).
>
>>  It might be better to disallow replacements
>> in the file format (which does introduce semantic slippage between the
>> file and stream formats as Antoine was saying).
>
> It is is certainly possible, to accept the slippage from the stream format 
> for now and later add this capability, since it should be forwards compatible.
>
> Thanks,
> Micah
>
> [1] There is also medium complexity option where we require one non-delta 
> dictionary and as many delta dictionaries as the user want.
>
> On Wed, Aug 28, 2019 at 7:50 AM Wes McKinney  wrote:
>>
>> On Tue, Aug 27, 2019 at 6:05 PM Micah Kornfield  
>> wrote:
>> >
>> > I was thinking the file format must satisfy one of two conditions:
>> > 1.  Exactly one dictionarybatch per encoded column
>> > 2.  DictionaryBatches are interleaved correctly.
>>
>> Could you clarify? In the first case, there is no issue with
>> dictionary replacements. I'm not sure about the second case -- if a
>> dictionary id appears twice, then you'll see it twice in the file
>> footer. I suppose you could look at the file offsets to determine
>> whether a dictionary batch precedes a particular record batch block
>> (to know which dictionary you should be using), but that's rather
>> complicated to implement. It might be better to disallow replacements
>> in the file format (which does introduce semantic slippage between the
>> file and stream formats as Antoine was saying).
>>
>> >
>> > On Tuesday, August 27, 2019, Wes McKinney  wrote:
>> >
>> > > On Tue, Aug 27, 2019 at 3:55 PM Antoine Pitrou  
>> > > wrote:
>> > > >
>> > > >
>> > > > Le 27/08/2019 à 22:31, Wes McKinney a écrit :
>> > > > > So the current situation we have right now in C++ is that if we tried
>> > > > > to create an IPC stream from a sequence of record batches that don't
>> > > > > all have the same dictionary, we'd run into two scenarios:
>> > > > >
>> > > > > * Batches that either have a prefix of a prior-observed dictionary, 
>> > > > > or
>> > > > > the prior dictionary is a prefix of their dictionary. For example,
>> > > > > suppose that the dictionary sent for an id was ['A', 'B', 'C'] and
>> > > > > then there's a subsequent batch with ['A', 'B', 'C', 'D', 'E']. In
>> > > > > such case we could compute and send a delta batch
>> > > > >
>> > > > > * Batches with a dictionary that is a permutation of values, and
>> > > > > possibly new unique values.
>> > > > >
>> > > > > In this latter case, without the option of replacing an existing ID 
>> > > > > in
>> > > > > the stream, we would have to do a unification / permutation of 
>> > > > > indices
>> > > > > and then also possibly send a delta batch. We should probably have
>> > > > > code at some point that deals with both cases, but in the meantime I
>> > > > > would like to allow dictionaries to be redefined in this case. Seems
>> > > > > like we might need a vote to formalize this?
>> > > >
>> > > > Isn't the stream format deviating from the file format then?  In the
>> > > > file format, IIUC, dictionaries can appear after the respective record
>> > > > batches, so there's no way to tell whether the original or redefined
>> > > > version of a dictionary is being referred to.
>> > >
>> > > You make a good point -- we can consider changes to the file format to
>> > > allow for record batches to have different dictionaries. Even handling
>> > > delta dictionaries with the current file format would be a bit tedious
>> > > (though not indeterminate)
>> > >
>> > > > Regards
>> > > >
>> > > > Antoine.
>> > >


[jira] [Created] (ARROW-6497) [Website] On change to master branch, automatically make PR to asf-site

2019-09-09 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-6497:
--

 Summary: [Website] On change to master branch, automatically make 
PR to asf-site
 Key: ARROW-6497
 URL: https://issues.apache.org/jira/browse/ARROW-6497
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Website
Reporter: Neal Richardson
Assignee: Neal Richardson


I added a build/deploy script to arrow-site that would enable automatically 
publishing to asf-site when there is a commit to the master branch. However, 
ASF won't let us add a deploy key to enable this publishing (INFRA-18924). 

I have a workaround that's not automatic but as close as we can get. On commits 
to apache/arrow-site's master branch, Travis builds the site and pushes it to a 
fork of arrow-site (where there is no restriction on deploy keys), and then it 
makes a PR from there back to the asf-site branch of apache/arrow-site using 
[hub|https://hub.github.com/hub-pull-request.1.html]. So it's "semiautomatic": 
the asf-site PR is made automatically, but a committer will need to merge it. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6496) [Python] Fix ARROW_ORC=ON build in Python wheels on macOS

2019-09-09 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6496:
---

 Summary: [Python] Fix ARROW_ORC=ON build in Python wheels on macOS
 Key: ARROW-6496
 URL: https://issues.apache.org/jira/browse/ARROW-6496
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney


This was disabled in ARROW-6446 https://github.com/apache/arrow/pull/5291 as it 
was failing 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


Re: Plasma scenarios

2019-09-09 Thread Wes McKinney
hi Eric,

On Fri, Sep 6, 2019 at 5:09 PM Eric Erhardt
 wrote:
>
> I was looking for the high level scenarios for the Plasma In-Memory Object 
> Store. A colleague of mine suggested we could use it to pass data between a 
> C# process and a Python process.
>
> I've read the intro blog [0] on Plasma, which describes using the same data 
> set from multiple processes - which sounds like the same scenario as above.
>
> I am trying to prioritize creating C# bindings for the Plasma client. So I'd 
> like to know all the scenarios that would could be enabled with Plasma.
>
> For example:
> - could using Plasma speed up Pandas UDFs in PySpark? Because the data 
> wouldn't have to go across the socket between Java and Python, but instead 
> would be memory-mapped. We have similar functionality in .NET for Apache 
> Spark.

Memory still would need to be copied into the memory-mappable file, so
it's unclear whether this would be faster than passing the data
through a socket as it's being done now.

> - Is Plasma being used by Nvidia RAPIDS?

AFAIK it is not. It doesn't seem out of the question, though, given
that we have some level of CUDA support in Plasma now.

>
> I know Plasma today is not supported on Windows, but I think support could be 
> added since Windows supports memory mapped files (through a different API 
> than mmap) and it now supports Unix Domain Sockets [1].
>
> Also - side question about the c_glib bindings. I assume those will only ever 
> work on Windows with something like Cygwin or MSYS2, right? Would people be 
> opposed to adding pure "C" exports to the plasma library so the C# bindings 
> could use it? (similar to the JNI support today).
>

In theory you could use the GLib-based library with MSVC, the main
requirement is gobject-introspection

https://github.com/GNOME/gobject-introspection/blob/master/MSVC.README.rst

Note that GLib itself is LGPL-licensed -- since it is an optional
component in Apache Arrow, it is OK for optional components to have an
LGPL dependency (though ASF projects aren't allowed to have
mandatory/hard dependencies on LGPL). So if you do go that route just
beware the possible issues you might have down the road.

I have no objection to adding a "plasma/plasma-c.h" with C exports.

> Eric
>
> [0] 
> https://ray-project.github.io/2017/08/08/plasma-in-memory-object-store.html
> [1] https://devblogs.microsoft.com/commandline/af_unix-comes-to-windows/


Re: Can the R interface to write_parquet accept strings?

2019-09-09 Thread Neal Richardson
Hi Daniel,
This works on my machine:

> library(arrow)
> write_parquet(data.frame(y = c("a", "b", "c"), stringsAsFactors=FALSE), file= 
> "string.parquet")
> read_parquet("string.parquet")
  y
1 a
2 b
3 c
>

(The function masking warnings are all from library(tidyverse) and
aren't relevant here.)

What OS are you on, and how did you install the arrow package? I'm on
macOS and installed arrow from CRAN, but if that's not the case for
you, then your C++ library may have different capabilities.

Neal

On Sun, Sep 8, 2019 at 3:41 AM Daniel Feenberg  wrote:
>
> Can the R interface to Arrow Parquet write string data? Take the
> following script:
>
>library(arrow)
>library(tidyverse)
>write_parquet(table = tibble(y = c("a", "b", "c")), file = 
> "string.parquet")
>
> I get the error message:
>
>Error in write_parquet_file(to_arrow(table), file) :
>Arrow error: IOError: Metadata contains Thrift LogicalType that is
>not recognized.
>
> after warnings that stats::filter(), stats::lag() and
> arrow::read_table() are masked, but I assume that isn't the problem.
> This is with R 3.5.1 and arrow_0.14.1.1
>
>
> Daniel Feenberg


[jira] [Created] (ARROW-6495) [Plasma] Use xxh3 for object hashing

2019-09-09 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-6495:
-

 Summary: [Plasma] Use xxh3 for object hashing
 Key: ARROW-6495
 URL: https://issues.apache.org/jira/browse/ARROW-6495
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Plasma
Reporter: Antoine Pitrou


We recently vendored xxh3 in Arrow. Plasma may want to use it for object 
hashing, since it's supposed to be even faster than XXH64.
See https://fastcompression.blogspot.com/2019/03/presenting-xxh3.html for 
performance numbers. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6494) [C++] Implement basic PartitionScheme

2019-09-09 Thread Benjamin Kietzman (Jira)
Benjamin Kietzman created ARROW-6494:


 Summary: [C++] Implement basic PartitionScheme
 Key: ARROW-6494
 URL: https://issues.apache.org/jira/browse/ARROW-6494
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Benjamin Kietzman
Assignee: Benjamin Kietzman


The PartitionScheme interface parses paths and yields the partition expressions 
which are encoded in those paths. For example, the Hive partition scheme would 
yield {{"a"_=2 and "b"_=3}} from "a=2/b=3/*.parquet".



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6492) [Python] file written with latest fastparquet cannot be read with latest pyarrow

2019-09-09 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-6492:


 Summary: [Python] file written with latest fastparquet cannot be 
read with latest pyarrow
 Key: ARROW-6492
 URL: https://issues.apache.org/jira/browse/ARROW-6492
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


>From report on the pandas issue tracker: 
>https://github.com/pandas-dev/pandas/issues/28252

With the latest released versions of fastparquet (0.3.2) and pyarrow (0.14.1), 
writing a file with pandas using the fastparquet engine cannot be read with the 
pyarrow engine:

{code}
df = pd.DataFrame({'A': [1, 2, 3]})
df.to_parquet("test.parquet", engine="fastparquet", compression=None)   

  
pd.read_parquet("test.parquet", engine="pyarrow")   
{code}

gives the following error when reading:

{code}
> 1 pd.read_parquet("test.parquet", engine="pyarrow")

~/miniconda3/lib/python3.7/site-packages/pandas/io/parquet.py in 
read_parquet(path, engine, columns, **kwargs)
292 
293 impl = get_engine(engine)
--> 294 return impl.read(path, columns=columns, **kwargs)

~/miniconda3/lib/python3.7/site-packages/pandas/io/parquet.py in read(self, 
path, columns, **kwargs)
123 kwargs["use_pandas_metadata"] = True
124 result = self.api.parquet.read_table(
--> 125 path, columns=columns, **kwargs
126 ).to_pandas()
127 if should_close:

~/miniconda3/lib/python3.7/site-packages/pyarrow/array.pxi in 
pyarrow.lib._PandasConvertible.to_pandas()

~/miniconda3/lib/python3.7/site-packages/pyarrow/table.pxi in 
pyarrow.lib.Table._to_pandas()

~/miniconda3/lib/python3.7/site-packages/pyarrow/pandas_compat.py in 
table_to_blockmanager(options, table, categories, ignore_metadata)
642 column_indexes = pandas_metadata.get('column_indexes', [])
643 index_descriptors = pandas_metadata['index_columns']
--> 644 table = _add_any_metadata(table, pandas_metadata)
645 table, index = _reconstruct_index(table, index_descriptors,
646   all_columns)

~/miniconda3/lib/python3.7/site-packages/pyarrow/pandas_compat.py in 
_add_any_metadata(table, pandas_metadata)
965 raw_name = 'None'
966 
--> 967 idx = schema.get_field_index(raw_name)
968 if idx != -1:
969 if col_meta['pandas_type'] == 'datetimetz':

~/miniconda3/lib/python3.7/site-packages/pyarrow/types.pxi in 
pyarrow.lib.Schema.get_field_index()

~/miniconda3/lib/python3.7/site-packages/pyarrow/lib.cpython-37m-x86_64-linux-gnu.so
 in string.from_py.__pyx_convert_string_from_py_std__in_string()

TypeError: expected bytes, dict found
{code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6491) [Java] fix master build failure caused by ErrorProne

2019-09-09 Thread Pindikura Ravindra (Jira)
Pindikura Ravindra created ARROW-6491:
-

 Summary: [Java] fix master build failure caused by ErrorProne
 Key: ARROW-6491
 URL: https://issues.apache.org/jira/browse/ARROW-6491
 Project: Apache Arrow
  Issue Type: Task
  Components: Java
Reporter: Pindikura Ravindra
Assignee: Ji Liu






--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6490) [Java] log error for leak in allocator close

2019-09-09 Thread Pindikura Ravindra (Jira)
Pindikura Ravindra created ARROW-6490:
-

 Summary: [Java] log error for leak in allocator close
 Key: ARROW-6490
 URL: https://issues.apache.org/jira/browse/ARROW-6490
 Project: Apache Arrow
  Issue Type: Task
  Components: Java
Reporter: Pindikura Ravindra
Assignee: Pindikura Ravindra


Currently, the allocator close throws an exception that includes some details 
in case of memory leaks. However, if there is a hierarchy of allocators and 
they are all closed at different times, it's hard to find the cause of the 
original leak.

 

If we also log a message when the leak occurs, it will be easier to correlate 
these.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


Re: [ANNOUNCE] New committers: Ben Kietzman, Kenta Murata, and Neal Richardson

2019-09-09 Thread Joris Van den Bossche
Congratulations!

On Sat, 7 Sep 2019 at 20:54, Rok Mihevc  wrote:

> Congrats all!
>
> On Sat, Sep 7, 2019 at 5:02 AM Bryan Cutler  wrote:
>
> > Congrats Ben, Kenta and Neal!
> >
> > On Fri, Sep 6, 2019, 12:15 PM Krisztián Szűcs  >
> > wrote:
> >
> > > Congratulations!
> > >
> > > On Fri, Sep 6, 2019 at 8:12 PM Ben Kietzman 
> > > wrote:
> > >
> > > > Thanks!
> > > >
> > > > On Fri, Sep 6, 2019 at 1:09 PM Micah Kornfield <
> emkornfi...@gmail.com>
> > > > wrote:
> > > >
> > > > > Congrats everyone! (apologies if I double sent this).
> > > > >
> > > > > On Fri, Sep 6, 2019 at 10:06 AM Neal Richardson <
> > > > > neal.p.richard...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Thanks, y'all!
> > > > > >
> > > > > > On Fri, Sep 6, 2019 at 5:44 AM David Li 
> > > wrote:
> > > > > > >
> > > > > > > Congrats all! :)
> > > > > > >
> > > > > > > Best,
> > > > > > > David
> > > > > > >
> > > > > > > On 9/6/19, Francois Saint-Jacques 
> > wrote:
> > > > > > > > Congrats to everyone!
> > > > > > > >
> > > > > > > > François
> > > > > > > >
> > > > > > > > On Fri, Sep 6, 2019 at 4:34 AM Kenta Murata 
> > > wrote:
> > > > > > > >>
> > > > > > > >> Thank you very much everyone!
> > > > > > > >> I'm very happy to join this community.
> > > > > > > >>
> > > > > > > >> 2019年9月6日(金) 12:39 Micah Kornfield :
> > > > > > > >>
> > > > > > > >> >
> > > > > > > >> > Congrats everyone.
> > > > > > > >> >
> > > > > > > >> > On Thu, Sep 5, 2019 at 7:06 PM Ji Liu
> > > >  > > > > >
> > > > > > > >> > wrote:
> > > > > > > >> >
> > > > > > > >> > > Congratulations!
> > > > > > > >> > >
> > > > > > > >> > > Thanks,
> > > > > > > >> > > Ji Liu
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > --
> > > > > > > >> > > From:Fan Liya 
> > > > > > > >> > > Send Time:2019年9月6日(星期五) 09:28
> > > > > > > >> > > To:dev 
> > > > > > > >> > > Subject:Re: [ANNOUNCE] New committers: Ben Kietzman,
> Kenta
> > > > > Murata,
> > > > > > > >> > > and
> > > > > > > >> > > Neal Richardson
> > > > > > > >> > >
> > > > > > > >> > > Big congratulations to Ben, Kenta and Neal!
> > > > > > > >> > >
> > > > > > > >> > > Best,
> > > > > > > >> > > Liya Fan
> > > > > > > >> > >
> > > > > > > >> > > On Fri, Sep 6, 2019 at 5:33 AM Wes McKinney <
> > > > > wesmck...@gmail.com>
> > > > > > > >> > > wrote:
> > > > > > > >> > >
> > > > > > > >> > > > hi all,
> > > > > > > >> > > >
> > > > > > > >> > > > on behalf of the Arrow PMC, I'm pleased to announce
> that
> > > > Ben,
> > > > > > > >> > > > Kenta,
> > > > > > > >> > > > and Neal have accepted invitations to become Arrow
> > > > committers.
> > > > > > > >> > > > Welcome
> > > > > > > >> > > > and thank you for all your contributions!
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> --
> > > > > > > >> Kenta Murata
> > > > > > > >> OpenPGP FP = 1D69 ADDE 081C 9CC2 2E54  98C1 CEFE 8AFB 6081
> > B062
> > > > > > > >>
> > > > > > > >> 本を書きました!!
> > > > > > > >> 『Ruby 逆引きレシピ』 http://www.amazon.co.jp/dp/4798119881/mrkn-22
> > > > > > > >>
> > > > > > > >> E-mail: m...@mrkn.jp
> > > > > > > >> twitter: http://twitter.com/mrkn/
> > > > > > > >> blog: http://d.hatena.ne.jp/mrkn/
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>