[jira] [Created] (ARROW-6393) [C++]Add EqualOptions support in SparseTensor::Equals

2019-08-29 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-6393:
---

 Summary: [C++]Add EqualOptions support in SparseTensor::Equals
 Key: ARROW-6393
 URL: https://issues.apache.org/jira/browse/ARROW-6393
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Kenta Murata
Assignee: Kenta Murata


SparseTensor::Equals should take EqualOptions argument as Tensor::Equals does.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6392) [Python][Flight] list_actions Server RPC is not tested in test_flight.py, nor is return value validated

2019-08-29 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6392:
---

 Summary: [Python][Flight] list_actions Server RPC is not tested in 
test_flight.py, nor is return value validated
 Key: ARROW-6392
 URL: https://issues.apache.org/jira/browse/ARROW-6392
 Project: Apache Arrow
  Issue Type: Bug
  Components: FlightRPC, Python
Reporter: Wes McKinney
 Fix For: 0.15.0


This server method is implemented and part of the Python server vtable, but it 
is not tested. If you mistakenly return a "string" action type, it will pass 
silently. We might want to constrain the output to be ActionType or a tuple



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6391) [Python][Flight] Add built-in methods on FlightServerBase to start server and wait for it to be available

2019-08-29 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6391:
---

 Summary: [Python][Flight] Add built-in methods on FlightServerBase 
to start server and wait for it to be available
 Key: ARROW-6391
 URL: https://issues.apache.org/jira/browse/ARROW-6391
 Project: Apache Arrow
  Issue Type: Improvement
  Components: FlightRPC, Python
Reporter: Wes McKinney
 Fix For: 0.15.0


It seems like this logic could be a part of the library / made general purpose 
to make it more convenient to spawn servers in Python

https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_flight.py#L414



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6390) [Python][Flight] Add Python documentation / tutorial for Flight

2019-08-29 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6390:
---

 Summary: [Python][Flight] Add Python documentation / tutorial for 
Flight
 Key: ARROW-6390
 URL: https://issues.apache.org/jira/browse/ARROW-6390
 Project: Apache Arrow
  Issue Type: Improvement
  Components: FlightRPC, Python
Reporter: Wes McKinney
 Fix For: 0.15.0


There is no Sphinx documentation for using Flight from Python. I have found 
that writing documentation is an effective way to uncover usability problems -- 
I would suggest we write comprehensive documentation for using Flight from 
Python as a way to refine the public Python API



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6389) java.io.IOException: No FileSystem for scheme: hdfs [On AWS EMR]

2019-08-29 Thread Ben Schreck (Jira)
Ben Schreck created ARROW-6389:
--

 Summary: java.io.IOException: No FileSystem for scheme: hdfs [On 
AWS EMR]
 Key: ARROW-6389
 URL: https://issues.apache.org/jira/browse/ARROW-6389
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java, Python
Affects Versions: 0.14.1
 Environment: Hadoop 2.85
EMR 5.24.1
python version: 3.7.4
skein version: 0.8.0
Reporter: Ben Schreck


I can't access hdfs through pyarrow ( from inside a yarn container created by 
skein)

This code works in a jupyter notebook running on the master node, or in an 
ipython terminal on a worker when given the ARROW_LIBHDFS_DIR env var:

```{{import pyarrow; pyarrow.hdfs.connect()```}}

 

However, when running on yarn by submitting the following skein application, I 
get a Java error.

 

{{name: test_conn
queue: default

master:
  env:
ARROW_LIBHDFS_DIR: /usr/lib/hadoop/lib/native
JAVA_HOME: /etc/alternatives/jre
  resources:
vcores: 1
memory: 10 GiB
  files:
conda_env: /home/hadoop/environment.tar.gz
  script: |
echo $HADOOP_HOME
echo $JAVA_HOME
echo $HADOOP_CLASSPATH
echo $ARROW_LIBHDFS_DIR
source conda_env/bin/activate
python -c "import pyarrow; pyarrow.hdfs.connect(); 
print(fs.open('test.txt').read())"
echo "Hello World!"}}

FYI I tried with/without all those extra env vars, to no effect. I also tried 
modifying the EMR cluster with any of the following

 

{{"fs.hdfs.impl": "org.apache.hadoop.fs.Hdfs"
"fs.AbstractFileSystem.hdfs.impl": 
"org.apache.hadoop.hdfs.DistributedFileSystem"
"fs.hdfs.impl": "org.apache.hadoop.hdfs.DistributedFileSystem"}}

The {{fs.AbstractFileSystem.hdfs.impl}} one gave a slightly different error- it 
was able to find which class by name to use for the "hdfs://" prefix, namely 
{{org.apache.hadoop.hdfs.DistributedFileSystem}}, but not able to find that 
class.

Logs:

 

{{=
LogType:application.driver.log
Log Upload Time:Thu Aug 29 20:51:59 + 2019
LogLength:2635
Log Contents:
/usr/lib/hadoop
/usr/lib/jvm/java-openjdk
:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar:/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar:/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar:/usr/share/aws/emr/cloudwatch-sink/lib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar:/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar:/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar:/usr/share/aws/emr/cloudwatch-sink/lib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*

hdfsBuilderConnect(forceNewInstance=1, nn=default, port=0, 
kerbTicketCachePath=(NULL), userName=(NULL)) error:
java.io.IOException: No FileSystem for scheme: hdfs
at 
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2846)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2857)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2896)
at org.apache.hadoop.fs.FileSystem$Cache.getUnique(FileSystem.java:2884)
at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:439)
at org.apache.hadoop.fs.FileSystem$2.run(FileSystem.java:414)
at org.apache.hadoop.fs.FileSystem$2.run(FileSystem.java:411)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:411)
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/mnt2/yarn/usercache/hadoop/appcache/application_1567110830725_0001/container_1567110830725_0001_01_01/conda_env/lib/python3.7/site-packages/pyarrow/hdfs.py",
 line 215, in connect
extra_conf=extra_conf)
  File 
"/mnt2/yarn/usercache/hadoop/appcache/application_1567110830725_0001/container_1567110830725_0001_01_01/conda_env/lib/python3.7/site-packages/pyarrow/hdfs.py",
 line 40, in __init__
self._connect(host, port, user, kerb_ticket, driver, extra_conf)
  File "pyarrow/io-hdfs.pxi", line 105, in pyarrow.lib.HadoopFileSystem._connect
  File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: HDFS connection failed
Hello World!
End of LogType:application.driver.log

LogType:application.master.log
Log Upload Time:Thu Aug 29 20:51:59 

Re: [DISCUSS] Ternary logic

2019-08-29 Thread Ben Kietzman
Indeed it's not about sanitizing nulls; it's about how nulls should
interact with boolean (and other) expressions.

For purposes of discussion, I'm naming the current approach of propagating
null "NaN logic" (since all expressions involving NaN evaluate to NaN).

To give some context for this discussion, I'm currently working on support
for filter expressions (ARROW-6243).

As an example of when this would come into play, let there be a dataset
spanning several files. The older files have an IPV4 column while the newer
files have IPV6 as well.
With NaN logic the expression (IPV4=="127.0.0.1" or IPV6=="::1") yields
null for all of the older files since they lack an IPV6 column (regardless
of their IPV4 column) which
seems undesirable.

Could you explain what you mean by "safest"?
Under NaN logic, the Kleene result can be recovered with
(coalesce(IPV4=="127.0.0.1", false) or coalesce(IPV6=="::1", false))
Under Kleene logic, the NaN result can be recovered with (case IPV4 is null
or IPV6 is null when 1 then null else IPV4=="127.0.0.1" or IPV6=="::1" end)
I don't think we're losing information either way.

I'm not attached to either system but I'd like to understand and document
the rationale behind our choice.

On Thu, Aug 29, 2019 at 1:14 PM Antoine Pitrou  wrote:

>
> IIUC it's not about sanitizing to false.  Ben explained it in more
> detail in private to me, perhaps he want to copy that explanation here ;-)
>
> Regards
>
> Antoine.
>
>
> Le 29/08/2019 à 19:05, Wes McKinney a écrit :
> > hi Ben,
> >
> > My instinct is that always propagating null (at least by default) is
> > the safest choice. Applications can choose to sanitize null to false
> > if that's what they want semantically.
> >
> > - Wes
> >
> > On Thu, Aug 29, 2019 at 8:37 AM Ben Kietzman 
> wrote:
> >>
> >> To my knowledge, there isn't explicit documentation on how null slots
> in an
> >> array should be interpreted. SQL uses Kleene logic, wherein a null is
> >> explicitly an unknown rather than a special value. This yields for
> example
> >> `(null AND false) -> false`, since `(x AND false) -> false` for all
> >> possible values of x. This is also the behavior of Gandiva's boolean
> >> expressions.
> >>
> >> By contrast the boolean kernels implement something closer to the
> behavior
> >> of NaN: `(null AND false) -> null`. I think this is simply an error in
> the
> >> boolean kernels but in any case I think explicit documentation should be
> >> added to prevent future confusion.
> >>
> >> https://issues.apache.org/jira/browse/ARROW-6386
>


Re: [DISCUSS] Ternary logic

2019-08-29 Thread Antoine Pitrou


IIUC it's not about sanitizing to false.  Ben explained it in more
detail in private to me, perhaps he want to copy that explanation here ;-)

Regards

Antoine.


Le 29/08/2019 à 19:05, Wes McKinney a écrit :
> hi Ben,
> 
> My instinct is that always propagating null (at least by default) is
> the safest choice. Applications can choose to sanitize null to false
> if that's what they want semantically.
> 
> - Wes
> 
> On Thu, Aug 29, 2019 at 8:37 AM Ben Kietzman  wrote:
>>
>> To my knowledge, there isn't explicit documentation on how null slots in an
>> array should be interpreted. SQL uses Kleene logic, wherein a null is
>> explicitly an unknown rather than a special value. This yields for example
>> `(null AND false) -> false`, since `(x AND false) -> false` for all
>> possible values of x. This is also the behavior of Gandiva's boolean
>> expressions.
>>
>> By contrast the boolean kernels implement something closer to the behavior
>> of NaN: `(null AND false) -> null`. I think this is simply an error in the
>> boolean kernels but in any case I think explicit documentation should be
>> added to prevent future confusion.
>>
>> https://issues.apache.org/jira/browse/ARROW-6386


Re: [DISCUSS] Ternary logic

2019-08-29 Thread Wes McKinney
hi Ben,

My instinct is that always propagating null (at least by default) is
the safest choice. Applications can choose to sanitize null to false
if that's what they want semantically.

- Wes

On Thu, Aug 29, 2019 at 8:37 AM Ben Kietzman  wrote:
>
> To my knowledge, there isn't explicit documentation on how null slots in an
> array should be interpreted. SQL uses Kleene logic, wherein a null is
> explicitly an unknown rather than a special value. This yields for example
> `(null AND false) -> false`, since `(x AND false) -> false` for all
> possible values of x. This is also the behavior of Gandiva's boolean
> expressions.
>
> By contrast the boolean kernels implement something closer to the behavior
> of NaN: `(null AND false) -> null`. I think this is simply an error in the
> boolean kernels but in any case I think explicit documentation should be
> added to prevent future confusion.
>
> https://issues.apache.org/jira/browse/ARROW-6386


[jira] [Created] (ARROW-6388) [C++] Consider implementing BufferOuputStream using BufferBuilder internally

2019-08-29 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6388:
---

 Summary: [C++] Consider implementing BufferOuputStream using 
BufferBuilder internally
 Key: ARROW-6388
 URL: https://issues.apache.org/jira/browse/ARROW-6388
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


See discussion in ARROW-6381 https://github.com/apache/arrow/pull/5222



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6387) [Archery] Errors with make

2019-08-29 Thread Omer Ozarslan (Jira)
Omer Ozarslan created ARROW-6387:


 Summary: [Archery] Errors with make
 Key: ARROW-6387
 URL: https://issues.apache.org/jira/browse/ARROW-6387
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Omer Ozarslan


{{archery --debug benchmark run}} gives error on Debian 10, CMake 3.13.4, GNU 
make 4.2.1:
{code:java}
(.venv)  omer@omer  ~/src/ext/arrow/cpp/build   master ●  archery --debug 
benchmark run   
 
DEBUG:archery:Running benchmark WORKSPACE   

   
DEBUG:archery:Executing `['/usr/bin/cmake', '-GMake', 
'-DCMAKE_EXPORT_COMPILE_COMMANDS=ON', '-DCMAKE_BUILD_TYPE=release', 
'-DBUILD_WARNING_LEVEL=production', '-DARROW_BUILD_TESTS=ON', 
'-DARROW_BUILD_BENCHMARKS=ON', '-DARROW_PYTHON=OFF', '-DARROW_PARQUET=OFF', 
'-DARROW_GANDIVA=OFF', '-DARROW_PLASMA=OFF', '-DARROW_FLIGHT=OFF', 
'/home/omer/src/ext/arrow/cpp']`
CMake Error: Could not create named generator Make  


  
Generators  
 
  Unix Makefiles   = Generates standard UNIX makefiles. 

   
  Ninja= Generates build.ninja files.   

   
  Watcom WMake = Generates Watcom WMake makefiles.  

   
  CodeBlocks - Ninja   = Generates CodeBlocks project files.

   
  CodeBlocks - Unix Makefiles  = Generates CodeBlocks project files.

   
  CodeLite - Ninja = Generates CodeLite project files.  
 
  CodeLite - Unix Makefiles= Generates CodeLite project files.  
   
  Sublime Text 2 - Ninja   = Generates Sublime Text 2 project files.
  
  Sublime Text 2 - Unix Makefiles
   = Generates Sublime Text 2 project files.
  
  Kate - Ninja = Generates Kate project files.  
   
  Kate - Unix Makefiles= Generates Kate project files.
  Eclipse CDT4 - Ninja = Generates Eclipse CDT 4.0 project files.
  Eclipse CDT4 - Unix Makefiles= Generates Eclipse CDT 4.0 project files.
Traceback (most recent call last):
[[[cropped]]]{code}
After trivial fix:
{code:java}
diff --git a/dev/archery/archery/utils/cmake.py 
b/dev/archery/archery/utils/cmake.py
index 38aedab2d..3150ea9a6 100644
--- a/dev/archery/archery/utils/cmake.py
+++ b/dev/archery/archery/utils/cmake.py
@@ -34,7 +34,7 @@ class CMake(Command):
 in the search path.
 """
 found_ninja = which("ninja")
-return "Ninja" if found_ninja else "Make"
+return "Ninja" if found_ninja else "Unix Makefiles"{code}
I get another error:
{code:java}
[[[cropped]]
-- Generating done
-- Build files have been written to: /tmp/arrow-bench-48x_yleb/WORKSPACE/build
DEBUG:archery:Executing `[None]`
Traceback (most recent call last):
  File "/home/omer/src/ext/arrow/.venv/bin/archery", line 11, in 
load_entry_point('archery', 'console_scripts', 'archery')()
  File 
"/home/omer/src/ext/arrow/.venv/lib/python3.7/site-packages/click/core.py", 
line 764, in __call__
return self.main(*args, **kwargs)
  File 
"/home/omer/src/ext/arrow/.venv/lib/python3.7/site-packages/click/core.py", 
line 717, in main
rv = self.invoke(ctx)
  File 
"/home/omer/src/ext/arrow/.venv/lib/python3.7/site-packages/click/core.py", 
line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
  File 
"/home/omer/src/ext/arrow/.venv/lib/python3.7/site-packages/click/core.py", 
line 1137, in invoke
return 

[jira] [Created] (ARROW-6386) [C++][Documentation] Explicit documentation of null slot interpretation

2019-08-29 Thread Benjamin Kietzman (Jira)
Benjamin Kietzman created ARROW-6386:


 Summary: [C++][Documentation] Explicit documentation of null slot 
interpretation
 Key: ARROW-6386
 URL: https://issues.apache.org/jira/browse/ARROW-6386
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Documentation
Reporter: Benjamin Kietzman
Assignee: Benjamin Kietzman


To my knowledge, there isn't explicit documentation on how null slots in an 
array should be interpreted. SQL uses Kleene logic, wherein a null is 
explicitly an unknown rather than a special value. This yields for example 
`(null AND false) -> false`, since `(x AND false) -> false` for all possible 
values of x. This is also the behavior of Gandiva's boolean expressions.

By contrast the boolean kernels implement something closer to the behavior of 
NaN: `(null AND false) -> null`. I think this is simply an error in the boolean 
kernels but in any case I think explicit documentation should be added to 
prevent future confusion.





--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[DISCUSS] Ternary logic

2019-08-29 Thread Ben Kietzman
To my knowledge, there isn't explicit documentation on how null slots in an
array should be interpreted. SQL uses Kleene logic, wherein a null is
explicitly an unknown rather than a special value. This yields for example
`(null AND false) -> false`, since `(x AND false) -> false` for all
possible values of x. This is also the behavior of Gandiva's boolean
expressions.

By contrast the boolean kernels implement something closer to the behavior
of NaN: `(null AND false) -> null`. I think this is simply an error in the
boolean kernels but in any case I think explicit documentation should be
added to prevent future confusion.

https://issues.apache.org/jira/browse/ARROW-6386


Re: [RESULT] [VOTE] Alter Arrow binary protocol to address 8-byte Flatbuffer alignment requirements (2nd vote)

2019-08-29 Thread Ji Liu
Here is the Java implementation
https://github.com/apache/arrow/pull/5229

cc @Wes McKinney @emkornfield

Thanks,
Ji Liu


--
From:Ji Liu 
Send Time:2019年8月28日(星期三) 17:34
To:emkornfield ; dev 
Cc:Paul Taylor 
Subject:Re: [RESULT] [VOTE] Alter Arrow binary protocol to address 8-byte 
Flatbuffer alignment requirements (2nd vote)

I could take the Java implementation and will take a close watch on this issue 
in the next few days.

Thanks,
Ji Liu


--
From:Micah Kornfield 
Send Time:2019年8月28日(星期三) 17:14
To:dev 
Cc:Paul Taylor 
Subject:Re: [RESULT] [VOTE] Alter Arrow binary protocol to address 8-byte 
Flatbuffer alignment requirements (2nd vote)

I should have integration tests with 0.14.1 generated binaries in the next
few days.  I think the one remaining unassigned piece of work in the Java
implementation, i can take that up next if no one else gets to it.

On Tue, Aug 27, 2019 at 7:19 PM Wes McKinney  wrote:

> Here's the C++ changes
>
> https://github.com/apache/arrow/pull/5211
>
> I'm going to create a integration branch where we can merge each patch
> before merging to master
>
> On Fri, Aug 23, 2019 at 9:03 AM Wes McKinney  wrote:
> >
> > It isn't implemented in C++ yet but I will try to get a patch up for
> > that soon (today maybe). I think we should create a branch where we
> > can stack the patches that implement this for each language.
> >
> > On Fri, Aug 23, 2019 at 4:04 AM Paul Taylor 
> wrote:
> > >
> > > I'll do the JS updates. Is it safe to validate against the Arrow C++
> > > integration tests?
> > >
> > >
> > > On 8/22/19 7:28 PM, Micah Kornfield wrote:
> > > > I created https://issues.apache.org/jira/browse/ARROW-6313 as a
> tracking
> > > > issue with sub-issues on the development work.  So far no-one has
> claimed
> > > > Java and Javascript tasks.
> > > >
> > > > Would it make sense to have a separate dev branch for this work?
> > > >
> > > > Thanks,
> > > > Micah
> > > >
> > > > On Thu, Aug 22, 2019 at 3:24 PM Wes McKinney 
> wrote:
> > > >
> > > >> The vote carries with 4 binding +1 votes and 1 non-binding +1
> > > >>
> > > >> I'll merge the specification patch later today and we can begin
> > > >> working on implementations so we can get this done for 0.15.0
> > > >>
> > > >> On Tue, Aug 20, 2019 at 12:30 PM Bryan Cutler 
> wrote:
> > > >>> +1 (non-binding)
> > > >>>
> > > >>> On Tue, Aug 20, 2019, 7:43 AM Antoine Pitrou 
> > > >> wrote:
> > >  Sorry, had forgotten to send my vote on this.
> > > 
> > >  +1 from me.
> > > 
> > >  Regards
> > > 
> > >  Antoine.
> > > 
> > > 
> > >  On Wed, 14 Aug 2019 17:42:33 -0500
> > >  Wes McKinney  wrote:
> > > > hi all,
> > > >
> > > > As we've been discussing [1], there is a need to introduce 4
> bytes of
> > > > padding into the preamble of the "encapsulated IPC message"
> format to
> > > > ensure that the Flatbuffers metadata payload begins on an 8-byte
> > > > aligned memory offset. The alternative to this would be for Arrow
> > > > implementations where alignment is important (e.g. C or C++) to
> copy
> > > > the metadata (which is not always small) into memory when it is
> > > > unaligned.
> > > >
> > > > Micah has proposed to address this by adding a
> > > > 4-byte "continuation" value at the beginning of the payload
> > > > having the value 0x. The reason to do it this way is that
> > > > old clients will see an invalid length (what is currently the
> > > > first 4 bytes of the message -- a 32-bit little endian signed
> > > > integer indicating the metadata length) rather than potentially
> > > > crashing on a valid length. We also propose to expand the "end of
> > > > stream" marker used in the stream and file format from 4 to 8
> > > > bytes. This has the additional effect of aligning the file footer
> > > > defined in File.fbs.
> > > >
> > > > This would be a backwards incompatible protocol change, so older
> > > >> Arrow
> > > > libraries would not be able to read these new messages.
> Maintaining
> > > > forward compatibility (reading data produced by older libraries)
> > > >> would
> > > > be possible as we can reason that a value other than the
> continuation
> > > > value was produced by an older library (and then validate the
> > > > Flatbuffer message of course). Arrow implementations could offer
> a
> > > > backward compatibility mode for the sake of old readers if they
> > > >> desire
> > > > (this may also assist with testing).
> > > >
> > > > Additionally with this vote, we want to formally approve the
> change
> > > >> to
> > > > the Arrow "file" format to always write the (new 8-byte)
> > > >> end-of-stream
> > > > marker, which enables code that processes Arrow streams to safely
> > > >> read
> > > > the file's 

[PROPOSAL] Consolidate Arrow's CI configuration

2019-08-29 Thread Krisztián Szűcs
Hi,

Arrow's current continuous integration setup utilizes multiple CI
providers,
tools, and scripts:

 - Unit tests are running on Travis and Appveyor
 - Binary packaging builds are running on crossbow, an abstraction over
multiple
   CI providers driven through a GitHub repository
 - For local tests and tasks, there is a docker-compose setup, or of course
you
   can maintain your own environment

This setup has run into some limitations:
 - It’s slow: the CI parallelism of Travis has degraded over the last
couple of
   months. Testing a PR takes more than an hour, which is a long time for
both
   the maintainers and the contributors, and it has a negative effect on
the
   development throughput.
 - Build configurations are not portable, they are tied to specific
services.
   You can’t just take a Travis script and run it somewhere else.
 - Because they’re not portable, build configurations are duplicated in
several
   places.
 - The Travis, Appveyor and crossbow builds are not reproducible locally,
so
   developing them requires the slow git push cycles.
 - Public CI has limited platform support, just for example ARM machines
are
   not available.
 - Public CI also has limited hardware support, no GPUs are available

Resolving all of the issues above is complicated, but is a must for the
long
term sustainability of Arrow.

For some time, we’ve been working on a tool called Ursabot[1], a library on
top
of the CI framework Buildbot[2]. Buildbot is well maintained and widely
used
for complex projects, including CPython, Webkit, LLVM, MariaDB, etc.
Buildbot
is not another hosted CI service like Travis or Appveyor: it is an
extensible
framework to implement various automations like continuous integration
tasks.

You’ve probably noticed additional “Ursabot” builds appearing on pull
requests,
in addition to the Travis and Appveyor builds. We’ve been testing the
framework
with a fully featured CI server at ci.ursalabs.org. This service runs build
configurations we can’t run on Travis, does it faster than Travis, and has
the
GitHub comment bot integration for ad hoc build triggering.

While we’re not prepared to propose moving all CI to a self-hosted setup,
our
work has demonstrated the potential of using buildbot to resolve Arrow’s
continuous integration challenges:
 - The docker-based builders are reusing the docker images, which eliminate
   slow dependency installation steps. Some builds on this setup, run on
   Ursa Labs’s infrastructure, run 20 minutes faster than the comparable
   Travis-CI jobs.
 - It’s scalable. We can deploy buildbot wherever and add more masters and
   workers, which we can’t do with public CI.
 - It’s platform and CI-provider independent. Builds can be run on
arbitrary
   architectures, operating systems, and hardware: Python is the only
   requirement. Additionally builds specified in buildbot/ursabot can be
run
   anywhere: not only on custom buildbot infrastructure but also on Travis,
or
   even on your own machine.
 - It improves reproducibility and encourages consolidation of
configuration.
   You can run the exact job locally that ran on Travis, and you can even
get
   an interactive shell in the build so you can debug a test failure. And
   because you can run the same job anywhere, we wouldn’t need to have
   duplicated, Travis-specific or the docker-compose build configuration
stored
   separately.
 - It’s extensible. More exotic features like a comment bot, benchmark
   database, benchmark dashboard, artifact store, integrating other systems
are
   easily implementable within the same system.

I’m proposing to donate the build configuration we’ve been iterating on in
Ursabot to the Arrow codebase. Here [3] is a patch that adds the
configuration.
This will enable us to explore consolidating build configuration using the
buildbot framework. A next step after to explore that would be to port a
Travis
build to Ursabot, and in the Travis configuration, execute the build by the
shell command `$ ursabot project build `. This is the same
way we
would be able to execute the build locally--something we can’t currently do
with the Travis builds.

I am not proposing here that we stop using Travis-CI and Appveyor to run CI
for
apache/arrow, though that may well be a direction we choose to go in the
future. Moving build configuration into something like buildbot would be a
necessary first step to do that; that said, there are other immediate
benefits
to be had by porting build configuration into buildbot: local
reproducibility,
consolidation of build logic, independence from a particular CI provider,
and
ease of using and maintaining faster, Docker-based jobs. Self-hosting CI
brings
a number of other challenges, which we will concurrently continue to
explore,
but we believe that there are benefits to adopting buildbot build
configuration
regardless.

Regards, Krisztian

[1]: https://github.com/ursa-labs/ursabot
[2]: https://buildbot.net
 https://docs.buildbot.net
 

[jira] [Created] (ARROW-6385) [C++] Investigate xxh3

2019-08-29 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-6385:
-

 Summary: [C++] Investigate xxh3
 Key: ARROW-6385
 URL: https://issues.apache.org/jira/browse/ARROW-6385
 Project: Apache Arrow
  Issue Type: Task
  Components: Benchmarking, C++
Reporter: Antoine Pitrou


xxh3 is a new hash algorithm by Yann Collet that claims excellent speed on both 
small/tiny and large keys. It has accelerated paths for x86 SSE2, AVX and ARM 
NEON. It also has excellent hash quality.
https://fastcompression.blogspot.com/2019/03/presenting-xxh3.html

Perhaps this can replace our current complex strategy involving a custom tiny 
string hashing implementation, a HW CRC32-based path where available for large 
strings, and a murmurhash2 fallback.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6384) [C++] Bump dependencies

2019-08-29 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-6384:
-

 Summary: [C++] Bump dependencies
 Key: ARROW-6384
 URL: https://issues.apache.org/jira/browse/ARROW-6384
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Antoine Pitrou






--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6383) [Java] report outstanding child allocators on parent allocator close

2019-08-29 Thread Pindikura Ravindra (Jira)
Pindikura Ravindra created ARROW-6383:
-

 Summary: [Java] report outstanding child allocators on parent 
allocator close
 Key: ARROW-6383
 URL: https://issues.apache.org/jira/browse/ARROW-6383
 Project: Apache Arrow
  Issue Type: Task
Reporter: Pindikura Ravindra
Assignee: Pindikura Ravindra


when a parent allocator is closed, we should report the child allocators if any 
are outstanding. This helps in debugging memory leaks - will tell if the leak 
happened in the parent or the child.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6382) Unable to catch Python UDF exceptions when using PyArrow

2019-08-29 Thread Jan (Jira)
Jan created ARROW-6382:
--

 Summary: Unable to catch Python UDF exceptions when using PyArrow
 Key: ARROW-6382
 URL: https://issues.apache.org/jira/browse/ARROW-6382
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.14.1
 Environment: Ubuntu 18.04
Reporter: Jan


When PyArrow is enabled, Pandas UDF exceptions raised by the Executor become 
impossible to catch: see example below. Is this expected behavior?

If so, what is the rationale. If not, how do I fix this?

Confirmed behavior in PyArrow 0.11 and 0.14.1 (latest) and PySpark 2.4.0 and 
2.4.3. Python 3.6.5.

To reproduce:

{{import pandas as pdfrom pyspark.sql import SparkSessionfrom 
pyspark.sql.functions import udf

spark = SparkSession.builder.getOrCreate()# setting this to false will allow 
the exception to be caughtspark.conf.set("spark.sql.execution.arrow.enabled", 
"true")@udfdef disrupt(x):raise Exception("Test EXCEPTION")data = 
spark.createDataFrame(pd.DataFrame({"A": [1, 2, 3]}))try:test = 
data.withColumn("test", disrupt("A")).toPandas()except:print("exception 
caught")print('end')}}

I would hope there's a way to catch the exception with the general except 
clause.

 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


Re: [DISCUSS] Release cadence and release vote conventions

2019-08-29 Thread Micah Kornfield
Hi Andy,
Just curious with the next release coming up if you had a chance to look
into the Maven issues yet.

Thanks,
Micah

On Thu, Aug 1, 2019 at 2:04 PM Wes McKinney  wrote:

> I agree. In my experiences as RM I have found the involvement of Maven
> in the release process to be a nuisance. I think it makes more sense
> in Java-only projects
>
> On Thu, Aug 1, 2019 at 2:54 PM Andy Grove  wrote:
> >
> > I'll start taking a look at the maven issue. We might not want to use
> maven
> > release plugin given that we control the version number already across
> this
> > repository via other means.
> >
> > On Wed, Jul 31, 2019 at 4:26 PM Sutou Kouhei  wrote:
> >
> > > Hi,
> > >
> > > Sorry for not replying this thread.
> > >
> > > I think that the biggest problem is related to our Java
> > > package.
> > >
> > >
> > > We'll be able to resolve the GPG key problem by creating a
> > > GPG key only for nightly release test. We can share the test
> > > GPG key publicly because it's a just for testing.
> > >
> > > It'll work for our binary artifacts and APT/Yum repositories
> > > but not work for our Java package. I don't know where GPG
> > > key is used in our Java package...
> > >
> > >
> > > We'll be able to resolve the Git commit problem by creating
> > > a cloned Git repository for test. It's done in our
> > > dev/release/00-prepare-test.rb[1].
> > >
> > > [1]
> > >
> https://github.com/apache/arrow/blob/master/dev/release/00-prepare-test.rb#L30
> > >
> > > The biggest problem for the Git commit is our Java package
> > > requires "apache-arrow-${VERSION}" tag on
> > > https://github.com/apache/arrow . (Right?)
> > > I think that "mvm release:perform" in
> > > dev/release/01-perform.sh does so but I don't know the
> > > details of "mvm release:perform"...
> > >
> > >
> > > More details:
> > >
> > > dev/release/00-prepare.sh:
> > >
> > > We'll be able to run this automatically when we can resolve
> > > the above GPG key problem in our Java package. We can
> > > resolve the Git commit problem by creating a cloned Git
> > > repository.
> > >
> > > dev/release/01-prepare.sh:
> > >
> > > We'll be able to run this automatically when we can resolve
> > > the above Git commit ("apche-arrow-${VERSION}" tag) problem
> > > in our Java package.
> > >
> > > dev/release/02-source.sh:
> > >
> > > We'll be able to run this automatically by creating a GPG
> > > key for nightly release test. We'll use Bintray to upload RC
> > > source archive instead of dist.apache.org. Ah, we need a
> > > Bintray API key for this. It must be secret.
> > >
> > > dev/release/03-binary.sh:
> > >
> > > We'll be able to run this automatically by creating a GPG
> > > key for nightly release test. We need a Bintray API key too.
> > >
> > > We need to improve this to support nightly release test. It
> > > use "XXX-rc" such as "debian-rc" for Bintray "package" name.
> > > It should use "XXX-nightly" such as "debian-nightly" for
> > > nightly release test instead.
> > >
> > > dev/release/post-00-release.sh:
> > >
> > > We'll be able to skip this.
> > >
> > > dev/release/post-01-upload.sh:
> > >
> > > We'll be able to skip this.
> > >
> > > dev/release/post-02-binary.sh:
> > >
> > > We'll be able to run this automatically by creating Bintray
> > > "packages" for nightly release and use them. We can create
> > > "XXX-nightly-release" ("debian-nightly-release") Bintray
> > > "packages" and use them instead of "XXX" ("debian") Bintray
> > > "packages".
> > >
> > > "debian" Bintray "package": https://bintray.com/apache/debian/
> > >
> > > We need to improve this to support nightly release.
> > >
> > > dev/release/post-03-website.sh:
> > >
> > > We'll be able to run this automatically by creating a cloned
> > > Git repository for test.
> > >
> > > It's better that we have a Web site to show generated pages.
> > > We can create
> > > https://github.com/apache/arrow-site/tree/asf-site/nightly
> > > and use it but I don't like it. Because arrow-site increases
> > > a commit day by day.
> > > Can we prepare a Web site for this? (arrow-nightly.ursalabs.org?)
> > >
> > > dev/release/post-04-rubygems.sh:
> > >
> > > We may be able to use GitHub Package Registry[2] to upload
> > > RubyGems. We can use "pre-release" package feature of
> > > https://rubygems.org/ but it's not suitable for
> > > nightly. It's for RC or beta release.
> > >
> > > [2]
> https://github.blog/2019-05-10-introducing-github-package-registry/
> > >
> > > dev/release/post-05-js.sh:
> > >
> > > We may be able to use GitHub Package Registry[2] to upload
> > > npm packages.
> > >
> > > dev/release/post-06-csharp.sh:
> > >
> > > We may be able to use GitHub Package Registry[2] to upload
> > > NuGet packages.
> > >
> > > dev/release/post-07-rust.sh:
> > >
> > > I don't have any idea. But it must be ran
> > > automatically. It's always failed. I needed to run each
> > > command manually.
> > >
> > > dev/release/post-08-remove-rc.sh:
> > >
> > > We'll be able to skip this.
> > >
> > >
> > > Thanks,
> > > 

Re: [Format] Semantics for dictionary batches in streams

2019-08-29 Thread Micah Kornfield
>
>
> > I was thinking the file format must satisfy one of two conditions:
> > 1.  Exactly one dictionarybatch per encoded column
> > 2.  DictionaryBatches are interleaved correctly.

Could you clarify?

I think you clarified it very well :) My motivation for suggesting the
additional complexity is I see two use-cases for the file format.  These
roughly correspond with the two options I suggested:
1.  We are encoding data from scratch.  In this case, it seems like all
dictionaries would be built incrementally, not need replacement and we
write them at the end of the file [1]

2.  The data being written out is essentially a "tee" off of some stream
that is generating new dictionaries requiring replacement on the fly (i.e.
reading back two parquet files).

 It might be better to disallow replacements
> in the file format (which does introduce semantic slippage between the
> file and stream formats as Antoine was saying).

It is is certainly possible, to accept the slippage from the stream format
for now and later add this capability, since it should be forwards
compatible.

Thanks,
Micah

[1] There is also medium complexity option where we require one non-delta
dictionary and as many delta dictionaries as the user want.

On Wed, Aug 28, 2019 at 7:50 AM Wes McKinney  wrote:

> On Tue, Aug 27, 2019 at 6:05 PM Micah Kornfield 
> wrote:
> >
> > I was thinking the file format must satisfy one of two conditions:
> > 1.  Exactly one dictionarybatch per encoded column
> > 2.  DictionaryBatches are interleaved correctly.
>
> Could you clarify? In the first case, there is no issue with
> dictionary replacements. I'm not sure about the second case -- if a
> dictionary id appears twice, then you'll see it twice in the file
> footer. I suppose you could look at the file offsets to determine
> whether a dictionary batch precedes a particular record batch block
> (to know which dictionary you should be using), but that's rather
> complicated to implement. It might be better to disallow replacements
> in the file format (which does introduce semantic slippage between the
> file and stream formats as Antoine was saying).
>
> >
> > On Tuesday, August 27, 2019, Wes McKinney  wrote:
> >
> > > On Tue, Aug 27, 2019 at 3:55 PM Antoine Pitrou 
> wrote:
> > > >
> > > >
> > > > Le 27/08/2019 à 22:31, Wes McKinney a écrit :
> > > > > So the current situation we have right now in C++ is that if we
> tried
> > > > > to create an IPC stream from a sequence of record batches that
> don't
> > > > > all have the same dictionary, we'd run into two scenarios:
> > > > >
> > > > > * Batches that either have a prefix of a prior-observed
> dictionary, or
> > > > > the prior dictionary is a prefix of their dictionary. For example,
> > > > > suppose that the dictionary sent for an id was ['A', 'B', 'C'] and
> > > > > then there's a subsequent batch with ['A', 'B', 'C', 'D', 'E']. In
> > > > > such case we could compute and send a delta batch
> > > > >
> > > > > * Batches with a dictionary that is a permutation of values, and
> > > > > possibly new unique values.
> > > > >
> > > > > In this latter case, without the option of replacing an existing
> ID in
> > > > > the stream, we would have to do a unification / permutation of
> indices
> > > > > and then also possibly send a delta batch. We should probably have
> > > > > code at some point that deals with both cases, but in the meantime
> I
> > > > > would like to allow dictionaries to be redefined in this case.
> Seems
> > > > > like we might need a vote to formalize this?
> > > >
> > > > Isn't the stream format deviating from the file format then?  In the
> > > > file format, IIUC, dictionaries can appear after the respective
> record
> > > > batches, so there's no way to tell whether the original or redefined
> > > > version of a dictionary is being referred to.
> > >
> > > You make a good point -- we can consider changes to the file format to
> > > allow for record batches to have different dictionaries. Even handling
> > > delta dictionaries with the current file format would be a bit tedious
> > > (though not indeterminate)
> > >
> > > > Regards
> > > >
> > > > Antoine.
> > >
>