Re: Human-readable version of Arrow Schema?

2019-12-23 Thread Micah Kornfield
>
> If we were to make the same kinds of forward/backward compatibility
> guarantees as with Flatbuffers it could create a lot of work for
> maintainers.

Does it pay to follow-up with the flatbuffer project to understand if the
forward/backward compatibility guarantees the flatbuffers provide extend to
their JSON format?

On Sun, Dec 15, 2019 at 11:17 AM Wes McKinney  wrote:

> I'd be open to looking at a proposal for a human-readable text
> representation, but I'm definitely wary about making any kind of
> cross-version compatibility guarantees (beyond "we will do our best").
> If we were to make the same kinds of forward/backward compatibility
> guarantees as with Flatbuffers it could create a lot of work for
> maintainers.
>
> On Thu, Dec 12, 2019 at 12:43 AM Micah Kornfield 
> wrote:
> >
> > >
> > > With these two together, it would seem not too difficult to create a
> text
> > > representation for Arrow schemas that (at some point) has some
> > > compatibility guarantees, but maybe I'm missing something?
> >
> >
> > I think the main risk is if somehow flatbuffers JSON parsing doesn't
> handle
> > backward compatible changes to the arrow schema message.  Given the way
> the
> > documentation is describing the JSON functionality I think this would be
> > considered a bug.
> >
> > The one downside to calling the "schema" canonical is the flatbuffers
> JSON
> > functionality only appears to be available in C++ and Java via JNI, so it
> > wouldn't have cross language support.  I think this issue is more one of
> > semantics though (i.e. does the JSON description become part of the
> "Arrow
> > spec" or does it live as a C++/Python only feature).
> >
> > -Micah
> >
> >
> > On Tue, Dec 10, 2019 at 10:51 AM Christian Hudon 
> > wrote:
> >
> > > Micah: I didn't know that Flatbuffers supported serialization to/from
> JSON,
> > > thanks. That seems like a very good start, at least. I'll aim to
> create a
> > > draft pull request that at least wires everything up in Arrow so we can
> > > load/save a Schema.fbs instance from/to JSON. At least it'll make it
> easier
> > > for me to see how Arrow schemas would look in JSON with that.
> > >
> > > Otherwise, I'm still gathering requirements internally here. For
> example,
> > > one thing that would be nice would be to be able to output a JSON
> Schema
> > > from at least a subset of the Arrow schema. (That way our users could
> start
> > > by passing around JSON with a given schema, and transition pieces of a
> > > workflow to Arrow as they're ready.) But that part can also be done
> outside
> > > of the Arrow code, if deemed not relevant to have in the Arrow codebase
> > > itself.
> > >
> > > One core requirement for us, however, would be eventual compatibility
> > > between Arrow versions for a given text representation of a schema.
> > > Meaning, if you have a text description of a given Arrow schema, you
> can
> > > load it into different versions of Arrow and it creates a valid Schema
> > > Flatbuffer description, that Arrow can use. Wes, were you thinking of
> that,
> > > or of something else, when you wrote "only makes sense if it is offered
> > > without any backward/forward compatibility guarantees"?
> > >
> > > For the now, or me, assuming the JSON serialization done by the
> Flatbuffer
> > > libraries is usable, it seems we have all the pieces to make this
> happen:
> > > 1) The binary Schema.fbs data structures has to be compatible between
> > > different versions of Arrow, otherwise two processes with different
> Arrow
> > > versions won't be able to interoperate, no?
> > > 2) The Flatbuffer <-> JSON serialization supplied by the Flatbuffers
> > > library also has to be compatible between different versions of the
> > > Flatbuffers library, since the main use case seems to be storing
> > > Flatbuffers assets into version control. Breaking changes there will
> also
> > > be painful to their users.
> > >
> > > With these two together, it would seem not too difficult to create a
> text
> > > representation for Arrow schemas that (at some point) has some
> > > compatibility guarantees, but maybe I'm missing something?
> > >
> > > Thanks,
> > >
> > >   Christian
> > >
> > > Le lun. 9 déc. 2019, à 07 h 00, Wes McKinney  a
> > > écrit :
> > >
> > > > The only "canonical" representation of schemas at the moment is the
> > > > Flatbuffers data structure [1]
> > > >
> > > > Having a human-readable/parseable text representation I think only
> > > > makes sense if it is offered without any backward/forward
> > > > compatibility guarantees.
> > > >
> > > > Note I had previously opened
> > > > https://issues.apache.org/jira/browse/ARROW-3730 where I noted that
> > > > there's no way (aside from generating the Flatbuffers messages) to
> > > > generate a schema representation that can be used later to
> reconstruct
> > > > a schema in a program. If such a representation were human
> > > > readable/editable that seems beneficial.
> > > >
> > > >
> > > >
> > > > [1]: http

Re: [DISCUSS][Java] Enhance code style checking for Java code

2019-12-23 Thread Micah Kornfield
Hi Liya Fan,
Thank you for the PR and starting the discussion. Sorry if I missed it but
is there something the PR to prevent these problems from reappearing?

My personal preference would be to find a mechanism to enforce code style.
Run it once (and accept it might be a large change) and from that point
forward validate the format as part of CI.

Cheers,
Micah

On Wed, Dec 18, 2019 at 1:44 AM Fan Liya  wrote:

> Dear all,
>
> We want to enhance the Java code style checking.
>
> This is due to a discussion in [1]. In the discussion, we found the current
> style checking for Java code is not sufficient. So we want to enhace it in
> a series of "small" steps, in order to avoid having to change too many
> files at once.
>
> Currently, we have opened a JIRA [2] to track the problem of consecutive
> spaces between tokens. An initial PR [3] is prepared. Please see if it
> looks good to you.
>
>
> Thank you in advance.
>
> Best regards,
>
> Liya Fan
>
>
> [1] https://github.com/apache/arrow/pull/5861#discussion_r348917065
> [2] https://issues.apache.org/jira/browse/ARROW-7429
> [3] https://github.com/apache/arrow/pull/6060
>


Re: [C++][Compute] RFC: add SIMD support to C++ kernel

2019-12-23 Thread Micah Kornfield
I would lean against adding another library dependency.  My main concerns
with adding another library dependency are:
1.  Supporting it across all of the build tool-chains (using a GCC specific
option would be my least favorite approach).
2.  Distributed binary size (for wheels at least people seem to care).

I would like lean more towards yes if there were some real world benchmarks
showing the a substantial performance gain.

I don't think it is unreasonable to package our binaries targeting a common
instruction set (e.g. AVX 1 or 2).  For those that want to make full use of
their latest hardware compiling from source doesn't seem unreasonable,
especially given the recent effort to trim dependencies.

Cheers,
Micah



On Fri, Dec 20, 2019 at 2:13 AM Antoine Pitrou  wrote:

>
> Hi,
>
> I would recommend against reinventing the wheel.  It would be possible
> to reuse an existing C++ SIMD library.  There are several of them (Vc,
> xsimd, libsimdpp...).  Of course, "just use Gandiva" is another possible
> answer.
>
> Regards
>
> Antoine.
>
>
> Le 20/12/2019 à 08:32, Yibo Cai a écrit :
> > Hi,
> >
> > I'm investigating SIMD support to C++ compute kernel(not gandiva).
> >
> > A typical case is the sum kernel[1]. Below tight loop can be easily
> optimized with SIMD.
> >
> > for (int64_t i = 0; i < length; i++) {
> >local.sum += values[i];
> > }
> >
> > Compiler already does loop vectorization. But it's done at compile time
> without knowledge of target cpu.
> > Binaries compiled with avx-512 cannot run on old cpu, while binaries
> compiled with only sse4 enabled is suboptimal on new hardware.
> >
> > I have some proposals, would like to hear comments from community.
> >
> > - Based on our experience of ISA-L[2] project(optimized storage
> acceleration library for x86 and Arm), runtime dispatcher is a good
> approach. Basically, it links in codes optimized for different cpu
> features(sse4,avx2,neon,...) and selects the best one fits target cpu at
> first invocation. This is similar to gcc indirect function[3], but doesn't
> depend on compilers.
> >
> > - Use gcc FMV [4] to generate multiple binaries for one function. See
> sample source and compiled code [5].
> >Though looks simple, it has many limitations: It's gcc specific
> feature, no support from clang and msvc. It only works on x86, no Arm
> support.
> >I think this approach is no-go.
> >
> > - Don't do it.
> >Gandiva leverages LLVM JIT for runtime code optimization. Is it
> duplicated effort to do it in C++ kernel? Will these vetorizable
> computations move to Gandiva in the future?
> >
> > [1]
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sum_internal.h#L104-L106
> > [2] https://github.com/intel/isa-l
> > [3] https://willnewton.name/2013/07/02/using-gnu-indirect-functions/
> > [4] https://lwn.net/Articles/691932/
> > [5] https://godbolt.org/z/ajpuq_
> >
>


[jira] [Created] (ARROW-7469) [C++] Improve division related bit operations

2019-12-23 Thread Liya Fan (Jira)
Liya Fan created ARROW-7469:
---

 Summary: [C++] Improve division related bit operations
 Key: ARROW-7469
 URL: https://issues.apache.org/jira/browse/ARROW-7469
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Liya Fan
Assignee: Liya Fan


Improve some operations in bit_util:

1. Eliminate one division for CeilDiv
2. Avoid overflow for RoundUp
3. Add a utility for CeilDiv(value, 8)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7468) [Python] Fix typos

2019-12-23 Thread Kazuaki Ishizaki (Jira)
Kazuaki Ishizaki created ARROW-7468:
---

 Summary: [Python] Fix typos
 Key: ARROW-7468
 URL: https://issues.apache.org/jira/browse/ARROW-7468
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 1.0.0
Reporter: Kazuaki Ishizaki


Fix typos in files under `python` directory



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [FlightRPC] Flight fallback mechanism.

2019-12-23 Thread Wes McKinney
hi Vinay,

I think these are application-level concerns -- Flight provides an
efficient means to transport datasets on a network. The result of a
DoGet or DoPut provides an iterator of record batches which generally
may not be that big. How you handle the chunks of data in memory (or
put them on disk, etc.) seems out of scope for Flight itself.

- Wes

On Mon, Dec 23, 2019 at 3:27 AM Vinay Kesarwani  wrote:
>
> Flight is great way to share data among processes, and holding shared data
> in-memory.
>
> I couldn't find any api or design to handle fall back mechanism in case
> data is not fitting in memory.
>
> Cases:
> 1- Once memory buffer is nearing full, data should spill over to disk.
> 2- Spilling over disk or memory mapped file.
> 3- Should it be .arrow file or feather format on Disk.
> 4- Should it be compressed? any design suggestion?
>
> Can we have a feasibility check and performance impact in terms of
> benchmark #.


Re: [DISCUSS][C++] Pointer name aliasing

2019-12-23 Thread Wes McKinney
Hi Liya,

In general I think when a development guideline promotes readability
it is a good thing. Issues like these may be addressed on a case by
case basis, though.

best
Wes

On Sun, Dec 22, 2019 at 8:54 PM Fan Liya  wrote:
>
> IMO, this question relates to something general and fundamental.
>
> Generally, name alias leads to two results:
> 1) It makes writing code easier
> 2) It makes reading code more difficult
>
> Personally, I prefer readability to writability.
> However, I am wrondering if we have some general principles regarding this?
>
> Best,
> Liya Fan
>
>
> On Fri, Dec 20, 2019 at 12:17 AM Francois Saint-Jacques <
> fsaintjacq...@gmail.com> wrote:
>
> > I created the following ticket (and sub-tasks) [1]  to track
> >
> > François
> >
> > [1] https://jira.apache.org/jira/browse/ARROW-7438
> >
> > On Tue, Nov 26, 2019 at 12:09 AM Micah Kornfield 
> > wrote:
> > >
> > > I would need to look at the other instances as well.  I will try to so by
> > > next week, but I think we can probably take an incremental approach of:
> > > 1.  Eliminate *Ptr in src/arrow code (discuss similar changes in
> > > parquet/gandiva).
> > > 2.  Decide on the Iterator/Vector.
> > >
> > > On Fri, Nov 22, 2019 at 10:47 AM Wes McKinney 
> > wrote:
> > >
> > > > hi Francois
> > > >
> > > > On Fri, Nov 22, 2019 at 11:17 AM Francois Saint-Jacques
> > > >  wrote:
> > > > >
> > > > > I'll revert, some questions:
> > > > >
> > > > > 1. Should we revert only the pointer aliases, or also the
> > > > Vector/Iterator.
> > > >
> > > > Could you clarify what Vector/Iterator aliases you are referring to,
> > > > like RecordBatchIterator?
> > > >
> > > > I think we may need to distinguish between endogenous aliases versus
> > > > aliases involving STL types.
> > > >
> > > > IMHO
> > > >
> > > > using RecordBatchIterator = Iterator>;
> > > >
> > > > is less problematic from a readability standpoint than
> > > >
> > > > using RecordBatchPtr = shared_ptr;
> > > >
> > > > > 2. Should we revert all modules, i.e. gandiva and compute.
> > > >
> > > > I would say one step at a time -- in the case of Gandiva I would say
> > > > that we should open a JIRA issue and discuss further to ensure that we
> > > > do not cause disruption for public API consumers. Since this software
> > > > has already been released multiple times, a different approach may be
> > > > needed
> > > >
> > > > >
> > > > > François
> > > >
> >


[jira] [Created] (ARROW-7467) [Java] ComplexCopier does incorrect copy for Map nullable info

2019-12-23 Thread Ji Liu (Jira)
Ji Liu created ARROW-7467:
-

 Summary: [Java] ComplexCopier does incorrect copy for Map nullable 
info
 Key: ARROW-7467
 URL: https://issues.apache.org/jira/browse/ARROW-7467
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


The {{MapVector}} and its 'value' vector are nullable, and its {{structVector}} 
and 'key' vector are non-nullable.

However, the {{MapVector}} generated by ComplexCopier has all nullable fields 
which is not correct.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[FlightRPC] Flight fallback mechanism.

2019-12-23 Thread Vinay Kesarwani
Flight is great way to share data among processes, and holding shared data
in-memory.

I couldn't find any api or design to handle fall back mechanism in case
data is not fitting in memory.

Cases:
1- Once memory buffer is nearing full, data should spill over to disk.
2- Spilling over disk or memory mapped file.
3- Should it be .arrow file or feather format on Disk.
4- Should it be compressed? any design suggestion?

Can we have a feasibility check and performance impact in terms of
benchmark #.