Re: Contribute "RowSet" mechanism from Apache Drill?

2018-08-27 Thread Li Jin
Hi Paul,

Thank you for the email. I think this is interesting.

Arrow (Java API) currently doesn't have the capability of automatically
limiting the memory size of record batches. In Spark we have similar needs
to limit the size of record batches and have talked about implementing some
kind of size estimator for record batches but haven't started to work on it.

I personally think it makes sense for Arrow to incorporate such
capabilities.



On Mon, Aug 27, 2018 at 1:33 AM Paul Rogers 
wrote:

> Hi All,
>
> Over in the Apache Drill project, we developed some handy vector
> reader/writer abstractions. I wonder if they might be of interest to Apache
> Arrow. Key contributions of the "RowSet" abstractions:
>
> * Control row batch size: the aggregate memory taken by a set of vectors
> (and all their sub-vectors for structured types.)
> * Control the maximum per-vector size.
> * Simple, highly optimized read/write interface that handles vector offset
> accounting, even for deeply nested types.
> * Minimize vector internal fragmentation (wasted space.)
>
> More information is available in [1]. Arrow improved and simplified
> Drill's original vector and metadata abstractions. As a result, work would
> be required to port the RowSet code from Drill's version of these classes
> to the Arrow versions.
>
> Does Arrow already have a similar solution? If not, would the above be
> useful for Arrow?
>
> Thanks,
> - Paul
>
>
> Apache Drill PMC member
> Co-author of the upcoming O'Reilly book "Learning Apache Drill"
> [1]
> https://github.com/paul-rogers/drill/wiki/RowSet-Abstractions-for-Arrow
>
>
>


[DISCUSS] Standardize Java style

2018-08-27 Thread Li Jin
Hi All,

Bryan Cutler has started a PR to fix Java checkstyle warning (Thank you
Bryan!). In my experience style is something hard to get consensus on due
to personal preference, so I wonder if we can pick a well known style guide
(say google style: https://google.github.io/styleguide/javaguide.html) to
minimize the discussion on styles?

What do other people think?

Li


Re: Progress on Arrow RPC a.k.a. Arrow Flight

2018-08-27 Thread Li Jin
Thank you both for the explanation, it makes sense.

Another feedback I have is around flight.proto - some of the message (such
as FlightDescriptor and FlightPutInstruction) is not very clear to me - it
would be helpful to get some more explanation for those here or on the PR.

Thanks!
Li

On Sun, Aug 26, 2018 at 6:14 PM Jacques Nadeau  wrote:

> Wes nailed my thinking. There are autobindings for every language for the
> envelope if you use protobuf meaning someone can send/receive an arrow
> stream without having to know how to read the arrow stream.
>
> On Sat, Aug 25, 2018 at 6:00 PM Wes McKinney  wrote:
>
> > Hi Li -- Protobuf is the "native" wire format for GRPC [1]. You can use
> > Flatbuffers with it, too [2], but if we are aiming for fairly broad
> support
> > at the RPC level then using Protobuf is probably a safer bet.
> >
> > One question might be "Well, Arrow already uses Flatbuffers". That's
> true,
> > but a system could make Flight RPCs and delegate handling of the messages
> > to third party code -- so the RPC handler does not need to know anything
> > about Flatbuffers or Arrow columnar format for that matter.
> >
> > The main thing we need to be concerned about re: zero copy is the
> > FlightData.
> >
> > As an aside: I still believe that Flatbuffers was the right choice for
> > Arrow's metadata serialization. We've suffered a bit from weakness in
> > implementation for languages like Rust, but to have the option to
> > selectively read only a small part of a potentially very large message
> is a
> > big benefit (vs. having to do an all-or-nothing parse of the proto). It
> > would be useful to quantify this benefit at some point by creating some
> > benchmarks vs. a protobuf-based version of Arrow's metadata
> >
> > - Wes
> >
> > [1]: https://grpc.io/docs/guides/concepts.html#overview
> > [2]: https://grpc.io/blog/flatbuffers
> >
> > On Fri, Aug 24, 2018, 5:05 PM Li Jin  wrote:
> >
> > > One question I have is around the choice of using protobufs - It seems
> > that
> > > flatbuffers has better support for zero-copy and works with grpc as
> well.
> > > What's the rational behind picking protobuf over flatbuffer?
> > >
> > > On Thu, Aug 16, 2018 at 7:41 PM Wes McKinney 
> > wrote:
> > >
> > > > hi Julian,
> > > >
> > > > Thanks for chiming in.
> > > >
> > > > On Thu, Aug 16, 2018 at 1:16 PM, Julian Hyde 
> wrote:
> > > > > If your use case is SQL RPC, then you are getting close to
> Avatica's
> > > > > territory. Avatica[1] is a protocol for implementing
> > > > > language-independent JDBC and ODBC stacks.
> > > >
> > > > I'm not proposing to develop a SQL RPC system inside Apache Arrow.
> But
> > > > Arrow Flight could be used to build one
> > > >
> > > > >
> > > > > Now, I agree that many ODBC implementations are inefficient. Some
> > ODBC
> > > > > stacks make more round trips than necessary, and do more copying
> than
> > > > > necessary. In Avatica we are trying to squeeze out those
> > > > > inefficiencies, for example minimizing the number of RPCs. We would
> > > > > also love to use Arrow as the data format and reduce copying on the
> > > > > server side and client side.
> > > >
> > > > Indeed -- what I would like to see instead is for Avatica to _use_
> > > > Arrow Flight to provide an alternative platform to offer Arrow-native
> > > > connectivity in addition to the slower JDBC and ODBC standards.
> > > >
> > > > >
> > > > > But conversely, people who start with a simple RPC use case - send
> > > > > SQL, get the results - may soon find themselves needing a more
> > complex
> > > > > protocol - authentication, sessions, prepared statements, bind
> > > > > variables, getting metadata before executing, cursors, skipping
> over
> > > > > rows. In other words, find themselves wanting substantial portions
> of
> > > > > an ODBC or JDBC driver.
> > > > >
> > > > > You could find yourselves building Avatica all over again. We saw
> all
> > > > > of this happen in XML-RPC, and it was sad.
> > > >
> > > > Agreed. I don't think this is in the cards, and what's being proposed
> > > > now is orthogonal.
> > > >
> > > > >
> > > > > I suggest to keep flight for the truly simple use case, and for the
> > > > > more complex use case, invest effort putting Arrow into Avatica. We
> > > > > are always happy to welcome new contributors.
> > > >
> > > > +1
> > > >
> > > > >
> > > > > Julian
> > > > >
> > > > > [1] https://calcite.apache.org/avatica/docs/
> > > > > On Thu, Aug 16, 2018 at 7:56 AM Wes McKinney 
> > > > wrote:
> > > > >>
> > > > >> To give some extra color on my personal motivation for interest in
> > > > Arrow Flight:
> > > > >>
> > > > >> Systems that expose databases on a network frequently send data
> very
> > > > >> slowly. For example, ODBC is in general extremely slow. What I
> would
> > > > >> like to see is servers that can expose a "sql" action type.
> > > > >>
> > > > >> So, in consideration of the protocol as it stands now [1], example
> > > > >> session goes like this:
> > 

[jira] [Created] (ARROW-3125) [Python] Update ASV instructions

2018-08-27 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-3125:
-

 Summary: [Python] Update ASV instructions
 Key: ARROW-3125
 URL: https://issues.apache.org/jira/browse/ARROW-3125
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.10.0
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


The ability to define custom install / build / uninstall commands was added in 
mainline ASV in https://github.com/airspeed-velocity/asv/pull/699
We don't need to use our own fork / PR anymore.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] Standardize Java style

2018-08-27 Thread Bryan Cutler
Thanks for bringing this discussion up Li. I think we can use an existing
style guide as a starting point, but ultimately we as a community should
decide how to best fit it for the project. I believe we already have the
google checkstlye as our Java rules configuration file, but already off the
bat the import rules seem insufficient. For example, they now have a single
import group that would encompass all third-party and Arrow imports, and
the section of the guide about imports in your link is empty. I know a lot
of style is a matter of preference, so I would encourage anyone to follow
and comment in the JIRA at https://issues.apache.org/jira/browse/ARROW-1688
where I will be making style fixes incrementally in sub-tasks. Whatever
style we choose does not have to be set in stone and can always be changed
in the future, however I don't think we should point people to a
third-party guide as official reference.

Thanks,
Bryan

On Mon, Aug 27, 2018 at 8:09 AM Li Jin  wrote:

> Hi All,
>
> Bryan Cutler has started a PR to fix Java checkstyle warning (Thank you
> Bryan!). In my experience style is something hard to get consensus on due
> to personal preference, so I wonder if we can pick a well known style guide
> (say google style: https://google.github.io/styleguide/javaguide.html) to
> minimize the discussion on styles?
>
> What do other people think?
>
> Li
>


Fwd: How to concatenate RecordBatches into a single RecordBatch?

2018-08-27 Thread Jacob Quinn Shenker
Hi all,

Question: If I have a set of small (10-1000 rows) RecordBatches on
disk or in memory, how can I (efficiently) concatenate/rechunk them
into larger RecordBatches (so that each column is output as a
contiguous array when written to a new Arrow buffer)?

Context: With such small RecordBatches, I'm finding that reading Arrow
into a pandas table is very slow (~100x slower than local disk) from
my cluster's Lustre distributed file system (plenty of bandwidth but
each IO op has very high latency); I'm assuming this has to do with
needing many seek() calls for each RecordBatch. I'm hoping it'll help
if I rechunk my data into larger RecordBatches before writing to disk.
(The input RecordBatches are small because they are the individual
results returned by millions of tasks on a dask cluster, as part of a
streaming analysis pipeline.)

While I'm here I also wanted to thank everyone on this list for all
their work on Arrow! I'm a PhD student in biology at Harvard Medical
School. We take images of about 1 billion individual bacteria every
day with our microscopes, generating about ~1PB/yr in raw data. We're
using this data to search for new kinds of antibiotic drugs. Using way
more data allows us precisely measure how the bacteria's growth is
affected by the drug candidates, which allows us to find new drugs
that previous screens have missed—and that's why I'm really excited
about Arrow, it's making dealing with these data volumes a lot easier
for us!

~ J


[jira] [Created] (ARROW-3126) [Python] Add buffering option to pyarrow.open_stream to enable larger read ahead window for high latency file systems

2018-08-27 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-3126:
---

 Summary: [Python] Add buffering option to pyarrow.open_stream to 
enable larger read ahead window for high latency file systems
 Key: ARROW-3126
 URL: https://issues.apache.org/jira/browse/ARROW-3126
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.12.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: How to concatenate RecordBatches into a single RecordBatch?

2018-08-27 Thread Wes McKinney
hi Jacob,

We have https://issues.apache.org/jira/browse/ARROW-549 about
concatenating arrays. Someone needs to write the code and tests, and
then we can easily add an API to "consolidate" table columns.

If you have small record batches, could you read the entire file into
memory before parsing it with pyarrow.open_file/open_stream? The might
improve IO performance by reducing seeks. We don't support any
buffering in open_stream yet, so I'm going to open a JIRA about that:

https://issues.apache.org/jira/browse/ARROW-3126

Building a big development platform like this is a lot of work, but we
are making progress!

- Wes

On Mon, Aug 27, 2018 at 8:22 PM, Jacob Quinn Shenker
 wrote:
> Hi all,
>
> Question: If I have a set of small (10-1000 rows) RecordBatches on
> disk or in memory, how can I (efficiently) concatenate/rechunk them
> into larger RecordBatches (so that each column is output as a
> contiguous array when written to a new Arrow buffer)?
>
> Context: With such small RecordBatches, I'm finding that reading Arrow
> into a pandas table is very slow (~100x slower than local disk) from
> my cluster's Lustre distributed file system (plenty of bandwidth but
> each IO op has very high latency); I'm assuming this has to do with
> needing many seek() calls for each RecordBatch. I'm hoping it'll help
> if I rechunk my data into larger RecordBatches before writing to disk.
> (The input RecordBatches are small because they are the individual
> results returned by millions of tasks on a dask cluster, as part of a
> streaming analysis pipeline.)
>
> While I'm here I also wanted to thank everyone on this list for all
> their work on Arrow! I'm a PhD student in biology at Harvard Medical
> School. We take images of about 1 billion individual bacteria every
> day with our microscopes, generating about ~1PB/yr in raw data. We're
> using this data to search for new kinds of antibiotic drugs. Using way
> more data allows us precisely measure how the bacteria's growth is
> affected by the drug candidates, which allows us to find new drugs
> that previous screens have missed—and that's why I'm really excited
> about Arrow, it's making dealing with these data volumes a lot easier
> for us!
>
> ~ J


Re: Contribute "RowSet" mechanism from Apache Drill?

2018-08-27 Thread Jacques Nadeau
This seems like it could be a useful addition. In general, our experience
with writing Arrow structures is that the most optimal path is using
columnar interaction rather than rowwise. That being said, most people
start out by interacting with Arrow rowwise first and having an interface
like this could be helpful in allowing people to start writing Arrow
datasets with less effort and mistakes.

In terms of record batch sizing/estimations, I think that should probably
be uncoupled from writing/reading vectors.



On Mon, Aug 27, 2018 at 7:00 AM Li Jin  wrote:

> Hi Paul,
>
> Thank you for the email. I think this is interesting.
>
> Arrow (Java API) currently doesn't have the capability of automatically
> limiting the memory size of record batches. In Spark we have similar needs
> to limit the size of record batches and have talked about implementing some
> kind of size estimator for record batches but haven't started to work on
> it.
>
> I personally think it makes sense for Arrow to incorporate such
> capabilities.
>
>
>
> On Mon, Aug 27, 2018 at 1:33 AM Paul Rogers 
> wrote:
>
> > Hi All,
> >
> > Over in the Apache Drill project, we developed some handy vector
> > reader/writer abstractions. I wonder if they might be of interest to
> Apache
> > Arrow. Key contributions of the "RowSet" abstractions:
> >
> > * Control row batch size: the aggregate memory taken by a set of vectors
> > (and all their sub-vectors for structured types.)
> > * Control the maximum per-vector size.
> > * Simple, highly optimized read/write interface that handles vector
> offset
> > accounting, even for deeply nested types.
> > * Minimize vector internal fragmentation (wasted space.)
> >
> > More information is available in [1]. Arrow improved and simplified
> > Drill's original vector and metadata abstractions. As a result, work
> would
> > be required to port the RowSet code from Drill's version of these classes
> > to the Arrow versions.
> >
> > Does Arrow already have a similar solution? If not, would the above be
> > useful for Arrow?
> >
> > Thanks,
> > - Paul
> >
> >
> > Apache Drill PMC member
> > Co-author of the upcoming O'Reilly book "Learning Apache Drill"
> > [1]
> > https://github.com/paul-rogers/drill/wiki/RowSet-Abstractions-for-Arrow
> >
> >
> >
>


[jira] [Created] (ARROW-3127) [C++] Add Tutorial about Sending Tensor from C++ to Python

2018-08-27 Thread Simon Mo (JIRA)
Simon Mo created ARROW-3127:
---

 Summary: [C++] Add Tutorial about Sending Tensor from C++ to Python
 Key: ARROW-3127
 URL: https://issues.apache.org/jira/browse/ARROW-3127
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Website
Reporter: Simon Mo


I can add a short tutorial showing how to
 # Serialize a floating-point array in C++ into Tensor
 # Save the Tensor to Plasma
 # Access the Tensor in Python

c.f. [https://github.com/apache/arrow/pull/2481]

cc @[pcmoritz|https://github.com/pcmoritz]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Contribute "RowSet" mechanism from Apache Drill?

2018-08-27 Thread Paul Rogers
Hi Jacques,

Thanks much for the note. I wonder, when reading data into, or out of, Arrow, 
are not the interfaces often row-wise? For example, it is somewhat difficult to 
read a CSV file column-wise. Similarly, when serving a BI tool (for tables or 
charts), data must be presented row-wise. (JDBC, for example, is a row-wise 
interface.) The abstractions help with these cases.

Perhaps much of the emphasis in Arrow is in cross-tool compatibility in which 
data is passed column-wise as a set of vectors? The abstractions wouldn't be 
needed in this data transfer case.

The batch size component is an essential part of row-wise loading. When reading 
data into vectors, even from Parquet, we found it necessary to 1) control the 
overall amount of memory used by the batch, and 2) read the same number of rows 
for every column. The RowSet abstractions encapsulate this coordinated 
cross-column work.

The memory limits in the "RowSet" abstraction are not estimates. (There was a 
separate Drill project for that, which is why it might be confusing.) Instead, 
the memory limits are based on knowing the current write offset into each 
vector.  In Drill, when a vector becomes full, we automatically resize the 
vector by doubling the memory for that vector. The RowSet abstraction tracks 
when doubling the vector would exceed the "budget" set for that vector or 
batch. When the limit occurs, the abstraction marks the batch complete. (The 
"overflow" row is saved for later to avoid exceeding the limit, and to keep the 
details of overflow hidden from the client.) The same logic can be applied, I 
would assume, to whatever memory allocation technique is used in Arrow, if 
Arrow has evolved beyond Drill's technique.

A size estimate (when available) helps by allowing the client code to 
pre-allocate vectors to their final size. Doing so avoids growing vectors 
during data loads. In this case, the abstractions simply pack data into those 
pre-allocated vectors until one of them becomes full.

The idea of separating memory from reading/writing is sound. In fact, that's 
how the code is structured. The memory-unaware version is heavily used in unit 
tests where we know how much memory is used. The memory-aware version is used 
in production to handle whatever strange data sets present themselves.

Of course, none of this was clear from my terse description. I'll go ahead and 
create a JIRA ticket to provide additional context and to gather detailed 
comments so we can figure out the best way to proceed.

Thanks,

- Paul

 

On Monday, August 27, 2018, 5:52:19 PM PDT, Jacques Nadeau 
 wrote:  
 
 This seems like it could be a useful addition. In general, our experience
with writing Arrow structures is that the most optimal path is using
columnar interaction rather than rowwise. That being said, most people
start out by interacting with Arrow rowwise first and having an interface
like this could be helpful in allowing people to start writing Arrow
datasets with less effort and mistakes.

In terms of record batch sizing/estimations, I think that should probably
be uncoupled from writing/reading vectors.



On Mon, Aug 27, 2018 at 7:00 AM Li Jin  wrote:

> Hi Paul,
>
> Thank you for the email. I think this is interesting.
>
> Arrow (Java API) currently doesn't have the capability of automatically
> limiting the memory size of record batches. In Spark we have similar needs
> to limit the size of record batches and have talked about implementing some
> kind of size estimator for record batches but haven't started to work on
> it.
>
> I personally think it makes sense for Arrow to incorporate such
> capabilities.
>
>
>
> On Mon, Aug 27, 2018 at 1:33 AM Paul Rogers 
> wrote:
>
> > Hi All,
> >
> > Over in the Apache Drill project, we developed some handy vector
> > reader/writer abstractions. I wonder if they might be of interest to
> Apache
> > Arrow. Key contributions of the "RowSet" abstractions:
> >
> > * Control row batch size: the aggregate memory taken by a set of vectors
> > (and all their sub-vectors for structured types.)
> > * Control the maximum per-vector size.
> > * Simple, highly optimized read/write interface that handles vector
> offset
> > accounting, even for deeply nested types.
> > * Minimize vector internal fragmentation (wasted space.)
> >
> > More information is available in [1]. Arrow improved and simplified
> > Drill's original vector and metadata abstractions. As a result, work
> would
> > be required to port the RowSet code from Drill's version of these classes
> > to the Arrow versions.
> >
> > Does Arrow already have a similar solution? If not, would the above be
> > useful for Arrow?
> >
> > Thanks,
> > - Paul
> >
> >
> > Apache Drill PMC member
> > Co-author of the upcoming O'Reilly book "Learning Apache Drill"
> > [1]
> > https://github.com/paul-rogers/drill/wiki/RowSet-Abstractions-for-Arrow
> >
> >
> >
>
  

[jira] [Created] (ARROW-3128) [C++] Support system shared zlib

2018-08-27 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-3128:
---

 Summary: [C++] Support system shared zlib
 Key: ARROW-3128
 URL: https://issues.apache.org/jira/browse/ARROW-3128
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.11.0
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


Debian package recommends not static linking zlib for security reason. If we 
use zlib as a static library, Debian package lint reports the following error:

{noformat}
E: libarrow10: embedded-library usr/lib/x86_64-linux-gnu/libarrow.so.10.0.0: 
zlib
{noformat}

embedded-library error detail: 
https://lintian.debian.org/tags/embedded-library.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3129) [Packaging] Stop to use deprecated BuildRoot and Group in .rpm

2018-08-27 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-3129:
---

 Summary: [Packaging] Stop to use deprecated BuildRoot and Group in 
.rpm
 Key: ARROW-3129
 URL: https://issues.apache.org/jira/browse/ARROW-3129
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Affects Versions: 0.10.0
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


They are deprecated
https://fedoraproject.org/wiki/Packaging:Guidelines says the followings:

{quote}
* The BuildRoot: tag, Group: tag, and %clean section SHOULD NOT be used.
{quote}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)