[jira] [Created] (ARROW-5207) [Java] add APIs to support vector

2019-04-23 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5207:
-

 Summary: [Java] add APIs to support vector 
 Key: ARROW-5207
 URL: https://issues.apache.org/jira/browse/ARROW-5207
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ji Liu


In some scenarios we hope that ValueVector could be reused to reduce creation 
overhead. This is very common in shuffle stage, it's no need to create 
ValueVector or realloc buffers every time, suppose that the recordCount of 
ValueVector and capacity of its buffers is written in stream, when we 
deserialize it, we can simply judge whether realloc is needed through 
dataLength.

My proposal is that add APIs in ValueVector to process this logic, otherwise 
users have to implement by themselves if they want to reuse which is not 
user-friendly. 

If you agree with this, I would like to take this ticket. Thanks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


contributor permmission

2019-04-23 Thread niki.lj

Hi,
could you please give me the contributor permission, I want to contribute to 
Arrow, thanks!
My apache account is tianchen92


Ji Liu

[jira] [Created] (ARROW-5206) [JAVA]Add APIs in MessageSerializer to directly serialize/deserialize ArrowBuf

2019-04-23 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5206:
-

 Summary: [JAVA]Add APIs in MessageSerializer to directly 
serialize/deserialize ArrowBuf
 Key: ARROW-5206
 URL: https://issues.apache.org/jira/browse/ARROW-5206
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ji Liu


It seems there no APIs to directly write ArrowBuf to OutputStream or read 
ArrowBuf from InputStream. These APIs may be helpful when users use Vectors 
directly instead of RecordBatch, in this case, provide APIs to 
serialize/deserialize dataBuffer/validityBuffer/offsetBuffer is necessary.

I would like to work on this and make it my first contribution to Arrow. What 
do you think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


RE: Benchmarking mailing list thread [was Fwd: [Discuss] Benchmarking infrastructure]

2019-04-23 Thread Melik-Adamyan, Areg
Because we are using Google Benchmark, which has specific format there is a 
tool called becnhcmp which compares two runs:

$ benchcmp old.txt new.txt
benchmark   old ns/op new ns/op delta
BenchmarkConcat 523   68.6  -86.88%

So the comparison part is done and there is no need to create infra for that.

What we need is to change the ctest -L Benchmarks output to stdout to standard 
google benchmark output
--
BenchmarkTime   CPU Iterations
--
BM_UserCounter/threads:1  9504 ns   9504 ns  73787
BM_UserCounter/threads:2  4775 ns   9550 ns  72606
BM_UserCounter/threads:4  2508 ns   9951 ns  70332
BM_UserCounter/threads:8  2055 ns   9933 ns  70344
BM_UserCounter/threads:16 1610 ns   9946 ns  70720
BM_UserCounter/threads:32 1192 ns   9948 ns  70496

The script on the build machine will parse this and alongside with the machine 
info send to DB.

The subset is done through passing --benchmark-filter=<...>
$ ./run_benchmarks.x --benchmark_filter=BM_memcpy/32
Run on (1 X 2300 MHz CPU )
2016-06-25 19:34:24
Benchmark  Time   CPU Iterations

BM_memcpy/32  11 ns 11 ns   79545455
BM_memcpy/32k   2181 ns   2185 ns 324074
BM_memcpy/32  12 ns 12 ns   54687500
BM_memcpy/32k   1834 ns   1837 ns 357143

Or we can create buildbot mode and produce output in JSON format
{
  "context": {
"date": "2019/03/17-18:40:25",
"num_cpus": 40,
"mhz_per_cpu": 2801,
"cpu_scaling_enabled": false,
"build_type": "debug"
  },
  "benchmarks": [
{
  "name": "BM_SetInsert/1024/1",
  "iterations": 94877,
  "real_time": 29275,
  "cpu_time": 29836,
  "bytes_per_second": 134066,
  "items_per_second": 33516
}
  ]
}

So we have all the ingredients and do not need to reinvent anything, we need 
just to agree on the process: what is done when and put to where in which 
format.


-- Forwarded message -
From: Francois Saint-Jacques 
mailto:fsaintjacq...@gmail.com>>
Date: Tue, Apr 16, 2019 at 11:44 AM
Subject: Re: [Discuss] Benchmarking infrastructure
To: mailto:dev@arrow.apache.org>>


Hello,

A small status update, I recently implemented archery [1] a tool for Arrow 
benchmarks comparison [2]. The documentation ([3] and [4]) is in the 
pull-request. The primary goal is to compare 2 commits (and/or build
directories) for performance regressions. For now, it supports C++ benchmarks.
This is accessible via the command `archery benchmark diff`. The end result is 
a one comparison per line, with an regression indicator.

Currently, there is no facility to perform a single "run", e.g. run benchmarks 
in the current workspace without comparing to a previous version. This was 
initially implemented in [5] but depended heavily on ctest (with no control on 
execution). Once [1] is merged, I'll re-implement single run (ARROW-5071) this 
in term of archery, since it already execute and parses C++ benchmarks.

The next goal is to be able to push the results into an upstream database, be 
it the one defined in dev/benchmarking, or codespeed as Areg proposed. The 
steps required for this:
- ARROW-5071: Run and format benchmark results for upstream consumption
  (ideally under the `archery benchmark run` sub-command)
- ARROW-5175: Make a list of benchmarks to include in regression checks
- ARROW-4716: Collect machine and benchmarks context
- ARROW-TBD: Push benchmark results to upstream database

In parallel, with ARROW-4827, Krisztian and I are working on 2 related buildbot 
sub-projects enabling some regression detection:
- Triggering on-demand benchmark comparison via comments in PR
   (as proposed by Wes)
- Regression check on master merge (without database support)

François

P.S.
A collateral of this PR is that archery is a modular python library and can be 
used for other purposes, e.g. it could centralize orphaned scripts in dev/, 
e.g. linting, release, and merge since it offers utilities to handle arrow 
sources, git, cmake and exposes a usable CLI interface (with documentation).

[1] https://github.com/apache/arrow/pull/4141
[2] https://jira.apache.org/jira/browse/ARROW-4827
[3]
https://github.com/apache/arrow/blob/512ae64bc074a0b620966131f9338d4a1eed2356/docs/source/developers/benchmarks.rst
[4]
https://github.com/apache/arrow/pull/4141/files#diff-7a8805436a6884ddf74fe3eaec697e71R216
[5] https://github.com/apache/arrow/pull/4077

On Fri, Mar 29, 2019 at 3:21 PM Melik-Adamyan, Areg < 
areg.melik-adam...@intel.com> wrote:

> >When you say "output is parsed", how is that exactly? We don't have
> >any
> scripts in the repository to do this yet (I have some comments on this
> 

Re: [Discuss] Benchmarking infrastructure

2019-04-23 Thread Wes McKinney
hi Francois,

This sounds like good progress.

For any tool consumable through a CLI/command-line interface my
recommendation is to ensure that the software is usable as a library
equally as well as via a CLI interface.

In this patch I see

https://github.com/apache/arrow/pull/4141/files#diff-7a8805436a6884ddf74fe3eaec697e71R212

Please be wary of making business logic exclusively available through
a CLI; it makes composability and reuse harder (= requiring
refactoring that might have been avoidable). AFAICT this is still a
concern with Crossbow; there is task business logic that can only be
accessed by using the command line interface

- Wes

On Tue, Apr 16, 2019 at 11:44 AM Francois Saint-Jacques
 wrote:
>
> Hello,
>
> A small status update, I recently implemented archery [1] a tool for Arrow
> benchmarks comparison [2]. The documentation ([3] and [4]) is in the
> pull-request. The primary goal is to compare 2 commits (and/or build
> directories) for performance regressions. For now, it supports C++
> benchmarks.
> This is accessible via the command `archery benchmark diff`. The end result
> is
> a one comparison per line, with an regression indicator.
>
> Currently, there is no facility to perform a single "run", e.g. run
> benchmarks
> in the current workspace without comparing to a previous version. This was
> initially implemented in [5] but depended heavily on ctest (with no control
> on
> execution). Once [1] is merged, I'll re-implement single run (ARROW-5071)
> this
> in term of archery, since it already execute and parses C++ benchmarks.
>
> The next goal is to be able to push the results into an upstream database,
> be
> it the one defined in dev/benchmarking, or codespeed as Areg proposed. The
> steps required for this:
> - ARROW-5071: Run and format benchmark results for upstream consumption
>   (ideally under the `archery benchmark run` sub-command)
> - ARROW-5175: Make a list of benchmarks to include in regression checks
> - ARROW-4716: Collect machine and benchmarks context
> - ARROW-TBD: Push benchmark results to upstream database
>
> In parallel, with ARROW-4827, Krisztian and I are working on 2 related
> buildbot
> sub-projects enabling some regression detection:
> - Triggering on-demand benchmark comparison via comments in PR
>(as proposed by Wes)
> - Regression check on master merge (without database support)
>
> François
>
> P.S.
> A collateral of this PR is that archery is a modular python library and can
> be
> used for other purposes, e.g. it could centralize orphaned scripts in dev/,
> e.g. linting, release, and merge since it offers utilities to handle arrow
> sources, git, cmake and exposes a usable CLI interface (with documentation).
>
> [1] https://github.com/apache/arrow/pull/4141
> [2] https://jira.apache.org/jira/browse/ARROW-4827
> [3]
> https://github.com/apache/arrow/blob/512ae64bc074a0b620966131f9338d4a1eed2356/docs/source/developers/benchmarks.rst
> [4]
> https://github.com/apache/arrow/pull/4141/files#diff-7a8805436a6884ddf74fe3eaec697e71R216
> [5] https://github.com/apache/arrow/pull/4077
>
> On Fri, Mar 29, 2019 at 3:21 PM Melik-Adamyan, Areg <
> areg.melik-adam...@intel.com> wrote:
>
> > >When you say "output is parsed", how is that exactly? We don't have any
> > scripts in the repository to do this yet (I have some comments on this
> > below). We also have to collect machine information and insert that into
> > the database. From my >perspective we have quite a bit of engineering work
> > on this topic ("benchmark execution and data collection") to do.
> > Yes I wrote one as a test.  Then it can do POST to the needed endpoint the
> > JSON structure. Everything else will be done in the
> >
> > >My team and I have some physical hardware (including an Aarch64 Jetson
> > TX2 machine, might be interesting to see what the ARM64 results look like)
> > where we'd like to run benchmarks and upload the results also, so we need
> > to write some documentation about how to add a new machine and set up a
> > cron job of some kind.
> > If it can run Linux, then we can setup it.
> >
> > >I'd like to eventually have a bot that we can ask to run a benchmark
> > comparison versus master. Reporting on all PRs automatically might be quite
> > a bit of work (and load on the machines)
> > You should be able to choose the comparison between any two points:
> > master-PR, master now - master yesterday, etc.
> >
> > >I thought the idea (based on our past e-mail discussions) was that we
> > would implement benchmark collectors (as programs in the Arrow git
> > repository) for each benchmarking framework, starting with gbenchmark and
> > expanding to include ASV (for Python) and then others
> > I'll open a PR and happy to put it into Arrow.
> >
> > >It seems like writing the benchmark collector script that runs the
> > benchmarks, collects machine information, and inserts data into an instance
> > of the database is the next milestone. Until that's done it seems difficult
> > to do 

[jira] [Created] (ARROW-5205) [Python] Improved error messages when user erroneously uses a non-local resource URI to open a file

2019-04-23 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5205:
---

 Summary: [Python] Improved error messages when user erroneously 
uses a non-local resource URI to open a file
 Key: ARROW-5205
 URL: https://issues.apache.org/jira/browse/ARROW-5205
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney


In a number of places if a string filepath is passed, it is assumed to be a 
local file. Since we are developing better support for file URIs, we may be 
able to detect that the user has passed an unsupported URI (e.g. something 
starting with "s3:" or "hdfs:") and return a better error message than "local 
file not found"

see

https://stackoverflow.com/questions/55704943/what-could-be-the-explanation-of-this-pyarrow-lib-arrowioerror/55707311#55707311



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-23 Thread Matei Zaharia
Just as a note here, if the goal is the format not change, why not make that 
explicit in a versioning policy? You can always include a format version number 
and say that future versions may increment the number, but this specific 
version will always be readable in some specific way. You could also put a 
timeline on how long old version numbers will be recognized in the official 
libraries (e.g. 3 years).

Matei

> On Apr 22, 2019, at 6:36 AM, Bobby Evans  wrote:
> 
> Yes, it is technically possible for the layout to change.  No, it is not 
> going to happen.  It is already baked into several different official 
> libraries which are widely used, not just for holding and processing the 
> data, but also for transfer of the data between the various implementations.  
> There would have to be a really serious reason to force an incompatible 
> change at this point.  So in the worst case, we can version the layout and 
> bake that into the API that exposes the internal layout of the data.  That 
> way code that wants to program against a JAVA API can do so using the API 
> that Spark provides, those who want to interface with something that expects 
> the data in arrow format will already have to know what version of the format 
> it was programmed against and in the worst case if the layout does change we 
> can support the new layout if needed.
> 
> On Sun, Apr 21, 2019 at 12:45 AM Bryan Cutler  wrote:
> The Arrow data format is not yet stable, meaning there are no guarantees on 
> backwards/forwards compatibility. Once version 1.0 is released, it will have 
> those guarantees but it's hard to say when that will be. The remaining work 
> to get there can be seen at 
> https://cwiki.apache.org/confluence/display/ARROW/Columnar+Format+1.0+Milestone.
>  So yes, it is a risk that exposing Spark data as Arrow could cause an issue 
> if handled by a different version that is not compatible. That being said, 
> changes to format are not taken lightly and are backwards compatible when 
> possible. I think it would be fair to mark the APIs exposing Arrow data as 
> experimental for the time being, and clearly state the version that must be 
> used to be compatible in the docs. Also, adding features like this and 
> SPARK-24579 will probably help adoption of Arrow and accelerate a 1.0 
> release. Adding the Arrow dev list to CC.
> 
> Bryan
> 
> On Sat, Apr 20, 2019 at 5:25 PM Matei Zaharia  wrote:
> Okay, that makes sense, but is the Arrow data format stable? If not, we risk 
> breakage when Arrow changes in the future and some libraries using this 
> feature are begin to use the new Arrow code.
> 
> Matei
> 
> > On Apr 20, 2019, at 1:39 PM, Bobby Evans  wrote:
> > 
> > I want to be clear that this SPIP is not proposing exposing Arrow 
> > APIs/Classes through any Spark APIs.  SPARK-24579 is doing that, and 
> > because of the overlap between the two SPIPs I scaled this one back to 
> > concentrate just on the columnar processing aspects. Sorry for the 
> > confusion as I didn't update the JIRA description clearly enough when we 
> > adjusted it during the discussion on the JIRA.  As part of the columnar 
> > processing, we plan on providing arrow formatted data, but that will be 
> > exposed through a Spark owned API.
> > 
> > On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia  
> > wrote:
> > FYI, I’d also be concerned about exposing the Arrow API or format as a 
> > public API if it’s not yet stable. Is stabilization of the API and format 
> > coming soon on the roadmap there? Maybe someone can work with the Arrow 
> > community to make that happen.
> > 
> > We’ve been bitten lots of times by API changes forced by external libraries 
> > even when those were widely popular. For example, we used Guava’s Optional 
> > for a while, which changed at some point, and we also had issues with 
> > Protobuf and Scala itself (especially how Scala’s APIs appear in Java). API 
> > breakage might not be as serious in dynamic languages like Python, where 
> > you can often keep compatibility with old behaviors, but it really hurts in 
> > Java and Scala.
> > 
> > The problem is especially bad for us because of two aspects of how Spark is 
> > used:
> > 
> > 1) Spark is used for production data transformation jobs that people need 
> > to keep running for a long time. Nobody wants to make changes to a job 
> > that’s been working fine and computing something correctly for years just 
> > to get a bug fix from the latest Spark release or whatever. It’s much 
> > better if they can upgrade Spark without editing every job.
> > 
> > 2) Spark is often used as “glue” to combine data processing code in other 
> > libraries, and these might start to require different versions of our 
> > dependencies. For example, the Guava class exposed in Spark became a 
> > problem when third-party libraries started requiring a new version of 
> > Guava: those new libraries just couldn’t work with Spark. Protobuf was 
> > especially bad because some 

[jira] [Created] (ARROW-5204) [C++] Improve BufferBuilder performance

2019-04-23 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-5204:
-

 Summary: [C++] Improve BufferBuilder performance
 Key: ARROW-5204
 URL: https://issues.apache.org/jira/browse/ARROW-5204
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.13.0
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


BufferBuilder makes a spurious memset() when extending the buffer size.

We could also tweak the overallocation strategy in Reserve().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5203) [GLib] Add support for Compare filter

2019-04-23 Thread Yosuke Shiro (JIRA)
Yosuke Shiro created ARROW-5203:
---

 Summary: [GLib] Add support for Compare filter
 Key: ARROW-5203
 URL: https://issues.apache.org/jira/browse/ARROW-5203
 Project: Apache Arrow
  Issue Type: New Feature
  Components: GLib
Reporter: Yosuke Shiro
Assignee: Yosuke Shiro
 Fix For: 0.14.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5202) [C++

2019-04-23 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-5202:
-

 Summary: [C++
 Key: ARROW-5202
 URL: https://issues.apache.org/jira/browse/ARROW-5202
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Francois Saint-Jacques






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5201) [Python] Import ABCs from collections is deprecated in Python 3.7

2019-04-23 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5201:


 Summary: [Python] Import ABCs from collections is deprecated in 
Python 3.7
 Key: ARROW-5201
 URL: https://issues.apache.org/jira/browse/ARROW-5201
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


>From running the tests, I see a few deprecation warnings related to that on 
>Python 3, abstract base classes should be imported from `collections.abc` 
>instead of `collections`:

{code:none}
pyarrow/tests/test_array.py:808
  /home/joris/scipy/repos/arrow/python/pyarrow/tests/test_array.py:808: 
DeprecationWarning: Using or importing the ABCs from 'collections' instead of 
from 'collections.abc' is deprecated, and in 3.8 it will stop working
    pa.struct([pa.field('a', pa.int64()), pa.field('b', pa.string())]))

pyarrow/tests/test_table.py:18
  /home/joris/scipy/repos/arrow/python/pyarrow/tests/test_table.py:18: 
DeprecationWarning: Using or importing the ABCs from 'collections' instead of 
from 'collections.abc' is deprecated, and in 3.8 it will stop working
    from collections import OrderedDict, Iterable

pyarrow/tests/test_feather.py::TestFeatherReader::test_non_string_columns
  /home/joris/scipy/repos/arrow/python/pyarrow/pandas_compat.py:294: 
DeprecationWarning: Using or importing the ABCs from 'collections' instead of 
from 'collections.abc' is deprecated, and in 3.8 it will stop working
    elif isinstance(name, collections.Sequence):{code}

Those could be imported depending on python 2/3 in the ``pyarrow.compat`` 
module.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5200) Provide light-weight arrow APIs

2019-04-23 Thread Liya Fan (JIRA)
Liya Fan created ARROW-5200:
---

 Summary: Provide light-weight arrow APIs
 Key: ARROW-5200
 URL: https://issues.apache.org/jira/browse/ARROW-5200
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan
 Attachments: image-2019-04-23-15-19-34-187.png

We are trying to incorporate Apache Arrow to Apache Flink runtime. We find 
Arrow an amazing library, which greatly simplifies the support of columnar data 
format.

However, for many scenarios, we find the performance unacceptable. Our 
investigation shows the reason is that, there are too many redundant checks and 
computations in Arrow API.

For example, the following figures shows that in a single call to 
Float8Vector.get(int) method (this is one of the most frequently used APIs in 
Flink computation),  there are 20+ method invocations.

!image-2019-04-23-15-19-34-187.png!

 

There are many other APIs with similar problems. We believe that these checks 
will make sure of the integrity of the program. However, it also impacts 
performance severely. For our evaluation, the performance may degrade by two or 
three orders of magnitude slower, compared to access data on heap memory. 

We think at least for some scenarios, we can give the responsibility of 
integrity check to application owners. If they can be sure all the checks have 
been passed, we can provide some light-weight APIs and the inherent high 
performance, to them.

In the light-weight APIs, we only provide minimum checks, or avoid checks at 
all. The application owner can still develop and debug their code using the 
original heavy-weight APIs. Once all bugs have been fixed, they can switch to 
light-weight APIs in their products and enjoy the consequent high performance.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)