[jira] [Created] (ARROW-6849) Can not read a list of items type

2019-10-10 Thread Yevgeni Litvin (Jira)
Yevgeni Litvin created ARROW-6849:
-

 Summary: Can not read a list of items type 
 Key: ARROW-6849
 URL: https://issues.apache.org/jira/browse/ARROW-6849
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.15.0
Reporter: Yevgeni Litvin
 Attachments: test_bad_parquet.tgz

A field having a type of list-of-ints can not be read using 
{{parrow.parquet.read_table}} function.

This happens only in pyarrow 0.15.0. When downgrading to 0.14.1, the issue is 
not observed.

pyspark version: 2.4.4[^test_bad_parquet.tgz]

Minimal snippet to reproduce the issue:

 
{code:java}
import pyarrow.parquet as pq
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, ArrayType, 
Row

output_url = '/tmp/test_bad_parquet'
spark = SparkSession.builder.getOrCreate()

schema = StructType([StructField('int_fixed_size_list', 
ArrayType(IntegerType(), False), False)])
rows = [Row(int_fixed_size_list=[1, 2, 3])]
dataframe = spark.createDataFrame(rows, 
schema).write.mode('overwrite').parquet(output_url)

pq.read_table(output_url)

{code}
I get an error:
{code:java}
Traceback (most recent call last):
  File "/home/yevgeni/uatc/dataset-toolkit/repro_failure.py", line 13, in 

pq.read_table(output_url)
  File 
"/home/yevgeni/uatc/.petastorm3.6/lib/python3.6/site-packages/pyarrow/parquet.py",
 line 1281, in read_table
use_pandas_metadata=use_pandas_metadata)
  File 
"/home/yevgeni/uatc/.petastorm3.6/lib/python3.6/site-packages/pyarrow/parquet.py",
 line 1137, in read
use_pandas_metadata=use_pandas_metadata)
  File 
"/home/yevgeni/uatc/.petastorm3.6/lib/python3.6/site-packages/pyarrow/parquet.py",
 line 605, in read
table = reader.read(**options)
  File 
"/home/yevgeni/uatc/.petastorm3.6/lib/python3.6/site-packages/pyarrow/parquet.py",
 line 253, in read
use_threads=use_threads)
  File "pyarrow/_parquet.pyx", line 1136, in 
pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column data for field 0 with type list is inconsistent with schema listProcess 
finished with exit code 1

{code}
 

Column data for field 0 with type list is inconsistent 
with schema list

 

A parquet store, as generated by the snippet is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Proposal about integration test of arrow parquet reader

2019-10-10 Thread Renjie Liu
Thanks wes. Sure I'll fix it.

Wes McKinney  于 2019年10月11日周五 上午6:10写道:

> I just merged the PR https://github.com/apache/arrow-testing/pull/11
>
> Various aspects of this make me uncomfortable so I hope they can be
> addressed in follow up work
>
> On Thu, Oct 10, 2019 at 5:41 AM Renjie Liu 
> wrote:
> >
> > I've create ticket to track here:
> > https://issues.apache.org/jira/browse/ARROW-6845
> >
> > For this moment, can we check in those pregenerated data to unblock rust
> > version's arrow reader?
> >
> > On Thu, Oct 10, 2019 at 1:20 PM Renjie Liu 
> wrote:
> >
> > > It would be fine in that case.
> > >
> > > Wes McKinney  于 2019年10月10日周四 下午12:58写道:
> > >
> > >> On Wed, Oct 9, 2019 at 10:16 PM Renjie Liu 
> > >> wrote:
> > >> >
> > >> > 1. There already exists a low level parquet writer which can produce
> > >> > parquet file, so unit test should be fine. But writer from arrow to
> > >> parquet
> > >> > doesn't exist yet, and it may take some period of time to finish it.
> > >> > 2. In fact my data are randomly generated and it's definitely
> > >> reproducible.
> > >> > However, I don't think it would be good idea to randomly generate
> data
> > >> > everytime we run ci because it would be difficult to debug. For
> example
> > >> PR
> > >> > a introduced a bug, which is triggerred in other PR's build it
> would be
> > >> > confusing for contributors.
> > >>
> > >> Presumably any random data generation would use a fixed seed precisely
> > >> to be reproducible.
> > >>
> > >> > 3. I think it would be good idea to spend effort on integration test
> > >> with
> > >> > parquet because it's an important use case of arrow. Also similar
> > >> approach
> > >> > could be extended to other language and other file format(avro,
> orc).
> > >> >
> > >> >
> > >> > On Wed, Oct 9, 2019 at 11:08 PM Wes McKinney 
> > >> wrote:
> > >> >
> > >> > > There are a number of issues worth discussion.
> > >> > >
> > >> > > 1. What is the timeline/plan for Rust implementing a Parquet
> _writer_?
> > >> > > It's OK to be reliant on other libraries in the short term to
> produce
> > >> > > files to test against, but does not strike me as a sustainable
> > >> > > long-term plan. Fixing bugs can be a lot more difficult than it
> needs
> > >> > > to be if you can't write targeted "endogenous" unit tests
> > >> > >
> > >> > > 2. Reproducible data generation
> > >> > >
> > >> > > I think if you're going to test against a pre-generated corpus,
> you
> > >> > > should make sure that generating the corpus is reproducible for
> other
> > >> > > developers (i.e. with a Dockerfile), and can be extended by
> adding new
> > >> > > files or random data generation.
> > >> > >
> > >> > > I additionally would prefer generating the test corpus at test
> time
> > >> > > rather than checking in binary files. If this isn't viable right
> now
> > >> > > we can create an "arrow-rust-crutch" git repository for you to
> stash
> > >> > > binary files until some of these testing scalability issues are
> > >> > > addressed.
> > >> > >
> > >> > > If we're going to spend energy on Parquet integration testing with
> > >> > > Java, this would be a good opportunity to do the work in a way
> where
> > >> > > the C++ Parquet library can also participate (since we ought to be
> > >> > > doing integration tests with Java, and we can also read JSON
> files to
> > >> > > Arrow).
> > >> > >
> > >> > > On Tue, Oct 8, 2019 at 11:54 PM Renjie Liu <
> liurenjie2...@gmail.com>
> > >> > > wrote:
> > >> > > >
> > >> > > > On Wed, Oct 9, 2019 at 12:11 PM Andy Grove <
> andygrov...@gmail.com>
> > >> > > wrote:
> > >> > > >
> > >> > > > > I'm very interested in helping to find a solution to this
> because
> > >> we
> > >> > > really
> > >> > > > > do need integration tests for Rust to make sure we're
> compatible
> > >> with
> > >> > > other
> > >> > > > > implementations... there is also the ongoing CI dockerization
> work
> > >> > > that I
> > >> > > > > feel is related.
> > >> > > > >
> > >> > > > > I haven't looked at the current integration tests yet and
> would
> > >> > > appreciate
> > >> > > > > some pointers on how all of this works (do we have docs?) or
> > >> where to
> > >> > > start
> > >> > > > > looking.
> > >> > > > >
> > >> > > > I have a test in my latest PR:
> > >> https://github.com/apache/arrow/pull/5523
> > >> > > > And here is the generated data:
> > >> > > > https://github.com/apache/arrow-testing/pull/11
> > >> > > > As with program to generate these data, it's just a simple java
> > >> program.
> > >> > > > I'm not sure whether we need to integrate it into arrow.
> > >> > > >
> > >> > > > >
> > >> > > > > I imagine the integration test could follow the approach that
> > >> Renjie is
> > >> > > > > outlining where we call Java to generate some files and then
> call
> > >> Rust
> > >> > > to
> > >> > > > > parse them?
> > >> > > > >
> > >> > > > > Thanks,
> > >> > > > >
> > >> > > > > Andy.
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> 

Re: [DRAFT] Apache Arrow Board Report - October 2019

2019-10-10 Thread Jacques Nadeau
Hey there, I meant to remove the issues section at top and replace with the
one in the community health section but forgot to remove the top part. I
just submitted with the removed top part. Let me know if people want me to
further edit.

Thanks

On Thu, Oct 10, 2019 at 1:54 PM Antoine Pitrou  wrote:

>
> It's good with me.
>
> Regards
>
> Antoine.
>
>
> Le 10/10/2019 à 22:51, Jacques Nadeau a écrit :
> > Antoine, is my synopsis fair?
> >
> > On Thu, Oct 10, 2019 at 12:53 PM Wes McKinney 
> wrote:
> >
> >> +1
> >>
> >> On Thu, Oct 10, 2019, 2:12 PM Jacques Nadeau 
> wrote:
> >>
> >>> Proposed report update below. LMK your thoughts.
> >>>
> >>> ## Description:
> >>> The mission of Apache Arrow is the creation and maintenance of software
> >>> related to columnar in-memory processing and data interchange
> >>>
> >>> ## Issues:
> >>>
> >>> * We are struggling with Continuous Integration scalability as the
> >> project
> >>> has
> >>>   definitely outgrown what Travis CI and Appveyor can do for us. Some
> >>>   contributors have shown reluctance to submit patches they aren't sure
> >>> about
> >>>   because they don't want to pile on the build queue. We are exploring
> >>>   alternative solutions such as Buildbot, Buildkite, and GitHub Actions
> >> to
> >>>   provide a path to migrate away from Travis CI / Appveyor. In our
> >> request
> >>> to
> >>>   Infrastructure INFRA-19217, some of us were alarmed to find that an
> >> CI/CD
> >>>   service like Buildkite may not be able to be connected to the @apache
> >>> GitHub
> >>>   account on account of requiring admin access to repository webhooks,
> >> but
> >>> no
> >>>   ability to modify source code. There are workarounds (building custom
> >>> OAuth
> >>>   bots) that could enable us to use Buildkite, but it would require
> extra
> >>>   development and result in a less refined experience for community
> >>> members.
> >>>
> >>>
> >>>
> >>> ## Membership Data:
> >>> * Apache Arrow was founded 2016-01-19 (4 years ago)
> >>> * There are currently 48 committers and 28 PMC members in this project.
> >>> * The Committer-to-PMC ratio is roughly 3:2.
> >>>
> >>> Community changes, past quarter:
> >>> - Micah Kornfield was added to the PMC on 2019-08-21
> >>> - Sebastien Binet was added to the PMC on 2019-08-21
> >>> - Ben Kietzman was added as committer on 2019-09-07
> >>> - David Li was added as committer on 2019-08-30
> >>> - Kenta Murata was added as committer on 2019-09-05
> >>> - Neal Richardson was added as committer on 2019-09-05
> >>> - Praveen Kumar was added as committer on 2019-07-14
> >>>
> >>> ## Project Activity:
> >>>
> >>> * The project has just made a 0.15.0 release.
> >>> * We are discussing ways to make the Arrow libraries as accessible as
> >>> possible
> >>>   to downstream projects for minimal use cases while allowing the
> >>> development
> >>>   of more comprehensive "standard libraries" with larger dependency
> >> stacks
> >>> in
> >>>   the project
> >>> * We plan to make a 1.0.0 release as our next major release, at which
> >> time
> >>> we
> >>>   will declare that the Arrow binary protocol is stable with forward
> and
> >>>   backward compatibility guarantees
> >>>
> >>> ## Community Health:
> >>>
> >>> * The community is continuing to grow at a great rate. We see good
> >> adoption
> >>>   among many other projects and fast growth of key metrics.
> >>> * Many contributors are struggling with the slowness of pre-commit CI.
> >>> Arrow
> >>>   has a large number of different platforms and components and a
> complex
> >>> build
> >>>   matrix. As new commits come in, they frequently take a long time to
> >>>   complete. The community is trying several ways to solve this. There
> is
> >>>   bubbling frustration in the community around the GitHub repo rules
> for
> >>> using
> >>>   third party services. This is especially challenging when there are
> >> free
> >>>   solutions to relieve the community pressure but the community is
> unable
> >>> to
> >>>   access these resources. This frustration is greatest among people who
> >>> work
> >>>   on many non-asf OSS projects which don't have such restrictive rules
> >>>   around GitHub.  Some examples of ways the community has tried to
> >> resolve
> >>>   these have included:
> >>>   * Try to use CircleCI, rejected in INFRA-15964
> >>>   * Try to use Azure Pipelines, rejected in INFRA-17030
> >>>   * Try to resolves Issues with Travis CI capacity: INFRA-18533 &
> >>> https://s.apache.org/ci-capacity (no resolution beyond "find
> >>> donations")
> >>>   * The creation of new infrastructure design (in progress but a huge
> >>> amount of
> >>> thankless work)
> >>> * While the community has seen great growth in contribution (more than
> >> 300
> >>>   unique contributors at this point), the vast majority are casual
> >>>   contributors. The daily active committers (the workhorses of the
> >> project
> >>>   that bear the load committing the constant PRs, more than 5000 closed
> >> at
> 

Re: Field metadata not retrievable from parquet file

2019-10-10 Thread Isaac Myers
Thanks for the quick response. When I use pyspark to read a parquet file 
written by arrow, I can't see even file-level metadata. Is that also a known 
issue? (Note: I searched the JIRA issues and couldn't find any info.)


Sent with ProtonMail Secure Email.

‐‐‐ Original Message ‐‐‐
On Thursday, October 10, 2019 12:44 PM, Wes McKinney  
wrote:

> We haven't implemented storing field-level metadata in Parquet files
> yet. It's somewhat tricky. See
> https://issues.apache.org/jira/browse/ARROW-4359
>
> On Thu, Oct 10, 2019 at 11:51 AM Isaac Myers
> isaacmy...@protonmail.com.invalid wrote:
>
> > I can write both field- and schema-level metadata and read the values back 
> > from schema or relevant field. I write the schema and table described by 
> > the schema to a local parquet file. Upon reading the table or schema from 
> > the parquet file, only schema metadata are present and field metadata are 
> > not present. Am I doing something wrong? Please view the minimum working 
> > example below:
> > 
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> > //#include 
> > int main(int argc, char* argv[])
> > {
> > /*
> > Create Parquet File
> > */
> > arrow::Status st;
> > arrow::MemoryPool pool = arrow::default_memory_pool();// Create Schema and 
> > fields with metadata
> > std::vector fields;
> > std::unordered_map a_keyval;
> > a_keyval["unit"] = "sec";
> > a_keyval["note"] = "not the standard millisecond unit";
> > arrow::KeyValueMetadata a_md(a_keyval);
> > std::shared_ptrarrow::Field a_field = arrow::field("a", arrow::int16(), 
> > false, a_md.Copy());
> > fields.push_back(a_field);
> > std::unordered_map b_keyval;
> > b_keyval["unit"] = "ft";
> > arrow::KeyValueMetadata b_md(b_keyval);
> > std::shared_ptrarrow::Field b_field = arrow::field("b", arrow::int16(), 
> > false, b_md.Copy());
> > fields.push_back(b_field);
> > std::shared_ptrarrow::Schema schema = arrow::schema(fields);
> > // Add metadata to schema.
> > std::unordered_map schema_keyval;
> > schema_keyval["classification"] = "Type 0";
> > arrow::KeyValueMetadata schema_md(schema_keyval);
> > schema = schema->AddMetadata(schema_md.Copy());
> > // Build arrays of data and add to Table.
> > const int64_t rowgroup_size = 100;
> > std::vector a_data(rowgroup_size, 0);
> > std::vector b_data(rowgroup_size, 0);
> > for (int16_t i = 0; i < rowgroup_size; i++)
> > {
> > a_data[i] = i;
> > b_data[i] = rowgroup_size - i;
> > }
> > arrow::Int16Builder a_bldr(pool);
> > arrow::Int16Builder b_bldr(pool);
> > st = a_bldr.Resize(rowgroup_size);
> > if (!st.ok()) return 1;
> > st = b_bldr.Resize(rowgroup_size);
> > if (!st.ok()) return 1;
> > st = a_bldr.AppendValues(a_data);
> > if (!st.ok()) return 1;
> > st = b_bldr.AppendValues(b_data);
> > if (!st.ok()) return 1;
> > std::shared_ptrarrow::Array a_arr_ptr;
> > std::shared_ptrarrow::Array b_arr_ptr;
> > arrow::ArrayVector arr_vec;
> > st = a_bldr.Finish(_arr_ptr);
> > if (!st.ok()) return 1;
> > arr_vec.push_back(a_arr_ptr);
> > st = b_bldr.Finish(_arr_ptr);
> > if (!st.ok()) return 1;
> > arr_vec.push_back(b_arr_ptr);
> > std::shared_ptrarrow::Table table = arrow::Table::Make(schema, arr_vec);
> > // Test metadata
> > printf("\nMetadata from original schema:\n");
> > printf("%s\n", schema->metadata()->ToString().c_str());
> > printf("%s\n", schema->field(0)->metadata()->ToString().c_str());
> > printf("%s\n", schema->field(1)->metadata()->ToString().c_str());
> > std::shared_ptrarrow::Schema table_schema = table->schema();
> > printf("\nMetadata from schema retrieved from table (should be the 
> > same):\n");
> > printf("%s\n", table_schema->metadata()->ToString().c_str());
> > printf("%s\n", table_schema->field(0)->metadata()->ToString().c_str());
> > printf("%s\n", table_schema->field(1)->metadata()->ToString().c_str());
> > // Open file and write table.
> > std::string file_name = "test.parquet";
> > std::shared_ptrarrow::io::FileOutputStream ostream;
> > st = arrow::io::FileOutputStream::Open(file_name, );
> > if (!st.ok()) return 1;
> > std::unique_ptrparquet::arrow::FileWriter writer;
> > std::shared_ptrparquet::WriterProperties props = 
> > parquet::default_writer_properties();
> > st = parquet::arrow::FileWriter::Open(*schema, pool, ostream, props, 
> > );
> > if (!st.ok()) return 1;
> > st = writer->WriteTable(*table, rowgroup_size);
> > if (!st.ok()) return 1;
> > // Close file and stream.
> > st = writer->Close();
> > if (!st.ok()) return 1;
> > st = ostream->Close();
> > if (!st.ok()) return 1;
> > /*
> > Read Parquet File
> > **/
> > // Create new memory pool. Not sure if this is necessary.
> > //arrow::MemoryPool* pool2 = arrow::default_memory_pool();
> > // Open file reader.
> > std::shared_ptrarrow::io::ReadableFile input_file;
> > st = arrow::io::ReadableFile::Open(file_name, pool, _file);
> > 

Re: [DISCUSS] Proposal about integration test of arrow parquet reader

2019-10-10 Thread Wes McKinney
I just merged the PR https://github.com/apache/arrow-testing/pull/11

Various aspects of this make me uncomfortable so I hope they can be
addressed in follow up work

On Thu, Oct 10, 2019 at 5:41 AM Renjie Liu  wrote:
>
> I've create ticket to track here:
> https://issues.apache.org/jira/browse/ARROW-6845
>
> For this moment, can we check in those pregenerated data to unblock rust
> version's arrow reader?
>
> On Thu, Oct 10, 2019 at 1:20 PM Renjie Liu  wrote:
>
> > It would be fine in that case.
> >
> > Wes McKinney  于 2019年10月10日周四 下午12:58写道:
> >
> >> On Wed, Oct 9, 2019 at 10:16 PM Renjie Liu 
> >> wrote:
> >> >
> >> > 1. There already exists a low level parquet writer which can produce
> >> > parquet file, so unit test should be fine. But writer from arrow to
> >> parquet
> >> > doesn't exist yet, and it may take some period of time to finish it.
> >> > 2. In fact my data are randomly generated and it's definitely
> >> reproducible.
> >> > However, I don't think it would be good idea to randomly generate data
> >> > everytime we run ci because it would be difficult to debug. For example
> >> PR
> >> > a introduced a bug, which is triggerred in other PR's build it would be
> >> > confusing for contributors.
> >>
> >> Presumably any random data generation would use a fixed seed precisely
> >> to be reproducible.
> >>
> >> > 3. I think it would be good idea to spend effort on integration test
> >> with
> >> > parquet because it's an important use case of arrow. Also similar
> >> approach
> >> > could be extended to other language and other file format(avro, orc).
> >> >
> >> >
> >> > On Wed, Oct 9, 2019 at 11:08 PM Wes McKinney 
> >> wrote:
> >> >
> >> > > There are a number of issues worth discussion.
> >> > >
> >> > > 1. What is the timeline/plan for Rust implementing a Parquet _writer_?
> >> > > It's OK to be reliant on other libraries in the short term to produce
> >> > > files to test against, but does not strike me as a sustainable
> >> > > long-term plan. Fixing bugs can be a lot more difficult than it needs
> >> > > to be if you can't write targeted "endogenous" unit tests
> >> > >
> >> > > 2. Reproducible data generation
> >> > >
> >> > > I think if you're going to test against a pre-generated corpus, you
> >> > > should make sure that generating the corpus is reproducible for other
> >> > > developers (i.e. with a Dockerfile), and can be extended by adding new
> >> > > files or random data generation.
> >> > >
> >> > > I additionally would prefer generating the test corpus at test time
> >> > > rather than checking in binary files. If this isn't viable right now
> >> > > we can create an "arrow-rust-crutch" git repository for you to stash
> >> > > binary files until some of these testing scalability issues are
> >> > > addressed.
> >> > >
> >> > > If we're going to spend energy on Parquet integration testing with
> >> > > Java, this would be a good opportunity to do the work in a way where
> >> > > the C++ Parquet library can also participate (since we ought to be
> >> > > doing integration tests with Java, and we can also read JSON files to
> >> > > Arrow).
> >> > >
> >> > > On Tue, Oct 8, 2019 at 11:54 PM Renjie Liu 
> >> > > wrote:
> >> > > >
> >> > > > On Wed, Oct 9, 2019 at 12:11 PM Andy Grove 
> >> > > wrote:
> >> > > >
> >> > > > > I'm very interested in helping to find a solution to this because
> >> we
> >> > > really
> >> > > > > do need integration tests for Rust to make sure we're compatible
> >> with
> >> > > other
> >> > > > > implementations... there is also the ongoing CI dockerization work
> >> > > that I
> >> > > > > feel is related.
> >> > > > >
> >> > > > > I haven't looked at the current integration tests yet and would
> >> > > appreciate
> >> > > > > some pointers on how all of this works (do we have docs?) or
> >> where to
> >> > > start
> >> > > > > looking.
> >> > > > >
> >> > > > I have a test in my latest PR:
> >> https://github.com/apache/arrow/pull/5523
> >> > > > And here is the generated data:
> >> > > > https://github.com/apache/arrow-testing/pull/11
> >> > > > As with program to generate these data, it's just a simple java
> >> program.
> >> > > > I'm not sure whether we need to integrate it into arrow.
> >> > > >
> >> > > > >
> >> > > > > I imagine the integration test could follow the approach that
> >> Renjie is
> >> > > > > outlining where we call Java to generate some files and then call
> >> Rust
> >> > > to
> >> > > > > parse them?
> >> > > > >
> >> > > > > Thanks,
> >> > > > >
> >> > > > > Andy.
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > On Tue, Oct 8, 2019 at 9:48 PM Renjie Liu <
> >> liurenjie2...@gmail.com>
> >> > > wrote:
> >> > > > >
> >> > > > > > Hi:
> >> > > > > >
> >> > > > > > I'm developing rust version of reader which reads parquet into
> >> arrow
> >> > > > > array.
> >> > > > > > To verify the correct of this reader, I use the following
> >> approach:
> >> > > > > >
> >> > > > > >

Re: [DRAFT] Apache Arrow Board Report - October 2019

2019-10-10 Thread Jacques Nadeau
Antoine, is my synopsis fair?

On Thu, Oct 10, 2019 at 12:53 PM Wes McKinney  wrote:

> +1
>
> On Thu, Oct 10, 2019, 2:12 PM Jacques Nadeau  wrote:
>
> > Proposed report update below. LMK your thoughts.
> >
> > ## Description:
> > The mission of Apache Arrow is the creation and maintenance of software
> > related to columnar in-memory processing and data interchange
> >
> > ## Issues:
> >
> > * We are struggling with Continuous Integration scalability as the
> project
> > has
> >   definitely outgrown what Travis CI and Appveyor can do for us. Some
> >   contributors have shown reluctance to submit patches they aren't sure
> > about
> >   because they don't want to pile on the build queue. We are exploring
> >   alternative solutions such as Buildbot, Buildkite, and GitHub Actions
> to
> >   provide a path to migrate away from Travis CI / Appveyor. In our
> request
> > to
> >   Infrastructure INFRA-19217, some of us were alarmed to find that an
> CI/CD
> >   service like Buildkite may not be able to be connected to the @apache
> > GitHub
> >   account on account of requiring admin access to repository webhooks,
> but
> > no
> >   ability to modify source code. There are workarounds (building custom
> > OAuth
> >   bots) that could enable us to use Buildkite, but it would require extra
> >   development and result in a less refined experience for community
> > members.
> >
> >
> >
> > ## Membership Data:
> > * Apache Arrow was founded 2016-01-19 (4 years ago)
> > * There are currently 48 committers and 28 PMC members in this project.
> > * The Committer-to-PMC ratio is roughly 3:2.
> >
> > Community changes, past quarter:
> > - Micah Kornfield was added to the PMC on 2019-08-21
> > - Sebastien Binet was added to the PMC on 2019-08-21
> > - Ben Kietzman was added as committer on 2019-09-07
> > - David Li was added as committer on 2019-08-30
> > - Kenta Murata was added as committer on 2019-09-05
> > - Neal Richardson was added as committer on 2019-09-05
> > - Praveen Kumar was added as committer on 2019-07-14
> >
> > ## Project Activity:
> >
> > * The project has just made a 0.15.0 release.
> > * We are discussing ways to make the Arrow libraries as accessible as
> > possible
> >   to downstream projects for minimal use cases while allowing the
> > development
> >   of more comprehensive "standard libraries" with larger dependency
> stacks
> > in
> >   the project
> > * We plan to make a 1.0.0 release as our next major release, at which
> time
> > we
> >   will declare that the Arrow binary protocol is stable with forward and
> >   backward compatibility guarantees
> >
> > ## Community Health:
> >
> > * The community is continuing to grow at a great rate. We see good
> adoption
> >   among many other projects and fast growth of key metrics.
> > * Many contributors are struggling with the slowness of pre-commit CI.
> > Arrow
> >   has a large number of different platforms and components and a complex
> > build
> >   matrix. As new commits come in, they frequently take a long time to
> >   complete. The community is trying several ways to solve this. There is
> >   bubbling frustration in the community around the GitHub repo rules for
> > using
> >   third party services. This is especially challenging when there are
> free
> >   solutions to relieve the community pressure but the community is unable
> > to
> >   access these resources. This frustration is greatest among people who
> > work
> >   on many non-asf OSS projects which don't have such restrictive rules
> >   around GitHub.  Some examples of ways the community has tried to
> resolve
> >   these have included:
> >   * Try to use CircleCI, rejected in INFRA-15964
> >   * Try to use Azure Pipelines, rejected in INFRA-17030
> >   * Try to resolves Issues with Travis CI capacity: INFRA-18533 &
> > https://s.apache.org/ci-capacity (no resolution beyond "find
> > donations")
> >   * The creation of new infrastructure design (in progress but a huge
> > amount of
> > thankless work)
> > * While the community has seen great growth in contribution (more than
> 300
> >   unique contributors at this point), the vast majority are casual
> >   contributors. The daily active committers (the workhorses of the
> project
> >   that bear the load committing the constant PRs, more than 5000 closed
> at
> >   this point) have been growing slower than adoption. This is despite the
> > fact
> >   that the community has been very aggressive at being inclusive of new
> >   committers (with likelihood to have more than 50 in the next week). The
> >   community is still continuing to try to brainstorm ways to improve
> this.
> >
>


pyarrow and macOS 10.15

2019-10-10 Thread Brian Hulette
In Beam we've had a few users report issues importing Beam Python after
upgrading to macOS 10.15 Catalina, and it seems like our pyarrow import is
the root cause [1]. Given that I don't see any reports of this on the arrow
side I suspect that this is an issue just with pyarrow 0.14 (in Beam we've
restricted to <0.15 [2]), can anyone confirm that the pypi release of
pyarrow 0.15 is working on macOS 10.15?

Thanks,
Brian

[1] https://issues.apache.org/jira/browse/BEAM-8368
[2] https://github.com/apache/beam/blob/master/sdks/python/setup.py#L122


[jira] [Created] (ARROW-6848) [C++] Specify -std=c++11 instead of -std=gnu++11 when building

2019-10-10 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-6848:


 Summary: [C++] Specify -std=c++11 instead of -std=gnu++11 when 
building
 Key: ARROW-6848
 URL: https://issues.apache.org/jira/browse/ARROW-6848
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Zhuo Peng


Relevant discussion:

[https://lists.apache.org/thread.html/5807e65d865c1736b3a7a32653ca8bb405d719eb13b8a10b6fe0e904@%3Cdev.arrow.apache.org%3E]

in addition to

set(CMAKE_CXX_STANDARD 11)

, we also need to

set(CMAKE_CXX_EXTENSIONS OFF)

in order to turn off compiler-specific extensions (with GCC, it's -std=gnu++11)

 

This is supposed to be a no-op, because Arrow builds fine with other compilers 
(Clang-LLVM / MSCV). But opening this bug to track any issues with flipping the 
switch.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [VOTE] Release Apache Arrow 0.15.0 - RC2

2019-10-10 Thread Joris Van den Bossche
Wes, if you don't get to it today, I can try to update the docs tomorrow.

Joris

On Thu, 10 Oct 2019 at 21:51, Neal Richardson 
wrote:

> I updated the R docs because I had everything I needed to do that
> locally: https://github.com/apache/arrow-site/pull/30 Doing the others
> wasn't feasible for me on my computer (I don't have CUDA, and the case
> insensitivity of the macOS file system always bites me with the
> pyarrow docs anyway).
>
> IMO improving our CI/CD around documentation should be a priority for 1.0.
>
> Neal
>
> On Thu, Oct 10, 2019 at 12:03 PM Wes McKinney  wrote:
> >
> > The docs on http://arrow.apache.org/docs/ haven't been updated yet.
> > This happened the last release, too -- I ended up updating the docs
> > manually after a week or two. Is this included in the release
> > management guide? If no one beats me to it, I can update the docs by
> > hand again later today
> >
> > On Mon, Oct 7, 2019 at 6:20 PM Wes McKinney  wrote:
> > >
> > > I think we might be a little aggressive at removing artifacts from the
> > > dist system
> > >
> > > Can we change our process to only remove old dist artifacts when we
> > > are about to upload a new RC? Otherwise it's harder to make
> > > improvements to the release verification scripts without any old RC to
> > > test against
> > >
> > > On Mon, Oct 7, 2019 at 5:17 PM Neal Richardson
> > >  wrote:
> > > >
> > > > The R package has been accepted by CRAN. Binaries for macOS and
> > > > Windows should become available in the next few days.
> > > >
> > > > Neal
> > > >
> > > > On Mon, Oct 7, 2019 at 1:41 AM Krisztián Szűcs
> > > >  wrote:
> > > > >
> > > > > Thanks Andy!
> > > > >
> > > > > I've just removed the RC source artefacts from SVN.
> > > > >
> > > > > We have two remaining post release tasks:
> > > > > - homebrew
> > > > > - apidocs
> > > > >
> > > > > On Mon, Oct 7, 2019 at 1:47 AM Andy Grove 
> wrote:
> > > > >
> > > > > > I released the Rust crates from the RC2 source tarball. I had to
> comment
> > > > > > out the benchmark references in the Cargo.toml first since the
> tarball does
> > > > > > not include the benchmark source code. I filed
> > > > > > https://issues.apache.org/jira/browse/ARROW-6801 for this bug
> and will
> > > > > > fix the packaging before the 1.0.0 release.
> > > > > >
> > > > > > On Sun, Oct 6, 2019 at 2:01 AM Krisztián Szűcs <
> szucs.kriszt...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > >> The rust publishing script fails because it cannot find the
> benchmarks.
> > > > > >> Seems to be related to cargo changes.
> > > > > >> I cannot investigate it right now, @Andy could you take a look?
> > > > > >>
> > > > > >> On Sun, Oct 6, 2019, 9:11 AM Krisztián Szűcs <
> szucs.kriszt...@gmail.com>
> > > > > >> wrote:
> > > > > >>
> > > > > >>> - published js packages to npm, please check that they are
> working
> > > > > >>> properly
> > > > > >>>
> > > > > >>> On Sat, Oct 5, 2019 at 10:14 PM Neal Richardson <
> > > > > >>> neal.p.richard...@gmail.com> wrote:
> > > > > >>>
> > > > >  R release steps per
> > > > > 
> > > > > 
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingRpackages
> > > > >  are underway.
> > > > > 
> > > > >  Neal
> > > > > 
> > > > >  On Sat, Oct 5, 2019 at 8:40 AM Krisztián Szűcs
> > > > >   wrote:
> > > > >  >
> > > > >  > - website updated with the release notes
> > > > >  > - conda-forge updates are merged
> > > > >  >
> > > > >  > Remaining:
> > > > >  > - Javascript
> > > > >  > - Rust
> > > > >  > - R
> > > > >  > - Homebrew
> > > > >  > - Apidocs
> > > > >  >
> > > > >  > On Sat, Oct 5, 2019 at 2:58 PM Sutou Kouhei <
> k...@clear-code.com>
> > > > >  wrote:
> > > > >  >
> > > > >  > > - uploaded C# packages
> > > > >  > >
> > > > >  > > In  > > > >  t1k_vz68rcb3m...@mail.gmail.com>
> > > > >  > >   "Re: [VOTE] Release Apache Arrow 0.15.0 - RC2" on Sat,
> 5 Oct 2019
> > > > >  > > 14:50:51 +0200,
> > > > >  > >   Krisztián Szűcs  wrote:
> > > > >  > >
> > > > >  > > > - uploaded python wheels to pypi
> > > > >  > > > - uploaded java artifacts to maven central
> > > > >  > > >
> > > > >  > > > I'm going to update the conda recipes.
> > > > >  > > >
> > > > >  > > > Remaining:
> > > > >  > > > - Javascript
> > > > >  > > > - Rust
> > > > >  > > > - C#
> > > > >  > > > - R
> > > > >  > > > - Homebrew
> > > > >  > > > - Site
> > > > >  > > >
> > > > >  > > >
> > > > >  > > >
> > > > >  > > > On Sat, Oct 5, 2019 at 2:29 PM Krisztián Szűcs <
> > > > >  > > szucs.kriszt...@gmail.com>
> > > > >  > > > wrote:
> > > > >  > > >
> > > > >  > > >> - rebased master
> > > > >  > > >> - rebased the pull requests
> > > > >  > > >> - released the jira version
> > > > >  > > >> - started the new jira version
> > > > >  > > >> - 

Re: [DRAFT] Apache Arrow Board Report - October 2019

2019-10-10 Thread Wes McKinney
+1

On Thu, Oct 10, 2019, 2:12 PM Jacques Nadeau  wrote:

> Proposed report update below. LMK your thoughts.
>
> ## Description:
> The mission of Apache Arrow is the creation and maintenance of software
> related to columnar in-memory processing and data interchange
>
> ## Issues:
>
> * We are struggling with Continuous Integration scalability as the project
> has
>   definitely outgrown what Travis CI and Appveyor can do for us. Some
>   contributors have shown reluctance to submit patches they aren't sure
> about
>   because they don't want to pile on the build queue. We are exploring
>   alternative solutions such as Buildbot, Buildkite, and GitHub Actions to
>   provide a path to migrate away from Travis CI / Appveyor. In our request
> to
>   Infrastructure INFRA-19217, some of us were alarmed to find that an CI/CD
>   service like Buildkite may not be able to be connected to the @apache
> GitHub
>   account on account of requiring admin access to repository webhooks, but
> no
>   ability to modify source code. There are workarounds (building custom
> OAuth
>   bots) that could enable us to use Buildkite, but it would require extra
>   development and result in a less refined experience for community
> members.
>
>
>
> ## Membership Data:
> * Apache Arrow was founded 2016-01-19 (4 years ago)
> * There are currently 48 committers and 28 PMC members in this project.
> * The Committer-to-PMC ratio is roughly 3:2.
>
> Community changes, past quarter:
> - Micah Kornfield was added to the PMC on 2019-08-21
> - Sebastien Binet was added to the PMC on 2019-08-21
> - Ben Kietzman was added as committer on 2019-09-07
> - David Li was added as committer on 2019-08-30
> - Kenta Murata was added as committer on 2019-09-05
> - Neal Richardson was added as committer on 2019-09-05
> - Praveen Kumar was added as committer on 2019-07-14
>
> ## Project Activity:
>
> * The project has just made a 0.15.0 release.
> * We are discussing ways to make the Arrow libraries as accessible as
> possible
>   to downstream projects for minimal use cases while allowing the
> development
>   of more comprehensive "standard libraries" with larger dependency stacks
> in
>   the project
> * We plan to make a 1.0.0 release as our next major release, at which time
> we
>   will declare that the Arrow binary protocol is stable with forward and
>   backward compatibility guarantees
>
> ## Community Health:
>
> * The community is continuing to grow at a great rate. We see good adoption
>   among many other projects and fast growth of key metrics.
> * Many contributors are struggling with the slowness of pre-commit CI.
> Arrow
>   has a large number of different platforms and components and a complex
> build
>   matrix. As new commits come in, they frequently take a long time to
>   complete. The community is trying several ways to solve this. There is
>   bubbling frustration in the community around the GitHub repo rules for
> using
>   third party services. This is especially challenging when there are free
>   solutions to relieve the community pressure but the community is unable
> to
>   access these resources. This frustration is greatest among people who
> work
>   on many non-asf OSS projects which don't have such restrictive rules
>   around GitHub.  Some examples of ways the community has tried to resolve
>   these have included:
>   * Try to use CircleCI, rejected in INFRA-15964
>   * Try to use Azure Pipelines, rejected in INFRA-17030
>   * Try to resolves Issues with Travis CI capacity: INFRA-18533 &
> https://s.apache.org/ci-capacity (no resolution beyond "find
> donations")
>   * The creation of new infrastructure design (in progress but a huge
> amount of
> thankless work)
> * While the community has seen great growth in contribution (more than 300
>   unique contributors at this point), the vast majority are casual
>   contributors. The daily active committers (the workhorses of the project
>   that bear the load committing the constant PRs, more than 5000 closed at
>   this point) have been growing slower than adoption. This is despite the
> fact
>   that the community has been very aggressive at being inclusive of new
>   committers (with likelihood to have more than 50 in the next week). The
>   community is still continuing to try to brainstorm ways to improve this.
>


Re: [VOTE] Release Apache Arrow 0.15.0 - RC2

2019-10-10 Thread Neal Richardson
I updated the R docs because I had everything I needed to do that
locally: https://github.com/apache/arrow-site/pull/30 Doing the others
wasn't feasible for me on my computer (I don't have CUDA, and the case
insensitivity of the macOS file system always bites me with the
pyarrow docs anyway).

IMO improving our CI/CD around documentation should be a priority for 1.0.

Neal

On Thu, Oct 10, 2019 at 12:03 PM Wes McKinney  wrote:
>
> The docs on http://arrow.apache.org/docs/ haven't been updated yet.
> This happened the last release, too -- I ended up updating the docs
> manually after a week or two. Is this included in the release
> management guide? If no one beats me to it, I can update the docs by
> hand again later today
>
> On Mon, Oct 7, 2019 at 6:20 PM Wes McKinney  wrote:
> >
> > I think we might be a little aggressive at removing artifacts from the
> > dist system
> >
> > Can we change our process to only remove old dist artifacts when we
> > are about to upload a new RC? Otherwise it's harder to make
> > improvements to the release verification scripts without any old RC to
> > test against
> >
> > On Mon, Oct 7, 2019 at 5:17 PM Neal Richardson
> >  wrote:
> > >
> > > The R package has been accepted by CRAN. Binaries for macOS and
> > > Windows should become available in the next few days.
> > >
> > > Neal
> > >
> > > On Mon, Oct 7, 2019 at 1:41 AM Krisztián Szűcs
> > >  wrote:
> > > >
> > > > Thanks Andy!
> > > >
> > > > I've just removed the RC source artefacts from SVN.
> > > >
> > > > We have two remaining post release tasks:
> > > > - homebrew
> > > > - apidocs
> > > >
> > > > On Mon, Oct 7, 2019 at 1:47 AM Andy Grove  wrote:
> > > >
> > > > > I released the Rust crates from the RC2 source tarball. I had to 
> > > > > comment
> > > > > out the benchmark references in the Cargo.toml first since the 
> > > > > tarball does
> > > > > not include the benchmark source code. I filed
> > > > > https://issues.apache.org/jira/browse/ARROW-6801 for this bug and will
> > > > > fix the packaging before the 1.0.0 release.
> > > > >
> > > > > On Sun, Oct 6, 2019 at 2:01 AM Krisztián Szűcs 
> > > > > 
> > > > > wrote:
> > > > >
> > > > >> The rust publishing script fails because it cannot find the 
> > > > >> benchmarks.
> > > > >> Seems to be related to cargo changes.
> > > > >> I cannot investigate it right now, @Andy could you take a look?
> > > > >>
> > > > >> On Sun, Oct 6, 2019, 9:11 AM Krisztián Szűcs 
> > > > >> 
> > > > >> wrote:
> > > > >>
> > > > >>> - published js packages to npm, please check that they are working
> > > > >>> properly
> > > > >>>
> > > > >>> On Sat, Oct 5, 2019 at 10:14 PM Neal Richardson <
> > > > >>> neal.p.richard...@gmail.com> wrote:
> > > > >>>
> > > >  R release steps per
> > > > 
> > > >  https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingRpackages
> > > >  are underway.
> > > > 
> > > >  Neal
> > > > 
> > > >  On Sat, Oct 5, 2019 at 8:40 AM Krisztián Szűcs
> > > >   wrote:
> > > >  >
> > > >  > - website updated with the release notes
> > > >  > - conda-forge updates are merged
> > > >  >
> > > >  > Remaining:
> > > >  > - Javascript
> > > >  > - Rust
> > > >  > - R
> > > >  > - Homebrew
> > > >  > - Apidocs
> > > >  >
> > > >  > On Sat, Oct 5, 2019 at 2:58 PM Sutou Kouhei 
> > > >  wrote:
> > > >  >
> > > >  > > - uploaded C# packages
> > > >  > >
> > > >  > > In  > > >  t1k_vz68rcb3m...@mail.gmail.com>
> > > >  > >   "Re: [VOTE] Release Apache Arrow 0.15.0 - RC2" on Sat, 5 Oct 
> > > >  > > 2019
> > > >  > > 14:50:51 +0200,
> > > >  > >   Krisztián Szűcs  wrote:
> > > >  > >
> > > >  > > > - uploaded python wheels to pypi
> > > >  > > > - uploaded java artifacts to maven central
> > > >  > > >
> > > >  > > > I'm going to update the conda recipes.
> > > >  > > >
> > > >  > > > Remaining:
> > > >  > > > - Javascript
> > > >  > > > - Rust
> > > >  > > > - C#
> > > >  > > > - R
> > > >  > > > - Homebrew
> > > >  > > > - Site
> > > >  > > >
> > > >  > > >
> > > >  > > >
> > > >  > > > On Sat, Oct 5, 2019 at 2:29 PM Krisztián Szűcs <
> > > >  > > szucs.kriszt...@gmail.com>
> > > >  > > > wrote:
> > > >  > > >
> > > >  > > >> - rebased master
> > > >  > > >> - rebased the pull requests
> > > >  > > >> - released the jira version
> > > >  > > >> - started the new jira version
> > > >  > > >> - uploaded source artifacts to svn
> > > >  > > >> - uploaded binary artifacts to bintray
> > > >  > > >> - currently uploading python wheels to pypi
> > > >  > > >>
> > > >  > > >>
> > > >  > > >> On Sat, Oct 5, 2019 at 2:04 PM Sutou Kouhei 
> > > >  > > >> 
> > > >  wrote:
> > > >  > > >>
> > > >  > > >>> I'll release RubyGems.
> > > >  > > >>>
> > > >  > > >>> In <
> > 

Re: [DRAFT] Apache Arrow Board Report - October 2019

2019-10-10 Thread Jacques Nadeau
Proposed report update below. LMK your thoughts.

## Description:
The mission of Apache Arrow is the creation and maintenance of software
related to columnar in-memory processing and data interchange

## Issues:

* We are struggling with Continuous Integration scalability as the project
has
  definitely outgrown what Travis CI and Appveyor can do for us. Some
  contributors have shown reluctance to submit patches they aren't sure
about
  because they don't want to pile on the build queue. We are exploring
  alternative solutions such as Buildbot, Buildkite, and GitHub Actions to
  provide a path to migrate away from Travis CI / Appveyor. In our request
to
  Infrastructure INFRA-19217, some of us were alarmed to find that an CI/CD
  service like Buildkite may not be able to be connected to the @apache
GitHub
  account on account of requiring admin access to repository webhooks, but
no
  ability to modify source code. There are workarounds (building custom
OAuth
  bots) that could enable us to use Buildkite, but it would require extra
  development and result in a less refined experience for community members.



## Membership Data:
* Apache Arrow was founded 2016-01-19 (4 years ago)
* There are currently 48 committers and 28 PMC members in this project.
* The Committer-to-PMC ratio is roughly 3:2.

Community changes, past quarter:
- Micah Kornfield was added to the PMC on 2019-08-21
- Sebastien Binet was added to the PMC on 2019-08-21
- Ben Kietzman was added as committer on 2019-09-07
- David Li was added as committer on 2019-08-30
- Kenta Murata was added as committer on 2019-09-05
- Neal Richardson was added as committer on 2019-09-05
- Praveen Kumar was added as committer on 2019-07-14

## Project Activity:

* The project has just made a 0.15.0 release.
* We are discussing ways to make the Arrow libraries as accessible as
possible
  to downstream projects for minimal use cases while allowing the
development
  of more comprehensive "standard libraries" with larger dependency stacks
in
  the project
* We plan to make a 1.0.0 release as our next major release, at which time
we
  will declare that the Arrow binary protocol is stable with forward and
  backward compatibility guarantees

## Community Health:

* The community is continuing to grow at a great rate. We see good adoption
  among many other projects and fast growth of key metrics.
* Many contributors are struggling with the slowness of pre-commit CI. Arrow
  has a large number of different platforms and components and a complex
build
  matrix. As new commits come in, they frequently take a long time to
  complete. The community is trying several ways to solve this. There is
  bubbling frustration in the community around the GitHub repo rules for
using
  third party services. This is especially challenging when there are free
  solutions to relieve the community pressure but the community is unable to
  access these resources. This frustration is greatest among people who work
  on many non-asf OSS projects which don't have such restrictive rules
  around GitHub.  Some examples of ways the community has tried to resolve
  these have included:
  * Try to use CircleCI, rejected in INFRA-15964
  * Try to use Azure Pipelines, rejected in INFRA-17030
  * Try to resolves Issues with Travis CI capacity: INFRA-18533 &
https://s.apache.org/ci-capacity (no resolution beyond "find donations")
  * The creation of new infrastructure design (in progress but a huge
amount of
thankless work)
* While the community has seen great growth in contribution (more than 300
  unique contributors at this point), the vast majority are casual
  contributors. The daily active committers (the workhorses of the project
  that bear the load committing the constant PRs, more than 5000 closed at
  this point) have been growing slower than adoption. This is despite the
fact
  that the community has been very aggressive at being inclusive of new
  committers (with likelihood to have more than 50 in the next week). The
  community is still continuing to try to brainstorm ways to improve this.


Re: [VOTE] Release Apache Arrow 0.15.0 - RC2

2019-10-10 Thread Wes McKinney
The docs on http://arrow.apache.org/docs/ haven't been updated yet.
This happened the last release, too -- I ended up updating the docs
manually after a week or two. Is this included in the release
management guide? If no one beats me to it, I can update the docs by
hand again later today

On Mon, Oct 7, 2019 at 6:20 PM Wes McKinney  wrote:
>
> I think we might be a little aggressive at removing artifacts from the
> dist system
>
> Can we change our process to only remove old dist artifacts when we
> are about to upload a new RC? Otherwise it's harder to make
> improvements to the release verification scripts without any old RC to
> test against
>
> On Mon, Oct 7, 2019 at 5:17 PM Neal Richardson
>  wrote:
> >
> > The R package has been accepted by CRAN. Binaries for macOS and
> > Windows should become available in the next few days.
> >
> > Neal
> >
> > On Mon, Oct 7, 2019 at 1:41 AM Krisztián Szűcs
> >  wrote:
> > >
> > > Thanks Andy!
> > >
> > > I've just removed the RC source artefacts from SVN.
> > >
> > > We have two remaining post release tasks:
> > > - homebrew
> > > - apidocs
> > >
> > > On Mon, Oct 7, 2019 at 1:47 AM Andy Grove  wrote:
> > >
> > > > I released the Rust crates from the RC2 source tarball. I had to comment
> > > > out the benchmark references in the Cargo.toml first since the tarball 
> > > > does
> > > > not include the benchmark source code. I filed
> > > > https://issues.apache.org/jira/browse/ARROW-6801 for this bug and will
> > > > fix the packaging before the 1.0.0 release.
> > > >
> > > > On Sun, Oct 6, 2019 at 2:01 AM Krisztián Szűcs 
> > > > 
> > > > wrote:
> > > >
> > > >> The rust publishing script fails because it cannot find the benchmarks.
> > > >> Seems to be related to cargo changes.
> > > >> I cannot investigate it right now, @Andy could you take a look?
> > > >>
> > > >> On Sun, Oct 6, 2019, 9:11 AM Krisztián Szűcs 
> > > >> 
> > > >> wrote:
> > > >>
> > > >>> - published js packages to npm, please check that they are working
> > > >>> properly
> > > >>>
> > > >>> On Sat, Oct 5, 2019 at 10:14 PM Neal Richardson <
> > > >>> neal.p.richard...@gmail.com> wrote:
> > > >>>
> > >  R release steps per
> > > 
> > >  https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingRpackages
> > >  are underway.
> > > 
> > >  Neal
> > > 
> > >  On Sat, Oct 5, 2019 at 8:40 AM Krisztián Szűcs
> > >   wrote:
> > >  >
> > >  > - website updated with the release notes
> > >  > - conda-forge updates are merged
> > >  >
> > >  > Remaining:
> > >  > - Javascript
> > >  > - Rust
> > >  > - R
> > >  > - Homebrew
> > >  > - Apidocs
> > >  >
> > >  > On Sat, Oct 5, 2019 at 2:58 PM Sutou Kouhei 
> > >  wrote:
> > >  >
> > >  > > - uploaded C# packages
> > >  > >
> > >  > > In  > >  t1k_vz68rcb3m...@mail.gmail.com>
> > >  > >   "Re: [VOTE] Release Apache Arrow 0.15.0 - RC2" on Sat, 5 Oct 
> > >  > > 2019
> > >  > > 14:50:51 +0200,
> > >  > >   Krisztián Szűcs  wrote:
> > >  > >
> > >  > > > - uploaded python wheels to pypi
> > >  > > > - uploaded java artifacts to maven central
> > >  > > >
> > >  > > > I'm going to update the conda recipes.
> > >  > > >
> > >  > > > Remaining:
> > >  > > > - Javascript
> > >  > > > - Rust
> > >  > > > - C#
> > >  > > > - R
> > >  > > > - Homebrew
> > >  > > > - Site
> > >  > > >
> > >  > > >
> > >  > > >
> > >  > > > On Sat, Oct 5, 2019 at 2:29 PM Krisztián Szűcs <
> > >  > > szucs.kriszt...@gmail.com>
> > >  > > > wrote:
> > >  > > >
> > >  > > >> - rebased master
> > >  > > >> - rebased the pull requests
> > >  > > >> - released the jira version
> > >  > > >> - started the new jira version
> > >  > > >> - uploaded source artifacts to svn
> > >  > > >> - uploaded binary artifacts to bintray
> > >  > > >> - currently uploading python wheels to pypi
> > >  > > >>
> > >  > > >>
> > >  > > >> On Sat, Oct 5, 2019 at 2:04 PM Sutou Kouhei 
> > >  > > >> 
> > >  wrote:
> > >  > > >>
> > >  > > >>> I'll release RubyGems.
> > >  > > >>>
> > >  > > >>> In <
> > >  cahm19a5pxw5mq5zgb7pxoerg9rkxrhmadcrpmkw12jkjokw...@mail.gmail.com
> > >  > > >
> > >  > > >>>   "Re: [VOTE] Release Apache Arrow 0.15.0 - RC2" on Sat, 5 
> > >  > > >>> Oct
> > >  2019
> > >  > > >>> 11:46:16 +0200,
> > >  > > >>>   Krisztián Szűcs  wrote:
> > >  > > >>>
> > >  > > >>> > On Sat, Oct 5, 2019 at 11:40 AM Krisztián Szűcs <
> > >  > > >>> szucs.kriszt...@gmail.com>
> > >  > > >>> > wrote:
> > >  > > >>> >
> > >  > > >>> >> The VOTE carries with 5 binding +1 votes and 2 non-binding
> > >  +1 votes.
> > >  > > >>> >>
> > >  > > >>> >> On Fri, Oct 4, 2019 at 10:04 PM Wes McKinney <
> > >  wesmck...@gmail.com>
> > >  > 

Re: Field metadata not retrievable from parquet file

2019-10-10 Thread Wes McKinney
We haven't implemented storing field-level metadata in Parquet files
yet. It's somewhat tricky.  See
https://issues.apache.org/jira/browse/ARROW-4359

On Thu, Oct 10, 2019 at 11:51 AM Isaac Myers
 wrote:
>
> I can write both field- and schema-level metadata and read the values back 
> from schema or relevant field. I write the schema and table described by the 
> schema to a local parquet file. Upon reading the table or schema from the 
> parquet file, only schema metadata are present and field metadata are not 
> present. Am I doing something wrong? Please view the minimum working example 
> below:
>
> 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> //#include 
>
> int main(int argc, char* argv[])
> {
> /*
> Create Parquet File
> **/
> arrow::Status st;
> arrow::MemoryPool* pool = arrow::default_memory_pool();
>
> // Create Schema and fields with metadata
> std::vector> fields;
>
> std::unordered_map a_keyval;
> a_keyval["unit"] = "sec";
> a_keyval["note"] = "not the standard millisecond unit";
> arrow::KeyValueMetadata a_md(a_keyval);
> std::shared_ptr a_field = arrow::field("a", arrow::int16(), 
> false, a_md.Copy());
> fields.push_back(a_field);
>
> std::unordered_map b_keyval;
> b_keyval["unit"] = "ft";
> arrow::KeyValueMetadata b_md(b_keyval);
> std::shared_ptr b_field = arrow::field("b", arrow::int16(), 
> false, b_md.Copy());
> fields.push_back(b_field);
>
> std::shared_ptr schema = arrow::schema(fields);
>
> // Add metadata to schema.
> std::unordered_map schema_keyval;
> schema_keyval["classification"] = "Type 0";
> arrow::KeyValueMetadata schema_md(schema_keyval);
> schema = schema->AddMetadata(schema_md.Copy());
>
> // Build arrays of data and add to Table.
> const int64_t rowgroup_size = 100;
> std::vector a_data(rowgroup_size, 0);
> std::vector b_data(rowgroup_size, 0);
>
> for (int16_t i = 0; i < rowgroup_size; i++)
> {
> a_data[i] = i;
> b_data[i] = rowgroup_size - i;
> }
>
> arrow::Int16Builder a_bldr(pool);
> arrow::Int16Builder b_bldr(pool);
> st = a_bldr.Resize(rowgroup_size);
> if (!st.ok()) return 1;
> st = b_bldr.Resize(rowgroup_size);
> if (!st.ok()) return 1;
>
> st = a_bldr.AppendValues(a_data);
> if (!st.ok()) return 1;
>
> st = b_bldr.AppendValues(b_data);
> if (!st.ok()) return 1;
>
> std::shared_ptr a_arr_ptr;
> std::shared_ptr b_arr_ptr;
>
> arrow::ArrayVector arr_vec;
> st = a_bldr.Finish(_arr_ptr);
> if (!st.ok()) return 1;
> arr_vec.push_back(a_arr_ptr);
> st = b_bldr.Finish(_arr_ptr);
> if (!st.ok()) return 1;
> arr_vec.push_back(b_arr_ptr);
>
> std::shared_ptr table = arrow::Table::Make(schema, arr_vec);
>
> // Test metadata
> printf("\nMetadata from original schema:\n");
> printf("%s\n", schema->metadata()->ToString().c_str());
> printf("%s\n", schema->field(0)->metadata()->ToString().c_str());
> printf("%s\n", schema->field(1)->metadata()->ToString().c_str());
>
> std::shared_ptr table_schema = table->schema();
> printf("\nMetadata from schema retrieved from table (should be the same):\n");
> printf("%s\n", table_schema->metadata()->ToString().c_str());
> printf("%s\n", table_schema->field(0)->metadata()->ToString().c_str());
> printf("%s\n", table_schema->field(1)->metadata()->ToString().c_str());
>
> // Open file and write table.
> std::string file_name = "test.parquet";
> std::shared_ptr ostream;
> st = arrow::io::FileOutputStream::Open(file_name, );
> if (!st.ok()) return 1;
>
> std::unique_ptr writer;
> std::shared_ptr props = 
> parquet::default_writer_properties();
> st = parquet::arrow::FileWriter::Open(*schema, pool, ostream, props, );
> if (!st.ok()) return 1;
> st = writer->WriteTable(*table, rowgroup_size);
> if (!st.ok()) return 1;
>
> // Close file and stream.
> st = writer->Close();
> if (!st.ok()) return 1;
> st = ostream->Close();
> if (!st.ok()) return 1;
>
> /*
> Read Parquet File
> **/
>
> // Create new memory pool. Not sure if this is necessary.
> //arrow::MemoryPool* pool2 = arrow::default_memory_pool();
>
> // Open file reader.
> std::shared_ptr input_file;
> st = arrow::io::ReadableFile::Open(file_name, pool, _file);
> if (!st.ok()) return 1;
> std::unique_ptr reader;
> st = parquet::arrow::OpenFile(input_file, pool, );
> if (!st.ok()) return 1;
>
> // Get schema and read metadata.
> std::shared_ptr new_schema;
> st = reader->GetSchema(_schema);
> if (!st.ok()) return 1;
> printf("\nMetadata from schema read from file:\n");
> printf("%s\n", new_schema->metadata()->ToString().c_str());
>
> // Crashes because there are no metadata.
> /*printf("%s\n", new_schema->field(0)->metadata()->ToString().c_str());
> printf("%s\n", new_schema->field(1)->metadata()->ToString().c_str());*/
>
> printf("field name %s metadata exists: %d\n", 
> new_schema->field(0)->name().c_str(),
> new_schema->field(0)->HasMetadata());
> printf("field name %s metadata exists: %d\n", 
> 

[jira] [Created] (ARROW-6847) [C++] Add a range_expression interface to Iterator<>

2019-10-10 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-6847:
---

 Summary: [C++] Add a range_expression interface to Iterator<>
 Key: ARROW-6847
 URL: https://issues.apache.org/jira/browse/ARROW-6847
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Ben Kietzman
Assignee: Ben Kietzman


Iterator provides the Visit() method for visiting each element, but idiomatic 
C++ uses a range for loop



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Question about timestamps ...

2019-10-10 Thread David Boles
Joris,

Thank you for the response. There's such a trail of stale information
online w/r to the overall that it wasn't clear what the status was. For
example, simple searches take you into the "INT96 is deprecated therefore
suppport for nanoseconds is as well" cul-de-sac. Absence that confusing
context, the existing error message is fine.

It's worth noting that accurate and precise timestamps down to ~0.1
nanosecond are widely available, with 0.02ns being available for just a few
thousand $US.

I'll stick with usec resolution for absolute time and just use an int64
field for my nanosecond data.

Thanks again.

 - db

On Thu, Oct 10, 2019 at 5:11 AM Joris Van den Bossche <
jorisvandenboss...@gmail.com> wrote:

> Hi David,
>
> This is intentional, see
> https://arrow.apache.org/docs/python/parquet.html#storing-timestamps for
> some explanation in the documentation. Basicly, the parquet format only
> supports ms and us resolution, and so nanosecond timestamps (which are
> supported by Arrow) are converted to one of those resolutions.
>
> We could maybe clarify that better in the error message (something like
> "only 'ms' and 'us' are supported") ?
>
> In the latest version of the parquet format specification, there is
> actually support for nanosecond resolution as well (
>
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#deprecated-timestamp-convertedtype
> ).
> You can obtain this by specifying version="2.0", but the implementation is
> not yet fully ready (see https://issues.apache.org/jira/browse/PARQUET-458
> ),
> and also not all frameworks support this version (so if compatibility
> across processing frameworks is important, it is recommended to stick with
> version 1).
>
> Joris
>
> On Wed, 9 Oct 2019 at 21:27, David Boles  wrote:
>
> > The following code dies with pyarrow 0.14.2:
> >
> > import pyarrow as pa
> > import pyarrow.parquet as pq
> >
> > schema = pa.schema([('timestamp', pa.timestamp('ns', tz='UTC')),])
> > writer = pq.ParquetWriter('foo.parquet', schema, coerce_timestamps='ns')
> >
> > ts_array = pa.array([ int(1234567893141) ], type=pa.timestamp('ns',
> > tz='UTC'))
> > table = pa.Table.from_arrays([ ts_array ], names=['timestamp'])
> >
> > writer.write_table(table)
> > writer.close()
> >
> > with the message:
> >
> > ValueError: Invalid value for coerce_timestamps: ns
> >
> > That appears to be because of this code in _parquet.pxi:
> >
> > cdef int _set_coerce_timestamps(
> > self, ArrowWriterProperties.Builder* props) except -1:
> > if self.coerce_timestamps == 'ms':
> > props.coerce_timestamps(TimeUnit_MILLI)
> > elif self.coerce_timestamps == 'us':
> > props.coerce_timestamps(TimeUnit_MICRO)
> > elif self.coerce_timestamps is not None:
> > raise ValueError('Invalid value for coerce_timestamps: {0}'
> >  .format(self.coerce_timestamps))
> >
> > which restricts the choice to 'ms' or 'us', even though AFAICT everywhere
> > else also allows 'ns' (and there is a TimeUnit_NANO defined). Is this
> > intentional, or a bug?
> >
> > Thanks,
> >
> >  - db
> >
>


Re: [DRAFT] Apache Arrow Board Report - October 2019

2019-10-10 Thread Jacques Nadeau
Arg... accidental send before ready.

What do think about the statement below for community health? Does it
fairly capture the concerns/perspective?

On Thu, Oct 10, 2019 at 10:24 AM Jacques Nadeau  wrote:

> Many contributors are struggling with the slowness of pre-commit CI. Arrow
> has a large number of different platforms and components and a complex
> build matrix. As new commits come in, they frequently take a long time to
> complete. The community is trying several ways to solve this. Some of those
> have been:
>
>- Try to use CircleCI, rejected in INFRA-15964
>
>- Try to use Azure Pipelines, rejected in INFRA-17030
>- Try to resolves Issues with Travis CI capacity: INFRA-18533
>,
>https://s.apache.org/ci-capacity (no resolution beyond "find
>donations")
>- The creation of new infrastructure design (in progress but a huge
>amount of thankless work)
>
>
> There is bubbling frustration in the community around the GitHub repo
> rules for using third party services. This is especially challenging when
> there are free solutions to relieve the community pressure but the
> community is unable to access these resources. This frustration is greatest
> among people who work on projects on many OSS projects which don't have
> such restrictive rules around GitHub.
>
> On Thu, Oct 10, 2019 at 5:36 AM Wes McKinney  wrote:
>
>> Here is a rejection of CircleCI more than 18 months ago
>>
>> https://issues.apache.org/jira/browse/INFRA-15964
>>
>> On Thu, Oct 10, 2019 at 4:33 AM Antoine Pitrou 
>> wrote:
>> >
>> >
>> > For the record, here is the ticket for Azure Pipelines integration:
>> > https://issues.apache.org/jira/browse/INFRA-17030
>> >
>> > I opened an issue back in May about the Travis-CI capacity situation:
>> > https://issues.apache.org/jira/browse/INFRA-18533
>> >
>> > Apparently CI capacity has been a "hot topic as of late":
>> >
>> https://lists.apache.org/thread.html/af52e2a3e865c01596d46374e8b294f2740587dbd59d85e132429b6c@%3Cbuilds.apache.org%3E
>> >
>> > (I didn't know this list -- bui...@apache.org -- existed, by the way)
>> >
>> > Regards
>> >
>> > Antoine.
>> >
>> >
>> > Le 10/10/2019 à 07:34, Wes McKinney a écrit :
>> > > On Thu, Oct 10, 2019 at 12:22 AM Jacques Nadeau 
>> wrote:
>> > >>
>> > >> I'm not dismissing the there are issues but I also don't feel like
>> there
>> > >> has been constant discussion for months on the list that INFRA is
>> not being
>> > >> responsive to Arrow community requests. It seems like you might be
>> saying a
>> > >> couple different things one of two things (or both?)?
>> > >>
>> > >> 1) The Arrow infrastructure requirements are vastly different than
>> other
>> > >> projects. Because of Arrow's specialized requirements, we need
>> things that
>> > >> no other project needs.
>> > >> 2) There are many projects that want CircleCI, Buildkite and Azure
>> > >> pipelines but Infrastructure is not responsive. This is putting a big
>> > >> damper on the success of the Arrow project.
>> > >
>> > > Yes, I'm saying both of these things.
>> > >
>> > > 1. Yes, Arrow is special -- validating the project requires running a
>> > > dozen or more different builds (with dozens more nightly builds) that
>> > > test different parts of the project. Different language components, a
>> > > large and diverse packaging matrix, and interproject integration tests
>> > > and integration with external projects (e.g. Apache Spark adn others)
>> > >
>> > > 2. Yes, the limited GitHub App availability is hurting us.
>> > >
>> > > I'm OK to place this concern in the "Community Health" section and
>> > > spend more time building a comprehensive case about how Infra's
>> > > conservatism around Apps is causing us to work with one hand tied
>> > > behind our back. I know that I'm not the only one who is unhappy, but
>> > > I'll let the others speak for themselves.
>> > >
>> > >> For each of these, if we're asking the board to do something, we
>> should say
>> > >> more and more clearly. Sure, CI is a pain in the Arrow project's
>> a**. I
>> > >> also agree that community health is impacted by the challenge to
>> merge
>> > >> things. I also share the perspective that the foundation has been
>> slow to
>> > >> adopt new technologies and has been way to religious about svn.
>> However, If
>> > >> we're asking the board to do something, what is it?
>> > >
>> > > Allow GitHub Apps that do not require write access to the code itself,
>> > > set up appropriate checks and balances to ensure that the Foundation's
>> > > IP provenance webhooks are preserved.
>> > >
>> > >> Looking at the two things you might be saying...
>> > >> If 1, are we confident in that? Many other projects have pretty
>> complex
>> > >> build matrices I think. (I haven't thought about this and evaluated
>> the
>> > >> other projects...maybe it is true.) If 1, we should clarify why we
>> think
>> > 

Re: [DRAFT] Apache Arrow Board Report - October 2019

2019-10-10 Thread Jacques Nadeau
Many contributors are struggling with the slowness of pre-commit CI. Arrow
has a large number of different platforms and components and a complex
build matrix. As new commits come in, they frequently take a long time to
complete. The community is trying several ways to solve this. Some of those
have been:

   - Try to use CircleCI, rejected in INFRA-15964
   
   - Try to use Azure Pipelines, rejected in INFRA-17030
   - Try to resolves Issues with Travis CI capacity: INFRA-18533
   ,
   https://s.apache.org/ci-capacity (no resolution beyond "find donations")
   - The creation of new infrastructure design (in progress but a huge
   amount of thankless work)


There is bubbling frustration in the community around the GitHub repo rules
for using third party services. This is especially challenging when there
are free solutions to relieve the community pressure but the community is
unable to access these resources. This frustration is greatest among people
who work on projects on many OSS projects which don't have such restrictive
rules around GitHub.

On Thu, Oct 10, 2019 at 5:36 AM Wes McKinney  wrote:

> Here is a rejection of CircleCI more than 18 months ago
>
> https://issues.apache.org/jira/browse/INFRA-15964
>
> On Thu, Oct 10, 2019 at 4:33 AM Antoine Pitrou  wrote:
> >
> >
> > For the record, here is the ticket for Azure Pipelines integration:
> > https://issues.apache.org/jira/browse/INFRA-17030
> >
> > I opened an issue back in May about the Travis-CI capacity situation:
> > https://issues.apache.org/jira/browse/INFRA-18533
> >
> > Apparently CI capacity has been a "hot topic as of late":
> >
> https://lists.apache.org/thread.html/af52e2a3e865c01596d46374e8b294f2740587dbd59d85e132429b6c@%3Cbuilds.apache.org%3E
> >
> > (I didn't know this list -- bui...@apache.org -- existed, by the way)
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 10/10/2019 à 07:34, Wes McKinney a écrit :
> > > On Thu, Oct 10, 2019 at 12:22 AM Jacques Nadeau 
> wrote:
> > >>
> > >> I'm not dismissing the there are issues but I also don't feel like
> there
> > >> has been constant discussion for months on the list that INFRA is not
> being
> > >> responsive to Arrow community requests. It seems like you might be
> saying a
> > >> couple different things one of two things (or both?)?
> > >>
> > >> 1) The Arrow infrastructure requirements are vastly different than
> other
> > >> projects. Because of Arrow's specialized requirements, we need things
> that
> > >> no other project needs.
> > >> 2) There are many projects that want CircleCI, Buildkite and Azure
> > >> pipelines but Infrastructure is not responsive. This is putting a big
> > >> damper on the success of the Arrow project.
> > >
> > > Yes, I'm saying both of these things.
> > >
> > > 1. Yes, Arrow is special -- validating the project requires running a
> > > dozen or more different builds (with dozens more nightly builds) that
> > > test different parts of the project. Different language components, a
> > > large and diverse packaging matrix, and interproject integration tests
> > > and integration with external projects (e.g. Apache Spark adn others)
> > >
> > > 2. Yes, the limited GitHub App availability is hurting us.
> > >
> > > I'm OK to place this concern in the "Community Health" section and
> > > spend more time building a comprehensive case about how Infra's
> > > conservatism around Apps is causing us to work with one hand tied
> > > behind our back. I know that I'm not the only one who is unhappy, but
> > > I'll let the others speak for themselves.
> > >
> > >> For each of these, if we're asking the board to do something, we
> should say
> > >> more and more clearly. Sure, CI is a pain in the Arrow project's a**.
> I
> > >> also agree that community health is impacted by the challenge to merge
> > >> things. I also share the perspective that the foundation has been
> slow to
> > >> adopt new technologies and has been way to religious about svn.
> However, If
> > >> we're asking the board to do something, what is it?
> > >
> > > Allow GitHub Apps that do not require write access to the code itself,
> > > set up appropriate checks and balances to ensure that the Foundation's
> > > IP provenance webhooks are preserved.
> > >
> > >> Looking at the two things you might be saying...
> > >> If 1, are we confident in that? Many other projects have pretty
> complex
> > >> build matrices I think. (I haven't thought about this and evaluated
> the
> > >> other projects...maybe it is true.) If 1, we should clarify why we
> think
> > >> we're different. If that is the case, what are asking for from the
> board.
> > >>
> > >> If 2, and you are proposing throwing stones at INFRA, we should back
> it up
> > >> with INFRA tickets and numbers (e.g. how many projects have wanted
> these
> > >> things and for how long). We should reference multiple threads on the
> INFRA
> > >> 

[jira] [Created] (ARROW-6846) [C++] Build failures with glog enabled

2019-10-10 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-6846:
-

 Summary: [C++] Build failures with glog enabled
 Key: ARROW-6846
 URL: https://issues.apache.org/jira/browse/ARROW-6846
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou


This has started appearing on Travis, e.g.:
https://travis-ci.org/apache/arrow/jobs/596181386#L3663
{code}
In file included from 
/home/travis/build/apache/arrow/cpp/src/arrow/util/logging.cc:29:0:

/home/travis/build/apache/arrow/pyarrow-test-3.6/include/glog/logging.h:994:0: 
error: "DCHECK" redefined [-Werror]

 #define DCHECK(condition) CHECK(condition)

 

In file included from 
/home/travis/build/apache/arrow/cpp/src/arrow/util/logging.cc:18:0:

/home/travis/build/apache/arrow/cpp/src/arrow/util/logging.h:130:0: note: this 
is the location of the previous definition

 #define DCHECK ARROW_CHECK

 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Field metadata not retrievable from parquet file

2019-10-10 Thread Isaac Myers
I can write both field- and schema-level metadata and read the values back from 
schema or relevant field. I write the schema and table described by the schema 
to a local parquet file. Upon reading the table or schema from the parquet 
file, only schema metadata are present and field metadata are not present. Am I 
doing something wrong? Please view the minimum working example below:


#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
//#include 

int main(int argc, char* argv[])
{
/*
Create Parquet File
**/
arrow::Status st;
arrow::MemoryPool* pool = arrow::default_memory_pool();

// Create Schema and fields with metadata
std::vector> fields;

std::unordered_map a_keyval;
a_keyval["unit"] = "sec";
a_keyval["note"] = "not the standard millisecond unit";
arrow::KeyValueMetadata a_md(a_keyval);
std::shared_ptr a_field = arrow::field("a", arrow::int16(), 
false, a_md.Copy());
fields.push_back(a_field);

std::unordered_map b_keyval;
b_keyval["unit"] = "ft";
arrow::KeyValueMetadata b_md(b_keyval);
std::shared_ptr b_field = arrow::field("b", arrow::int16(), 
false, b_md.Copy());
fields.push_back(b_field);

std::shared_ptr schema = arrow::schema(fields);

// Add metadata to schema.
std::unordered_map schema_keyval;
schema_keyval["classification"] = "Type 0";
arrow::KeyValueMetadata schema_md(schema_keyval);
schema = schema->AddMetadata(schema_md.Copy());

// Build arrays of data and add to Table.
const int64_t rowgroup_size = 100;
std::vector a_data(rowgroup_size, 0);
std::vector b_data(rowgroup_size, 0);

for (int16_t i = 0; i < rowgroup_size; i++)
{
a_data[i] = i;
b_data[i] = rowgroup_size - i;
}

arrow::Int16Builder a_bldr(pool);
arrow::Int16Builder b_bldr(pool);
st = a_bldr.Resize(rowgroup_size);
if (!st.ok()) return 1;
st = b_bldr.Resize(rowgroup_size);
if (!st.ok()) return 1;

st = a_bldr.AppendValues(a_data);
if (!st.ok()) return 1;

st = b_bldr.AppendValues(b_data);
if (!st.ok()) return 1;

std::shared_ptr a_arr_ptr;
std::shared_ptr b_arr_ptr;

arrow::ArrayVector arr_vec;
st = a_bldr.Finish(_arr_ptr);
if (!st.ok()) return 1;
arr_vec.push_back(a_arr_ptr);
st = b_bldr.Finish(_arr_ptr);
if (!st.ok()) return 1;
arr_vec.push_back(b_arr_ptr);

std::shared_ptr table = arrow::Table::Make(schema, arr_vec);

// Test metadata
printf("\nMetadata from original schema:\n");
printf("%s\n", schema->metadata()->ToString().c_str());
printf("%s\n", schema->field(0)->metadata()->ToString().c_str());
printf("%s\n", schema->field(1)->metadata()->ToString().c_str());

std::shared_ptr table_schema = table->schema();
printf("\nMetadata from schema retrieved from table (should be the same):\n");
printf("%s\n", table_schema->metadata()->ToString().c_str());
printf("%s\n", table_schema->field(0)->metadata()->ToString().c_str());
printf("%s\n", table_schema->field(1)->metadata()->ToString().c_str());

// Open file and write table.
std::string file_name = "test.parquet";
std::shared_ptr ostream;
st = arrow::io::FileOutputStream::Open(file_name, );
if (!st.ok()) return 1;

std::unique_ptr writer;
std::shared_ptr props = 
parquet::default_writer_properties();
st = parquet::arrow::FileWriter::Open(*schema, pool, ostream, props, );
if (!st.ok()) return 1;
st = writer->WriteTable(*table, rowgroup_size);
if (!st.ok()) return 1;

// Close file and stream.
st = writer->Close();
if (!st.ok()) return 1;
st = ostream->Close();
if (!st.ok()) return 1;

/*
Read Parquet File
**/

// Create new memory pool. Not sure if this is necessary.
//arrow::MemoryPool* pool2 = arrow::default_memory_pool();

// Open file reader.
std::shared_ptr input_file;
st = arrow::io::ReadableFile::Open(file_name, pool, _file);
if (!st.ok()) return 1;
std::unique_ptr reader;
st = parquet::arrow::OpenFile(input_file, pool, );
if (!st.ok()) return 1;

// Get schema and read metadata.
std::shared_ptr new_schema;
st = reader->GetSchema(_schema);
if (!st.ok()) return 1;
printf("\nMetadata from schema read from file:\n");
printf("%s\n", new_schema->metadata()->ToString().c_str());

// Crashes because there are no metadata.
/*printf("%s\n", new_schema->field(0)->metadata()->ToString().c_str());
printf("%s\n", new_schema->field(1)->metadata()->ToString().c_str());*/

printf("field name %s metadata exists: %d\n", 
new_schema->field(0)->name().c_str(),
new_schema->field(0)->HasMetadata());
printf("field name %s metadata exists: %d\n", 
new_schema->field(1)->name().c_str(),
new_schema->field(1)->HasMetadata());

// What if I read the whole table and get the schema from it.
std::shared_ptr new_table;
st = reader->ReadTable(_table);
if (!st.ok()) return 1;
std::shared_ptr schema_from_table = new_table->schema();
printf("\nMetadata from schema that is retrieved through table that is read 
from file:\n");
printf("%s\n", schema_from_table->metadata()->ToString().c_str());

// Crashes because there are no metadata.
/*printf("%s\n", 

Re: Simple Join Implementation Questions

2019-10-10 Thread Antoine Pitrou


Hi David,

You should look into the visitor facilities provided by Arrow C++, in
arrow/visitor_inline.h.

I would especially look at two of them:

- VisitArrayInline() will call the visitor's overloaded Visit() method
with the right array concrete type (for example Int16Array, ListArray...)

- Once you know the concrete type (for example Int16Type, which is
Int16Array::TypeClass), you can use ArrayDataVisitor to
iterate over each array element

Regards

Antoine.


Le 10/10/2019 à 17:31, david sherrier a écrit :
> Hey all,
> 
> I'm working on a simple serial join implementation and need to be able to
> compare data across two columns of the same type.  Right now the only way I
> have found to do this is too use ArrayData::GetValues(1) and then
> iterate over the returned buffer comparing the values.  The problem I am
> having with this approach is that I need the type in the template meaning
> that when I need to add a row to the result table I need to know the type
> of each column which would seem like a needlessly large switch statement
> comparing on the type id and then returning the type. This also seems to
> only work for fixed length types it would appear to be even more
> complicated to read string data but I have not tried that yet. Is there an
> easier way too do this that I am missing?  The second issue I am having is
> that comparisons between types that do not inherit from ctypes seem to not
> be implemented yet in particular for this use case String type.  I would
> have expected that since tables have a defined schema with the type known
> there would be some sort of iterator to read over column data?
> 
> Thanks,
> David Sherrier
> 


Simple Join Implementation Questions

2019-10-10 Thread david sherrier
Hey all,

I'm working on a simple serial join implementation and need to be able to
compare data across two columns of the same type.  Right now the only way I
have found to do this is too use ArrayData::GetValues(1) and then
iterate over the returned buffer comparing the values.  The problem I am
having with this approach is that I need the type in the template meaning
that when I need to add a row to the result table I need to know the type
of each column which would seem like a needlessly large switch statement
comparing on the type id and then returning the type. This also seems to
only work for fixed length types it would appear to be even more
complicated to read string data but I have not tried that yet. Is there an
easier way too do this that I am missing?  The second issue I am having is
that comparisons between types that do not inherit from ctypes seem to not
be implemented yet in particular for this use case String type.  I would
have expected that since tables have a defined schema with the type known
there would be some sort of iterator to read over column data?

Thanks,
David Sherrier


Re: Looking ahead to 1.0

2019-10-10 Thread John Muehlhausen
The format change is ARROW-6836 ... add a custom_metadata:[KeyValue] field
to the Footer table in File.fbs

The other change (slicing a recordbatch to honor RecordBatch.length rather
than array length if the former is smaller) will hopefully not affect the
format.


On Wed, Oct 9, 2019 at 11:55 PM Wes McKinney  wrote:

> Hi John,
>
> Since the 1.0.0 release is focused on Format stability, probably the
> only real "blockers" will be ensuring that we have hardened multiple
> implementations (in particular C++ and Java) of the columnar format as
> specified with integration tests to prove it. The issues you listed
> sound more like C++ library changes to me?
>
> If you want to propose Format-related changes, that would need to
> happen right away otherwise the ship will sail on that.
>
> - Wes
>
> On Wed, Oct 9, 2019 at 9:08 PM John Muehlhausen  wrote:
> >
> > ARROW-5916
> > ARROW-6836/6837
> >
> > These are of particular interest to me because they enable recordbatch
> > "incrementalism" which is useful for streaming applications:
> >
> > ARROW-5916 allows a recordbatch to pre-allocate space for future records
> > that have not yet been populated, making it safe for readers to consume
> the
> > partial batch.
> >
> > ARROW-6836/6837 allows a file of record batches to be extended at the
> end,
> > without re-writing the beginning, while including the idea that the
> > custom_metadata may change with each update.  (custom_metadata in the
> > Schema is not a good candidate because Schema also appears at the
> beginning
> > of the file.)
> >
> > While these are not blockers for me quite yet, they soon will be!  If I
> > wanted to ensure that these are in 1.0, what is my deadline for
> > implementation and test cases?  Can such a note be made on the wiki?
> > Should I change the priority in Jira?
> >
> > Thanks,
> > John
> >
> > On Wed, Oct 9, 2019 at 2:57 PM Neal Richardson <
> neal.p.richard...@gmail.com>
> > wrote:
> >
> > > Congratulations everyone on 0.15! I know a lot of hard work went into
> > > it, not only in the software itself but also in the build and release
> > > process.
> > >
> > > Once you've caught your breath from the release, we should start
> > > thinking about what's in scope for our next release, the big 1.0. To
> > > get us started (or restarted, since we did discuss 1.0 before the
> > > flatbuffer alignment issue came up), I've created
> > > https://cwiki.apache.org/confluence/display/ARROW/Arrow+1.0.0+Release
> > > based on our past release wiki pages.
> > >
> > > A good place to begin would be to list, either in "blocker" Jiras or
> > > bullet points on the document, the key features and tasks we must
> > > resolve before 1.0. For example, I get the sense that we need to
> > > overhaul the documentation, but that should be expressed in a more
> > > concrete, actionable way.
> > >
> > > Neal
> > >
>


Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-10-10-0

2019-10-10 Thread Krisztián Szűcs
Disabled it in https://github.com/apache/arrow/pull/5617

On Thu, Oct 10, 2019 at 3:12 PM Wes McKinney  wrote:

> Seems like CircleCI might be paywalling some features now
>
> "
> #!/bin/sh -eo pipefail
> # Blocked due to free-plan-docker-layer-caching-unavailable
> #
> # ---
> # Warning: This configuration was auto-generated to show you the message
> above.
> # Don't rerun this job. Rerunning will have no effect.
> false
> "
>
> I can look into how much it would cost to pay for CircleCI on
> github.com/ursa-labs
>
> On Thu, Oct 10, 2019 at 7:01 AM Crossbow  wrote:
> >
> >
> > Arrow Build Report for Job nightly-2019-10-10-0
> >
> > All tasks:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0
> >
> > Failed Tasks:
> > - docker-go:
> >   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-go
> > - docker-cpp-cmake32:
> >   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-cpp-cmake32
> > - docker-turbodbc-integration:
> >   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-turbodbc-integration
> > - docker-python-2.7:
> >   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-python-2.7
> > - wheel-osx-cp37m:
> >   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-travis-wheel-osx-cp37m
> > - wheel-win-cp37m:
> >   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-appveyor-wheel-win-cp37m
> > - docker-hdfs-integration:
> >   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-hdfs-integration
> > - docker-r-conda:
> >   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-r-conda
> > - docker-lint:
> >   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-lint
> > - docker-cpp-release:
> >   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-cpp-release
> > - docker-js:
> >   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-js
> > - docker-docs:
> >   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-docs
> > - docker-clang-format:
> >   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-clang-format
> > - docker-dask-integration:
> >   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-dask-integration
> > - docker-python-3.7:
> >   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-python-3.7
> > - wheel-osx-cp36m:
> >   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-travis-wheel-osx-cp36m
> > - docker-cpp:
> >   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-cpp
> > - wheel-win-cp36m:
> >   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-appveyor-wheel-win-cp36m
> > - gandiva-jar-trusty:
> >   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-travis-gandiva-jar-trusty
> > - docker-python-2.7-nopandas:
> >   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-python-2.7-nopandas
> > - docker-python-3.6:
> >   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-python-3.6
> > - docker-c_glib:
> >   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-c_glib
> > - docker-pandas-master:
> >   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-pandas-master
> > - gandiva-jar-osx:
> >   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-travis-gandiva-jar-osx
> > - docker-iwyu:
> >   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-iwyu
> > - docker-java:
> >   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-java
> > - wheel-osx-cp27m:
> >   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-travis-wheel-osx-cp27m
> > - docker-spark-integration:
> >   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-spark-integration
> > - docker-r:
> >   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-r
> > - docker-cpp-static-only:
> >   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-cpp-static-only
> > - docker-r-sanitizer:
> >   URL:
> 

Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-10-10-0

2019-10-10 Thread Wes McKinney
Seems like CircleCI might be paywalling some features now

"
#!/bin/sh -eo pipefail
# Blocked due to free-plan-docker-layer-caching-unavailable
#
# ---
# Warning: This configuration was auto-generated to show you the message above.
# Don't rerun this job. Rerunning will have no effect.
false
"

I can look into how much it would cost to pay for CircleCI on
github.com/ursa-labs

On Thu, Oct 10, 2019 at 7:01 AM Crossbow  wrote:
>
>
> Arrow Build Report for Job nightly-2019-10-10-0
>
> All tasks: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0
>
> Failed Tasks:
> - docker-go:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-go
> - docker-cpp-cmake32:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-cpp-cmake32
> - docker-turbodbc-integration:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-turbodbc-integration
> - docker-python-2.7:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-python-2.7
> - wheel-osx-cp37m:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-travis-wheel-osx-cp37m
> - wheel-win-cp37m:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-appveyor-wheel-win-cp37m
> - docker-hdfs-integration:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-hdfs-integration
> - docker-r-conda:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-r-conda
> - docker-lint:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-lint
> - docker-cpp-release:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-cpp-release
> - docker-js:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-js
> - docker-docs:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-docs
> - docker-clang-format:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-clang-format
> - docker-dask-integration:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-dask-integration
> - docker-python-3.7:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-python-3.7
> - wheel-osx-cp36m:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-travis-wheel-osx-cp36m
> - docker-cpp:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-cpp
> - wheel-win-cp36m:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-appveyor-wheel-win-cp36m
> - gandiva-jar-trusty:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-travis-gandiva-jar-trusty
> - docker-python-2.7-nopandas:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-python-2.7-nopandas
> - docker-python-3.6:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-python-3.6
> - docker-c_glib:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-c_glib
> - docker-pandas-master:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-pandas-master
> - gandiva-jar-osx:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-travis-gandiva-jar-osx
> - docker-iwyu:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-iwyu
> - docker-java:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-java
> - wheel-osx-cp27m:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-travis-wheel-osx-cp27m
> - docker-spark-integration:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-spark-integration
> - docker-r:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-r
> - docker-cpp-static-only:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-cpp-static-only
> - docker-r-sanitizer:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-r-sanitizer
> - docker-rust:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-rust
> - wheel-osx-cp35m:
>   URL: 
> 

Re: [C++] The quest for zero-dependency builds

2019-10-10 Thread Antoine Pitrou


Yes, we could express dependencies in a Python script and have it
generate a CMake module of if/else chains in cmake_modules (which we
would check in git to avoid having people depend on a Python install,
perhaps).

Still, that is an additional maintenance burden.

Regards

Antoine.


Le 10/10/2019 à 14:50, Wes McKinney a écrit :
> I guess one question we should first discuss is: who is the C++ build
> system for?
> 
> The users who are most sensitive to benchmark-driven decision making
> will generally be consuming the project through pre-built binaries,
> like our Python or R packages. If C++ developers build the project
> from source and don't do a minimal read of the documentation to see
> what a "recommended configuration" looks like, I would say that is
> more their fault than ours. In the case of the ARROW_JEMALLOC option,
> I think it's important for C++ system integrators to be aware of the
> impact of the choice of memory allocator.
> 
> The concern I have with the current "out of the box" experience is
> that people are getting the impression that "I have to build $X, $Y,
> and $Z -- which I don't necessarily need -- to have $CORE_FEATURE_1".
> They can, of course, read the documentation and learn that those
> things can be toggled off, but I think the user that reaches for a
> self-built source install is much different in general than someone
> who uses the project through the Linux binary packages, for example.
> 
> On the subject of managing intraproject dependencies and
> relationships, I think we should develop a better way to express
> relationships between components than we have now.
> 
> As an example, building the Python library assumes that various
> components are enabled
> 
> - ARROW_COMPUTE=ON
> - ARROW_FILESYSTEM=ON
> - ARROW_IPC=ON
> 
> Somewhere in the code we might have some code like
> 
> if (ARROW_PYTHON)
>   set(ARROW_COMPUTE ON)
>   ...
> endif()
> 
> This doesn't strike me as that scalable. I would rather see a
> dependency file like
> 
> component_dependencies = {
> ...
> 'python': ['compute', 'filesystem', 'ipc'],
> ...
> }
> 
> A helper Python script as part of the build could be used to give
> CMake (because CMake is a bit poor as a programming language) the list
> of required components based on what the user has indicated to CMake.
> 
> On Thu, Oct 10, 2019 at 7:36 AM Francois Saint-Jacques
>  wrote:
>>
>> There's always the route of vendoring some library and not exposing
>> external CMake options. This would achieve the goal of
>> compile-out-of-the-box and enable important feature in the basic
>> build. We also simplify dependencies requirements (benefits CI or
>> developer). The downside is following security patches and grumpy
>> reaction from package maintainers. I think we should explore this
>> route for dependencies that match the following criteria:
>>
>> - libarrow*.so don't export any of the symbols of the dependency and
>> not referenced in any public headers
>> - dependency is lightweight, e.g. excludes boost, openssl, grpc, llvm,
>> thrift, protobuf
>> - dependency is not-ubiquitous on major platform and have a stable
>> API, e.g. excludes libz and openssl
>>
>> A small list of candidates:
>> - RapidJSON (enables JSON)
>> - DoubleConversion (enables CSV)
>>
>> There's a precedent, arrow already vendors small C++ libraries
>> (datetime, utf8cpp, variant, xxhash).
>>
>> François
>>
>>
>> On Thu, Oct 10, 2019 at 6:03 AM Antoine Pitrou  wrote:
>>>
>>>
>>> Hi all,
>>>
>>> I'm a bit concerned that we're planning to add many additional build
>>> options in the quest to have a core zero-dependency build in C++.
>>> See for example https://issues.apache.org/jira/browse/ARROW-6633 or
>>> https://issues.apache.org/jira/browse/ARROW-6612.
>>>
>>> The problem is that this is creating many possible configurations and we
>>> will only be testing a tiny subset of them.  Inevitably, users will try
>>> other option combinations and they'll fail building for some random
>>> reason.  It will not be a very good user experience.
>>>
>>> Another related issue is user perception when doing a default build.
>>> For example https://issues.apache.org/jira/browse/ARROW-6638 proposes to
>>> build with jemalloc disabled by default.  Inevitably, people will be
>>> doing benchmarks with this (publicly or not) and they'll conclude Arrow
>>> is not as performant as it claims to be.
>>>
>>> Perhaps we should look for another approach instead?
>>>
>>> For example we could have a single ARROW_BARE_CORE (whatever the name)
>>> option that when enabled (not by default) builds the tiniest minimal
>>> subset of Arrow.  It's more inflexible, but at least it's something that
>>> we can reasonably test.
>>>
>>> Regards
>>>
>>> Antoine.


Re: [C++] The quest for zero-dependency builds

2019-10-10 Thread Wes McKinney
I guess one question we should first discuss is: who is the C++ build
system for?

The users who are most sensitive to benchmark-driven decision making
will generally be consuming the project through pre-built binaries,
like our Python or R packages. If C++ developers build the project
from source and don't do a minimal read of the documentation to see
what a "recommended configuration" looks like, I would say that is
more their fault than ours. In the case of the ARROW_JEMALLOC option,
I think it's important for C++ system integrators to be aware of the
impact of the choice of memory allocator.

The concern I have with the current "out of the box" experience is
that people are getting the impression that "I have to build $X, $Y,
and $Z -- which I don't necessarily need -- to have $CORE_FEATURE_1".
They can, of course, read the documentation and learn that those
things can be toggled off, but I think the user that reaches for a
self-built source install is much different in general than someone
who uses the project through the Linux binary packages, for example.

On the subject of managing intraproject dependencies and
relationships, I think we should develop a better way to express
relationships between components than we have now.

As an example, building the Python library assumes that various
components are enabled

- ARROW_COMPUTE=ON
- ARROW_FILESYSTEM=ON
- ARROW_IPC=ON

Somewhere in the code we might have some code like

if (ARROW_PYTHON)
  set(ARROW_COMPUTE ON)
  ...
endif()

This doesn't strike me as that scalable. I would rather see a
dependency file like

component_dependencies = {
...
'python': ['compute', 'filesystem', 'ipc'],
...
}

A helper Python script as part of the build could be used to give
CMake (because CMake is a bit poor as a programming language) the list
of required components based on what the user has indicated to CMake.

On Thu, Oct 10, 2019 at 7:36 AM Francois Saint-Jacques
 wrote:
>
> There's always the route of vendoring some library and not exposing
> external CMake options. This would achieve the goal of
> compile-out-of-the-box and enable important feature in the basic
> build. We also simplify dependencies requirements (benefits CI or
> developer). The downside is following security patches and grumpy
> reaction from package maintainers. I think we should explore this
> route for dependencies that match the following criteria:
>
> - libarrow*.so don't export any of the symbols of the dependency and
> not referenced in any public headers
> - dependency is lightweight, e.g. excludes boost, openssl, grpc, llvm,
> thrift, protobuf
> - dependency is not-ubiquitous on major platform and have a stable
> API, e.g. excludes libz and openssl
>
> A small list of candidates:
> - RapidJSON (enables JSON)
> - DoubleConversion (enables CSV)
>
> There's a precedent, arrow already vendors small C++ libraries
> (datetime, utf8cpp, variant, xxhash).
>
> François
>
>
> On Thu, Oct 10, 2019 at 6:03 AM Antoine Pitrou  wrote:
> >
> >
> > Hi all,
> >
> > I'm a bit concerned that we're planning to add many additional build
> > options in the quest to have a core zero-dependency build in C++.
> > See for example https://issues.apache.org/jira/browse/ARROW-6633 or
> > https://issues.apache.org/jira/browse/ARROW-6612.
> >
> > The problem is that this is creating many possible configurations and we
> > will only be testing a tiny subset of them.  Inevitably, users will try
> > other option combinations and they'll fail building for some random
> > reason.  It will not be a very good user experience.
> >
> > Another related issue is user perception when doing a default build.
> > For example https://issues.apache.org/jira/browse/ARROW-6638 proposes to
> > build with jemalloc disabled by default.  Inevitably, people will be
> > doing benchmarks with this (publicly or not) and they'll conclude Arrow
> > is not as performant as it claims to be.
> >
> > Perhaps we should look for another approach instead?
> >
> > For example we could have a single ARROW_BARE_CORE (whatever the name)
> > option that when enabled (not by default) builds the tiniest minimal
> > subset of Arrow.  It's more inflexible, but at least it's something that
> > we can reasonably test.
> >
> > Regards
> >
> > Antoine.


Re: [DISCUSS][Java] Design of the algorithm module

2019-10-10 Thread Fan Liya
Dear all,

I have added the draft for the fourth part of the document.
This part contains discussion of more algorithms, some of which are already
in progress.

Please pay special attention to Section 4.2.1, as it contains a
general discussion about the representation of integer vectors.
Please take a look, and give your valuable feedback:

https://docs.google.com/document/d/17nqHWS7gs0vARfeDAcUEbhKMOYHnCtA46TOY_Nls69s/edit?usp=sharing

Thanks a lot for your attention.

Best,
Liya Fan


On Sat, Oct 5, 2019 at 2:59 PM fan_li_ya  wrote:

> Hi Micah and Praveen,
>
> Thanks a lot for your valuable feedback.
>
> My thoughts on the problems:
>
> 1. About audiance of the algorithms:
>
> I think the algorithms should be better termed "micro-algorithms". They
> are termed "micro" in the sense that they do not directly compose a query
> engine, because they only provide primitive functionalities (e.g. vector
> sort).
> Instead, they can be used as building blocks for query engines.  The major
> benefit of the micro-algorithms is their generality: they can be used in
> wide ranges of common scenarios. They can be used in more than one query
> engine. In addition, there are other common scenarios, like vector data
> compression/decompression (e.g. dictionary encoding and RLE encoding, as we
> have already supported/discussed), IPC communication, data analysis, data
> mining, etc.
>
> 2. About performance improvments:
>
> Code generation and template types are powerful tools. In addition, JIT is
> also a powerful tool, as it can inline megamorphic virtual functions for
> many scenarios, if the algorithm is implemented appropriately.
> IMO, code generation is applicable to almost all scenarios to achieve good
> performance, if we are willing to pay the price of code readability.
> I will try to detail the principles for choosing these tools for
> performance improvements later.
>
> Best,
> Liya Fan
>
> --
> 发件人:Praveen Kumar 
> 发送时间:2019年10月4日(星期五) 19:20
> 收件人:Micah Kornfield 
> 抄 送:Fan Liya ; dev 
> 主 题:Re: [DISCUSS][Java] Design of the algorithm module
>
> Hi Micah,
>
> I agree with 1., i think as an end user, what they would really want is a
> query/data processing engine. I am not sure how easy/relevant the
> algorithms will be in the absence of the engine. For e.g. most of these
> operators would need to pipelined, handle memory, distribution etc. So
> bundling this along with engine makes a lot more sense, the interfaces
> required might be a bit different too for that.
>
> Thx.
>
>
>
> On Thu, Oct 3, 2019 at 10:27 AM Micah Kornfield 
> wrote:
>
> > Hi Liya Fan,
> > Thanks again for writing this up.  I think it provides a road-map for
>
> > intended features.  I commented on the document but I wanted to raise a few
> > high-level concerns here as well to get more feedback from the community.
> >
>
> > 1.  It isn't clear to me who the users will of this will be.  My perception
> > is that in the Java ecosystem there aren't use-cases for the algorithms
>
> > outside of specific compute engines.  I'm not super involved in open-source
>
> > Java these days so I would love to hear others opinions. For instance, I'm
> > not sure if Dremio would switch to using these algorithms instead of the
> > ones they've already open-sourced  [1] and Apache Spark I believe is only
> > using Arrow for interfacing with Python (they similarly have there own
>
> > compute pipeline).  I think you mentioned in the past that these are being
> > used internally on an engine that your company is working on, but if that
>
> > is the only consumer it makes me wonder if the algorithm development might
> > be better served as part of that engine.
> >
> > 2.  If we do move forward with this, we also need a plan for how to
> > optimize the algorithms to avoid virtual calls.  There are two high-level
> > approaches template-based and (byte)code generation based.  Both aren't
>
> > applicable in all situations but it would be good to come consensus on when
> > (and when not to) use each.
> >
> > Thanks,
> > Micah
> >
> > [1]
> >
> >
> https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/sabot/op/sort/external
> >
> > On Tue, Sep 24, 2019 at 6:48 AM Fan Liya  wrote:
> >
> > > Hi Micah,
> > >
> > > Thanks for your effort and precious time.
> > > Looking forward to receiving more valuable feedback from you.
> > >
> > > Best,
> > > Liya Fan
> > >
> > > On Tue, Sep 24, 2019 at 2:12 PM Micah Kornfield  >
> > > wrote:
> > >
> > >> Hi Liya Fan,
> > >> I started reviewing but haven't gotten all the way through it. I will
> > try
> > >> to leave more comments over the next few days.
> > >>
> > >> Thanks again for the write-up I think it will help frame a productive
> > >> conversation.
> > >>
> > >> -Micah
> > >>
> > >> On Tue, Sep 17, 2019 at 1:47 AM Fan Liya  > wrote:
> > >>
> > >>> Hi Micah,
> > >>>
> > >>> Thanks for your kind reminder. Comments are enabled 

Re: [C++] The quest for zero-dependency builds

2019-10-10 Thread Francois Saint-Jacques
There's always the route of vendoring some library and not exposing
external CMake options. This would achieve the goal of
compile-out-of-the-box and enable important feature in the basic
build. We also simplify dependencies requirements (benefits CI or
developer). The downside is following security patches and grumpy
reaction from package maintainers. I think we should explore this
route for dependencies that match the following criteria:

- libarrow*.so don't export any of the symbols of the dependency and
not referenced in any public headers
- dependency is lightweight, e.g. excludes boost, openssl, grpc, llvm,
thrift, protobuf
- dependency is not-ubiquitous on major platform and have a stable
API, e.g. excludes libz and openssl

A small list of candidates:
- RapidJSON (enables JSON)
- DoubleConversion (enables CSV)

There's a precedent, arrow already vendors small C++ libraries
(datetime, utf8cpp, variant, xxhash).

François


On Thu, Oct 10, 2019 at 6:03 AM Antoine Pitrou  wrote:
>
>
> Hi all,
>
> I'm a bit concerned that we're planning to add many additional build
> options in the quest to have a core zero-dependency build in C++.
> See for example https://issues.apache.org/jira/browse/ARROW-6633 or
> https://issues.apache.org/jira/browse/ARROW-6612.
>
> The problem is that this is creating many possible configurations and we
> will only be testing a tiny subset of them.  Inevitably, users will try
> other option combinations and they'll fail building for some random
> reason.  It will not be a very good user experience.
>
> Another related issue is user perception when doing a default build.
> For example https://issues.apache.org/jira/browse/ARROW-6638 proposes to
> build with jemalloc disabled by default.  Inevitably, people will be
> doing benchmarks with this (publicly or not) and they'll conclude Arrow
> is not as performant as it claims to be.
>
> Perhaps we should look for another approach instead?
>
> For example we could have a single ARROW_BARE_CORE (whatever the name)
> option that when enabled (not by default) builds the tiniest minimal
> subset of Arrow.  It's more inflexible, but at least it's something that
> we can reasonably test.
>
> Regards
>
> Antoine.


Re: [DRAFT] Apache Arrow Board Report - October 2019

2019-10-10 Thread Wes McKinney
Here is a rejection of CircleCI more than 18 months ago

https://issues.apache.org/jira/browse/INFRA-15964

On Thu, Oct 10, 2019 at 4:33 AM Antoine Pitrou  wrote:
>
>
> For the record, here is the ticket for Azure Pipelines integration:
> https://issues.apache.org/jira/browse/INFRA-17030
>
> I opened an issue back in May about the Travis-CI capacity situation:
> https://issues.apache.org/jira/browse/INFRA-18533
>
> Apparently CI capacity has been a "hot topic as of late":
> https://lists.apache.org/thread.html/af52e2a3e865c01596d46374e8b294f2740587dbd59d85e132429b6c@%3Cbuilds.apache.org%3E
>
> (I didn't know this list -- bui...@apache.org -- existed, by the way)
>
> Regards
>
> Antoine.
>
>
> Le 10/10/2019 à 07:34, Wes McKinney a écrit :
> > On Thu, Oct 10, 2019 at 12:22 AM Jacques Nadeau  wrote:
> >>
> >> I'm not dismissing the there are issues but I also don't feel like there
> >> has been constant discussion for months on the list that INFRA is not being
> >> responsive to Arrow community requests. It seems like you might be saying a
> >> couple different things one of two things (or both?)?
> >>
> >> 1) The Arrow infrastructure requirements are vastly different than other
> >> projects. Because of Arrow's specialized requirements, we need things that
> >> no other project needs.
> >> 2) There are many projects that want CircleCI, Buildkite and Azure
> >> pipelines but Infrastructure is not responsive. This is putting a big
> >> damper on the success of the Arrow project.
> >
> > Yes, I'm saying both of these things.
> >
> > 1. Yes, Arrow is special -- validating the project requires running a
> > dozen or more different builds (with dozens more nightly builds) that
> > test different parts of the project. Different language components, a
> > large and diverse packaging matrix, and interproject integration tests
> > and integration with external projects (e.g. Apache Spark adn others)
> >
> > 2. Yes, the limited GitHub App availability is hurting us.
> >
> > I'm OK to place this concern in the "Community Health" section and
> > spend more time building a comprehensive case about how Infra's
> > conservatism around Apps is causing us to work with one hand tied
> > behind our back. I know that I'm not the only one who is unhappy, but
> > I'll let the others speak for themselves.
> >
> >> For each of these, if we're asking the board to do something, we should say
> >> more and more clearly. Sure, CI is a pain in the Arrow project's a**. I
> >> also agree that community health is impacted by the challenge to merge
> >> things. I also share the perspective that the foundation has been slow to
> >> adopt new technologies and has been way to religious about svn. However, If
> >> we're asking the board to do something, what is it?
> >
> > Allow GitHub Apps that do not require write access to the code itself,
> > set up appropriate checks and balances to ensure that the Foundation's
> > IP provenance webhooks are preserved.
> >
> >> Looking at the two things you might be saying...
> >> If 1, are we confident in that? Many other projects have pretty complex
> >> build matrices I think. (I haven't thought about this and evaluated the
> >> other projects...maybe it is true.) If 1, we should clarify why we think
> >> we're different. If that is the case, what are asking for from the board.
> >>
> >> If 2, and you are proposing throwing stones at INFRA, we should back it up
> >> with INFRA tickets and numbers (e.g. how many projects have wanted these
> >> things and for how long). We should reference multiple threads on the INFRA
> >> mailing list where we voiced certain concerns and many other people voiced
> >> similar concerns and INFRA turned a deaf ear or blind eye (maybe these
> >> exist, I haven't spent much time on the INFRA list lately). As it stands,
> >> the one ticket referenced in this thread is a ticket that has only one
> >> project asking for a new integration that has been open for less than a
> >> week. That may be annoying but it doesn't seem like something that has
> >> gotten to the level that we need to get the boards help.
> >>
> >> In a nutshell, I agree that this is impacting the health and growth of the
> >> project but think we should cover that in the community health section of
> >> the report. I'm less a fan of saying this is an issue the board needs to
> >> help us solve unless it has been a constant point of pain that we've
> >> attempted to elevate multiple times in infra forums and experienced
> >> unreasonable responses. The board is a blunt instrument and should only be
> >> used when we have depleted every other avenue for resolution.
> >>
> >
> > Yes, I'm happy to spend more time building a comprehensive case before
> > escalating it to the board level. However, Apache Arrow is a high
> > profile project and it is not a good luck to have a PMC in a
> > fast-growing project growing disgruntled with the Foundation's
> > policies in this way. We've been struggling visibly for a 

[NIGHTLY] Arrow Build Report for Job nightly-2019-10-10-0

2019-10-10 Thread Crossbow


Arrow Build Report for Job nightly-2019-10-10-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0

Failed Tasks:
- docker-go:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-go
- docker-cpp-cmake32:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-cpp-cmake32
- docker-turbodbc-integration:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-turbodbc-integration
- docker-python-2.7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-python-2.7
- wheel-osx-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-travis-wheel-osx-cp37m
- wheel-win-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-appveyor-wheel-win-cp37m
- docker-hdfs-integration:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-hdfs-integration
- docker-r-conda:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-r-conda
- docker-lint:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-lint
- docker-cpp-release:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-cpp-release
- docker-js:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-js
- docker-docs:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-docs
- docker-clang-format:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-clang-format
- docker-dask-integration:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-dask-integration
- docker-python-3.7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-python-3.7
- wheel-osx-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-travis-wheel-osx-cp36m
- docker-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-cpp
- wheel-win-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-appveyor-wheel-win-cp36m
- gandiva-jar-trusty:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-travis-gandiva-jar-trusty
- docker-python-2.7-nopandas:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-python-2.7-nopandas
- docker-python-3.6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-python-3.6
- docker-c_glib:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-c_glib
- docker-pandas-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-pandas-master
- gandiva-jar-osx:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-travis-gandiva-jar-osx
- docker-iwyu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-iwyu
- docker-java:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-java
- wheel-osx-cp27m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-travis-wheel-osx-cp27m
- docker-spark-integration:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-spark-integration
- docker-r:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-r
- docker-cpp-static-only:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-cpp-static-only
- docker-r-sanitizer:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-r-sanitizer
- docker-rust:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-rust
- wheel-osx-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-travis-wheel-osx-cp35m
- docker-python-3.6-nopandas:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-circle-docker-python-3.6-nopandas

Succeeded Tasks:
- centos-6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-azure-centos-6
- wheel-manylinux1-cp27mu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-10-0-travis-wheel-manylinux1-cp27mu
- wheel-manylinux2010-cp27mu:
  URL: 

Re: [C++] The quest for zero-dependency builds

2019-10-10 Thread Tim Paine
FWIW for perspective, we ended up just using our own Cmake file to build arrow, 
we needed a minimal subset of functionality on a tight size budget and it was 
easier doing that than configuring all the flags.

https://github.com/finos/perspective/blob/master/cmake/arrow/CMakeLists.txt



Tim Paine
tim.paine.nyc
908-721-1185

> On Oct 10, 2019, at 06:02, Antoine Pitrou  wrote:
> 
> 
> Hi all,
> 
> I'm a bit concerned that we're planning to add many additional build
> options in the quest to have a core zero-dependency build in C++.
> See for example https://issues.apache.org/jira/browse/ARROW-6633 or
> https://issues.apache.org/jira/browse/ARROW-6612.
> 
> The problem is that this is creating many possible configurations and we
> will only be testing a tiny subset of them.  Inevitably, users will try
> other option combinations and they'll fail building for some random
> reason.  It will not be a very good user experience.
> 
> Another related issue is user perception when doing a default build.
> For example https://issues.apache.org/jira/browse/ARROW-6638 proposes to
> build with jemalloc disabled by default.  Inevitably, people will be
> doing benchmarks with this (publicly or not) and they'll conclude Arrow
> is not as performant as it claims to be.
> 
> Perhaps we should look for another approach instead?
> 
> For example we could have a single ARROW_BARE_CORE (whatever the name)
> option that when enabled (not by default) builds the tiniest minimal
> subset of Arrow.  It's more inflexible, but at least it's something that
> we can reasonably test.
> 
> Regards
> 
> Antoine.


Re: [DISCUSS] Proposal about integration test of arrow parquet reader

2019-10-10 Thread Renjie Liu
I've create ticket to track here:
https://issues.apache.org/jira/browse/ARROW-6845

For this moment, can we check in those pregenerated data to unblock rust
version's arrow reader?

On Thu, Oct 10, 2019 at 1:20 PM Renjie Liu  wrote:

> It would be fine in that case.
>
> Wes McKinney  于 2019年10月10日周四 下午12:58写道:
>
>> On Wed, Oct 9, 2019 at 10:16 PM Renjie Liu 
>> wrote:
>> >
>> > 1. There already exists a low level parquet writer which can produce
>> > parquet file, so unit test should be fine. But writer from arrow to
>> parquet
>> > doesn't exist yet, and it may take some period of time to finish it.
>> > 2. In fact my data are randomly generated and it's definitely
>> reproducible.
>> > However, I don't think it would be good idea to randomly generate data
>> > everytime we run ci because it would be difficult to debug. For example
>> PR
>> > a introduced a bug, which is triggerred in other PR's build it would be
>> > confusing for contributors.
>>
>> Presumably any random data generation would use a fixed seed precisely
>> to be reproducible.
>>
>> > 3. I think it would be good idea to spend effort on integration test
>> with
>> > parquet because it's an important use case of arrow. Also similar
>> approach
>> > could be extended to other language and other file format(avro, orc).
>> >
>> >
>> > On Wed, Oct 9, 2019 at 11:08 PM Wes McKinney 
>> wrote:
>> >
>> > > There are a number of issues worth discussion.
>> > >
>> > > 1. What is the timeline/plan for Rust implementing a Parquet _writer_?
>> > > It's OK to be reliant on other libraries in the short term to produce
>> > > files to test against, but does not strike me as a sustainable
>> > > long-term plan. Fixing bugs can be a lot more difficult than it needs
>> > > to be if you can't write targeted "endogenous" unit tests
>> > >
>> > > 2. Reproducible data generation
>> > >
>> > > I think if you're going to test against a pre-generated corpus, you
>> > > should make sure that generating the corpus is reproducible for other
>> > > developers (i.e. with a Dockerfile), and can be extended by adding new
>> > > files or random data generation.
>> > >
>> > > I additionally would prefer generating the test corpus at test time
>> > > rather than checking in binary files. If this isn't viable right now
>> > > we can create an "arrow-rust-crutch" git repository for you to stash
>> > > binary files until some of these testing scalability issues are
>> > > addressed.
>> > >
>> > > If we're going to spend energy on Parquet integration testing with
>> > > Java, this would be a good opportunity to do the work in a way where
>> > > the C++ Parquet library can also participate (since we ought to be
>> > > doing integration tests with Java, and we can also read JSON files to
>> > > Arrow).
>> > >
>> > > On Tue, Oct 8, 2019 at 11:54 PM Renjie Liu 
>> > > wrote:
>> > > >
>> > > > On Wed, Oct 9, 2019 at 12:11 PM Andy Grove 
>> > > wrote:
>> > > >
>> > > > > I'm very interested in helping to find a solution to this because
>> we
>> > > really
>> > > > > do need integration tests for Rust to make sure we're compatible
>> with
>> > > other
>> > > > > implementations... there is also the ongoing CI dockerization work
>> > > that I
>> > > > > feel is related.
>> > > > >
>> > > > > I haven't looked at the current integration tests yet and would
>> > > appreciate
>> > > > > some pointers on how all of this works (do we have docs?) or
>> where to
>> > > start
>> > > > > looking.
>> > > > >
>> > > > I have a test in my latest PR:
>> https://github.com/apache/arrow/pull/5523
>> > > > And here is the generated data:
>> > > > https://github.com/apache/arrow-testing/pull/11
>> > > > As with program to generate these data, it's just a simple java
>> program.
>> > > > I'm not sure whether we need to integrate it into arrow.
>> > > >
>> > > > >
>> > > > > I imagine the integration test could follow the approach that
>> Renjie is
>> > > > > outlining where we call Java to generate some files and then call
>> Rust
>> > > to
>> > > > > parse them?
>> > > > >
>> > > > > Thanks,
>> > > > >
>> > > > > Andy.
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Tue, Oct 8, 2019 at 9:48 PM Renjie Liu <
>> liurenjie2...@gmail.com>
>> > > wrote:
>> > > > >
>> > > > > > Hi:
>> > > > > >
>> > > > > > I'm developing rust version of reader which reads parquet into
>> arrow
>> > > > > array.
>> > > > > > To verify the correct of this reader, I use the following
>> approach:
>> > > > > >
>> > > > > >
>> > > > > >1. Define schema with protobuf.
>> > > > > >2. Generate json data of this schema using other language
>> with
>> > > more
>> > > > > >sophisticated implementation (e.g. java)
>> > > > > >3. Generate parquet data of this schema using other language
>> with
>> > > more
>> > > > > >sophisticated implementation (e.g. java)
>> > > > > >4. Write tests to read json file, and parquet file into
>> memory
>> > > (arrow
>> > > > > >array), 

[jira] [Created] (ARROW-6845) Setup process to generate random data for integration tests

2019-10-10 Thread Renjie Liu (Jira)
Renjie Liu created ARROW-6845:
-

 Summary: Setup process to generate random data for integration 
tests
 Key: ARROW-6845
 URL: https://issues.apache.org/jira/browse/ARROW-6845
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Renjie Liu
Assignee: Renjie Liu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Question about timestamps ...

2019-10-10 Thread Joris Van den Bossche
Hi David,

This is intentional, see
https://arrow.apache.org/docs/python/parquet.html#storing-timestamps for
some explanation in the documentation. Basicly, the parquet format only
supports ms and us resolution, and so nanosecond timestamps (which are
supported by Arrow) are converted to one of those resolutions.

We could maybe clarify that better in the error message (something like
"only 'ms' and 'us' are supported") ?

In the latest version of the parquet format specification, there is
actually support for nanosecond resolution as well (
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#deprecated-timestamp-convertedtype).
You can obtain this by specifying version="2.0", but the implementation is
not yet fully ready (see https://issues.apache.org/jira/browse/PARQUET-458),
and also not all frameworks support this version (so if compatibility
across processing frameworks is important, it is recommended to stick with
version 1).

Joris

On Wed, 9 Oct 2019 at 21:27, David Boles  wrote:

> The following code dies with pyarrow 0.14.2:
>
> import pyarrow as pa
> import pyarrow.parquet as pq
>
> schema = pa.schema([('timestamp', pa.timestamp('ns', tz='UTC')),])
> writer = pq.ParquetWriter('foo.parquet', schema, coerce_timestamps='ns')
>
> ts_array = pa.array([ int(1234567893141) ], type=pa.timestamp('ns',
> tz='UTC'))
> table = pa.Table.from_arrays([ ts_array ], names=['timestamp'])
>
> writer.write_table(table)
> writer.close()
>
> with the message:
>
> ValueError: Invalid value for coerce_timestamps: ns
>
> That appears to be because of this code in _parquet.pxi:
>
> cdef int _set_coerce_timestamps(
> self, ArrowWriterProperties.Builder* props) except -1:
> if self.coerce_timestamps == 'ms':
> props.coerce_timestamps(TimeUnit_MILLI)
> elif self.coerce_timestamps == 'us':
> props.coerce_timestamps(TimeUnit_MICRO)
> elif self.coerce_timestamps is not None:
> raise ValueError('Invalid value for coerce_timestamps: {0}'
>  .format(self.coerce_timestamps))
>
> which restricts the choice to 'ms' or 'us', even though AFAICT everywhere
> else also allows 'ns' (and there is a TimeUnit_NANO defined). Is this
> intentional, or a bug?
>
> Thanks,
>
>  - db
>


Re: [DRAFT] Apache Arrow Board Report - October 2019

2019-10-10 Thread Antoine Pitrou


For the record, here is the ticket for Azure Pipelines integration:
https://issues.apache.org/jira/browse/INFRA-17030

I opened an issue back in May about the Travis-CI capacity situation:
https://issues.apache.org/jira/browse/INFRA-18533

Apparently CI capacity has been a "hot topic as of late":
https://lists.apache.org/thread.html/af52e2a3e865c01596d46374e8b294f2740587dbd59d85e132429b6c@%3Cbuilds.apache.org%3E

(I didn't know this list -- bui...@apache.org -- existed, by the way)

Regards

Antoine.


Le 10/10/2019 à 07:34, Wes McKinney a écrit :
> On Thu, Oct 10, 2019 at 12:22 AM Jacques Nadeau  wrote:
>>
>> I'm not dismissing the there are issues but I also don't feel like there
>> has been constant discussion for months on the list that INFRA is not being
>> responsive to Arrow community requests. It seems like you might be saying a
>> couple different things one of two things (or both?)?
>>
>> 1) The Arrow infrastructure requirements are vastly different than other
>> projects. Because of Arrow's specialized requirements, we need things that
>> no other project needs.
>> 2) There are many projects that want CircleCI, Buildkite and Azure
>> pipelines but Infrastructure is not responsive. This is putting a big
>> damper on the success of the Arrow project.
> 
> Yes, I'm saying both of these things.
> 
> 1. Yes, Arrow is special -- validating the project requires running a
> dozen or more different builds (with dozens more nightly builds) that
> test different parts of the project. Different language components, a
> large and diverse packaging matrix, and interproject integration tests
> and integration with external projects (e.g. Apache Spark adn others)
> 
> 2. Yes, the limited GitHub App availability is hurting us.
> 
> I'm OK to place this concern in the "Community Health" section and
> spend more time building a comprehensive case about how Infra's
> conservatism around Apps is causing us to work with one hand tied
> behind our back. I know that I'm not the only one who is unhappy, but
> I'll let the others speak for themselves.
> 
>> For each of these, if we're asking the board to do something, we should say
>> more and more clearly. Sure, CI is a pain in the Arrow project's a**. I
>> also agree that community health is impacted by the challenge to merge
>> things. I also share the perspective that the foundation has been slow to
>> adopt new technologies and has been way to religious about svn. However, If
>> we're asking the board to do something, what is it?
> 
> Allow GitHub Apps that do not require write access to the code itself,
> set up appropriate checks and balances to ensure that the Foundation's
> IP provenance webhooks are preserved.
> 
>> Looking at the two things you might be saying...
>> If 1, are we confident in that? Many other projects have pretty complex
>> build matrices I think. (I haven't thought about this and evaluated the
>> other projects...maybe it is true.) If 1, we should clarify why we think
>> we're different. If that is the case, what are asking for from the board.
>>
>> If 2, and you are proposing throwing stones at INFRA, we should back it up
>> with INFRA tickets and numbers (e.g. how many projects have wanted these
>> things and for how long). We should reference multiple threads on the INFRA
>> mailing list where we voiced certain concerns and many other people voiced
>> similar concerns and INFRA turned a deaf ear or blind eye (maybe these
>> exist, I haven't spent much time on the INFRA list lately). As it stands,
>> the one ticket referenced in this thread is a ticket that has only one
>> project asking for a new integration that has been open for less than a
>> week. That may be annoying but it doesn't seem like something that has
>> gotten to the level that we need to get the boards help.
>>
>> In a nutshell, I agree that this is impacting the health and growth of the
>> project but think we should cover that in the community health section of
>> the report. I'm less a fan of saying this is an issue the board needs to
>> help us solve unless it has been a constant point of pain that we've
>> attempted to elevate multiple times in infra forums and experienced
>> unreasonable responses. The board is a blunt instrument and should only be
>> used when we have depleted every other avenue for resolution.
>>
> 
> Yes, I'm happy to spend more time building a comprehensive case before
> escalating it to the board level. However, Apache Arrow is a high
> profile project and it is not a good luck to have a PMC in a
> fast-growing project growing disgruntled with the Foundation's
> policies in this way. We've been struggling visibly for a long time
> with our CI scalability, and I think we should have all the options on
> the table to utilize GitHub-integrated tools to help us find a way out
> of the mess that we are in.
> 
>>
>> On Wed, Oct 9, 2019 at 9:44 PM Wes McKinney  wrote:
>>
>>> hi Jacques,
>>>
>>> I think we need to share the concerns that many PMC