[jira] [Created] (ARROW-9108) [C++][Dataset] Add Parquet Statistics conversion for timestamp columns
Francois Saint-Jacques created ARROW-9108: - Summary: [C++][Dataset] Add Parquet Statistics conversion for timestamp columns Key: ARROW-9108 URL: https://issues.apache.org/jira/browse/ARROW-9108 Project: Apache Arrow Issue Type: Sub-task Reporter: Francois Saint-Jacques -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9107) [C++][Dataset] Time-based types support
Francois Saint-Jacques created ARROW-9107: - Summary: [C++][Dataset] Time-based types support Key: ARROW-9107 URL: https://issues.apache.org/jira/browse/ARROW-9107 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Francois Saint-Jacques We lack the support of date/timestamp partitions, and predicate pushdown rules. Timestamp columns are usually the most important predicate in OLAP style queries, we need to support this transparently. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9068) [C++][Dataset] Simplify Partitioning interface
Francois Saint-Jacques created ARROW-9068: - Summary: [C++][Dataset] Simplify Partitioning interface Key: ARROW-9068 URL: https://issues.apache.org/jira/browse/ARROW-9068 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Francois Saint-Jacques The `int segment` of `Partitioning::Parse` should not be exposed to the user. KeyValuePartiioning should be a private Impl interface, not in public headers. The same apply to `Partitioning::Format`. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9028) [R] Should be able to convert an empty table
Francois Saint-Jacques created ARROW-9028: - Summary: [R] Should be able to convert an empty table Key: ARROW-9028 URL: https://issues.apache.org/jira/browse/ARROW-9028 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Francois Saint-Jacques -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8997) [Archery] Benchmark formatter should have friendly units
Francois Saint-Jacques created ARROW-8997: - Summary: [Archery] Benchmark formatter should have friendly units Key: ARROW-8997 URL: https://issues.apache.org/jira/browse/ARROW-8997 Project: Apache Arrow Issue Type: Bug Reporter: Francois Saint-Jacques The current output is not friendly to glance at. Usage of humanfriendly can help here. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8986) [Archery][ursabot] Fix benchmark diff checkout of origin/master
Francois Saint-Jacques created ARROW-8986: - Summary: [Archery][ursabot] Fix benchmark diff checkout of origin/master Key: ARROW-8986 URL: https://issues.apache.org/jira/browse/ARROW-8986 Project: Apache Arrow Issue Type: Bug Reporter: Francois Saint-Jacques https://github.com/apache/arrow/pull/7300#issuecomment-635967095 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8890) [R] Fix C++ lint issue
Francois Saint-Jacques created ARROW-8890: - Summary: [R] Fix C++ lint issue Key: ARROW-8890 URL: https://issues.apache.org/jira/browse/ARROW-8890 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Francois Saint-Jacques Assignee: Francois Saint-Jacques Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8884) [C++] Listing files with S3FileSystem is slow
Francois Saint-Jacques created ARROW-8884: - Summary: [C++] Listing files with S3FileSystem is slow Key: ARROW-8884 URL: https://issues.apache.org/jira/browse/ARROW-8884 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Francois Saint-Jacques Listing files on S3 is slow due to the recursive nature of the algorithm. The following change modifies the behavior of the S3Result to include all objects but no "grouping" (directories). This lower dramatically the number of HTTP calls. {code:c++} diff --git a/cpp/src/arrow/filesystem/s3fs.cc b/cpp/src/arrow/filesystem/s3fs.cc index 70c87f46ec..98a40b17a2 100644 --- a/cpp/src/arrow/filesystem/s3fs.cc +++ b/cpp/src/arrow/filesystem/s3fs.cc @@ -986,7 +986,7 @@ class S3FileSystem::Impl { if (!prefix.empty()) { req.SetPrefix(ToAwsString(prefix) + kSep); } -req.SetDelimiter(Aws::String() + kSep); +// req.SetDelimiter(Aws::String() + kSep); req.SetMaxKeys(kListObjectsMaxKeys); while (true) { {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8874) [C++][Dataset] Scanner::ToTable race when ScanTask exit early with an error
Francois Saint-Jacques created ARROW-8874: - Summary: [C++][Dataset] Scanner::ToTable race when ScanTask exit early with an error Key: ARROW-8874 URL: https://issues.apache.org/jira/browse/ARROW-8874 Project: Apache Arrow Issue Type: Bug Reporter: Francois Saint-Jacques https://github.com/apache/arrow/pull/7180#issuecomment-631059751 The issue is when [Finish|https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/scanner.cc#L184-L208] exit early due to a ScanTask error, in-flight tasks may try to lock the out-of-scope mutex. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8720) [C++] Fix checked_pointer_cast
Francois Saint-Jacques created ARROW-8720: - Summary: [C++] Fix checked_pointer_cast Key: ARROW-8720 URL: https://issues.apache.org/jira/browse/ARROW-8720 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Francois Saint-Jacques While investigating performance, I noted that dyncast (and rtti internal methods) were showing up in the "hot" functions for release builds. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8604) [R] Windows compilation failure
Francois Saint-Jacques created ARROW-8604: - Summary: [R] Windows compilation failure Key: ARROW-8604 URL: https://issues.apache.org/jira/browse/ARROW-8604 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Francois Saint-Jacques Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8603) [Documentation] Fix Sphinx doxygen comment
Francois Saint-Jacques created ARROW-8603: - Summary: [Documentation] Fix Sphinx doxygen comment Key: ARROW-8603 URL: https://issues.apache.org/jira/browse/ARROW-8603 Project: Apache Arrow Issue Type: Bug Components: C++, Documentation Reporter: Francois Saint-Jacques See [https://github.com/apache/arrow/runs/622393532] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8602) [CMake] Fix ws2_32 link issue when cross-compiling on Linux
Francois Saint-Jacques created ARROW-8602: - Summary: [CMake] Fix ws2_32 link issue when cross-compiling on Linux Key: ARROW-8602 URL: https://issues.apache.org/jira/browse/ARROW-8602 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Francois Saint-Jacques -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8601) [Go][Flight] Implement Flight Writer interface
Francois Saint-Jacques created ARROW-8601: - Summary: [Go][Flight] Implement Flight Writer interface Key: ARROW-8601 URL: https://issues.apache.org/jira/browse/ARROW-8601 Project: Apache Arrow Issue Type: Improvement Components: FlightRPC, Go Reporter: Francois Saint-Jacques -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8497) [Archery] Add missing component to builds
Francois Saint-Jacques created ARROW-8497: - Summary: [Archery] Add missing component to builds Key: ARROW-8497 URL: https://issues.apache.org/jira/browse/ARROW-8497 Project: Apache Arrow Issue Type: Improvement Components: Archery, Developer Tools Reporter: Francois Saint-Jacques Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8488) [R] Replace VALUE_OR_STOP with ValueOrStop
Francois Saint-Jacques created ARROW-8488: - Summary: [R] Replace VALUE_OR_STOP with ValueOrStop Key: ARROW-8488 URL: https://issues.apache.org/jira/browse/ARROW-8488 Project: Apache Arrow Issue Type: Improvement Reporter: Francois Saint-Jacques We should avoid macro as much as possible as per style guide. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8448) [Package] Can't build apt packages with ubuntu-focal
Francois Saint-Jacques created ARROW-8448: - Summary: [Package] Can't build apt packages with ubuntu-focal Key: ARROW-8448 URL: https://issues.apache.org/jira/browse/ARROW-8448 Project: Apache Arrow Issue Type: Bug Components: Packaging Reporter: Francois Saint-Jacques Assignee: Kouhei Sutou While trying to debug the failing nightly (due to disk space), I encounter the following error, the tar generated by the build script does not conform to what debuilder expects. It blocks {code} Unable to find source-code formatter for language: shell. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yamlSuccessfully built ecdda7ea015d Successfully tagged apache-arrow-ubuntu-focal:latest docker run --rm --tty --volume /home/fsaintjacques/src/db/arrow/dev/tasks/linux-packages/apache-arrow/apt:/host:rw --env DEBUG=yes apache-arrow-ubuntu-focal /host/build.sh This package has a Debian revision number but there does not seem to be an appropriate original tar file or .orig directory in the parent directory; (expected one of apache-arrow_0.16.0.orig.tar.gz, apache-arrow_0.16.0.orig.tar.bz2, apache-arrow_0.16.0.orig.tar.lzma, apache-arrow_0.16.0.orig.tar.xz or apache-arrow-1.0.0~dev20200414.orig) continue anyway? (y/n) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8447) [C++][Dataset] Ensure Scanner::ToTable preserve ordering
Francois Saint-Jacques created ARROW-8447: - Summary: [C++][Dataset] Ensure Scanner::ToTable preserve ordering Key: ARROW-8447 URL: https://issues.apache.org/jira/browse/ARROW-8447 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Francois Saint-Jacques This can be refactored with a little effort in Scanner::ToTable: # Change `batches` to `std::vector` # When pushing the closure to the TaskGroup, also track an incrementing integer, e.g. scan_task_id # In the closure, store the RecordBatches for this ScanTask in a local vector, when all batches are consumed, move the local vector in the `batches` at the right index, resizing and emplacing with mutex # After waiting for the task group completion either * Concatenate into a single vector and call `Table::FromRecordBatch` or * Write a RecordBatchReader that supports vector and add method `Table::FromRecordBatchReader` The later involves more work but is the clean way, the other FromRecordBatch method can be implemented from it and support "streaming". -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8382) [C++][Dataset] Refactor WritePlan to decouple from Fragment/Scan/Partition classes
Francois Saint-Jacques created ARROW-8382: - Summary: [C++][Dataset] Refactor WritePlan to decouple from Fragment/Scan/Partition classes Key: ARROW-8382 URL: https://issues.apache.org/jira/browse/ARROW-8382 Project: Apache Arrow Issue Type: Improvement Reporter: Francois Saint-Jacques WritePlan should look like the following. {code:c++} class ARROW_DS_EXPORT WritePlan { public: /// Execute the WritePlan and return a FileSystemDataset as a result. Result Execute(); protected: /// The schema of the Dataset which will be written std::shared_ptr schema; /// The format into which fragments will be written std::shared_ptr format; using SourceAndReader = std::pair; /// std::vector outputs; }; {code} * Refactor FileFormat::Write(FileSource destination, RecordBatchReader), not sure if it should take the output schema, or the RecordBatchReader should be already of the right schema. * Add a class/function that constructs SourceAndReader from Fragments, Partitioning and base path. And remove any Write/Fragment logic from partition.cc. * Move Write() out FIleSystemDataset into WritePlan. It could take a FileSystemDatasetFactory to recreate the FileSystemDataset. This is a bonus, not a requirement. * Simplify writing routine to avoid the PathTree directory structure, it shouldn't be more complex than `for task in write_tasks: task()`. Not path construction should there. The effects are: * Simplified WritePlan execution, abstracted away from path construction, and can write to multiple FileSystem and/or Buffers since it doesn't construct the FileSource. * By the virtue of using RecordBatchReader instead of Fragment, it isn't tied to writing from Fragment, it can take any construct that yields a RecordBatchReader. It also means that WritePlan doesn't have to know about any Scan related classes. * Writing can be done with or without partitioning, this logic is given to whomever generates the SourceAndReader list. * Should be simpler to test. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8381) [C++][Dataset] Dataset writing should require a writer schema
Francois Saint-Jacques created ARROW-8381: - Summary: [C++][Dataset] Dataset writing should require a writer schema Key: ARROW-8381 URL: https://issues.apache.org/jira/browse/ARROW-8381 Project: Apache Arrow Issue Type: Bug Components: C++ - Dataset Reporter: Francois Saint-Jacques # Dataset writing should always take an explicit writer schema instead of the first fragment's schema. # The MakeWritePlanImpl should not try removing columns that are found in the partition, this is left to the caller by passing an explicit schema. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8374) [R] Table to vector of DictonaryType will error when Arrays don't have the same Dictionary per array
Francois Saint-Jacques created ARROW-8374: - Summary: [R] Table to vector of DictonaryType will error when Arrays don't have the same Dictionary per array Key: ARROW-8374 URL: https://issues.apache.org/jira/browse/ARROW-8374 Project: Apache Arrow Issue Type: Bug Reporter: Francois Saint-Jacques The conversion should accommodate Unifying the dictionary before converting, otherwise the indices are simply broken -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8354) [C++][R] Segfault in test-dataset.r
Francois Saint-Jacques created ARROW-8354: - Summary: [C++][R] Segfault in test-dataset.r Key: ARROW-8354 URL: https://issues.apache.org/jira/browse/ARROW-8354 Project: Apache Arrow Issue Type: Improvement Components: C++ - Dataset, R Reporter: Francois Saint-Jacques See https://github.com/fsaintjacques/arrow/runs/564315427#step:6:2169 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8348) [C++] Support optional sentinel values in primitive Array for nulls
Francois Saint-Jacques created ARROW-8348: - Summary: [C++] Support optional sentinel values in primitive Array for nulls Key: ARROW-8348 URL: https://issues.apache.org/jira/browse/ARROW-8348 Project: Apache Arrow Issue Type: Improvement Reporter: Francois Saint-Jacques This is an optional feature where a sentinel value is stored in null cells and is exposed via an accessor method, e.g. `optional Array::HasSentinel() const;`. This would allow zero-copy bi-directional conversion with R. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8318) [C++][Dataset] Dataset should instantiate Fragment
Francois Saint-Jacques created ARROW-8318: - Summary: [C++][Dataset] Dataset should instantiate Fragment Key: ARROW-8318 URL: https://issues.apache.org/jira/browse/ARROW-8318 Project: Apache Arrow Issue Type: Improvement Components: C++ - Dataset Reporter: Francois Saint-Jacques Fragments are created on the fly when invoking a Scan. This means that a lot of the auxilliary/ancilliary data must be stored by the specialised Dataset, e.g. the FileSystemDataset must hold the path and partition expression. With the venue of more complex Fragment, e.g. ParquetFileFragment, more data must be stored. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8065) [C++][Dataset] Untangle Dataset, Fragment and ScanOptions
Francois Saint-Jacques created ARROW-8065: - Summary: [C++][Dataset] Untangle Dataset, Fragment and ScanOptions Key: ARROW-8065 URL: https://issues.apache.org/jira/browse/ARROW-8065 Project: Apache Arrow Issue Type: Improvement Reporter: Francois Saint-Jacques We should be able to list fragments without going through the Scanner/ScanOptions hoops. This exposes a flaw with the current API where it require a ScanOptions to create Fragment, this is also a problem for ARROW-7824, i.e. why do we need a ScanOptions (read manifest) to write record batches to a given path. # Remove {{ScanOptions}} from Fragment's properties and move it into {{Fragment::Scan}} parameters. # Remove {{ScanOptions}} from {{Dataset::GetFragments}}, if required, we can still provide an alternate signature, e.g. {{Dataset::GetFragments(std::shared_ptr predicate)}} for sub-tree pruning in FileSystemDataset. # Fragment constructor should take a schema (and store it as a property), usually extracted from the Dataset schema. Update the schema() method accordingly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7964) [C++] Add short representation string to common classes
Francois Saint-Jacques created ARROW-7964: - Summary: [C++] Add short representation string to common classes Key: ARROW-7964 URL: https://issues.apache.org/jira/browse/ARROW-7964 Project: Apache Arrow Issue Type: Improvement Reporter: Francois Saint-Jacques This should apply primarly to DataType, Field, and Schema. It should not try to print things like metadata and nullability. This is not meant for serialization but quick glance. {code:java} i32 list dict struct<,>>> schema<>{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7917) [CMake] FindPythonInterp should check for python3
Francois Saint-Jacques created ARROW-7917: - Summary: [CMake] FindPythonInterp should check for python3 Key: ARROW-7917 URL: https://issues.apache.org/jira/browse/ARROW-7917 Project: Apache Arrow Issue Type: Improvement Affects Versions: 0.16.0 Reporter: Francois Saint-Jacques On ubuntu 18.04 it'll pick python2 by default. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7878) [C++] Implement LogicalPlan and LogicalPlanBuilder
Francois Saint-Jacques created ARROW-7878: - Summary: [C++] Implement LogicalPlan and LogicalPlanBuilder Key: ARROW-7878 URL: https://issues.apache.org/jira/browse/ARROW-7878 Project: Apache Arrow Issue Type: New Feature Components: C++ - Compute Affects Versions: 1.0.0 Reporter: Francois Saint-Jacques -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7861) [C++][Parquet] Add fuzz regression corpus for parquet reader
Francois Saint-Jacques created ARROW-7861: - Summary: [C++][Parquet] Add fuzz regression corpus for parquet reader Key: ARROW-7861 URL: https://issues.apache.org/jira/browse/ARROW-7861 Project: Apache Arrow Issue Type: Bug Reporter: Francois Saint-Jacques -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7821) [Gandiva] Add support for literal variables
Francois Saint-Jacques created ARROW-7821: - Summary: [Gandiva] Add support for literal variables Key: ARROW-7821 URL: https://issues.apache.org/jira/browse/ARROW-7821 Project: Apache Arrow Issue Type: Sub-task Components: C++ - Gandiva Reporter: Francois Saint-Jacques Fix For: 1.0.0 Gandiva supports static literal constants, but doesn't support runtime literal constants (or simply, variables). This means that queries like `x > 1` and `x > 2` are compiled in separate operators. The goal would be to provide something like prepared statement for very simple expression, e.g. ` x > ?`. This way we can pre-generate operators for most basic comparison filters on every type. I'm thinking that the variables should be stashed in the context pointer as opposed to a new function parameter. This would minimise the implementation impact. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7820) [C++][Gandiva] Add CMake support for compiling LLVM's IR into a library
Francois Saint-Jacques created ARROW-7820: - Summary: [C++][Gandiva] Add CMake support for compiling LLVM's IR into a library Key: ARROW-7820 URL: https://issues.apache.org/jira/browse/ARROW-7820 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Francois Saint-Jacques Fix For: 1.0.0 We should be able to inject LLVM IR into libraries, assuming that `llc` is found on the platform. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7819) [C++][Gandiva] Implement gandiva-dump-ir tool to output llvm IR to a file
Francois Saint-Jacques created ARROW-7819: - Summary: [C++][Gandiva] Implement gandiva-dump-ir tool to output llvm IR to a file Key: ARROW-7819 URL: https://issues.apache.org/jira/browse/ARROW-7819 Project: Apache Arrow Issue Type: Sub-task Components: C++ - Gandiva Reporter: Francois Saint-Jacques Fix For: 1.0.0 The tool should take a protobuf expression from stdin and dump the IR to stdout. This might require some though as the schema is not always known. It could mean a refactor to support plain array, especially for the Filter kernel. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7818) [C++][Gandiva] Generate Filter kernels from gandiva code at compile time
Francois Saint-Jacques created ARROW-7818: - Summary: [C++][Gandiva] Generate Filter kernels from gandiva code at compile time Key: ARROW-7818 URL: https://issues.apache.org/jira/browse/ARROW-7818 Project: Apache Arrow Issue Type: New Feature Components: C++, C++ - Gandiva Reporter: Francois Saint-Jacques Assignee: Francois Saint-Jacques Fix For: 1.0.0 The goal of this feature is to support generating kernels at compile time (and possibly runtime if gandiva is linked) to avoid rewriting C++ kernels that gandiva knows how to compile. The generated kernels would be linked in the compute module. This is an experimental task that will guide future development, notably implementing aggregate kernels in gandiva once instead both C++ and gandiva implementations. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7798) [R] Refactor vector to Array conversion
Francois Saint-Jacques created ARROW-7798: - Summary: [R] Refactor vector to Array conversion Key: ARROW-7798 URL: https://issues.apache.org/jira/browse/ARROW-7798 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Francois Saint-Jacques There's a bit of technical debt accumulated in this file: * Mix of conversion *and* casting, ideally we'd move casting out of there (at the cost of more memory copy). The rationale is that the conversion logic will differ from the CastKernels, e.g. when to raise errors, benefits from complex conversions like timezone... The current implementation is fast, e.g. it fuses the conversion and casting in a single loop at the cost of code clarity and divergence. * There should be 2 paths, zero-copy, non zero-copy. The non-zero copy should use the newly introduced VectorToArrayConverter which will work with complex nested types. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7767) [C++] Add a facility to create a Bitmap buffer from an data pointer with a specified sentinel
Francois Saint-Jacques created ARROW-7767: - Summary: [C++] Add a facility to create a Bitmap buffer from an data pointer with a specified sentinel Key: ARROW-7767 URL: https://issues.apache.org/jira/browse/ARROW-7767 Project: Apache Arrow Issue Type: Improvement Components: C++, R Reporter: Francois Saint-Jacques This is a special case for R and other cases where the null value is represented by a sentinel. This would read the data pointer and return a null bitmap buffer where bits are activate for every row where the value is not the sentinel value. If no sentinel is encountered, return nullptr. {code:c++} template Result> NullBitmapFromSentinelData(MemoryPool* pool, const CType* data, size_t n_values, CType sentinel_value>(); {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7765) [C++] Add Result to the Visitor pattern
Francois Saint-Jacques created ARROW-7765: - Summary: [C++] Add Result to the Visitor pattern Key: ARROW-7765 URL: https://issues.apache.org/jira/browse/ARROW-7765 Project: Apache Arrow Issue Type: Sub-task Reporter: Francois Saint-Jacques -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7764) [C++] Builders allocate a null bitmap buffer even if there is no nulls
Francois Saint-Jacques created ARROW-7764: - Summary: [C++] Builders allocate a null bitmap buffer even if there is no nulls Key: ARROW-7764 URL: https://issues.apache.org/jira/browse/ARROW-7764 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Francois Saint-Jacques This is an optimization where we can coalesce to nullptr if there's no null in the array. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7761) [C++] Add S3 support to fs::FileSystemFromUri
Francois Saint-Jacques created ARROW-7761: - Summary: [C++] Add S3 support to fs::FileSystemFromUri Key: ARROW-7761 URL: https://issues.apache.org/jira/browse/ARROW-7761 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Francois Saint-Jacques FileSystemFromUri doesn't support S3. This would give almost immediate support for S3 in python/R. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7759) [C++][Dataset] Add CsvFileFormat for CSV support
Francois Saint-Jacques created ARROW-7759: - Summary: [C++][Dataset] Add CsvFileFormat for CSV support Key: ARROW-7759 URL: https://issues.apache.org/jira/browse/ARROW-7759 Project: Apache Arrow Issue Type: Improvement Components: C++ - Dataset Reporter: Francois Saint-Jacques This should be a minimal implementation that binds 1-1 file and ScanTask for now. Streaming optimizations can be done in ARROW-3410. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7673) [C++][Dataset] Revisit File discovery failure mode
Francois Saint-Jacques created ARROW-7673: - Summary: [C++][Dataset] Revisit File discovery failure mode Key: ARROW-7673 URL: https://issues.apache.org/jira/browse/ARROW-7673 Project: Apache Arrow Issue Type: Improvement Components: C++ - Dataset Reporter: Francois Saint-Jacques Currently, the default `FileSystemFactoryOptions::exclude_invalid_files` will silently ignore unsupported files (either IO error, not of the valid format, corruption, missing compression codecs, etc...) when creating a `FileSystemSource`. We should change this behavior to propagate an error in the Inspect/Finish calls by default and allow the user to toggle `exclude_invalid_files`. The error should contain at least the file path and a decipherable error (if possible). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7653) [C++][Dataset] Handle DictType index mismatch better
Francois Saint-Jacques created ARROW-7653: - Summary: [C++][Dataset] Handle DictType index mismatch better Key: ARROW-7653 URL: https://issues.apache.org/jira/browse/ARROW-7653 Project: Apache Arrow Issue Type: Improvement Components: C++ - Dataset Reporter: Francois Saint-Jacques There will be a schema incompatibility raised if the index width doesn't match for fragments/sources. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7602) [Archery] Add more build options
Francois Saint-Jacques created ARROW-7602: - Summary: [Archery] Add more build options Key: ARROW-7602 URL: https://issues.apache.org/jira/browse/ARROW-7602 Project: Apache Arrow Issue Type: Improvement Components: Archery Reporter: Francois Saint-Jacques Assignee: Francois Saint-Jacques -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7523) [Tools] Ignore modernize-use-trailing-return-type clang-tidy check
Francois Saint-Jacques created ARROW-7523: - Summary: [Tools] Ignore modernize-use-trailing-return-type clang-tidy check Key: ARROW-7523 URL: https://issues.apache.org/jira/browse/ARROW-7523 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Francois Saint-Jacques Fix For: 0.16.0 This is a very invasive check added in recent clang-tidy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7498) [C++][Dataset] Rename DataFragment/DataSource/PartitionScheme
Francois Saint-Jacques created ARROW-7498: - Summary: [C++][Dataset] Rename DataFragment/DataSource/PartitionScheme Key: ARROW-7498 URL: https://issues.apache.org/jira/browse/ARROW-7498 Project: Apache Arrow Issue Type: Wish Components: C++ - Dataset Reporter: Francois Saint-Jacques DataFragment -> Fragment DataSource -> Source PartitionSchema -> PartitionSchema *Discovery -> *Manifest -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7441) [C++] Remove compute pointer aliases
Francois Saint-Jacques created ARROW-7441: - Summary: [C++] Remove compute pointer aliases Key: ARROW-7441 URL: https://issues.apache.org/jira/browse/ARROW-7441 Project: Apache Arrow Issue Type: Sub-task Reporter: Francois Saint-Jacques -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7439) [C++][Dataset] Remove dataset pointer aliases
Francois Saint-Jacques created ARROW-7439: - Summary: [C++][Dataset] Remove dataset pointer aliases Key: ARROW-7439 URL: https://issues.apache.org/jira/browse/ARROW-7439 Project: Apache Arrow Issue Type: Sub-task Reporter: Francois Saint-Jacques -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7440) [C++][Gandiva] Remove gandiva pointer aliases
Francois Saint-Jacques created ARROW-7440: - Summary: [C++][Gandiva] Remove gandiva pointer aliases Key: ARROW-7440 URL: https://issues.apache.org/jira/browse/ARROW-7440 Project: Apache Arrow Issue Type: Sub-task Reporter: Francois Saint-Jacques -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7438) [C++] Remove pointer aliases
Francois Saint-Jacques created ARROW-7438: - Summary: [C++] Remove pointer aliases Key: ARROW-7438 URL: https://issues.apache.org/jira/browse/ARROW-7438 Project: Apache Arrow Issue Type: Wish Reporter: Francois Saint-Jacques -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7436) [Archery] Fix benchmark default configuration
Francois Saint-Jacques created ARROW-7436: - Summary: [Archery] Fix benchmark default configuration Key: ARROW-7436 URL: https://issues.apache.org/jira/browse/ARROW-7436 Project: Apache Arrow Issue Type: Bug Reporter: Francois Saint-Jacques Compute module is not being built since the slim default cmake configuration. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7390) [C++][Dataset] Concurrency race in Projector::Project
Francois Saint-Jacques created ARROW-7390: - Summary: [C++][Dataset] Concurrency race in Projector::Project Key: ARROW-7390 URL: https://issues.apache.org/jira/browse/ARROW-7390 Project: Apache Arrow Issue Type: Bug Reporter: Francois Saint-Jacques When a DataFragment is invoked by 2 scan tasks of the same DataFragment, there's a race to invoke SetInputSchema. Note that ResizeMissingColumns also suffers from this race. The ideal goal is to make Project a const method. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7380) [C++][Dataset] Implement DatasetDiscovery
Francois Saint-Jacques created ARROW-7380: - Summary: [C++][Dataset] Implement DatasetDiscovery Key: ARROW-7380 URL: https://issues.apache.org/jira/browse/ARROW-7380 Project: Apache Arrow Issue Type: Improvement Components: C++ - Dataset Reporter: Francois Saint-Jacques Assignee: Francois Saint-Jacques Takes a list of DataSourceDiscovery and yields a Dataset. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7379) [C++] Introduce Field::CompatiblesWith and Schema::CompatiblesWith
Francois Saint-Jacques created ARROW-7379: - Summary: [C++] Introduce Field::CompatiblesWith and Schema::CompatiblesWith Key: ARROW-7379 URL: https://issues.apache.org/jira/browse/ARROW-7379 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Francois Saint-Jacques The methods verifies if fields/schemas are compatible with regards to naming and type. This is a partly extracted from `UnifySchemas`. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7377) [C++][Dataset] Simplify parquet column projection
Francois Saint-Jacques created ARROW-7377: - Summary: [C++][Dataset] Simplify parquet column projection Key: ARROW-7377 URL: https://issues.apache.org/jira/browse/ARROW-7377 Project: Apache Arrow Issue Type: Improvement Components: C++ - Dataset Reporter: Francois Saint-Jacques Assignee: Francois Saint-Jacques This is a minor makeup. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7360) [R] Can't use dataset's filter with non-literal expression
Francois Saint-Jacques created ARROW-7360: - Summary: [R] Can't use dataset's filter with non-literal expression Key: ARROW-7360 URL: https://issues.apache.org/jira/browse/ARROW-7360 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Francois Saint-Jacques The following will generate an error {code:r} test_that("filtering with expression", { char_sym <- "b" expect_dplyr_equal( input %>% filter(chr == char_sym) %>% select(string = chr, int) %>% collect(), tbl ) }) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7339) [CMake] Thrift version not respected in CMake configuration version.txt
Francois Saint-Jacques created ARROW-7339: - Summary: [CMake] Thrift version not respected in CMake configuration version.txt Key: ARROW-7339 URL: https://issues.apache.org/jira/browse/ARROW-7339 Project: Apache Arrow Issue Type: Improvement Reporter: Francois Saint-Jacques If thrift is requested via BUNBLED, thrift 0.9.1 will be downloaded instead of the requested version. This is due to FindThrift.cmake overriding THRIFT_VERSION from the locally installed thrift compiler (0.9.1. on ubuntu 18.04). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7338) [C++] Rename SimpleDataSource to InMemoryDataSource
Francois Saint-Jacques created ARROW-7338: - Summary: [C++] Rename SimpleDataSource to InMemoryDataSource Key: ARROW-7338 URL: https://issues.apache.org/jira/browse/ARROW-7338 Project: Apache Arrow Issue Type: Improvement Components: C++ - Dataset Reporter: Francois Saint-Jacques The constructor should take a generator {code:c++} // Some comments here class InMemoryDataSource : public DataSource { public: using Generator = std::function>; InMemoryDataSource(Generator&& generator); // Convenience constructor to support a fixed list of RecordBatch InMemoryDataSource(std::shared_ptr); InMemoryDataSource(std::vector>); private: Generator generator; } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7272) [C++][Java] JNI bridge between RecordBatch and VectorSchemaRoot
Francois Saint-Jacques created ARROW-7272: - Summary: [C++][Java] JNI bridge between RecordBatch and VectorSchemaRoot Key: ARROW-7272 URL: https://issues.apache.org/jira/browse/ARROW-7272 Project: Apache Arrow Issue Type: Improvement Components: C++, Java Reporter: Francois Saint-Jacques Given a C++ std::shared_ptr, retrieve it in java as a VectorSchemaRoot class. Gandiva already offer a similar facility but with raw buffers. It would be convenient if users could call C++ that yields RecordBatch and retrieve it in a seamless fashion. This would remove one roadblock of using C++ dataset facility in Java. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7265) [Format][C++] Clarify the usage of typeIds in Union type documentation
Francois Saint-Jacques created ARROW-7265: - Summary: [Format][C++] Clarify the usage of typeIds in Union type documentation Key: ARROW-7265 URL: https://issues.apache.org/jira/browse/ARROW-7265 Project: Apache Arrow Issue Type: Improvement Reporter: Francois Saint-Jacques The documentation is unclear. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7210) [C++] Scalar cast should support time-based types
Francois Saint-Jacques created ARROW-7210: - Summary: [C++] Scalar cast should support time-based types Key: ARROW-7210 URL: https://issues.apache.org/jira/browse/ARROW-7210 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Francois Saint-Jacques This would allow supporting a minimum of expression evaluation on time-based arrays. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7178) [C++] Vendor forward compatible std::optional
Francois Saint-Jacques created ARROW-7178: - Summary: [C++] Vendor forward compatible std::optional Key: ARROW-7178 URL: https://issues.apache.org/jira/browse/ARROW-7178 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Francois Saint-Jacques Having std::optional was mentioned a few time, [~emkornfi...@gmail.com] suggested https://github.com/martinmoene/optional-lite -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7148) [C++][Dataset] API cleanup
Francois Saint-Jacques created ARROW-7148: - Summary: [C++][Dataset] API cleanup Key: ARROW-7148 URL: https://issues.apache.org/jira/browse/ARROW-7148 Project: Apache Arrow Issue Type: Improvement Components: C++ - Dataset Reporter: Francois Saint-Jacques Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7147) [C++][Dataset] Refactor dataset's API to use Result
Francois Saint-Jacques created ARROW-7147: - Summary: [C++][Dataset] Refactor dataset's API to use Result Key: ARROW-7147 URL: https://issues.apache.org/jira/browse/ARROW-7147 Project: Apache Arrow Issue Type: Bug Reporter: Francois Saint-Jacques We should make this switch before the API settles -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7079) [C++][Dataset] Implement ScalarAsStatisctics for non-primitive types
Francois Saint-Jacques created ARROW-7079: - Summary: [C++][Dataset] Implement ScalarAsStatisctics for non-primitive types Key: ARROW-7079 URL: https://issues.apache.org/jira/browse/ARROW-7079 Project: Apache Arrow Issue Type: Bug Components: C++ - Dataset Reporter: Francois Saint-Jacques Statistics are not extracted for the following (parquet) types - BYTE_ARRAY - FLBA - Any logical timestamps/dates -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7017) [C++] Refactor AddKernel to support other operations and types
Francois Saint-Jacques created ARROW-7017: - Summary: [C++] Refactor AddKernel to support other operations and types Key: ARROW-7017 URL: https://issues.apache.org/jira/browse/ARROW-7017 Project: Apache Arrow Issue Type: Improvement Components: C++ - Compute Reporter: Francois Saint-Jacques * Should avoid using builders (and/or NULLs) since the output shape is known a compute time. * Should be refatored to support other operations, e.g. Substraction, Multiplication. * Should have a overflow, underflow detection mode. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7007) [C++] Enable mmap option for LocalFs
Francois Saint-Jacques created ARROW-7007: - Summary: [C++] Enable mmap option for LocalFs Key: ARROW-7007 URL: https://issues.apache.org/jira/browse/ARROW-7007 Project: Apache Arrow Issue Type: Improvement Reporter: Francois Saint-Jacques -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6988) [CI][R] Buildbot's R Conda is failing
Francois Saint-Jacques created ARROW-6988: - Summary: [CI][R] Buildbot's R Conda is failing Key: ARROW-6988 URL: https://issues.apache.org/jira/browse/ARROW-6988 Project: Apache Arrow Issue Type: Improvement Reporter: Francois Saint-Jacques {code:java} Running ‘testthat.R’ ERROR Running the tests in ‘tests/testthat.R’ failed. Last 13 lines of output: 25: tryCatch(withCallingHandlers({eval(code, test_env)if (!handled && !is.null(test)) {skip_empty()}}, expectation = handle_expectation, skip = handle_skip, warning = handle_warning, message = handle_message, error = handle_error), error = handle_fatal, skip = function(e) {}) 26: test_code(NULL, exprs, env) 27: source_file(path, new.env(parent = env), chdir = TRUE, wrap = wrap) 28: force(code) 29: with_reporter(reporter = reporter, start_end_reporter = start_end_reporter, {reporter$start_file(basename(path)) lister$start_file(basename(path))source_file(path, new.env(parent = env), chdir = TRUE, wrap = wrap)reporter$.end_context() reporter$end_file()}) 30: FUN(X[[i]], ...) 31: lapply(paths, test_file, env = env, reporter = current_reporter, start_end_reporter = FALSE, load_helpers = FALSE, wrap = wrap) 32: force(code) 33: with_reporter(reporter = current_reporter, results <- lapply(paths, test_file, env = env, reporter = current_reporter, start_end_reporter = FALSE, load_helpers = FALSE, wrap = wrap)) 34: test_files(paths, reporter = reporter, env = env, stop_on_failure = stop_on_failure, stop_on_warning = stop_on_warning, wrap = wrap) 35: test_dir(path = test_path, reporter = reporter, env = env, filter = filter, ..., stop_on_failure = stop_on_failure, stop_on_warning = stop_on_warning, wrap = wrap) 36: test_package_dir(package = package, test_path = test_path, filter = filter, reporter = reporter, ..., stop_on_failure = stop_on_failure, stop_on_warning = stop_on_warning, wrap = wrap) 37: test_check("arrow") An irrecoverable exception occurred. R is aborting now ... Segmentation fault (core dumped) * checking for unstated dependencies in vignettes ... OK * checking package vignettes in ‘inst/doc’ ... OK * checking re-building of vignette outputs ... OK * DONE Status: 1 ERROR, 1 WARNING, 2 NOTEs See ‘/buildbot/AMD64_Conda_R/r/arrow.Rcheck/00check.log’ for details. {code} [|https://ci.ursalabs.org/#/builders/95] [https://ci.ursalabs.org/#/builders/95/builds/2386] [https://ci.ursalabs.org/#/builders/95] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6987) [CI] Travis OSX failing to install sdk headers
Francois Saint-Jacques created ARROW-6987: - Summary: [CI] Travis OSX failing to install sdk headers Key: ARROW-6987 URL: https://issues.apache.org/jira/browse/ARROW-6987 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration Reporter: Francois Saint-Jacques {code:java} sudo installer -pkg /Library/Developer/CommandLineTools/Packages/macOS_SDK_headers_for_macOS_10.14.pkg -target /343installer: Package name is macOS_SDK_headers_for_macOS_10.14344installer: Certificate used to sign package is not trusted. Use -allowUntrusted to override.345The command "$TRAVIS_BUILD_DIR/ci/travis_before_script_cpp.sh --only-library --homebrew" failed and exited with 1 during . {code} See [https://travis-ci.org/apache/arrow/jobs/602434884#L342-L345] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6969) [C++][Dataset] ParquetScanTask eagerly load file
Francois Saint-Jacques created ARROW-6969: - Summary: [C++][Dataset] ParquetScanTask eagerly load file Key: ARROW-6969 URL: https://issues.apache.org/jira/browse/ARROW-6969 Project: Apache Arrow Issue Type: Improvement Reporter: Francois Saint-Jacques The file content should only be read when invoking ParquetScanTask::Scan, not on construction. This blocks reading in a true streaming fashion with memory constraints. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6965) [C++][Dataset] Optionally expose partition keys as materialized columns
Francois Saint-Jacques created ARROW-6965: - Summary: [C++][Dataset] Optionally expose partition keys as materialized columns Key: ARROW-6965 URL: https://issues.apache.org/jira/browse/ARROW-6965 Project: Apache Arrow Issue Type: Improvement Reporter: Francois Saint-Jacques -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6964) [C++][Dataset] Expose a nested parellel option for Scanner
Francois Saint-Jacques created ARROW-6964: - Summary: [C++][Dataset] Expose a nested parellel option for Scanner Key: ARROW-6964 URL: https://issues.apache.org/jira/browse/ARROW-6964 Project: Apache Arrow Issue Type: Improvement Reporter: Francois Saint-Jacques -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6956) [C++] Status should use unique_ptr
Francois Saint-Jacques created ARROW-6956: - Summary: [C++] Status should use unique_ptr Key: ARROW-6956 URL: https://issues.apache.org/jira/browse/ARROW-6956 Project: Apache Arrow Issue Type: Improvement Reporter: Francois Saint-Jacques The logic of Status::State is _very_ similar to unique_ptr except the deep copy on copy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6953) [C++][Dataset] Implement Gandiva Filter/Projector in Scanner
Francois Saint-Jacques created ARROW-6953: - Summary: [C++][Dataset] Implement Gandiva Filter/Projector in Scanner Key: ARROW-6953 URL: https://issues.apache.org/jira/browse/ARROW-6953 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Francois Saint-Jacques Currently, we have `RecordBatchProjector` and `ExpressionEvaluator` to achieve this feature. This would implement a single class that fuse both and uses gandiva. This would be exposed in the ScannerBuilder via an option. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6951) [C++][Dataset] Ensure column projection is passed to ParquetDataFragment
Francois Saint-Jacques created ARROW-6951: - Summary: [C++][Dataset] Ensure column projection is passed to ParquetDataFragment Key: ARROW-6951 URL: https://issues.apache.org/jira/browse/ARROW-6951 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Francois Saint-Jacques -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6952) [C++][Dataset] Ensure expression filter is passed ParquetDataFragment
Francois Saint-Jacques created ARROW-6952: - Summary: [C++][Dataset] Ensure expression filter is passed ParquetDataFragment Key: ARROW-6952 URL: https://issues.apache.org/jira/browse/ARROW-6952 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Francois Saint-Jacques We should be able to prune RowGroups based on the expression and the statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6950) [C++][Dataset] Add example/benchmark for reading parquet files with dataset
Francois Saint-Jacques created ARROW-6950: - Summary: [C++][Dataset] Add example/benchmark for reading parquet files with dataset Key: ARROW-6950 URL: https://issues.apache.org/jira/browse/ARROW-6950 Project: Apache Arrow Issue Type: Test Components: C++ Reporter: Francois Saint-Jacques Create an executable that load a directory with a known partition scheme with a filter and a projection. This will be used as a baseline for future performance improvement but also to show various feature of the dataset API. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6902) [C++] Add String*/Binary* support for Compare kernels
Francois Saint-Jacques created ARROW-6902: - Summary: [C++] Add String*/Binary* support for Compare kernels Key: ARROW-6902 URL: https://issues.apache.org/jira/browse/ARROW-6902 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Francois Saint-Jacques -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6854) [Dataset] RecordBatchProjector is not thread safe
Francois Saint-Jacques created ARROW-6854: - Summary: [Dataset] RecordBatchProjector is not thread safe Key: ARROW-6854 URL: https://issues.apache.org/jira/browse/ARROW-6854 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Francois Saint-Jacques While working on ARROW-6769 I noted that RecordbBatchProjector is not thread safe. My goal is to use this class to wrap the ScanTaskIterator in another ScanTaskIterator that projects, so producer (fragments) don't have to know about this schema. The issue is that ScanTask are expected to run on concurrent thread. The projector will be invoked by multiple thread. The lack of concurrency safety is due to adaptivity of input schemas and `SetInputSchema` stores in a local cache. I suggest we refactor into 2 classes. # `RecordBatchProjector` which will work with a static `from` schema, i.e. no adaptivity. The schema is defined at construct time. This class is thread safe to invoke after construction since no local modification is done. # `AdaptiveRecordBatchProjector` which will have a cache map[schema_hash, std::shared_ptr] protected with a mutex. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6835) [Archery][CMake] Restore ARROW_LINT_ONLY
Francois Saint-Jacques created ARROW-6835: - Summary: [Archery][CMake] Restore ARROW_LINT_ONLY Key: ARROW-6835 URL: https://issues.apache.org/jira/browse/ARROW-6835 Project: Apache Arrow Issue Type: Bug Components: Archery Reporter: Francois Saint-Jacques This is used by developers to fasten the cmake build creation and loosen the required installed toolchains (notably libraries). This was yanked because ARROW_LINT_ONLY effectively exit-early and doesn't generate `compile_commands.json`. Restore this option, but ensure that archery toggles accordingly to the usage of iwyu or clang-tidy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6827) [Archery] lint sub-command should provide a --fail-fast option
Francois Saint-Jacques created ARROW-6827: - Summary: [Archery] lint sub-command should provide a --fail-fast option Key: ARROW-6827 URL: https://issues.apache.org/jira/browse/ARROW-6827 Project: Apache Arrow Issue Type: New Feature Components: Archery Reporter: Francois Saint-Jacques -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6828) [Archery] Benchmark diff should provide a TUI friendly output
Francois Saint-Jacques created ARROW-6828: - Summary: [Archery] Benchmark diff should provide a TUI friendly output Key: ARROW-6828 URL: https://issues.apache.org/jira/browse/ARROW-6828 Project: Apache Arrow Issue Type: New Feature Components: Archery Reporter: Francois Saint-Jacques -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6826) [Archery] Default build should be minimal
Francois Saint-Jacques created ARROW-6826: - Summary: [Archery] Default build should be minimal Key: ARROW-6826 URL: https://issues.apache.org/jira/browse/ARROW-6826 Project: Apache Arrow Issue Type: New Feature Components: Archery Reporter: Francois Saint-Jacques Follow of https://github.com/apache/arrow/pull/5600/files#r332655141 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6816) [Archery] Cleanup integration module to use companion classes
Francois Saint-Jacques created ARROW-6816: - Summary: [Archery] Cleanup integration module to use companion classes Key: ARROW-6816 URL: https://issues.apache.org/jira/browse/ARROW-6816 Project: Apache Arrow Issue Type: New Feature Components: Archery Reporter: Francois Saint-Jacques This is a followup ticket to ARROW-6466. * Replace print calls with utils.logger * Use ArrowSources instead of ARROW_HOME * Use utils.Command and utils.CMakeBuild where possible -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6769) [C++][Dataset] End to End dataset integration test case
Francois Saint-Jacques created ARROW-6769: - Summary: [C++][Dataset] End to End dataset integration test case Key: ARROW-6769 URL: https://issues.apache.org/jira/browse/ARROW-6769 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Francois Saint-Jacques 1. Create a DataSource from a known directory and a PartitionScheme. 2. Create a Dataset from the previous DataSource. 3. Request a ScannerBuilder from previous Dataset. 4. Add filter expression to ScannerBuilder (and other options). 5. Finalize into a Scan operation. 6. Materialize into an arrow::Table. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6768) [C++][Dataset] Implement dataset::Scan to Table helper function
Francois Saint-Jacques created ARROW-6768: - Summary: [C++][Dataset] Implement dataset::Scan to Table helper function Key: ARROW-6768 URL: https://issues.apache.org/jira/browse/ARROW-6768 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Francois Saint-Jacques The Scan interface exposes classes (ScanTask/Iterator) which are not of interest to all callers. This would implement `Status Scan::Materialize(std::shared_ptr* out)` so consumers can call this function instead of consuming and dispatching the streaming interface. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6730) [CI] Use Github Actions for "C++ with clang 7" docker image
Francois Saint-Jacques created ARROW-6730: - Summary: [CI] Use Github Actions for "C++ with clang 7" docker image Key: ARROW-6730 URL: https://issues.apache.org/jira/browse/ARROW-6730 Project: Apache Arrow Issue Type: New Feature Components: Continuous Integration Reporter: Francois Saint-Jacques -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6615) [C++] Add filtering option to fs::Selector
Francois Saint-Jacques created ARROW-6615: - Summary: [C++] Add filtering option to fs::Selector Key: ARROW-6615 URL: https://issues.apache.org/jira/browse/ARROW-6615 Project: Apache Arrow Issue Type: New Feature Reporter: Francois Saint-Jacques It would convenient if Selector could support file path filtering, either via a regex or globbing applied to the path. This is semi required for filtering file in Dataset to properly apply the file format. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6614) [C++][Dataset] Implement FileSystemDataSourceDiscovery
Francois Saint-Jacques created ARROW-6614: - Summary: [C++][Dataset] Implement FileSystemDataSourceDiscovery Key: ARROW-6614 URL: https://issues.apache.org/jira/browse/ARROW-6614 Project: Apache Arrow Issue Type: New Feature Reporter: Francois Saint-Jacques DataSourceDiscovery is what allows InferingSchema and constructing a DataSource with PartitionScheme. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6606) [C++] Construct tree structure from std::vector
Francois Saint-Jacques created ARROW-6606: - Summary: [C++] Construct tree structure from std::vector Key: ARROW-6606 URL: https://issues.apache.org/jira/browse/ARROW-6606 Project: Apache Arrow Issue Type: Improvement Reporter: Francois Saint-Jacques This will be used by FileSystemDataSource for pushdown predicate pruning of branches. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6605) [C++] Add recursion depth control to fs::Selector
Francois Saint-Jacques created ARROW-6605: - Summary: [C++] Add recursion depth control to fs::Selector Key: ARROW-6605 URL: https://issues.apache.org/jira/browse/ARROW-6605 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Francois Saint-Jacques This is similar to the recursive options, but also control the depth. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6493) [C++] Move MakeArrayFromScalar utilities to src/arrow/array/
Francois Saint-Jacques created ARROW-6493: - Summary: [C++] Move MakeArrayFromScalar utilities to src/arrow/array/ Key: ARROW-6493 URL: https://issues.apache.org/jira/browse/ARROW-6493 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Francois Saint-Jacques See https://github.com/apache/arrow/pull/5207#discussion_r321505582 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6476) [Java][CI] Travis java all-jdks job is broken
Francois Saint-Jacques created ARROW-6476: - Summary: [Java][CI] Travis java all-jdks job is broken Key: ARROW-6476 URL: https://issues.apache.org/jira/browse/ARROW-6476 Project: Apache Arrow Issue Type: Bug Reporter: Francois Saint-Jacques Introduced by ARROW-6433, fixing the shade check enabled evaluation of the incorrect body. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6448) [CI] Add crossbow notifications
Francois Saint-Jacques created ARROW-6448: - Summary: [CI] Add crossbow notifications Key: ARROW-6448 URL: https://issues.apache.org/jira/browse/ARROW-6448 Project: Apache Arrow Issue Type: New Feature Components: Continuous Integration Reporter: Francois Saint-Jacques Assignee: Francois Saint-Jacques -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6397) [C++][CI] Fix S3 minio failure
Francois Saint-Jacques created ARROW-6397: - Summary: [C++][CI] Fix S3 minio failure Key: ARROW-6397 URL: https://issues.apache.org/jira/browse/ARROW-6397 Project: Apache Arrow Issue Type: New Feature Components: C++, Continuous Integration Reporter: Francois Saint-Jacques See [https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/27065941/job/gwjmr2hudm7693ef] -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6396) [C++] Add CompareOptions to Compare kernels
Francois Saint-Jacques created ARROW-6396: - Summary: [C++] Add CompareOptions to Compare kernels Key: ARROW-6396 URL: https://issues.apache.org/jira/browse/ARROW-6396 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Francois Saint-Jacques This would add an enum ResolveNull \{ KLEENE_LOGIC, NULL_PROPAGATE }. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6378) [C++][Dataset] Implement TreeDataSource
Francois Saint-Jacques created ARROW-6378: - Summary: [C++][Dataset] Implement TreeDataSource Key: ARROW-6378 URL: https://issues.apache.org/jira/browse/ARROW-6378 Project: Apache Arrow Issue Type: New Feature Reporter: Francois Saint-Jacques The TreeDataSource is required to support partitions pruning of sub-trees. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6341) [Python] Implements low-level bindings to Dataset classes:
Francois Saint-Jacques created ARROW-6341: - Summary: [Python] Implements low-level bindings to Dataset classes: Key: ARROW-6341 URL: https://issues.apache.org/jira/browse/ARROW-6341 Project: Apache Arrow Issue Type: New Feature Components: Python Reporter: Francois Saint-Jacques The following classes should be accessible from Python: * class DataSource * class DataFragment * function DiscoverySource * class ScanContext, ScanOptions, ScanTask * class Dataset * class ScannerBuilder * class Scanner The end result is reading a directory of parquet files as a single stream. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6340) [R] Implements low-level bindings to Dataset classes
Francois Saint-Jacques created ARROW-6340: - Summary: [R] Implements low-level bindings to Dataset classes Key: ARROW-6340 URL: https://issues.apache.org/jira/browse/ARROW-6340 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Francois Saint-Jacques The following classes should be accessible from R: * class DataSource * class DataFragment * function DiscoverySource * class ScanContext, ScanOptions, ScanTask * class Dataset * class ScannerBuilder * class Scanner The end result is reading a directory of parquet files as a single stream -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6244) [C++] Implement Partition DataSource
Francois Saint-Jacques created ARROW-6244: - Summary: [C++] Implement Partition DataSource Key: ARROW-6244 URL: https://issues.apache.org/jira/browse/ARROW-6244 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Francois Saint-Jacques This is a DataSource that also has partition metadata. The end goal is to support filtering with a DataSelector/Filter expression. The initial implementation should not deal with PartitionScheme yet. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6243) [C++] Implement basic Filter expression classes
Francois Saint-Jacques created ARROW-6243: - Summary: [C++] Implement basic Filter expression classes Key: ARROW-6243 URL: https://issues.apache.org/jira/browse/ARROW-6243 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Francois Saint-Jacques Assignee: Benjamin Kietzman This will draft the basic classes for creating boolean expressions that are passed to the DataSources/DataFragments for predicate push-down. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6242) [C++] Implements basic Dataset/Scanner/ScannerBuilder
Francois Saint-Jacques created ARROW-6242: - Summary: [C++] Implements basic Dataset/Scanner/ScannerBuilder Key: ARROW-6242 URL: https://issues.apache.org/jira/browse/ARROW-6242 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Francois Saint-Jacques The goal of this would be to iterate over a Dataset and generate a "flattened" stream of RecordBatches from the union of data sources and data fragments. This should not bother with filtering yet. -- This message was sent by Atlassian JIRA (v7.6.14#76016)