[jira] [Assigned] (ARROW-14729) [C++][Documentation] Update overview of Arrow components/layers
[ https://issues.apache.org/jira/browse/ARROW-14729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Garigipati reassigned ARROW-14729: -- Assignee: Pradeep Garigipati > [C++][Documentation] Update overview of Arrow components/layers > --- > > Key: ARROW-14729 > URL: https://issues.apache.org/jira/browse/ARROW-14729 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Documentation >Reporter: Eduardo Ponce >Assignee: Pradeep Garigipati >Priority: Major > Labels: good-first-issue, good-second-issue, query-engine > Fix For: 8.0.0 > > > New components have been added/modified in Arrow (e.g., query engine), so we > should update documentation that describes these. Overview of Arrow layers > are described in > [overview.rst|https://github.com/apache/arrow/blob/master/docs/source/cpp/overview.rst]. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-14890) [C++][Dataset] Add support for filter pushdown in the ORC Scanner
[ https://issues.apache.org/jira/browse/ARROW-14890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498412#comment-17498412 ] Pradeep Garigipati commented on ARROW-14890: Is this issue open for assignment or has someone begun work already on it and status isn't updated ? > [C++][Dataset] Add support for filter pushdown in the ORC Scanner > - > > Key: ARROW-14890 > URL: https://issues.apache.org/jira/browse/ARROW-14890 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: xiangxiang Shen >Priority: Major > Labels: dataset, good-second-issue, orc > > In arrow dataset, Filter pushdown can improve reading files performance > greatly. We notice parquet has implemented, > https://github.com/apache/arrow/blob/35b3567e73423420a99dbe6116f000e3c77d2a4c/cpp/src/arrow/dataset/file_parquet.cc#L465-L484. > But ORC fileformat has not supported Filter pushdown. It ignores the "filter" > of ScanOptions now. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-9404) [C++] Add support for Decimal16, Decimal32 and Decimal64
[ https://issues.apache.org/jira/browse/ARROW-9404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498356#comment-17498356 ] Ben Baumgold commented on ARROW-9404: - Looks like https://github.com/apache/arrow/pull/8578 implemented this feature, but the PR seems abandoned. Would be nice to find a way to push it over the finish-line somehow so Arrow can support Decimal[16|32|64]. > [C++] Add support for Decimal16, Decimal32 and Decimal64 > > > Key: ARROW-9404 > URL: https://issues.apache.org/jira/browse/ARROW-9404 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Artem Alekseev >Priority: Major > > It looks like arrow lacks support for decimal16, decimal32 and decimal64 > types. Are there any reasons for that? -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15455) [C++] Cast between fixed size list type and variable size list
[ https://issues.apache.org/jira/browse/ARROW-15455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-15455: --- Labels: good-second-issue kernel pull-request-available (was: good-second-issue kernel) > [C++] Cast between fixed size list type and variable size list > --- > > Key: ARROW-15455 > URL: https://issues.apache.org/jira/browse/ARROW-15455 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Joris Van den Bossche >Assignee: Jabari Booker >Priority: Major > Labels: good-second-issue, kernel, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Casting from fixed size list to variable size list could be possible, I > think, but currently doesn't work: > {code:python} > >>> fixed_size = pa.array([[1, 2], [3, 4]], type=pa.list_(pa.int64(), 2)) > >>> fixed_size.cast(pa.list_(pa.int64())) > ... > ArrowNotImplementedError: Unsupported cast from fixed_size_list int64>[2] to list using function cast_list > {code} > And in principle, a cast the other way around could also be possible if it is > checked that each list has the correct length. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Closed] (ARROW-15785) [Benchmarks] Add conbench benchmark for single-file parquet reads
[ https://issues.apache.org/jira/browse/ARROW-15785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace closed ARROW-15785. --- Resolution: Not A Problem > [Benchmarks] Add conbench benchmark for single-file parquet reads > - > > Key: ARROW-15785 > URL: https://issues.apache.org/jira/browse/ARROW-15785 > Project: Apache Arrow > Issue Type: Improvement > Components: Benchmarking >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Major > > Release 7.0.0 introduced a regression in parquet single file reads. We > should add a macro-level benchmark that does single-file reads to help us > detect this in the future. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15785) [Benchmarks] Add conbench benchmark for single-file parquet reads
[ https://issues.apache.org/jira/browse/ARROW-15785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498355#comment-17498355 ] Weston Pace commented on ARROW-15785: - Ah, good point. Yes, that is the commit that was regressed and it appears that our macro benchmarks can indeed catch this case. So no new benchmarks are needed. > [Benchmarks] Add conbench benchmark for single-file parquet reads > - > > Key: ARROW-15785 > URL: https://issues.apache.org/jira/browse/ARROW-15785 > Project: Apache Arrow > Issue Type: Improvement > Components: Benchmarking >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Major > > Release 7.0.0 introduced a regression in parquet single file reads. We > should add a macro-level benchmark that does single-file reads to help us > detect this in the future. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15272) [Java] ArrowVectorIterator eats initialization exceptions when close fails
[ https://issues.apache.org/jira/browse/ARROW-15272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved ARROW-15272. -- Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12094 [https://github.com/apache/arrow/pull/12094] > [Java] ArrowVectorIterator eats initialization exceptions when close fails > -- > > Key: ARROW-15272 > URL: https://issues.apache.org/jira/browse/ARROW-15272 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Affects Versions: 6.0.1 >Reporter: Andrew Higgins >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > In ArrowVectorIterator's create method exceptions thrown during initialize() > are eaten if there are further exceptions while closing the iterator. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15790) field's metadata is not write into Parquet file
Sifang Li created ARROW-15790: - Summary: field's metadata is not write into Parquet file Key: ARROW-15790 URL: https://issues.apache.org/jira/browse/ARROW-15790 Project: Apache Arrow Issue Type: Bug Environment: Ubuntu Reporter: Sifang Li I used this code to test the metadata write into file and read back behavior of parquet file: [https://gist.github.com/dantrim/33f9f14d0b2d3ec45c022aa05f7a45ee] The generated file does not have metadata when I read the file in using code below and print it out: {quote}std::shared_ptr infile; PARQUET_ASSIGN_OR_THROW(infile, arrow::io::ReadableFile::Open("./test.parquet", arrow::default_memory_pool())); std::unique_ptr reader; PARQUET_THROW_NOT_OK( parquet::arrow::OpenFile(infile, arrow::default_memory_pool(), &reader)); std::shared_ptr table; PARQUET_THROW_NOT_OK(reader->ReadTable(&table)); EXPECT_EQ(frameCount, table->num_rows()); std::cout<<"==="ToString(true) <
[jira] [Updated] (ARROW-15789) [C++] Update OpenTelemetry to v1.2.0
[ https://issues.apache.org/jira/browse/ARROW-15789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-15789: --- Labels: pull-request-available (was: ) > [C++] Update OpenTelemetry to v1.2.0 > > > Key: ARROW-15789 > URL: https://issues.apache.org/jira/browse/ARROW-15789 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > We're currently on v1.1.0 and there were some minor API changes in v1.1.1. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15789) [C++] Update OpenTelemetry to v1.2.0
David Li created ARROW-15789: Summary: [C++] Update OpenTelemetry to v1.2.0 Key: ARROW-15789 URL: https://issues.apache.org/jira/browse/ARROW-15789 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: David Li Assignee: David Li We're currently on v1.1.0 and there were some minor API changes in v1.1.1. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (ARROW-14549) VectorSchemaRoot is not refreshed when value is null
[ https://issues.apache.org/jira/browse/ARROW-14549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498249#comment-17498249 ] Bryan Cutler edited comment on ARROW-14549 at 2/25/22, 6:22 PM: [~hu6360567] Calling `allocateNew()` will create new buffers, which is one way to clear previous results. If you don't want to allocate any new memory, you would need to to zero out all the vectors by calling `zeroVector()` and `setValueCount(0)`. If you don't do either of these, the incorrect data you see is expected. was (Author: bryanc): [~hu6360567] Calling `allocateNew()` will create new buffers, which is one way to clear previous results. If you don't want to allocate any new memory, you would need to to zero out all the vectors by calling `zeroVector()` and `setValueCount(0)` > VectorSchemaRoot is not refreshed when value is null > > > Key: ARROW-14549 > URL: https://issues.apache.org/jira/browse/ARROW-14549 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Affects Versions: 6.0.0 >Reporter: Wenbo Hu >Priority: Major > > I'm using `arrow-jdbc` to convert query result from JDBC to arrow. > But the following code, unexpected behaivor happens. > Assuming a sqlite db, the 2nd row of col_2 and col_3 are null. > |col_1|col_2|col_3| > |1|abc|3.14| > |2|NULL|NULL| > As document suggests, > {quote}populated data over and over into the same VectorSchemaRoot in a > stream of batches rather than creating a new VectorSchemaRoot instance each > time. > {quote} > *JdbcToArrowConfig* is set to reuse root. > {code:java} > public void querySql(String query, QueryOption option) throws Exception { > try (final java.sql.Connection conn = connectContainer.getConnection(); > final Statement stmt = conn.createStatement(); > final ResultSet rs = stmt.executeQuery(query) > ) { > // create config with reuse schema root and custom batch size from option > final JdbcToArrowConfig config = new > JdbcToArrowConfigBuilder().setAllocator(new > RootAllocator()).setCalendar(JdbcToArrowUtils.getUtcCalendar()) > > .setTargetBatchSize(option.getBatchSize()).setReuseVectorSchemaRoot(true).build(); > final ArrowVectorIterator iterator = > JdbcToArrow.sqlToArrowVectorIterator(rs, config); >while (iterator.hasNext()){ // retrieve result from iterator > final VectorSchemaRoot root = iterator.next(); > option.getCallback().handleBatchResult(root); > root.allocateNew(); // it has to be allocate new > } > } catch (java.lang.Exception e){ throw new Exception(e.getMessage()); } > } > > .. > // batch_size is set to 1, then callback is called twice. > QueryOptions options = new QueryOption(1, > root -> { > // if printer is not set, get schema, write header > if (printer == null) { > final String[] headers = > root.getSchema().getFields().stream().map(Field::getName).toArray(String[]::new); > > printer = new CSVPrinter(writer, > CSVFormat.Builder.create(CSVFormat.DEFAULT).setHeader(headers).build()); > } > > final int rows = root.getRowCount(); > final List fieldVectors = root.getFieldVectors(); > > // iterate over rows > for (int i = 0; i < rows; i++) { > final int rowId = i; > final List row = fieldVectors.stream().map(v -> > v.getObject(rowId)).map(String::valueOf).collect(Collectors.toList()); > printer.printRecord(row); > } > }); > > connection.querySql("SELECT * FROM test_db", options); > .. > {code} > if `root.allocateNew()` is called, the csv file is expected, > ``` > column_1,column_2,column_3 > 1,abc,3.14 > 2,null,null > ``` > Otherwise, null values of 2nd row are remaining the same values of 1st row > ``` > column_1,column_2,column_3 > 1,abc,3.14 > 2,abc,3.14 > ``` > **Question: Is expected to call `allocateNew` every time when the schema root > is reused?** > By without reusing schemaroot, the following code works as expected. > {code:java} > public void querySql(String query, QueryOption option) throws Exception { > try (final java.sql.Connection conn = connectContainer.getConnection(); > final Statement stmt = conn.createStatement(); > final ResultSet rs = stmt.executeQuery(query)) { > // create config without reuse schema root and custom batch size from > option > final JdbcToArrowConfig config = new > JdbcToArrowConfigBuilder().setAllocator(new > RootAllocator()).setCalendar(JdbcToArrowUtils.getUtcCalendar()) > > .setTargetBatchSize(option.getBatchSize()).setReuseVectorSchemaRoot(false).build(); > > final ArrowVectorIterator iterator = > JdbcToArrow.sqlToArrowVectorIterator(rs, config); > while (iterator.hasNext()) { > // retrieve result from iterator > try (VectorSchemaRoot root
[jira] [Resolved] (ARROW-14549) VectorSchemaRoot is not refreshed when value is null
[ https://issues.apache.org/jira/browse/ARROW-14549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved ARROW-14549. -- Resolution: Not A Problem > VectorSchemaRoot is not refreshed when value is null > > > Key: ARROW-14549 > URL: https://issues.apache.org/jira/browse/ARROW-14549 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Affects Versions: 6.0.0 >Reporter: Wenbo Hu >Priority: Major > > I'm using `arrow-jdbc` to convert query result from JDBC to arrow. > But the following code, unexpected behaivor happens. > Assuming a sqlite db, the 2nd row of col_2 and col_3 are null. > |col_1|col_2|col_3| > |1|abc|3.14| > |2|NULL|NULL| > As document suggests, > {quote}populated data over and over into the same VectorSchemaRoot in a > stream of batches rather than creating a new VectorSchemaRoot instance each > time. > {quote} > *JdbcToArrowConfig* is set to reuse root. > {code:java} > public void querySql(String query, QueryOption option) throws Exception { > try (final java.sql.Connection conn = connectContainer.getConnection(); > final Statement stmt = conn.createStatement(); > final ResultSet rs = stmt.executeQuery(query) > ) { > // create config with reuse schema root and custom batch size from option > final JdbcToArrowConfig config = new > JdbcToArrowConfigBuilder().setAllocator(new > RootAllocator()).setCalendar(JdbcToArrowUtils.getUtcCalendar()) > > .setTargetBatchSize(option.getBatchSize()).setReuseVectorSchemaRoot(true).build(); > final ArrowVectorIterator iterator = > JdbcToArrow.sqlToArrowVectorIterator(rs, config); >while (iterator.hasNext()){ // retrieve result from iterator > final VectorSchemaRoot root = iterator.next(); > option.getCallback().handleBatchResult(root); > root.allocateNew(); // it has to be allocate new > } > } catch (java.lang.Exception e){ throw new Exception(e.getMessage()); } > } > > .. > // batch_size is set to 1, then callback is called twice. > QueryOptions options = new QueryOption(1, > root -> { > // if printer is not set, get schema, write header > if (printer == null) { > final String[] headers = > root.getSchema().getFields().stream().map(Field::getName).toArray(String[]::new); > > printer = new CSVPrinter(writer, > CSVFormat.Builder.create(CSVFormat.DEFAULT).setHeader(headers).build()); > } > > final int rows = root.getRowCount(); > final List fieldVectors = root.getFieldVectors(); > > // iterate over rows > for (int i = 0; i < rows; i++) { > final int rowId = i; > final List row = fieldVectors.stream().map(v -> > v.getObject(rowId)).map(String::valueOf).collect(Collectors.toList()); > printer.printRecord(row); > } > }); > > connection.querySql("SELECT * FROM test_db", options); > .. > {code} > if `root.allocateNew()` is called, the csv file is expected, > ``` > column_1,column_2,column_3 > 1,abc,3.14 > 2,null,null > ``` > Otherwise, null values of 2nd row are remaining the same values of 1st row > ``` > column_1,column_2,column_3 > 1,abc,3.14 > 2,abc,3.14 > ``` > **Question: Is expected to call `allocateNew` every time when the schema root > is reused?** > By without reusing schemaroot, the following code works as expected. > {code:java} > public void querySql(String query, QueryOption option) throws Exception { > try (final java.sql.Connection conn = connectContainer.getConnection(); > final Statement stmt = conn.createStatement(); > final ResultSet rs = stmt.executeQuery(query)) { > // create config without reuse schema root and custom batch size from > option > final JdbcToArrowConfig config = new > JdbcToArrowConfigBuilder().setAllocator(new > RootAllocator()).setCalendar(JdbcToArrowUtils.getUtcCalendar()) > > .setTargetBatchSize(option.getBatchSize()).setReuseVectorSchemaRoot(false).build(); > > final ArrowVectorIterator iterator = > JdbcToArrow.sqlToArrowVectorIterator(rs, config); > while (iterator.hasNext()) { > // retrieve result from iterator > try (VectorSchemaRoot root = iterator.next()) { > option.getCallback().handleBatchResult(root); root.allocateNew(); > } >} > } catch (java.lang.Exception e) { throw new Exception(e.getMessage()); } > } > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-14549) VectorSchemaRoot is not refreshed when value is null
[ https://issues.apache.org/jira/browse/ARROW-14549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498249#comment-17498249 ] Bryan Cutler commented on ARROW-14549: -- [~hu6360567] Calling `allocateNew()` will create new buffers, which is one way to clear previous results. If you don't want to allocate any new memory, you would need to to zero out all the vectors by calling `zeroVector()` and `setValueCount(0)` > VectorSchemaRoot is not refreshed when value is null > > > Key: ARROW-14549 > URL: https://issues.apache.org/jira/browse/ARROW-14549 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Affects Versions: 6.0.0 >Reporter: Wenbo Hu >Priority: Major > > I'm using `arrow-jdbc` to convert query result from JDBC to arrow. > But the following code, unexpected behaivor happens. > Assuming a sqlite db, the 2nd row of col_2 and col_3 are null. > |col_1|col_2|col_3| > |1|abc|3.14| > |2|NULL|NULL| > As document suggests, > {quote}populated data over and over into the same VectorSchemaRoot in a > stream of batches rather than creating a new VectorSchemaRoot instance each > time. > {quote} > *JdbcToArrowConfig* is set to reuse root. > {code:java} > public void querySql(String query, QueryOption option) throws Exception { > try (final java.sql.Connection conn = connectContainer.getConnection(); > final Statement stmt = conn.createStatement(); > final ResultSet rs = stmt.executeQuery(query) > ) { > // create config with reuse schema root and custom batch size from option > final JdbcToArrowConfig config = new > JdbcToArrowConfigBuilder().setAllocator(new > RootAllocator()).setCalendar(JdbcToArrowUtils.getUtcCalendar()) > > .setTargetBatchSize(option.getBatchSize()).setReuseVectorSchemaRoot(true).build(); > final ArrowVectorIterator iterator = > JdbcToArrow.sqlToArrowVectorIterator(rs, config); >while (iterator.hasNext()){ // retrieve result from iterator > final VectorSchemaRoot root = iterator.next(); > option.getCallback().handleBatchResult(root); > root.allocateNew(); // it has to be allocate new > } > } catch (java.lang.Exception e){ throw new Exception(e.getMessage()); } > } > > .. > // batch_size is set to 1, then callback is called twice. > QueryOptions options = new QueryOption(1, > root -> { > // if printer is not set, get schema, write header > if (printer == null) { > final String[] headers = > root.getSchema().getFields().stream().map(Field::getName).toArray(String[]::new); > > printer = new CSVPrinter(writer, > CSVFormat.Builder.create(CSVFormat.DEFAULT).setHeader(headers).build()); > } > > final int rows = root.getRowCount(); > final List fieldVectors = root.getFieldVectors(); > > // iterate over rows > for (int i = 0; i < rows; i++) { > final int rowId = i; > final List row = fieldVectors.stream().map(v -> > v.getObject(rowId)).map(String::valueOf).collect(Collectors.toList()); > printer.printRecord(row); > } > }); > > connection.querySql("SELECT * FROM test_db", options); > .. > {code} > if `root.allocateNew()` is called, the csv file is expected, > ``` > column_1,column_2,column_3 > 1,abc,3.14 > 2,null,null > ``` > Otherwise, null values of 2nd row are remaining the same values of 1st row > ``` > column_1,column_2,column_3 > 1,abc,3.14 > 2,abc,3.14 > ``` > **Question: Is expected to call `allocateNew` every time when the schema root > is reused?** > By without reusing schemaroot, the following code works as expected. > {code:java} > public void querySql(String query, QueryOption option) throws Exception { > try (final java.sql.Connection conn = connectContainer.getConnection(); > final Statement stmt = conn.createStatement(); > final ResultSet rs = stmt.executeQuery(query)) { > // create config without reuse schema root and custom batch size from > option > final JdbcToArrowConfig config = new > JdbcToArrowConfigBuilder().setAllocator(new > RootAllocator()).setCalendar(JdbcToArrowUtils.getUtcCalendar()) > > .setTargetBatchSize(option.getBatchSize()).setReuseVectorSchemaRoot(false).build(); > > final ArrowVectorIterator iterator = > JdbcToArrow.sqlToArrowVectorIterator(rs, config); > while (iterator.hasNext()) { > // retrieve result from iterator > try (VectorSchemaRoot root = iterator.next()) { > option.getCallback().handleBatchResult(root); root.allocateNew(); > } >} > } catch (java.lang.Exception e) { throw new Exception(e.getMessage()); } > } > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-14665) [Java] JdbcToArrowUtils ResultSet iteration bug
[ https://issues.apache.org/jira/browse/ARROW-14665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved ARROW-14665. -- Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 11667 [https://github.com/apache/arrow/pull/11667] > [Java] JdbcToArrowUtils ResultSet iteration bug > --- > > Key: ARROW-14665 > URL: https://issues.apache.org/jira/browse/ARROW-14665 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Affects Versions: 6.0.0 >Reporter: Zac >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > When specifying a target batch size, the [iteration > logic|https://github.com/apache/arrow/blob/ea42b9e0aa000238fff22fd48f06f3aa516b9f3f/java/adapter/jdbc/src/main/java/org/apache/arrow/adapter/jdbc/JdbcToArrowUtils.java#L266] > is currently broken: > {code:java} > while (rs.next() && readRowCount < config.getTargetBatchSize()) { > compositeConsumer.consume(rs); > readRowCount++; > } > {code} > calling next() on the result set will move the cursor forward to the next > row, even when we've reached the target batch size. > For example, consider setting target batch size to 1, and query a table that > has three rows. > On the first iteration, we'll successfully consume the first row. On the next > iteration, we'll move the cursor to row 2, but detect the read row count is > no longer < target batch size and return. > Upon calling into the method again with the same result set, rs.next will be > called again which will result in successfully consuming row 3. > *Problem:* row 2 is skipped! -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15742) [Go] Implement 'bitmap_neon' with Arm64 GoLang Assembly
[ https://issues.apache.org/jira/browse/ARROW-15742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Topol resolved ARROW-15742. --- Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12502 [https://github.com/apache/arrow/pull/12502] > [Go] Implement 'bitmap_neon' with Arm64 GoLang Assembly > > > Key: ARROW-15742 > URL: https://issues.apache.org/jira/browse/ARROW-15742 > Project: Apache Arrow > Issue Type: Task > Components: Go >Reporter: Yuqi Gu >Assignee: Yuqi Gu >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > 1. Implement 'extract_bits' with Arm64 GoLang Assembly. '_pext_u64' is the > x86 bmi intrinsics for extract_bits. > There is no equivalent of '_pext_u64' instruction on Arm64. > The task is to implement equivalent of '_pext_u64' by Arm64 assembly. > 2. Implement 'levels_to_bitmap' with Arm64 GoLang Assembly for > greaterThanBitmapNEON -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498231#comment-17498231 ] Vibhatha Lakmal Abeykoon commented on ARROW-15765: -- Sure, I will give it a try and post what I find out. > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15788) [C++][FlightRPC] Support alternative transports in the Flight benchmark
David Li created ARROW-15788: Summary: [C++][FlightRPC] Support alternative transports in the Flight benchmark Key: ARROW-15788 URL: https://issues.apache.org/jira/browse/ARROW-15788 Project: Apache Arrow Issue Type: Improvement Components: C++, FlightRPC Reporter: David Li Assignee: David Li A follow-up to ARROW-15282. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15781) [Python] ParquetFileFragment.ensure_complete_metadata doesn't release the GIL
[ https://issues.apache.org/jira/browse/ARROW-15781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche resolved ARROW-15781. --- Resolution: Fixed Issue resolved by pull request 12509 [https://github.com/apache/arrow/pull/12509] > [Python] ParquetFileFragment.ensure_complete_metadata doesn't release the GIL > - > > Key: ARROW-15781 > URL: https://issues.apache.org/jira/browse/ARROW-15781 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > See https://github.com/apache/arrow/issues/12501 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15781) [Python] ParquetFileFragment.ensure_complete_metadata doesn't release the GIL
[ https://issues.apache.org/jira/browse/ARROW-15781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-15781: -- Fix Version/s: 8.0.0 > [Python] ParquetFileFragment.ensure_complete_metadata doesn't release the GIL > - > > Key: ARROW-15781 > URL: https://issues.apache.org/jira/browse/ARROW-15781 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > See https://github.com/apache/arrow/issues/12501 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15787) [C++] Temporal floor/ceil/round kernels could be optimised with templating
[ https://issues.apache.org/jira/browse/ARROW-15787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rok Mihevc updated ARROW-15787: --- Priority: Minor (was: Major) > [C++] Temporal floor/ceil/round kernels could be optimised with templating > -- > > Key: ARROW-15787 > URL: https://issues.apache.org/jira/browse/ARROW-15787 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Rok Mihevc >Priority: Minor > > [CeilTemporal, FloorTemporal, > RoundTemporal|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_temporal_unary.cc#L728-L980] > kernels could probably be templated in a clean way. They also execute a > switch statement for every call instead of creating an operator at kernel > call time and only running that. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15787) [C++] Temporal floor/ceil/round kernels could be optimised with templating
[ https://issues.apache.org/jira/browse/ARROW-15787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rok Mihevc updated ARROW-15787: --- Labels: kernel (was: ) > [C++] Temporal floor/ceil/round kernels could be optimised with templating > -- > > Key: ARROW-15787 > URL: https://issues.apache.org/jira/browse/ARROW-15787 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Rok Mihevc >Priority: Minor > Labels: kernel > > [CeilTemporal, FloorTemporal, > RoundTemporal|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_temporal_unary.cc#L728-L980] > kernels could probably be templated in a clean way. They also execute a > switch statement for every call instead of creating an operator at kernel > call time and only running that. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15787) [C++] Temporal floor/ceil/round kernels could be optimised with templating
[ https://issues.apache.org/jira/browse/ARROW-15787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rok Mihevc updated ARROW-15787: --- Component/s: C++ > [C++] Temporal floor/ceil/round kernels could be optimised with templating > -- > > Key: ARROW-15787 > URL: https://issues.apache.org/jira/browse/ARROW-15787 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Rok Mihevc >Priority: Minor > > [CeilTemporal, FloorTemporal, > RoundTemporal|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_temporal_unary.cc#L728-L980] > kernels could probably be templated in a clean way. They also execute a > switch statement for every call instead of creating an operator at kernel > call time and only running that. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15787) [C++] Temporal floor/ceil/round kernels could be optimised with templating
Rok Mihevc created ARROW-15787: -- Summary: [C++] Temporal floor/ceil/round kernels could be optimised with templating Key: ARROW-15787 URL: https://issues.apache.org/jira/browse/ARROW-15787 Project: Apache Arrow Issue Type: Improvement Reporter: Rok Mihevc [CeilTemporal, FloorTemporal, RoundTemporal|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_temporal_unary.cc#L728-L980] kernels could probably be templated in a clean way. They also execute a switch statement for every call instead of creating an operator at kernel call time and only running that. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15680) [C++] Temporal floor/ceil/round should accept week_starts_monday when rounding to multiple of week
[ https://issues.apache.org/jira/browse/ARROW-15680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rok Mihevc updated ARROW-15680: --- Priority: Major (was: Minor) > [C++] Temporal floor/ceil/round should accept week_starts_monday when > rounding to multiple of week > --- > > Key: ARROW-15680 > URL: https://issues.apache.org/jira/browse/ARROW-15680 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Rok Mihevc >Assignee: Rok Mihevc >Priority: Major > Labels: kernel, pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > See ARROW-14821 and the [related > PR|https://github.com/apache/arrow/pull/12154]. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15680) [C++] Temporal floor/ceil/round should accept week_starts_monday when rounding to multiple of week
[ https://issues.apache.org/jira/browse/ARROW-15680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rok Mihevc updated ARROW-15680: --- Priority: Minor (was: Major) > [C++] Temporal floor/ceil/round should accept week_starts_monday when > rounding to multiple of week > --- > > Key: ARROW-15680 > URL: https://issues.apache.org/jira/browse/ARROW-15680 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Rok Mihevc >Assignee: Rok Mihevc >Priority: Minor > Labels: kernel, pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > See ARROW-14821 and the [related > PR|https://github.com/apache/arrow/pull/12154]. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15666) [C++][Python][R] Add format inference option to StrptimeOptions
[ https://issues.apache.org/jira/browse/ARROW-15666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498152#comment-17498152 ] Rok Mihevc commented on ARROW-15666: Thanks for the warning Matthew, much appreciated! Looking at the utility-to-complexity ratio this does seem like something we'd better avoid. An idea would be to perhaps use the already existing pandas logic (if pandas is available at runtime) to do the format inference and then pass the inferred format to c++ and do the rest of the op there. Same for lubridate in R. > [C++][Python][R] Add format inference option to StrptimeOptions > --- > > Key: ARROW-15666 > URL: https://issues.apache.org/jira/browse/ARROW-15666 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Rok Mihevc >Priority: Major > > We want to have an option to infer timestamp format. > See > [pandas.to_datetime|https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html] > and lubridate > [parse_date_time|https://lubridate.tidyverse.org/reference/parse_date_time.html] > for examples. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (ARROW-15785) [Benchmarks] Add conbench benchmark for single-file parquet reads
[ https://issues.apache.org/jira/browse/ARROW-15785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498146#comment-17498146 ] Jonathan Keane edited comment on ARROW-15785 at 2/25/22, 2:16 PM: -- I think this is the PR that introduced the regression (though I might be totally off or it's a different regression...) https://github.com/apache/arrow/pull/11991#issuecomment-1009216946 And the conbench run: https://conbench.ursa.dev/compare/runs/c4d5e65d088243259e5198f4c0e219c9...5a1c693586c74471b7c8ba775005db54/ We should probably have the conbench bot alert more loudly that there are regressions of this magnitude. That 5% there is supposed to indicate that there's an issue, but we might have that set too low such that there's alarm fatigue or|and we should alert louder when there are this many high-change benchmarks (e.g. the file-read benchmark z-scores range from a -76 to -759, and we alert at -5) was (Author: jonkeane): I think this is the PR that introduced the regression (though I might be totally off or it's a different one...) https://github.com/apache/arrow/pull/11991#issuecomment-1009216946 And the conbench run: https://conbench.ursa.dev/compare/runs/c4d5e65d088243259e5198f4c0e219c9...5a1c693586c74471b7c8ba775005db54/ We should probably have the conbench bot alert more loudly that there are regressions of this magnitude. That 5% there is supposed to indicate that there's an issue, but we might have that set too low such that there's alarm fatigue or|and we should alert louder when there are this many high-change benchmarks (e.g. the file-read benchmark z-scores range from a -76 to -759, and we alert at -5) > [Benchmarks] Add conbench benchmark for single-file parquet reads > - > > Key: ARROW-15785 > URL: https://issues.apache.org/jira/browse/ARROW-15785 > Project: Apache Arrow > Issue Type: Improvement > Components: Benchmarking >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Major > > Release 7.0.0 introduced a regression in parquet single file reads. We > should add a macro-level benchmark that does single-file reads to help us > detect this in the future. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15785) [Benchmarks] Add conbench benchmark for single-file parquet reads
[ https://issues.apache.org/jira/browse/ARROW-15785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498146#comment-17498146 ] Jonathan Keane commented on ARROW-15785: I think this is the PR that introduced the regression (though I might be totally off or it's a different one...) https://github.com/apache/arrow/pull/11991#issuecomment-1009216946 And the conbench run: https://conbench.ursa.dev/compare/runs/c4d5e65d088243259e5198f4c0e219c9...5a1c693586c74471b7c8ba775005db54/ We should probably have the conbench bot alert more loudly that there are regressions of this magnitude. That 5% there is supposed to indicate that there's an issue, but we might have that set too low such that there's alarm fatigue or|and we should alert louder when there are this many high-change benchmarks (e.g. the file-read benchmark z-scores range from a -76 to -759, and we alert at -5) > [Benchmarks] Add conbench benchmark for single-file parquet reads > - > > Key: ARROW-15785 > URL: https://issues.apache.org/jira/browse/ARROW-15785 > Project: Apache Arrow > Issue Type: Improvement > Components: Benchmarking >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Major > > Release 7.0.0 introduced a regression in parquet single file reads. We > should add a macro-level benchmark that does single-file reads to help us > detect this in the future. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15785) [Benchmarks] Add conbench benchmark for single-file parquet reads
[ https://issues.apache.org/jira/browse/ARROW-15785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498144#comment-17498144 ] Jonathan Keane commented on ARROW-15785: Do the Python [1] and R [2] benchmarks for single file reads do this? Oddly(?) The python benchmarks do show a jump around January: https://conbench.ursa.dev/benchmarks/8c5cc1a939d8485eb6c42af83f82c8c0/ https://conbench.ursa.dev/benchmarks/1b8d2dae6f664fd19579071a7cf7766b/ But the corresponding R ones do not: https://conbench.ursa.dev/benchmarks/ca493bf17af84ae5babd97f385b69afc/ [1] https://github.com/ursacomputing/benchmarks/blob/main/benchmarks/file_benchmark.py [2] https://github.com/ursacomputing/arrowbench/blob/main/R/bm-read-file.R > [Benchmarks] Add conbench benchmark for single-file parquet reads > - > > Key: ARROW-15785 > URL: https://issues.apache.org/jira/browse/ARROW-15785 > Project: Apache Arrow > Issue Type: Improvement > Components: Benchmarking >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Major > > Release 7.0.0 introduced a regression in parquet single file reads. We > should add a macro-level benchmark that does single-file reads to help us > detect this in the future. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-14943) [R] Bindings for lubridate's ddays, dhours, dminutes, dmonths, dweeks, dyears
[ https://issues.apache.org/jira/browse/ARROW-14943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alenka Frim reassigned ARROW-14943: --- Assignee: Alenka Frim > [R] Bindings for lubridate's ddays, dhours, dminutes, dmonths, dweeks, dyears > - > > Key: ARROW-14943 > URL: https://issues.apache.org/jira/browse/ARROW-14943 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Nicola Crane >Assignee: Alenka Frim >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-14942) [R] Bindings for lubridate's dpicoseconds, dnanoseconds, desconds, dmilliseconds, dmicroseconds
[ https://issues.apache.org/jira/browse/ARROW-14942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alenka Frim reassigned ARROW-14942: --- Assignee: Alenka Frim > [R] Bindings for lubridate's dpicoseconds, dnanoseconds, desconds, > dmilliseconds, dmicroseconds > --- > > Key: ARROW-14942 > URL: https://issues.apache.org/jira/browse/ARROW-14942 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Nicola Crane >Assignee: Alenka Frim >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498093#comment-17498093 ] Antoine Pitrou commented on ARROW-15765: Someone could experiment with the typing generic approach indeed and see if it works. > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15098) [R] Add binding for lubridate::duration() and/or as.difftime()
[ https://issues.apache.org/jira/browse/ARROW-15098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498021#comment-17498021 ] Dragoș Moldovan-Grünfeld commented on ARROW-15098: -- This ticket should probably add bindings for {{base::difftime()}} too. > [R] Add binding for lubridate::duration() and/or as.difftime() > -- > > Key: ARROW-15098 > URL: https://issues.apache.org/jira/browse/ARROW-15098 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Dewey Dunnington >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > After ARROW-14941 we have support for the duration type; however, there is no > binding for {{lubridate::duration()}} or {{as.difftime()}} available in dplyr > evaluation that could create these objects. I'm actually not sure if we > should bind {{lubridate::duration}} since it returns a custom S4 class that's > identical in function to base R's difftime. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (ARROW-15098) [R] Add binding for lubridate::duration() and/or as.difftime()
[ https://issues.apache.org/jira/browse/ARROW-15098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498021#comment-17498021 ] Dragoș Moldovan-Grünfeld edited comment on ARROW-15098 at 2/25/22, 10:12 AM: - This ticket should probably cover adding bindings for {{base::difftime()}} too. was (Author: dragosmg): This ticket should probably add bindings for {{base::difftime()}} too. > [R] Add binding for lubridate::duration() and/or as.difftime() > -- > > Key: ARROW-15098 > URL: https://issues.apache.org/jira/browse/ARROW-15098 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Dewey Dunnington >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > After ARROW-14941 we have support for the duration type; however, there is no > binding for {{lubridate::duration()}} or {{as.difftime()}} available in dplyr > evaluation that could create these objects. I'm actually not sure if we > should bind {{lubridate::duration}} since it returns a custom S4 class that's > identical in function to base R's difftime. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (ARROW-14820) [R] Implement bindings for lubridate calculation functions
[ https://issues.apache.org/jira/browse/ARROW-14820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498015#comment-17498015 ] Dragoș Moldovan-Grünfeld edited comment on ARROW-14820 at 2/25/22, 10:07 AM: - Hi [~eitsupi], Thanks for filing the ticket. Once [PR 12433|https://github.com/apache/arrow/pull/12433] is merged you will be able to use `date()` to extract the date component of a timestamp. We aim to cover {{as.Date()}} as part of the same pull request. The snippet of code below should work once we merge the PR. Please note {{base::as.Date()}} by default assumes you want the date in UTC : {code:r} library(dplyr) library(lubridate) library(arrow) df <- tibble::tibble( col1 = lubridate::as_datetime("2010-08-03 00:50:50", tz = "Europe/London"), ) # lubridate df %>% mutate(x = as_date(col1), y = as.Date(col1)) #> # A tibble: 1 × 3 #> col1x y #> #> 1 2010-08-03 00:50:50 2010-08-03 2010-08-02 # arrow df %>% arrow_table() %>% mutate(x = date(col1), y = as.Date(col1)) %>% collect() #> # A tibble: 1 × 3 #> col1x y #> #> 1 2010-08-03 00:50:50 2010-08-03 2010-08-02 {code} was (Author: dragosmg): Hi [~eitsupi], Thanks for filing the ticket. Once [PR 12433|https://github.com/apache/arrow/pull/12433] is merged you will be able to use `date()` to extract the date component of a timestamp. {{as.Date()}} is covered by the same pull request. The snippet of code below should work once we merge the PR. Please note {{base::as.Date()}} by default assumes you want the date in UTC : {code:r} library(dplyr) library(lubridate) library(arrow) df <- tibble::tibble( col1 = lubridate::as_datetime("2010-08-03 00:50:50", tz = "Europe/London"), ) # lubridate df %>% mutate(x = as_date(col1), y = as.Date(col1)) #> # A tibble: 1 × 3 #> col1x y #> #> 1 2010-08-03 00:50:50 2010-08-03 2010-08-02 # arrow df %>% arrow_table() %>% mutate(x = date(col1), y = as.Date(col1)) %>% collect() #> # A tibble: 1 × 3 #> col1x y #> #> 1 2010-08-03 00:50:50 2010-08-03 2010-08-02 {code} > [R] Implement bindings for lubridate calculation functions > -- > > Key: ARROW-14820 > URL: https://issues.apache.org/jira/browse/ARROW-14820 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Nicola Crane >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-14820) [R] Implement bindings for lubridate calculation functions
[ https://issues.apache.org/jira/browse/ARROW-14820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498015#comment-17498015 ] Dragoș Moldovan-Grünfeld commented on ARROW-14820: -- Hi [~eitsupi], Thanks for filing the ticket. Once [PR 12433|https://github.com/apache/arrow/pull/12433] is merged you will be able to use `date()` to extract the date component of a timestamp. {{as.Date()}} is covered by the same pull request. The snippet of code below should work once we merge the PR. Please note {{base::as.Date()}} by default assumes you want the date in UTC : {code:r} library(dplyr) library(lubridate) library(arrow) df <- tibble::tibble( col1 = lubridate::as_datetime("2010-08-03 00:50:50", tz = "Europe/London"), ) # lubridate df %>% mutate(x = as_date(col1), y = as.Date(col1)) #> # A tibble: 1 × 3 #> col1x y #> #> 1 2010-08-03 00:50:50 2010-08-03 2010-08-02 # arrow df %>% arrow_table() %>% mutate(x = date(col1), y = as.Date(col1)) %>% collect() #> # A tibble: 1 × 3 #> col1x y #> #> 1 2010-08-03 00:50:50 2010-08-03 2010-08-02 {code} > [R] Implement bindings for lubridate calculation functions > -- > > Key: ARROW-14820 > URL: https://issues.apache.org/jira/browse/ARROW-14820 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Nicola Crane >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (ARROW-15645) [Flight][Java][C++] Data read through Flight is having endianness issue on s390x
[ https://issues.apache.org/jira/browse/ARROW-15645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498004#comment-17498004 ] Ravi Gummadi edited comment on ARROW-15645 at 2/25/22, 9:42 AM: Yes. Both server and client are on s390x. Thanks for the details [~apitrou] . I will watch [https://issues.apache.org/jira/projects/ARROW/issues/ARROW-15778] and test on my environment once a fix for 15778 is available. was (Author: ravidotg): Thanks for the details [~apitrou] . I will watch [https://issues.apache.org/jira/projects/ARROW/issues/ARROW-15778] and test on my environment once a fix for 15778 is available. > [Flight][Java][C++] Data read through Flight is having endianness issue on > s390x > > > Key: ARROW-15645 > URL: https://issues.apache.org/jira/browse/ARROW-15645 > Project: Apache Arrow > Issue Type: Bug > Components: C++, FlightRPC, Java >Affects Versions: 5.0.0 > Environment: Linux s390x (big endian) >Reporter: Ravi Gummadi >Priority: Major > > Am facing an endianness issue on s390x(big endian) when converting the data > read through flight to pandas data frame. > (1) table.validate() fails with error > {code} > Traceback (most recent call last): > File "/tmp/2.py", line 51, in > table.validate() > File "pyarrow/table.pxi", line 1232, in pyarrow.lib.Table.validate > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: Negative offsets in > binary array > {code} > (2) table.to_pandas() gives a segmentation fault > > Here is a sample code that I am using: > {code:python} > from pyarrow import flight > import os > import json > flight_endpoint = os.environ.get("flight_server_url", > "grpc+tls://...local:443") > print(flight_endpoint) > # > class TokenClientAuthHandler(flight.ClientAuthHandler): > """An example implementation of authentication via handshake. > With the default constructor, the user token is read from the > environment: TokenClientAuthHandler(). > You can also pass a user token as parameter to the constructor, > TokenClientAuthHandler(yourtoken). > """ > def \_\_init\_\_(self, token: str = None): > super().\_\_init\__() > if( token != None): > strToken = strToken = 'Bearer {}'.format(token) > else: > strToken = 'Bearer {}'.format(os.environ.get("some_auth_token")) > self.token = strToken.encode('utf-8') > #print(self.token) > def authenticate(self, outgoing, incoming): > outgoing.write(self.token) > self.token = incoming.read() > def get_token(self): > return self.token > > readClient = flight.FlightClient(flight_endpoint) > readClient.authenticate(TokenClientAuthHandler()) > cmd = json.dumps(\{...}) > descriptor = flight.FlightDescriptor.for_command(cmd) > flightInfo = readClient.get_flight_info(descriptor) > reader = readClient.do_get(flightInfo.endpoints[0].ticket) > table = reader.read_all() > print(table) > print(table.num_columns) > print(table.num_rows) > table.validate() > table.to_pandas() > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15645) [Flight][Java][C++] Data read through Flight is having endianness issue on s390x
[ https://issues.apache.org/jira/browse/ARROW-15645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498004#comment-17498004 ] Ravi Gummadi commented on ARROW-15645: -- Thanks for the details [~apitrou] . I will watch [https://issues.apache.org/jira/projects/ARROW/issues/ARROW-15778] and test on my environment once a fix for 15778 is available. > [Flight][Java][C++] Data read through Flight is having endianness issue on > s390x > > > Key: ARROW-15645 > URL: https://issues.apache.org/jira/browse/ARROW-15645 > Project: Apache Arrow > Issue Type: Bug > Components: C++, FlightRPC, Java >Affects Versions: 5.0.0 > Environment: Linux s390x (big endian) >Reporter: Ravi Gummadi >Priority: Major > > Am facing an endianness issue on s390x(big endian) when converting the data > read through flight to pandas data frame. > (1) table.validate() fails with error > {code} > Traceback (most recent call last): > File "/tmp/2.py", line 51, in > table.validate() > File "pyarrow/table.pxi", line 1232, in pyarrow.lib.Table.validate > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: Negative offsets in > binary array > {code} > (2) table.to_pandas() gives a segmentation fault > > Here is a sample code that I am using: > {code:python} > from pyarrow import flight > import os > import json > flight_endpoint = os.environ.get("flight_server_url", > "grpc+tls://...local:443") > print(flight_endpoint) > # > class TokenClientAuthHandler(flight.ClientAuthHandler): > """An example implementation of authentication via handshake. > With the default constructor, the user token is read from the > environment: TokenClientAuthHandler(). > You can also pass a user token as parameter to the constructor, > TokenClientAuthHandler(yourtoken). > """ > def \_\_init\_\_(self, token: str = None): > super().\_\_init\__() > if( token != None): > strToken = strToken = 'Bearer {}'.format(token) > else: > strToken = 'Bearer {}'.format(os.environ.get("some_auth_token")) > self.token = strToken.encode('utf-8') > #print(self.token) > def authenticate(self, outgoing, incoming): > outgoing.write(self.token) > self.token = incoming.read() > def get_token(self): > return self.token > > readClient = flight.FlightClient(flight_endpoint) > readClient.authenticate(TokenClientAuthHandler()) > cmd = json.dumps(\{...}) > descriptor = flight.FlightDescriptor.for_command(cmd) > flightInfo = readClient.get_flight_info(descriptor) > reader = readClient.do_get(flightInfo.endpoints[0].ticket) > table = reader.read_all() > print(table) > print(table.num_columns) > print(table.num_rows) > table.validate() > table.to_pandas() > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15760) [C++] Avoid hard dependency on git in cmake (download tarballs from github instead)
[ https://issues.apache.org/jira/browse/ARROW-15760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-15760: -- Fix Version/s: 8.0.0 > [C++] Avoid hard dependency on git in cmake (download tarballs from github > instead) > --- > > Key: ARROW-15760 > URL: https://issues.apache.org/jira/browse/ARROW-15760 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Joris Van den Bossche >Assignee: Jeroen van Straten >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > See https://github.com/apache/arrow/pull/12322#issuecomment-1048523391 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-15760) [C++] Avoid hard dependency on git in cmake (download tarballs from github instead)
[ https://issues.apache.org/jira/browse/ARROW-15760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche reassigned ARROW-15760: - Assignee: Jeroen van Straten (was: Matthijs Brobbel) > [C++] Avoid hard dependency on git in cmake (download tarballs from github > instead) > --- > > Key: ARROW-15760 > URL: https://issues.apache.org/jira/browse/ARROW-15760 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Joris Van den Bossche >Assignee: Jeroen van Straten >Priority: Major > Labels: pull-request-available > Time Spent: 1.5h > Remaining Estimate: 0h > > See https://github.com/apache/arrow/pull/12322#issuecomment-1048523391 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15760) [C++] Avoid hard dependency on git in cmake (download tarballs from github instead)
[ https://issues.apache.org/jira/browse/ARROW-15760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche resolved ARROW-15760. --- Resolution: Fixed > [C++] Avoid hard dependency on git in cmake (download tarballs from github > instead) > --- > > Key: ARROW-15760 > URL: https://issues.apache.org/jira/browse/ARROW-15760 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Joris Van den Bossche >Assignee: Jeroen van Straten >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > See https://github.com/apache/arrow/pull/12322#issuecomment-1048523391 -- This message was sent by Atlassian Jira (v8.20.1#820001)