[jira] [Assigned] (ARROW-14729) [C++][Documentation] Update overview of Arrow components/layers

2022-02-25 Thread Pradeep Garigipati (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Garigipati reassigned ARROW-14729:
--

Assignee: Pradeep Garigipati

> [C++][Documentation] Update overview of Arrow components/layers
> ---
>
> Key: ARROW-14729
> URL: https://issues.apache.org/jira/browse/ARROW-14729
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: Eduardo Ponce
>Assignee: Pradeep Garigipati
>Priority: Major
>  Labels: good-first-issue, good-second-issue, query-engine
> Fix For: 8.0.0
>
>
> New components have been added/modified in Arrow (e.g., query engine), so we 
> should update documentation that describes these. Overview of Arrow layers 
> are described in 
> [overview.rst|https://github.com/apache/arrow/blob/master/docs/source/cpp/overview.rst].



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-14890) [C++][Dataset] Add support for filter pushdown in the ORC Scanner

2022-02-25 Thread Pradeep Garigipati (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498412#comment-17498412
 ] 

Pradeep Garigipati commented on ARROW-14890:


Is this issue open for assignment or has someone begun work already on it and 
status isn't updated ?

> [C++][Dataset] Add support for filter pushdown in the ORC Scanner
> -
>
> Key: ARROW-14890
> URL: https://issues.apache.org/jira/browse/ARROW-14890
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: xiangxiang Shen
>Priority: Major
>  Labels: dataset, good-second-issue, orc
>
> In arrow dataset, Filter pushdown can improve reading files performance 
> greatly. We notice parquet has implemented, 
> https://github.com/apache/arrow/blob/35b3567e73423420a99dbe6116f000e3c77d2a4c/cpp/src/arrow/dataset/file_parquet.cc#L465-L484.
> But ORC fileformat has not supported Filter pushdown. It ignores the "filter" 
> of  ScanOptions now.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-9404) [C++] Add support for Decimal16, Decimal32 and Decimal64

2022-02-25 Thread Ben Baumgold (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498356#comment-17498356
 ] 

Ben Baumgold commented on ARROW-9404:
-

Looks like https://github.com/apache/arrow/pull/8578 implemented this feature, 
but the PR seems abandoned.  Would be nice to find a way to push it over the 
finish-line somehow so Arrow can support Decimal[16|32|64].

> [C++] Add support for Decimal16, Decimal32 and Decimal64
> 
>
> Key: ARROW-9404
> URL: https://issues.apache.org/jira/browse/ARROW-9404
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Artem Alekseev
>Priority: Major
>
> It looks like arrow lacks support for decimal16, decimal32 and decimal64 
> types. Are there any reasons for that?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-15455) [C++] Cast between fixed size list type and variable size list

2022-02-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-15455:
---
Labels: good-second-issue kernel pull-request-available  (was: 
good-second-issue kernel)

> [C++] Cast between fixed size list type and variable size list 
> ---
>
> Key: ARROW-15455
> URL: https://issues.apache.org/jira/browse/ARROW-15455
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Jabari Booker
>Priority: Major
>  Labels: good-second-issue, kernel, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Casting from fixed size list to variable size list could be possible, I 
> think, but currently doesn't work:
> {code:python}
> >>> fixed_size = pa.array([[1, 2], [3, 4]], type=pa.list_(pa.int64(), 2))
> >>> fixed_size.cast(pa.list_(pa.int64()))
> ...
> ArrowNotImplementedError: Unsupported cast from fixed_size_list int64>[2] to list using function cast_list
> {code}
> And in principle, a cast the other way around could also be possible if it is 
> checked that each list has the correct length.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (ARROW-15785) [Benchmarks] Add conbench benchmark for single-file parquet reads

2022-02-25 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace closed ARROW-15785.
---
Resolution: Not A Problem

> [Benchmarks] Add conbench benchmark for single-file parquet reads
> -
>
> Key: ARROW-15785
> URL: https://issues.apache.org/jira/browse/ARROW-15785
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Benchmarking
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>
> Release 7.0.0 introduced a regression in parquet single file reads.  We 
> should add a macro-level benchmark that does single-file reads to help us 
> detect this in the future.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15785) [Benchmarks] Add conbench benchmark for single-file parquet reads

2022-02-25 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498355#comment-17498355
 ] 

Weston Pace commented on ARROW-15785:
-

Ah, good point.  Yes, that is the commit that was regressed and it appears that 
our macro benchmarks can indeed catch this case.  So no new benchmarks are 
needed.

> [Benchmarks] Add conbench benchmark for single-file parquet reads
> -
>
> Key: ARROW-15785
> URL: https://issues.apache.org/jira/browse/ARROW-15785
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Benchmarking
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>
> Release 7.0.0 introduced a regression in parquet single file reads.  We 
> should add a macro-level benchmark that does single-file reads to help us 
> detect this in the future.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-15272) [Java] ArrowVectorIterator eats initialization exceptions when close fails

2022-02-25 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved ARROW-15272.
--
Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12094
[https://github.com/apache/arrow/pull/12094]

> [Java] ArrowVectorIterator eats initialization exceptions when close fails
> --
>
> Key: ARROW-15272
> URL: https://issues.apache.org/jira/browse/ARROW-15272
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 6.0.1
>Reporter: Andrew Higgins
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In ArrowVectorIterator's create method exceptions thrown during initialize() 
> are eaten if there are further exceptions while closing the iterator.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15790) field's metadata is not write into Parquet file

2022-02-25 Thread Sifang Li (Jira)
Sifang Li created ARROW-15790:
-

 Summary: field's metadata is not write into Parquet file
 Key: ARROW-15790
 URL: https://issues.apache.org/jira/browse/ARROW-15790
 Project: Apache Arrow
  Issue Type: Bug
 Environment: Ubuntu
Reporter: Sifang Li


I used this code to test the metadata write into file and read back behavior of 
parquet  file:

[https://gist.github.com/dantrim/33f9f14d0b2d3ec45c022aa05f7a45ee]

 

The generated file does not have metadata when I read the file in using code 
below and print it out: 
 
{quote}std::shared_ptr infile;
PARQUET_ASSIGN_OR_THROW(infile,
arrow::io::ReadableFile::Open("./test.parquet", arrow::default_memory_pool()));

std::unique_ptr reader;
PARQUET_THROW_NOT_OK(
parquet::arrow::OpenFile(infile, arrow::default_memory_pool(), &reader));
std::shared_ptr table;
PARQUET_THROW_NOT_OK(reader->ReadTable(&table));
EXPECT_EQ(frameCount, table->num_rows());
std::cout<<"==="ToString(true) <

[jira] [Updated] (ARROW-15789) [C++] Update OpenTelemetry to v1.2.0

2022-02-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-15789:
---
Labels: pull-request-available  (was: )

> [C++] Update OpenTelemetry to v1.2.0
> 
>
> Key: ARROW-15789
> URL: https://issues.apache.org/jira/browse/ARROW-15789
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We're currently on v1.1.0 and there were some minor API changes in v1.1.1.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15789) [C++] Update OpenTelemetry to v1.2.0

2022-02-25 Thread David Li (Jira)
David Li created ARROW-15789:


 Summary: [C++] Update OpenTelemetry to v1.2.0
 Key: ARROW-15789
 URL: https://issues.apache.org/jira/browse/ARROW-15789
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: David Li
Assignee: David Li


We're currently on v1.1.0 and there were some minor API changes in v1.1.1.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (ARROW-14549) VectorSchemaRoot is not refreshed when value is null

2022-02-25 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498249#comment-17498249
 ] 

Bryan Cutler edited comment on ARROW-14549 at 2/25/22, 6:22 PM:


[~hu6360567] Calling `allocateNew()` will create new buffers, which is one way 
to clear previous results. If you don't want to allocate any new memory, you 
would need to to zero out all the vectors by calling `zeroVector()` and 
`setValueCount(0)`. If you don't do either of these, the incorrect data you see 
is expected.


was (Author: bryanc):
[~hu6360567] Calling `allocateNew()` will create new buffers, which is one way 
to clear previous results. If you don't want to allocate any new memory, you 
would need to to zero out all the vectors by calling `zeroVector()` and 
`setValueCount(0)`

> VectorSchemaRoot is not refreshed when value is null
> 
>
> Key: ARROW-14549
> URL: https://issues.apache.org/jira/browse/ARROW-14549
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 6.0.0
>Reporter: Wenbo Hu
>Priority: Major
>
> I'm using `arrow-jdbc` to convert query result from JDBC to arrow.
>  But the following code, unexpected behaivor happens.
> Assuming a sqlite db, the 2nd row of col_2 and col_3 are null.
> |col_1|col_2|col_3|
> |1|abc|3.14|
> |2|NULL|NULL|
> As document suggests,
> {quote}populated data over and over into the same VectorSchemaRoot in a 
> stream of batches rather than creating a new VectorSchemaRoot instance each 
> time.
> {quote}
> *JdbcToArrowConfig* is set to reuse root.
> {code:java}
> public void querySql(String query, QueryOption option) throws Exception {
>  try (final java.sql.Connection conn = connectContainer.getConnection();
>  final Statement stmt = conn.createStatement();
>  final ResultSet rs = stmt.executeQuery(query)
>  ) {
>  // create config with reuse schema root and custom batch size from option
>  final JdbcToArrowConfig config = new 
> JdbcToArrowConfigBuilder().setAllocator(new 
> RootAllocator()).setCalendar(JdbcToArrowUtils.getUtcCalendar())
>  
> .setTargetBatchSize(option.getBatchSize()).setReuseVectorSchemaRoot(true).build();
>   final ArrowVectorIterator iterator = 
> JdbcToArrow.sqlToArrowVectorIterator(rs, config);
>while (iterator.hasNext()){ // retrieve result from iterator 
>  final VectorSchemaRoot root = iterator.next(); 
> option.getCallback().handleBatchResult(root); 
>  root.allocateNew(); // it has to be allocate new 
>    }
>   } catch (java.lang.Exception e){ throw new Exception(e.getMessage()); }
>  }
>  
>  ..
>  // batch_size is set to 1, then callback is called twice.
>  QueryOptions options = new QueryOption(1, 
>  root -> {
>  // if printer is not set, get schema, write header
>  if (printer == null) { 
>   final String[] headers = 
> root.getSchema().getFields().stream().map(Field::getName).toArray(String[]::new);
>  
>   printer = new CSVPrinter(writer, 
> CSVFormat.Builder.create(CSVFormat.DEFAULT).setHeader(headers).build()); 
>   }
>  
>  final int rows = root.getRowCount();
>  final List fieldVectors = root.getFieldVectors();
>  
>  // iterate over rows
>  for (int i = 0; i < rows; i++) { 
>   final int rowId = i; 
>   final List row = fieldVectors.stream().map(v -> 
> v.getObject(rowId)).map(String::valueOf).collect(Collectors.toList()); 
> printer.printRecord(row); 
>   }
>  });
>  
>  connection.querySql("SELECT * FROM test_db", options);
>  ..
> {code}
> if `root.allocateNew()` is called, the csv file is expected,
>  ```
>  column_1,column_2,column_3
>  1,abc,3.14
>  2,null,null
>  ```
>  Otherwise, null values of 2nd row are remaining the same values of 1st row
>  ```
>  column_1,column_2,column_3
>  1,abc,3.14
>  2,abc,3.14
>  ```
> **Question: Is expected to call `allocateNew` every time when the schema root 
> is reused?**
> By without reusing schemaroot, the following code works as expected.
> {code:java}
>  public void querySql(String query, QueryOption option) throws Exception {
>  try (final java.sql.Connection conn = connectContainer.getConnection();
>  final Statement stmt = conn.createStatement();
>  final ResultSet rs = stmt.executeQuery(query)) {
>  // create config without reuse schema root and custom batch size from 
> option
>  final JdbcToArrowConfig config = new 
> JdbcToArrowConfigBuilder().setAllocator(new 
> RootAllocator()).setCalendar(JdbcToArrowUtils.getUtcCalendar())
>  
> .setTargetBatchSize(option.getBatchSize()).setReuseVectorSchemaRoot(false).build();
>  
>  final ArrowVectorIterator iterator = 
> JdbcToArrow.sqlToArrowVectorIterator(rs, config);
>  while (iterator.hasNext()) {
>  // retrieve result from iterator
>  try (VectorSchemaRoot root

[jira] [Resolved] (ARROW-14549) VectorSchemaRoot is not refreshed when value is null

2022-02-25 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved ARROW-14549.
--
Resolution: Not A Problem

> VectorSchemaRoot is not refreshed when value is null
> 
>
> Key: ARROW-14549
> URL: https://issues.apache.org/jira/browse/ARROW-14549
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 6.0.0
>Reporter: Wenbo Hu
>Priority: Major
>
> I'm using `arrow-jdbc` to convert query result from JDBC to arrow.
>  But the following code, unexpected behaivor happens.
> Assuming a sqlite db, the 2nd row of col_2 and col_3 are null.
> |col_1|col_2|col_3|
> |1|abc|3.14|
> |2|NULL|NULL|
> As document suggests,
> {quote}populated data over and over into the same VectorSchemaRoot in a 
> stream of batches rather than creating a new VectorSchemaRoot instance each 
> time.
> {quote}
> *JdbcToArrowConfig* is set to reuse root.
> {code:java}
> public void querySql(String query, QueryOption option) throws Exception {
>  try (final java.sql.Connection conn = connectContainer.getConnection();
>  final Statement stmt = conn.createStatement();
>  final ResultSet rs = stmt.executeQuery(query)
>  ) {
>  // create config with reuse schema root and custom batch size from option
>  final JdbcToArrowConfig config = new 
> JdbcToArrowConfigBuilder().setAllocator(new 
> RootAllocator()).setCalendar(JdbcToArrowUtils.getUtcCalendar())
>  
> .setTargetBatchSize(option.getBatchSize()).setReuseVectorSchemaRoot(true).build();
>   final ArrowVectorIterator iterator = 
> JdbcToArrow.sqlToArrowVectorIterator(rs, config);
>while (iterator.hasNext()){ // retrieve result from iterator 
>  final VectorSchemaRoot root = iterator.next(); 
> option.getCallback().handleBatchResult(root); 
>  root.allocateNew(); // it has to be allocate new 
>    }
>   } catch (java.lang.Exception e){ throw new Exception(e.getMessage()); }
>  }
>  
>  ..
>  // batch_size is set to 1, then callback is called twice.
>  QueryOptions options = new QueryOption(1, 
>  root -> {
>  // if printer is not set, get schema, write header
>  if (printer == null) { 
>   final String[] headers = 
> root.getSchema().getFields().stream().map(Field::getName).toArray(String[]::new);
>  
>   printer = new CSVPrinter(writer, 
> CSVFormat.Builder.create(CSVFormat.DEFAULT).setHeader(headers).build()); 
>   }
>  
>  final int rows = root.getRowCount();
>  final List fieldVectors = root.getFieldVectors();
>  
>  // iterate over rows
>  for (int i = 0; i < rows; i++) { 
>   final int rowId = i; 
>   final List row = fieldVectors.stream().map(v -> 
> v.getObject(rowId)).map(String::valueOf).collect(Collectors.toList()); 
> printer.printRecord(row); 
>   }
>  });
>  
>  connection.querySql("SELECT * FROM test_db", options);
>  ..
> {code}
> if `root.allocateNew()` is called, the csv file is expected,
>  ```
>  column_1,column_2,column_3
>  1,abc,3.14
>  2,null,null
>  ```
>  Otherwise, null values of 2nd row are remaining the same values of 1st row
>  ```
>  column_1,column_2,column_3
>  1,abc,3.14
>  2,abc,3.14
>  ```
> **Question: Is expected to call `allocateNew` every time when the schema root 
> is reused?**
> By without reusing schemaroot, the following code works as expected.
> {code:java}
>  public void querySql(String query, QueryOption option) throws Exception {
>  try (final java.sql.Connection conn = connectContainer.getConnection();
>  final Statement stmt = conn.createStatement();
>  final ResultSet rs = stmt.executeQuery(query)) {
>  // create config without reuse schema root and custom batch size from 
> option
>  final JdbcToArrowConfig config = new 
> JdbcToArrowConfigBuilder().setAllocator(new 
> RootAllocator()).setCalendar(JdbcToArrowUtils.getUtcCalendar())
>  
> .setTargetBatchSize(option.getBatchSize()).setReuseVectorSchemaRoot(false).build();
>  
>  final ArrowVectorIterator iterator = 
> JdbcToArrow.sqlToArrowVectorIterator(rs, config);
>  while (iterator.hasNext()) {
>  // retrieve result from iterator
>  try (VectorSchemaRoot root = iterator.next()) { 
>   option.getCallback().handleBatchResult(root); root.allocateNew(); 
>   }
>}
>  } catch (java.lang.Exception e) { throw new Exception(e.getMessage()); }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-14549) VectorSchemaRoot is not refreshed when value is null

2022-02-25 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498249#comment-17498249
 ] 

Bryan Cutler commented on ARROW-14549:
--

[~hu6360567] Calling `allocateNew()` will create new buffers, which is one way 
to clear previous results. If you don't want to allocate any new memory, you 
would need to to zero out all the vectors by calling `zeroVector()` and 
`setValueCount(0)`

> VectorSchemaRoot is not refreshed when value is null
> 
>
> Key: ARROW-14549
> URL: https://issues.apache.org/jira/browse/ARROW-14549
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 6.0.0
>Reporter: Wenbo Hu
>Priority: Major
>
> I'm using `arrow-jdbc` to convert query result from JDBC to arrow.
>  But the following code, unexpected behaivor happens.
> Assuming a sqlite db, the 2nd row of col_2 and col_3 are null.
> |col_1|col_2|col_3|
> |1|abc|3.14|
> |2|NULL|NULL|
> As document suggests,
> {quote}populated data over and over into the same VectorSchemaRoot in a 
> stream of batches rather than creating a new VectorSchemaRoot instance each 
> time.
> {quote}
> *JdbcToArrowConfig* is set to reuse root.
> {code:java}
> public void querySql(String query, QueryOption option) throws Exception {
>  try (final java.sql.Connection conn = connectContainer.getConnection();
>  final Statement stmt = conn.createStatement();
>  final ResultSet rs = stmt.executeQuery(query)
>  ) {
>  // create config with reuse schema root and custom batch size from option
>  final JdbcToArrowConfig config = new 
> JdbcToArrowConfigBuilder().setAllocator(new 
> RootAllocator()).setCalendar(JdbcToArrowUtils.getUtcCalendar())
>  
> .setTargetBatchSize(option.getBatchSize()).setReuseVectorSchemaRoot(true).build();
>   final ArrowVectorIterator iterator = 
> JdbcToArrow.sqlToArrowVectorIterator(rs, config);
>while (iterator.hasNext()){ // retrieve result from iterator 
>  final VectorSchemaRoot root = iterator.next(); 
> option.getCallback().handleBatchResult(root); 
>  root.allocateNew(); // it has to be allocate new 
>    }
>   } catch (java.lang.Exception e){ throw new Exception(e.getMessage()); }
>  }
>  
>  ..
>  // batch_size is set to 1, then callback is called twice.
>  QueryOptions options = new QueryOption(1, 
>  root -> {
>  // if printer is not set, get schema, write header
>  if (printer == null) { 
>   final String[] headers = 
> root.getSchema().getFields().stream().map(Field::getName).toArray(String[]::new);
>  
>   printer = new CSVPrinter(writer, 
> CSVFormat.Builder.create(CSVFormat.DEFAULT).setHeader(headers).build()); 
>   }
>  
>  final int rows = root.getRowCount();
>  final List fieldVectors = root.getFieldVectors();
>  
>  // iterate over rows
>  for (int i = 0; i < rows; i++) { 
>   final int rowId = i; 
>   final List row = fieldVectors.stream().map(v -> 
> v.getObject(rowId)).map(String::valueOf).collect(Collectors.toList()); 
> printer.printRecord(row); 
>   }
>  });
>  
>  connection.querySql("SELECT * FROM test_db", options);
>  ..
> {code}
> if `root.allocateNew()` is called, the csv file is expected,
>  ```
>  column_1,column_2,column_3
>  1,abc,3.14
>  2,null,null
>  ```
>  Otherwise, null values of 2nd row are remaining the same values of 1st row
>  ```
>  column_1,column_2,column_3
>  1,abc,3.14
>  2,abc,3.14
>  ```
> **Question: Is expected to call `allocateNew` every time when the schema root 
> is reused?**
> By without reusing schemaroot, the following code works as expected.
> {code:java}
>  public void querySql(String query, QueryOption option) throws Exception {
>  try (final java.sql.Connection conn = connectContainer.getConnection();
>  final Statement stmt = conn.createStatement();
>  final ResultSet rs = stmt.executeQuery(query)) {
>  // create config without reuse schema root and custom batch size from 
> option
>  final JdbcToArrowConfig config = new 
> JdbcToArrowConfigBuilder().setAllocator(new 
> RootAllocator()).setCalendar(JdbcToArrowUtils.getUtcCalendar())
>  
> .setTargetBatchSize(option.getBatchSize()).setReuseVectorSchemaRoot(false).build();
>  
>  final ArrowVectorIterator iterator = 
> JdbcToArrow.sqlToArrowVectorIterator(rs, config);
>  while (iterator.hasNext()) {
>  // retrieve result from iterator
>  try (VectorSchemaRoot root = iterator.next()) { 
>   option.getCallback().handleBatchResult(root); root.allocateNew(); 
>   }
>}
>  } catch (java.lang.Exception e) { throw new Exception(e.getMessage()); }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-14665) [Java] JdbcToArrowUtils ResultSet iteration bug

2022-02-25 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved ARROW-14665.
--
Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 11667
[https://github.com/apache/arrow/pull/11667]

> [Java] JdbcToArrowUtils ResultSet iteration bug
> ---
>
> Key: ARROW-14665
> URL: https://issues.apache.org/jira/browse/ARROW-14665
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 6.0.0
>Reporter: Zac
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> When specifying a target batch size, the [iteration 
> logic|https://github.com/apache/arrow/blob/ea42b9e0aa000238fff22fd48f06f3aa516b9f3f/java/adapter/jdbc/src/main/java/org/apache/arrow/adapter/jdbc/JdbcToArrowUtils.java#L266]
>  is currently broken:
> {code:java}
> while (rs.next() && readRowCount < config.getTargetBatchSize()) {
>   compositeConsumer.consume(rs);
>   readRowCount++;
> }
> {code}
> calling next() on the result set will move the cursor forward to the next 
> row, even when we've reached the target batch size.
> For example, consider setting target batch size to 1, and query a table that 
> has three rows.
> On the first iteration, we'll successfully consume the first row. On the next 
> iteration, we'll move the cursor to row 2, but detect the read row count is 
> no longer < target batch size and return.
> Upon calling into the method again with the same result set, rs.next will be 
> called again which will result in successfully consuming row 3.
> *Problem:* row 2 is skipped! 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-15742) [Go] Implement 'bitmap_neon' with Arm64 GoLang Assembly

2022-02-25 Thread Matthew Topol (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol resolved ARROW-15742.
---
Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12502
[https://github.com/apache/arrow/pull/12502]

> [Go] Implement 'bitmap_neon' with Arm64 GoLang Assembly 
> 
>
> Key: ARROW-15742
> URL: https://issues.apache.org/jira/browse/ARROW-15742
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Go
>Reporter: Yuqi Gu
>Assignee: Yuqi Gu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> 1. Implement 'extract_bits' with Arm64 GoLang Assembly. '_pext_u64' is the 
> x86 bmi intrinsics for extract_bits.
> There is no  equivalent of '_pext_u64' instruction on Arm64.
> The task is to implement equivalent of '_pext_u64' by Arm64 assembly.
> 2. Implement 'levels_to_bitmap' with Arm64 GoLang Assembly for 
> greaterThanBitmapNEON



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects

2022-02-25 Thread Vibhatha Lakmal Abeykoon (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498231#comment-17498231
 ] 

Vibhatha Lakmal Abeykoon commented on ARROW-15765:
--

Sure, I will give it a try and post what I find out. 

> [Python] Extracting Type information from Python Objects
> 
>
> Key: ARROW-15765
> URL: https://issues.apache.org/jira/browse/ARROW-15765
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> When creating user defined functions or similar exercises where we want to 
> extract the Arrow data types from the type hints, the existing Python API 
> have some limitations. 
> An example case is as follows;
> {code:java}
> def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array:
>     return pc.call_function("add", [array1, array2])
>   {code}
> We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. 
> At the moment there doesn't exist a straightforward manner to get this done. 
> So the idea is to expose this feature to Python. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15788) [C++][FlightRPC] Support alternative transports in the Flight benchmark

2022-02-25 Thread David Li (Jira)
David Li created ARROW-15788:


 Summary: [C++][FlightRPC] Support alternative transports in the 
Flight benchmark
 Key: ARROW-15788
 URL: https://issues.apache.org/jira/browse/ARROW-15788
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, FlightRPC
Reporter: David Li
Assignee: David Li


A follow-up to ARROW-15282.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-15781) [Python] ParquetFileFragment.ensure_complete_metadata doesn't release the GIL

2022-02-25 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-15781.
---
Resolution: Fixed

Issue resolved by pull request 12509
[https://github.com/apache/arrow/pull/12509]

> [Python] ParquetFileFragment.ensure_complete_metadata doesn't release the GIL
> -
>
> Key: ARROW-15781
> URL: https://issues.apache.org/jira/browse/ARROW-15781
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> See https://github.com/apache/arrow/issues/12501



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-15781) [Python] ParquetFileFragment.ensure_complete_metadata doesn't release the GIL

2022-02-25 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-15781:
--
Fix Version/s: 8.0.0

> [Python] ParquetFileFragment.ensure_complete_metadata doesn't release the GIL
> -
>
> Key: ARROW-15781
> URL: https://issues.apache.org/jira/browse/ARROW-15781
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> See https://github.com/apache/arrow/issues/12501



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-15787) [C++] Temporal floor/ceil/round kernels could be optimised with templating

2022-02-25 Thread Rok Mihevc (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rok Mihevc updated ARROW-15787:
---
Priority: Minor  (was: Major)

> [C++] Temporal floor/ceil/round kernels could be optimised with templating
> --
>
> Key: ARROW-15787
> URL: https://issues.apache.org/jira/browse/ARROW-15787
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Rok Mihevc
>Priority: Minor
>
> [CeilTemporal, FloorTemporal, 
> RoundTemporal|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_temporal_unary.cc#L728-L980]
>  kernels could probably be templated in a clean way. They also execute a 
> switch statement for every call instead of creating an operator at kernel 
> call time and only running that.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-15787) [C++] Temporal floor/ceil/round kernels could be optimised with templating

2022-02-25 Thread Rok Mihevc (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rok Mihevc updated ARROW-15787:
---
Labels: kernel  (was: )

> [C++] Temporal floor/ceil/round kernels could be optimised with templating
> --
>
> Key: ARROW-15787
> URL: https://issues.apache.org/jira/browse/ARROW-15787
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Rok Mihevc
>Priority: Minor
>  Labels: kernel
>
> [CeilTemporal, FloorTemporal, 
> RoundTemporal|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_temporal_unary.cc#L728-L980]
>  kernels could probably be templated in a clean way. They also execute a 
> switch statement for every call instead of creating an operator at kernel 
> call time and only running that.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-15787) [C++] Temporal floor/ceil/round kernels could be optimised with templating

2022-02-25 Thread Rok Mihevc (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rok Mihevc updated ARROW-15787:
---
Component/s: C++

> [C++] Temporal floor/ceil/round kernels could be optimised with templating
> --
>
> Key: ARROW-15787
> URL: https://issues.apache.org/jira/browse/ARROW-15787
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Rok Mihevc
>Priority: Minor
>
> [CeilTemporal, FloorTemporal, 
> RoundTemporal|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_temporal_unary.cc#L728-L980]
>  kernels could probably be templated in a clean way. They also execute a 
> switch statement for every call instead of creating an operator at kernel 
> call time and only running that.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15787) [C++] Temporal floor/ceil/round kernels could be optimised with templating

2022-02-25 Thread Rok Mihevc (Jira)
Rok Mihevc created ARROW-15787:
--

 Summary: [C++] Temporal floor/ceil/round kernels could be 
optimised with templating
 Key: ARROW-15787
 URL: https://issues.apache.org/jira/browse/ARROW-15787
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Rok Mihevc


[CeilTemporal, FloorTemporal, 
RoundTemporal|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_temporal_unary.cc#L728-L980]
 kernels could probably be templated in a clean way. They also execute a switch 
statement for every call instead of creating an operator at kernel call time 
and only running that.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-15680) [C++] Temporal floor/ceil/round should accept week_starts_monday when rounding to multiple of week

2022-02-25 Thread Rok Mihevc (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rok Mihevc updated ARROW-15680:
---
Priority: Major  (was: Minor)

> [C++] Temporal floor/ceil/round  should accept week_starts_monday when 
> rounding to multiple of week
> ---
>
> Key: ARROW-15680
> URL: https://issues.apache.org/jira/browse/ARROW-15680
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Rok Mihevc
>Assignee: Rok Mihevc
>Priority: Major
>  Labels: kernel, pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> See ARROW-14821 and the [related 
> PR|https://github.com/apache/arrow/pull/12154].



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-15680) [C++] Temporal floor/ceil/round should accept week_starts_monday when rounding to multiple of week

2022-02-25 Thread Rok Mihevc (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rok Mihevc updated ARROW-15680:
---
Priority: Minor  (was: Major)

> [C++] Temporal floor/ceil/round  should accept week_starts_monday when 
> rounding to multiple of week
> ---
>
> Key: ARROW-15680
> URL: https://issues.apache.org/jira/browse/ARROW-15680
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Rok Mihevc
>Assignee: Rok Mihevc
>Priority: Minor
>  Labels: kernel, pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> See ARROW-14821 and the [related 
> PR|https://github.com/apache/arrow/pull/12154].



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15666) [C++][Python][R] Add format inference option to StrptimeOptions

2022-02-25 Thread Rok Mihevc (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498152#comment-17498152
 ] 

Rok Mihevc commented on ARROW-15666:


Thanks for the warning Matthew, much appreciated!
Looking at the utility-to-complexity ratio this does seem like something we'd 
better avoid.

An idea would be to perhaps use the already existing pandas logic (if pandas is 
available at runtime) to do the format inference and then pass the inferred 
format to c++ and do the rest of the op there. Same for lubridate in R.

> [C++][Python][R] Add format inference option to StrptimeOptions
> ---
>
> Key: ARROW-15666
> URL: https://issues.apache.org/jira/browse/ARROW-15666
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Rok Mihevc
>Priority: Major
>
> We want to have an option to infer timestamp format.
> See 
> [pandas.to_datetime|https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html]
>  and lubridate 
> [parse_date_time|https://lubridate.tidyverse.org/reference/parse_date_time.html]
>  for examples.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (ARROW-15785) [Benchmarks] Add conbench benchmark for single-file parquet reads

2022-02-25 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498146#comment-17498146
 ] 

Jonathan Keane edited comment on ARROW-15785 at 2/25/22, 2:16 PM:
--

I think this is the PR that introduced the regression (though I might be 
totally off or it's a different regression...) 

https://github.com/apache/arrow/pull/11991#issuecomment-1009216946

And the conbench run: 
https://conbench.ursa.dev/compare/runs/c4d5e65d088243259e5198f4c0e219c9...5a1c693586c74471b7c8ba775005db54/

We should probably have the conbench bot alert more loudly that there are 
regressions of this magnitude. That 5% there is supposed to indicate that 
there's an issue, but we might have that set too low such that there's alarm 
fatigue or|and we should alert louder when there are this many high-change 
benchmarks (e.g. the file-read benchmark z-scores range from a -76 to -759, and 
we alert at -5) 


was (Author: jonkeane):
I think this is the PR that introduced the regression (though I might be 
totally off or it's a different one...) 

https://github.com/apache/arrow/pull/11991#issuecomment-1009216946

And the conbench run: 
https://conbench.ursa.dev/compare/runs/c4d5e65d088243259e5198f4c0e219c9...5a1c693586c74471b7c8ba775005db54/

We should probably have the conbench bot alert more loudly that there are 
regressions of this magnitude. That 5% there is supposed to indicate that 
there's an issue, but we might have that set too low such that there's alarm 
fatigue or|and we should alert louder when there are this many high-change 
benchmarks (e.g. the file-read benchmark z-scores range from a -76 to -759, and 
we alert at -5) 

> [Benchmarks] Add conbench benchmark for single-file parquet reads
> -
>
> Key: ARROW-15785
> URL: https://issues.apache.org/jira/browse/ARROW-15785
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Benchmarking
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>
> Release 7.0.0 introduced a regression in parquet single file reads.  We 
> should add a macro-level benchmark that does single-file reads to help us 
> detect this in the future.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15785) [Benchmarks] Add conbench benchmark for single-file parquet reads

2022-02-25 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498146#comment-17498146
 ] 

Jonathan Keane commented on ARROW-15785:


I think this is the PR that introduced the regression (though I might be 
totally off or it's a different one...) 

https://github.com/apache/arrow/pull/11991#issuecomment-1009216946

And the conbench run: 
https://conbench.ursa.dev/compare/runs/c4d5e65d088243259e5198f4c0e219c9...5a1c693586c74471b7c8ba775005db54/

We should probably have the conbench bot alert more loudly that there are 
regressions of this magnitude. That 5% there is supposed to indicate that 
there's an issue, but we might have that set too low such that there's alarm 
fatigue or|and we should alert louder when there are this many high-change 
benchmarks (e.g. the file-read benchmark z-scores range from a -76 to -759, and 
we alert at -5) 

> [Benchmarks] Add conbench benchmark for single-file parquet reads
> -
>
> Key: ARROW-15785
> URL: https://issues.apache.org/jira/browse/ARROW-15785
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Benchmarking
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>
> Release 7.0.0 introduced a regression in parquet single file reads.  We 
> should add a macro-level benchmark that does single-file reads to help us 
> detect this in the future.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15785) [Benchmarks] Add conbench benchmark for single-file parquet reads

2022-02-25 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498144#comment-17498144
 ] 

Jonathan Keane commented on ARROW-15785:


Do the Python [1] and R [2] benchmarks for single file reads do this?

Oddly(?) The python benchmarks do show a jump around January:
https://conbench.ursa.dev/benchmarks/8c5cc1a939d8485eb6c42af83f82c8c0/
https://conbench.ursa.dev/benchmarks/1b8d2dae6f664fd19579071a7cf7766b/

But the corresponding R ones do not: 
https://conbench.ursa.dev/benchmarks/ca493bf17af84ae5babd97f385b69afc/

[1] 
https://github.com/ursacomputing/benchmarks/blob/main/benchmarks/file_benchmark.py
[2] https://github.com/ursacomputing/arrowbench/blob/main/R/bm-read-file.R

> [Benchmarks] Add conbench benchmark for single-file parquet reads
> -
>
> Key: ARROW-15785
> URL: https://issues.apache.org/jira/browse/ARROW-15785
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Benchmarking
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>
> Release 7.0.0 introduced a regression in parquet single file reads.  We 
> should add a macro-level benchmark that does single-file reads to help us 
> detect this in the future.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (ARROW-14943) [R] Bindings for lubridate's ddays, dhours, dminutes, dmonths, dweeks, dyears

2022-02-25 Thread Alenka Frim (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alenka Frim reassigned ARROW-14943:
---

Assignee: Alenka Frim

> [R] Bindings for lubridate's ddays, dhours, dminutes, dmonths, dweeks, dyears
> -
>
> Key: ARROW-14943
> URL: https://issues.apache.org/jira/browse/ARROW-14943
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: Alenka Frim
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (ARROW-14942) [R] Bindings for lubridate's dpicoseconds, dnanoseconds, desconds, dmilliseconds, dmicroseconds

2022-02-25 Thread Alenka Frim (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alenka Frim reassigned ARROW-14942:
---

Assignee: Alenka Frim

> [R] Bindings for lubridate's dpicoseconds, dnanoseconds, desconds, 
> dmilliseconds, dmicroseconds
> ---
>
> Key: ARROW-14942
> URL: https://issues.apache.org/jira/browse/ARROW-14942
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: Alenka Frim
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects

2022-02-25 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498093#comment-17498093
 ] 

Antoine Pitrou commented on ARROW-15765:


Someone could experiment with the typing generic approach indeed and see if it 
works.

> [Python] Extracting Type information from Python Objects
> 
>
> Key: ARROW-15765
> URL: https://issues.apache.org/jira/browse/ARROW-15765
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> When creating user defined functions or similar exercises where we want to 
> extract the Arrow data types from the type hints, the existing Python API 
> have some limitations. 
> An example case is as follows;
> {code:java}
> def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array:
>     return pc.call_function("add", [array1, array2])
>   {code}
> We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. 
> At the moment there doesn't exist a straightforward manner to get this done. 
> So the idea is to expose this feature to Python. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15098) [R] Add binding for lubridate::duration() and/or as.difftime()

2022-02-25 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-15098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498021#comment-17498021
 ] 

Dragoș Moldovan-Grünfeld commented on ARROW-15098:
--

This ticket should probably add bindings for {{base::difftime()}} too. 

> [R] Add binding for lubridate::duration() and/or as.difftime()
> --
>
> Key: ARROW-15098
> URL: https://issues.apache.org/jira/browse/ARROW-15098
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Dewey Dunnington
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> After ARROW-14941 we have support for the duration type; however, there is no 
> binding for {{lubridate::duration()}} or {{as.difftime()}} available in dplyr 
> evaluation that could create these objects. I'm actually not sure if we 
> should bind {{lubridate::duration}} since it returns a custom S4 class that's 
> identical in function to base R's difftime.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (ARROW-15098) [R] Add binding for lubridate::duration() and/or as.difftime()

2022-02-25 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-15098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498021#comment-17498021
 ] 

Dragoș Moldovan-Grünfeld edited comment on ARROW-15098 at 2/25/22, 10:12 AM:
-

This ticket should probably cover adding bindings for {{base::difftime()}} too. 


was (Author: dragosmg):
This ticket should probably add bindings for {{base::difftime()}} too. 

> [R] Add binding for lubridate::duration() and/or as.difftime()
> --
>
> Key: ARROW-15098
> URL: https://issues.apache.org/jira/browse/ARROW-15098
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Dewey Dunnington
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> After ARROW-14941 we have support for the duration type; however, there is no 
> binding for {{lubridate::duration()}} or {{as.difftime()}} available in dplyr 
> evaluation that could create these objects. I'm actually not sure if we 
> should bind {{lubridate::duration}} since it returns a custom S4 class that's 
> identical in function to base R's difftime.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (ARROW-14820) [R] Implement bindings for lubridate calculation functions

2022-02-25 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-14820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498015#comment-17498015
 ] 

Dragoș Moldovan-Grünfeld edited comment on ARROW-14820 at 2/25/22, 10:07 AM:
-

Hi [~eitsupi],

Thanks for filing the ticket. Once [PR 
12433|https://github.com/apache/arrow/pull/12433] is merged you will be able to 
use `date()` to extract the date component of a timestamp. We aim to cover 
{{as.Date()}} as part of the  same pull request.

The snippet of code below should work once we merge the PR. Please note 
{{base::as.Date()}} by default assumes you want the date in UTC :
{code:r}
library(dplyr)
library(lubridate)
library(arrow)

df <- tibble::tibble(
  col1 = lubridate::as_datetime("2010-08-03 00:50:50", tz = "Europe/London"),
)

# lubridate
df %>% 
  mutate(x = as_date(col1),
 y = as.Date(col1))
#> # A tibble: 1 × 3
#>   col1x  y 
#>  
#> 1 2010-08-03 00:50:50 2010-08-03 2010-08-02

# arrow
df %>% 
  arrow_table() %>% 
  mutate(x = date(col1),
 y = as.Date(col1)) %>% 
  collect()
#> # A tibble: 1 × 3
#>   col1x  y 
#>  
#> 1 2010-08-03 00:50:50 2010-08-03 2010-08-02
{code}


was (Author: dragosmg):
Hi [~eitsupi],

Thanks for filing the ticket. Once [PR 
12433|https://github.com/apache/arrow/pull/12433] is merged you will be able to 
use `date()` to extract the date component of a timestamp. {{as.Date()}} is 
covered by the same pull request.

The snippet of code below should work once we merge the PR. Please note 
{{base::as.Date()}} by default assumes you want the date in UTC :
{code:r}
library(dplyr)
library(lubridate)
library(arrow)

df <- tibble::tibble(
  col1 = lubridate::as_datetime("2010-08-03 00:50:50", tz = "Europe/London"),
)

# lubridate
df %>% 
  mutate(x = as_date(col1),
 y = as.Date(col1))
#> # A tibble: 1 × 3
#>   col1x  y 
#>  
#> 1 2010-08-03 00:50:50 2010-08-03 2010-08-02

# arrow
df %>% 
  arrow_table() %>% 
  mutate(x = date(col1),
 y = as.Date(col1)) %>% 
  collect()
#> # A tibble: 1 × 3
#>   col1x  y 
#>  
#> 1 2010-08-03 00:50:50 2010-08-03 2010-08-02
{code}

> [R] Implement bindings for lubridate calculation functions
> --
>
> Key: ARROW-14820
> URL: https://issues.apache.org/jira/browse/ARROW-14820
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-14820) [R] Implement bindings for lubridate calculation functions

2022-02-25 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-14820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498015#comment-17498015
 ] 

Dragoș Moldovan-Grünfeld commented on ARROW-14820:
--

Hi [~eitsupi],

Thanks for filing the ticket. Once [PR 
12433|https://github.com/apache/arrow/pull/12433] is merged you will be able to 
use `date()` to extract the date component of a timestamp. {{as.Date()}} is 
covered by the same pull request.

The snippet of code below should work once we merge the PR. Please note 
{{base::as.Date()}} by default assumes you want the date in UTC :
{code:r}
library(dplyr)
library(lubridate)
library(arrow)

df <- tibble::tibble(
  col1 = lubridate::as_datetime("2010-08-03 00:50:50", tz = "Europe/London"),
)

# lubridate
df %>% 
  mutate(x = as_date(col1),
 y = as.Date(col1))
#> # A tibble: 1 × 3
#>   col1x  y 
#>  
#> 1 2010-08-03 00:50:50 2010-08-03 2010-08-02

# arrow
df %>% 
  arrow_table() %>% 
  mutate(x = date(col1),
 y = as.Date(col1)) %>% 
  collect()
#> # A tibble: 1 × 3
#>   col1x  y 
#>  
#> 1 2010-08-03 00:50:50 2010-08-03 2010-08-02
{code}

> [R] Implement bindings for lubridate calculation functions
> --
>
> Key: ARROW-14820
> URL: https://issues.apache.org/jira/browse/ARROW-14820
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (ARROW-15645) [Flight][Java][C++] Data read through Flight is having endianness issue on s390x

2022-02-25 Thread Ravi Gummadi (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498004#comment-17498004
 ] 

Ravi Gummadi edited comment on ARROW-15645 at 2/25/22, 9:42 AM:


Yes. Both server and client are on s390x.

Thanks for the details [~apitrou] . I will watch 
[https://issues.apache.org/jira/projects/ARROW/issues/ARROW-15778] and test on 
my environment once a fix for 15778 is available.


was (Author: ravidotg):
Thanks for the details [~apitrou] . I will watch 
[https://issues.apache.org/jira/projects/ARROW/issues/ARROW-15778] and test on 
my environment once a fix for 15778 is available.

> [Flight][Java][C++] Data read through Flight is having endianness issue on 
> s390x
> 
>
> Key: ARROW-15645
> URL: https://issues.apache.org/jira/browse/ARROW-15645
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC, Java
>Affects Versions: 5.0.0
> Environment: Linux s390x (big endian)
>Reporter: Ravi Gummadi
>Priority: Major
>
> Am facing an endianness issue on s390x(big endian) when converting the data 
> read through flight to pandas data frame.
> (1) table.validate() fails with error
> {code}
> Traceback (most recent call last):
>   File "/tmp/2.py", line 51, in 
>     table.validate()
>   File "pyarrow/table.pxi", line 1232, in pyarrow.lib.Table.validate
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: Negative offsets in 
> binary array
> {code}
> (2) table.to_pandas() gives a segmentation fault
> 
> Here is a sample code that I am using:
> {code:python}
> from pyarrow import flight
> import os
> import json
> flight_endpoint = os.environ.get("flight_server_url", 
> "grpc+tls://...local:443")
> print(flight_endpoint)
> #
> class TokenClientAuthHandler(flight.ClientAuthHandler):
>     """An example implementation of authentication via handshake.
>        With the default constructor, the user token is read from the 
> environment: TokenClientAuthHandler().
>        You can also pass a user token as parameter to the constructor, 
> TokenClientAuthHandler(yourtoken).
>     """
>     def \_\_init\_\_(self, token: str = None):
>         super().\_\_init\__()
>         if( token != None):
>             strToken = strToken = 'Bearer {}'.format(token)
>         else:
>             strToken = 'Bearer {}'.format(os.environ.get("some_auth_token"))
>         self.token = strToken.encode('utf-8')
>         #print(self.token)
>     def authenticate(self, outgoing, incoming):
>         outgoing.write(self.token)
>         self.token = incoming.read()
>     def get_token(self):
>         return self.token
>     
> readClient = flight.FlightClient(flight_endpoint)
> readClient.authenticate(TokenClientAuthHandler())
> cmd = json.dumps(\{...})
> descriptor = flight.FlightDescriptor.for_command(cmd)
> flightInfo = readClient.get_flight_info(descriptor)
> reader = readClient.do_get(flightInfo.endpoints[0].ticket)
> table = reader.read_all()
> print(table)
> print(table.num_columns)
> print(table.num_rows)
> table.validate()
> table.to_pandas()
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15645) [Flight][Java][C++] Data read through Flight is having endianness issue on s390x

2022-02-25 Thread Ravi Gummadi (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498004#comment-17498004
 ] 

Ravi Gummadi commented on ARROW-15645:
--

Thanks for the details [~apitrou] . I will watch 
[https://issues.apache.org/jira/projects/ARROW/issues/ARROW-15778] and test on 
my environment once a fix for 15778 is available.

> [Flight][Java][C++] Data read through Flight is having endianness issue on 
> s390x
> 
>
> Key: ARROW-15645
> URL: https://issues.apache.org/jira/browse/ARROW-15645
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC, Java
>Affects Versions: 5.0.0
> Environment: Linux s390x (big endian)
>Reporter: Ravi Gummadi
>Priority: Major
>
> Am facing an endianness issue on s390x(big endian) when converting the data 
> read through flight to pandas data frame.
> (1) table.validate() fails with error
> {code}
> Traceback (most recent call last):
>   File "/tmp/2.py", line 51, in 
>     table.validate()
>   File "pyarrow/table.pxi", line 1232, in pyarrow.lib.Table.validate
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: Negative offsets in 
> binary array
> {code}
> (2) table.to_pandas() gives a segmentation fault
> 
> Here is a sample code that I am using:
> {code:python}
> from pyarrow import flight
> import os
> import json
> flight_endpoint = os.environ.get("flight_server_url", 
> "grpc+tls://...local:443")
> print(flight_endpoint)
> #
> class TokenClientAuthHandler(flight.ClientAuthHandler):
>     """An example implementation of authentication via handshake.
>        With the default constructor, the user token is read from the 
> environment: TokenClientAuthHandler().
>        You can also pass a user token as parameter to the constructor, 
> TokenClientAuthHandler(yourtoken).
>     """
>     def \_\_init\_\_(self, token: str = None):
>         super().\_\_init\__()
>         if( token != None):
>             strToken = strToken = 'Bearer {}'.format(token)
>         else:
>             strToken = 'Bearer {}'.format(os.environ.get("some_auth_token"))
>         self.token = strToken.encode('utf-8')
>         #print(self.token)
>     def authenticate(self, outgoing, incoming):
>         outgoing.write(self.token)
>         self.token = incoming.read()
>     def get_token(self):
>         return self.token
>     
> readClient = flight.FlightClient(flight_endpoint)
> readClient.authenticate(TokenClientAuthHandler())
> cmd = json.dumps(\{...})
> descriptor = flight.FlightDescriptor.for_command(cmd)
> flightInfo = readClient.get_flight_info(descriptor)
> reader = readClient.do_get(flightInfo.endpoints[0].ticket)
> table = reader.read_all()
> print(table)
> print(table.num_columns)
> print(table.num_rows)
> table.validate()
> table.to_pandas()
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-15760) [C++] Avoid hard dependency on git in cmake (download tarballs from github instead)

2022-02-25 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-15760:
--
Fix Version/s: 8.0.0

> [C++] Avoid hard dependency on git in cmake (download tarballs from github 
> instead)
> ---
>
> Key: ARROW-15760
> URL: https://issues.apache.org/jira/browse/ARROW-15760
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Jeroen van Straten
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> See https://github.com/apache/arrow/pull/12322#issuecomment-1048523391



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (ARROW-15760) [C++] Avoid hard dependency on git in cmake (download tarballs from github instead)

2022-02-25 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-15760:
-

Assignee: Jeroen van Straten  (was: Matthijs Brobbel)

> [C++] Avoid hard dependency on git in cmake (download tarballs from github 
> instead)
> ---
>
> Key: ARROW-15760
> URL: https://issues.apache.org/jira/browse/ARROW-15760
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Jeroen van Straten
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> See https://github.com/apache/arrow/pull/12322#issuecomment-1048523391



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-15760) [C++] Avoid hard dependency on git in cmake (download tarballs from github instead)

2022-02-25 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-15760.
---
Resolution: Fixed

> [C++] Avoid hard dependency on git in cmake (download tarballs from github 
> instead)
> ---
>
> Key: ARROW-15760
> URL: https://issues.apache.org/jira/browse/ARROW-15760
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Jeroen van Straten
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> See https://github.com/apache/arrow/pull/12322#issuecomment-1048523391



--
This message was sent by Atlassian Jira
(v8.20.1#820001)