[jira] [Assigned] (ARROW-16848) Update ORC to 1.7.5

2022-06-16 Thread William Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Hyun reassigned ARROW-16848:


Assignee: William Hyun

> Update ORC to 1.7.5
> ---
>
> Key: ARROW-16848
> URL: https://issues.apache.org/jira/browse/ARROW-16848
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Java
>Affects Versions: 9.0.0
>Reporter: William Hyun
>Assignee: William Hyun
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16848) Update ORC to 1.7.5

2022-06-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16848:
---
Labels: pull-request-available  (was: )

> Update ORC to 1.7.5
> ---
>
> Key: ARROW-16848
> URL: https://issues.apache.org/jira/browse/ARROW-16848
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Java
>Affects Versions: 9.0.0
>Reporter: William Hyun
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16848) Update ORC to 1.7.5

2022-06-16 Thread William Hyun (Jira)
William Hyun created ARROW-16848:


 Summary: Update ORC to 1.7.5
 Key: ARROW-16848
 URL: https://issues.apache.org/jira/browse/ARROW-16848
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Java
Affects Versions: 9.0.0
Reporter: William Hyun






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16848) Update ORC to 1.7.5

2022-06-16 Thread William Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Hyun updated ARROW-16848:
-
Issue Type: Bug  (was: Improvement)

> Update ORC to 1.7.5
> ---
>
> Key: ARROW-16848
> URL: https://issues.apache.org/jira/browse/ARROW-16848
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Java
>Affects Versions: 9.0.0
>Reporter: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16847) [C++] Rename or fix compute/kernels/aggregate_{mode, quantile}.cc modules to actually be aggregate functions

2022-06-16 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-16847:


 Summary: [C++] Rename or fix compute/kernels/aggregate_{mode, 
quantile}.cc modules to actually be aggregate functions
 Key: ARROW-16847
 URL: https://issues.apache.org/jira/browse/ARROW-16847
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C
Reporter: Wes McKinney
 Fix For: 9.0.0


These modules import VectorFunctions even though their file names state 
otherwise. Either they should implement aggregate functions or the files should 
be renamed to indicate that they are vector functions



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-13388) [C++][Parquet] Add DELTA_LENGTH_BYTE_ARRAY

2022-06-16 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-13388:


Assignee: Muthunagappan

> [C++][Parquet] Add DELTA_LENGTH_BYTE_ARRAY
> --
>
> Key: ARROW-13388
> URL: https://issues.apache.org/jira/browse/ARROW-13388
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Parquet
>Affects Versions: 4.0.1
>Reporter: Jorge Leitão
>Assignee: Muthunagappan
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> While trying to read a utf8 column with DELTA_LENGTH_BYTE_ARRAY encoding, 
> pyarrow yields
> `OSError: Not yet implemented: Unsupported encoding.`
> It would be nice to have support for this encoding, since it lends itself 
> really well to the arrow format, whereby values's encoding is equal to 
> arrow's representation.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (ARROW-12099) [Python] Explode array column

2022-06-16 Thread Nick Crews (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555329#comment-17555329
 ] 

Nick Crews edited comment on ARROW-12099 at 6/17/22 12:13 AM:
--

Small tweak to Guido's implementation (thank you for this!): If the table only 
has the one ListArray or MapArray column, then it crashes.

This handles that case:
{code:python}
import pyarrow as pa
import pyarrow.compute as pc
def explode_table(table, column):
null_filled = pc.fill_null(table[column], [None])
flattened = pc.list_flatten(null_filled)
other_columns = list(table.schema.names)
other_columns.remove(column)
if len(other_columns) == 0:
return pa.table({column: flattened})
else:
indices = pc.list_parent_indices(null_filled)
result = table.select(other_columns).take(indices)
result = result.append_column(
pa.field(column, table.schema.field(column).type.value_type),
flattened,
)
return result {code}


was (Author: JIRAUSER291113):
Small tweak to Guido's implementation (thank you for this!): If the table only 
has the one ListArray or MapArray column, then it crashes.

This handles that case:
{code:python}
import pyarrow as paimport pyarrow.compute as pc
def explode_table(table, column):null_filled = pc.fill_null(table[column], 
[None])flattened = pc.list_flatten(null_filled)other_columns = 
list(table.schema.names)other_columns.remove(column)if 
len(other_columns) == 0:return pa.table({column: flattened})else:   
 indices = pc.list_parent_indices(null_filled)result = 
table.select(other_columns).take(indices)result = result.append_column( 
   pa.field(column, table.schema.field(column).type.value_type),
flattened,)return result {code}

> [Python] Explode array column
> -
>
> Key: ARROW-12099
> URL: https://issues.apache.org/jira/browse/ARROW-12099
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Malthe Borch
>Priority: Major
>
> In Apache Spark, 
> [explode|https://spark.apache.org/docs/latest/api/sql/index.html#explode] 
> separates the elements of an array column (or expression) into multiple row.
> Note that each explode works at the top-level only (not recursively).
> This would also work with the existing 
> [flatten|https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.flatten]
>  method to allow fully unnesting a 
> [pyarrow.StructArray|https://arrow.apache.org/docs/python/generated/pyarrow.StructArray.html#pyarrow-structarray].



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-12099) [Python] Explode array column

2022-06-16 Thread Nick Crews (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555329#comment-17555329
 ] 

Nick Crews commented on ARROW-12099:


Small tweak to Guido's implementation (thank you for this!): If the table only 
has the one ListArray or MapArray column, then it crashes.

This handles that case:
{code:python}
import pyarrow as paimport pyarrow.compute as pc
def explode_table(table, column):null_filled = pc.fill_null(table[column], 
[None])flattened = pc.list_flatten(null_filled)other_columns = 
list(table.schema.names)other_columns.remove(column)if 
len(other_columns) == 0:return pa.table({column: flattened})else:   
 indices = pc.list_parent_indices(null_filled)result = 
table.select(other_columns).take(indices)result = result.append_column( 
   pa.field(column, table.schema.field(column).type.value_type),
flattened,)return result {code}

> [Python] Explode array column
> -
>
> Key: ARROW-12099
> URL: https://issues.apache.org/jira/browse/ARROW-12099
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Malthe Borch
>Priority: Major
>
> In Apache Spark, 
> [explode|https://spark.apache.org/docs/latest/api/sql/index.html#explode] 
> separates the elements of an array column (or expression) into multiple row.
> Note that each explode works at the top-level only (not recursively).
> This would also work with the existing 
> [flatten|https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.flatten]
>  method to allow fully unnesting a 
> [pyarrow.StructArray|https://arrow.apache.org/docs/python/generated/pyarrow.StructArray.html#pyarrow-structarray].



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16713) [C++] Pull join accumulation outside of HashJoinImpl

2022-06-16 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace resolved ARROW-16713.
-
Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 13332
[https://github.com/apache/arrow/pull/13332]

> [C++] Pull join accumulation outside of HashJoinImpl
> 
>
> Key: ARROW-16713
> URL: https://issues.apache.org/jira/browse/ARROW-16713
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Sasha Krassovsky
>Assignee: Sasha Krassovsky
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> This is part of the preparatory refactoring for spilling (ARROW-16389)
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16846) [Rust] Write blog post with Rust release highlights

2022-06-16 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb resolved ARROW-16846.
-
Resolution: Fixed

> [Rust] Write blog post with Rust release highlights
> ---
>
> Key: ARROW-16846
> URL: https://issues.apache.org/jira/browse/ARROW-16846
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Website
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> See details here 
> https://github.com/apache/arrow-rs/issues/1808
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16846) [Rust] Write blog post with Rust release highlights

2022-06-16 Thread Andrew Lamb (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555295#comment-17555295
 ] 

Andrew Lamb commented on ARROW-16846:
-

Closed in https://github.com/apache/arrow-site/pull/220

> [Rust] Write blog post with Rust release highlights
> ---
>
> Key: ARROW-16846
> URL: https://issues.apache.org/jira/browse/ARROW-16846
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Website
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> See details here 
> https://github.com/apache/arrow-rs/issues/1808
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16846) [Rust] Write blog post with Rust release highlights

2022-06-16 Thread Andrew Lamb (Jira)
Andrew Lamb created ARROW-16846:
---

 Summary: [Rust] Write blog post with Rust release highlights
 Key: ARROW-16846
 URL: https://issues.apache.org/jira/browse/ARROW-16846
 Project: Apache Arrow
  Issue Type: Task
  Components: Website
Reporter: Andrew Lamb
Assignee: Andrew Lamb


See details here 

https://github.com/apache/arrow-rs/issues/1808

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16845) [C++] ArraySpan::IsNull/IsValid implementations are incorrect for union types

2022-06-16 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-16845:


 Summary: [C++] ArraySpan::IsNull/IsValid implementations are 
incorrect for union types
 Key: ARROW-16845
 URL: https://issues.apache.org/jira/browse/ARROW-16845
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 9.0.0


Because the first buffer is not a validity bitmap. Follow up work from 
ARROW-16756



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16770) [C++] Arrow Substrait test fails with SIGSEGV, possibly due to gtest 1.11.0

2022-06-16 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555283#comment-17555283
 ] 

Yaron Gvili commented on ARROW-16770:
-

To make sure we're on the same page, my interpretation of this issue is that it 
shows there exist a non-contrived setup (which is whatever I happen to have and 
work with for some time now) where a normal Arrow test invocation leads to a 
not-so-trivial SIGSEGV. My intentions in filing this issue are to save time for 
whoever may run into this in the future and to try to find someone who knows 
how to fix it.

[~lidavidm], [~westonpace]: yes, as noted in the description, there is a mixup 
in my setup of GTest 1.10 and 1.11 (I think one was installed by pyarrow-dev 
and another by Ubuntu's apt) yet the point is I didn't do anything contrived to 
reach this setup, so it could happen to others.

Since this issue is currently not a blocker for me and I'm occupied with a 
large project, I'm fine with this staying on hold for the time being.

> [C++] Arrow Substrait test fails with SIGSEGV, possibly due to gtest 1.11.0
> ---
>
> Key: ARROW-16770
> URL: https://issues.apache.org/jira/browse/ARROW-16770
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yaron Gvili
>Priority: Major
>
> I built Arrow using the instructions in the Python development page, under 
> the pyarrow-dev environment, and found that `arrow-substrait-substrait-test` 
> fails with SIGSEGV - see gdb session below. The same Arrow builds and runs 
> correctly on my system, outside of pyarrow-dev. I suspect this is due to 
> something different about gtest 1.11.0 as compared to gtest 1.10.0 based on 
> the following observations:
>  # The backtrace in the gdb session shows gtest 1.11.0 is used.
>  # The backtrace also shows the error is deep inside gtest, working on an 
> `UnorderedElementsAre` expectation.
>  # My system, outside pyarrow-dev, uses gtest 1.10.0.
>  
> {noformat}
> $ gdb --args ./release/arrow-substrait-substrait-test 
> GNU gdb (Ubuntu 9.2-0ubuntu1~20.04) 9.2
> Copyright (C) 2020 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later 
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.
> Type "show copying" and "show warranty" for details.
> This GDB was configured as "x86_64-linux-gnu".
> Type "show configuration" for configuration details.
> For bug reporting instructions, please see:
> .
> Find the GDB manual and other documentation resources online at:
>     .
> For help, type "help".
> Type "apropos word" to search for commands related to "word"...
> Reading symbols from ./release/arrow-substrait-substrait-test...
> (gdb) run
> Starting program: 
> /mnt/user1/tscontract/github/rtpsw/arrow/cpp/build/debug/release/arrow-substrait-substrait-test
>  
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> [New Thread 0x741ff700 (LWP 115128)]
> Running main() from 
> /home/conda/feedstock_root/build_artifacts/gtest_1647154636757/work/googletest/src/gtest_main.cc
> [==] Running 33 tests from 3 test suites.
> [--] Global test environment set-up.
> [--] 4 tests from ExtensionIdRegistryTest
> [ RUN      ] ExtensionIdRegistryTest.RegisterTempTypes
> [       OK ] ExtensionIdRegistryTest.RegisterTempTypes (0 ms)
> [ RUN      ] ExtensionIdRegistryTest.RegisterTempFunctions
> [       OK ] ExtensionIdRegistryTest.RegisterTempFunctions (0 ms)
> [ RUN      ] ExtensionIdRegistryTest.RegisterNestedTypes
> [       OK ] ExtensionIdRegistryTest.RegisterNestedTypes (0 ms)
> [ RUN      ] ExtensionIdRegistryTest.RegisterNestedFunctions
> [       OK ] ExtensionIdRegistryTest.RegisterNestedFunctions (0 ms)
> [--] 4 tests from ExtensionIdRegistryTest (0 ms total)
> [--] 21 tests from Substrait
> [ RUN      ] Substrait.SupportedTypes
> [       OK ] Substrait.SupportedTypes (0 ms)
> [ RUN      ] Substrait.SupportedExtensionTypes
> [       OK ] Substrait.SupportedExtensionTypes (0 ms)
> [ RUN      ] Substrait.NamedStruct
> [       OK ] Substrait.NamedStruct (0 ms)
> [ RUN      ] Substrait.NoEquivalentArrowType
> [       OK ] Substrait.NoEquivalentArrowType (0 ms)
> [ RUN      ] Substrait.NoEquivalentSubstraitType
> [       OK ] Substrait.NoEquivalentSubstraitType (0 ms)
> [ RUN      ] Substrait.SupportedLiterals
> [       OK ] Substrait.SupportedLiterals (1 ms)
> [ RUN      ] Substrait.CannotDeserializeLiteral
> [       OK ] Substrait.CannotDeserializeLiteral (0 ms)
> [ RUN      ] Substrait.FieldRefRoundTrip
> [       OK ] 

[jira] [Commented] (ARROW-16822) Python Error: <>, exitCode: <139> when csv file converting parquet using pandas/pyarrow libraries

2022-06-16 Thread Mahesha Subrahamanya (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555276#comment-17555276
 ] 

Mahesha Subrahamanya commented on ARROW-16822:
--

we did try with traceback, exception handler however nothing worked here.

when pandas hand over the chunk data to pyarrow which is responsible to 
converting into parquet file. we suspect during the process of converting it's 
failing however it's not throwing the right error code/error message hence need 
your help. kindly let me know if anything can be helpful is really appreciated. 
 since we are running into this issue we couldn't deliver this project as it's 
dependency at the python libraries like pandas/pyarrow. 

> Python Error: <>, exitCode: <139> when csv file converting parquet using 
> pandas/pyarrow libraries
> -
>
> Key: ARROW-16822
> URL: https://issues.apache.org/jira/browse/ARROW-16822
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Parquet, Python
>Affects Versions: 5.0.0
>Reporter: Mahesha Subrahamanya
>Priority: Blocker
> Attachments: convertCSV2Parquet.png
>
>
> Our main requirement is to read source file (structured/semi structured 
> /unstructured) which are residing in AWS s3 through AWS redshift database, 
> where our customer have direct access to analyze the data very 
> quickly/seamlessly for reporting purpose without defining the schema info for 
> the file.
> We have created an data lake (aws s3) workspace where our customers dumps 
> csv/parquet huge size files (like 10/15 GB). We have developed a framework 
> which is consuming pandas/pyarrow (parquet) libraries to read source files in 
> chunking manner and identifying schema meaning (datatype/length) and push it 
> to AWS Glue where AWS redshift database can talk seamlessly to s3 files can 
> read very quickly.
>  
> Following is the snippet of parquet conversion where i'm getting this error. 
> Please take a look
>  
> read_csv_args = \{'filepath_or_buffer': src_object, 'chunksize': 
> self.chunkSizeLimit, 'encoding': 'UTF-8','on_bad_lines': 'error','sep': 
> fileDelimiter, 'low_memory': False, 'skip_blank_lines': True, 'memory_map': 
> True} # 'verbose': True , In order to enable memory consumption logging
>             
> if srcPath.endswith('.gz'):
>                 read_csv_args['compression'] = 'gzip'
>             if fileTextQualifier:
>                 read_csv_args['quotechar'] = fileTextQualifier
> with pd.read_csv(**read_csv_args) as reader:
>                 for chunk_number, chunk in enumerate(reader, 1):
>                     # To support shape-shifting for the incoming datafiles, 
> need to make sure match file with number of columns if not delete
>                     if glueMasterSchema is not None:
>                         sessionSchema=copy.deepcopy(glueMasterSchema) 
> #copying using deepcopy() method
>                         chunk.columns = chunk.columns.str.lower() # modifying 
> the column header of all columns to lowercase
>                         fileSchema = list(chunk.columns)
>                         for key in list(sessionSchema):
>                             if key not in fileSchema:
>                                 del sessionSchema[key]
>                         fields = []
>                         for col,dtypes in sessionSchema.items():
>                             fields.append(pa.field(col, dtypes))
>                         glue_schema = pa.schema(fields)
>                         # To identify the boolean datatype and convert back 
> to STRING which was done during the BF schema
>                         for cols in chunk.columns:
>                             try:
>                                 if chunk[cols].dtype =='bool':
>                                     chunk[cols] = chunk[cols].astype('str')
>                                 if chunk[cols].dtype =='object':
>                                     chunk[cols] = 
> chunk[cols].fillna('').astype('str').tolist()
>                             except (ParserError,ValueError,TypeError):
>                                 pass
>                     log.debug("chunk count", chunk_number, "chunk length", 
> len(chunk), 'glue_schema', glue_schema, 'Wrote file', targetKey)
>                     #log.debug("during pandas chunk data ", chunk,"df 
> schemas:", chunk.dtypes)
>                     table = pa.Table.from_pandas(chunk,  schema=glue_schema , 
> preserve_index=False)
>                     log.info('Glue schema:',glue_schema,'for a 
> [file:',targetKey|file:///',targetKey])
>                     log.info('pandas memory utilization during chunk process: 
> ', chunk.memory_usage().sum(), 'Bytes.','\n\n\n')
>                     # Guess the schema of the CSV file from the 

[jira] [Commented] (ARROW-16810) [Python] PyArrow: write_dataset - Could not open CSV input source

2022-06-16 Thread Earle Lyons (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555275#comment-17555275
 ] 

Earle Lyons commented on ARROW-16810:
-

Hi [~westonpace]! 

Good day to you! Thanks so much for the response and very helpful information. 
If I recall correctly, I tried to output the files to a new subdirectory (i.e. 
'/home/user/csv_files/pq_files') and the parquet file was discovered, but I did 
not try a new directory (i.e. '/home/user/data/pq_files'). 

I agree, passing a list of files is probably the best method given the 
available options. To your point, there are benefits and flexibility with 
including/excluding files using a list.

In the future, it would be wonderful if paths with wildcards and supported 
format extensions (i.e. /*.csv) could be handled in the dataset.dataset 
'source' parameter.

Thanks again!  :)

> [Python] PyArrow: write_dataset - Could not open CSV input source
> -
>
> Key: ARROW-16810
> URL: https://issues.apache.org/jira/browse/ARROW-16810
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
> Environment: Pop!_OS 20.04 LTS OS & Conda 4.11.0 /Mamba 0.23.0 
> Environment
>Reporter: Earle Lyons
>Priority: Minor
>
> Hi Arrow Community! 
> Happy Friday! I am a new user to Arrow, specifically using pyarrow. However, 
> I am very excited about the project. 
> I am experiencing issues with the '{*}write_dataset'{*} function from the 
> '{*}dataset{*}' module. Please forgive me, if this is a known issue. However, 
> I have searched the GitHub 'Issues', as well as Stack Overflow and I have not 
> identified a similar issue. 
> I have a directory that contains 90 CSV files (essentially one CSV for each 
> day between 2021-01-01 and 2021-03-31).  My objective was to read all the CSV 
> files into a dataset and write the dataset to a single Parquet file format. 
> Unfortunately, some of the CSV files contained nulls in some columns, which 
> presented some issues which were resolved by specifying DataTypes with the 
> following Stack Overflow solution:
> [How do I specify a dtype for all columns when reading a CSV file with 
> pyarrow?|https://stackoverflow.com/questions/71533197/how-do-i-specify-a-dtype-for-all-columns-when-reading-a-csv-file-with-pyarrow]
> The following code works on the first pass.
> {code:python}
> import pyarrow as pa
> import pyarrow.csv as csv
> import pyarrow.dataset as ds
> import re
> {code}
> {code:python}
> pa.__version__
> '8.0.0'
> {code}
> {code:python}
> column_types = {}
> csv_path = '/home/user/csv_files'
> field_re_pattern = "value_*"
> # Open a dataset with the 'csv_path' path and 'csv' file format
> # and assign to 'dataset1'
> dataset1 = ds.dataset(csv_path, format='csv')
> # Loop through each field in the 'dataset1' schema,
> # match the 'field_re_pattern' regex pattern in the field name,
> # and assign 'int64' DataType to the field.name in the 'column_types'
> # dictionary 
> for field in (field for field in dataset1.schema \
>               if re.match(field_re_pattern, field.name)):
>         column_types[field.name] = pa.int64()
> # Creates options for CSV data using the 'column_types' dictionary
> # This returns a 
> convert_options = csv.ConvertOptions(column_types=column_types)
> # Creates FileFormat for CSV using the 'convert_options' 
> # This returns a 
> custom_csv_format = ds.CsvFileFormat(convert_options=convert_options)
> # Open a a dataset with the 'csv_path' path, instead of using the 
> # 'csv' file format, use the 'custom_csv_format' and assign to 
> # 'dataset2'
> dataset2 = ds.dataset(csv_path, format=custom_csv_format)
> # Write the 'dataset2' to the 'csv_path' base directory in the 
> # 'parquet' format, and overwrite/ignore if the file exists
> ds.write_dataset(dataset2, base_dir=csv_path, format='parquet', 
> existing_data_behavior='overwrite_or_ignore')
> {code}
> As previously stated, on first pass, the code works and creates a single 
> parquet file (part-0.parquet) with the correct data, row count, and schema.
> However, if the code is run again, the following error is encountered:
> {code:python}
> ArrowInvalid: Could not open CSV input source 
> '/home/user/csv_files/part-0.parquet': Invalid: CSV parse error: Row #2: 
> Expected 4 columns, got 1: 6NQJRJV02XW$0Y8V p   A$A18CEBS
> 305DEM030TTW �5HZ50GCVJV1CSV
> {code}
> My interpretation of the error is that on the second pass the 'dataset2' 
> variable now includes the 'part-0.parquet' file (which can be confirmed with 
> the `dataset2.files` output showing the file) and the CSV reader is 
> attempting to parse/read the parquet file.
> If this is the case, is there an argument to ignore the parquet file and only 
> evaluate the CSV files? Also, 

[jira] [Updated] (ARROW-16822) Python Error: <>, exitCode: <139> when csv file converting parquet using pandas/pyarrow libraries

2022-06-16 Thread Mahesha Subrahamanya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahesha Subrahamanya updated ARROW-16822:
-
Attachment: convertCSV2Parquet.png

> Python Error: <>, exitCode: <139> when csv file converting parquet using 
> pandas/pyarrow libraries
> -
>
> Key: ARROW-16822
> URL: https://issues.apache.org/jira/browse/ARROW-16822
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Parquet, Python
>Affects Versions: 5.0.0
>Reporter: Mahesha Subrahamanya
>Priority: Blocker
> Attachments: convertCSV2Parquet.png
>
>
> Our main requirement is to read source file (structured/semi structured 
> /unstructured) which are residing in AWS s3 through AWS redshift database, 
> where our customer have direct access to analyze the data very 
> quickly/seamlessly for reporting purpose without defining the schema info for 
> the file.
> We have created an data lake (aws s3) workspace where our customers dumps 
> csv/parquet huge size files (like 10/15 GB). We have developed a framework 
> which is consuming pandas/pyarrow (parquet) libraries to read source files in 
> chunking manner and identifying schema meaning (datatype/length) and push it 
> to AWS Glue where AWS redshift database can talk seamlessly to s3 files can 
> read very quickly.
>  
> Following is the snippet of parquet conversion where i'm getting this error. 
> Please take a look
>  
> read_csv_args = \{'filepath_or_buffer': src_object, 'chunksize': 
> self.chunkSizeLimit, 'encoding': 'UTF-8','on_bad_lines': 'error','sep': 
> fileDelimiter, 'low_memory': False, 'skip_blank_lines': True, 'memory_map': 
> True} # 'verbose': True , In order to enable memory consumption logging
>             
> if srcPath.endswith('.gz'):
>                 read_csv_args['compression'] = 'gzip'
>             if fileTextQualifier:
>                 read_csv_args['quotechar'] = fileTextQualifier
> with pd.read_csv(**read_csv_args) as reader:
>                 for chunk_number, chunk in enumerate(reader, 1):
>                     # To support shape-shifting for the incoming datafiles, 
> need to make sure match file with number of columns if not delete
>                     if glueMasterSchema is not None:
>                         sessionSchema=copy.deepcopy(glueMasterSchema) 
> #copying using deepcopy() method
>                         chunk.columns = chunk.columns.str.lower() # modifying 
> the column header of all columns to lowercase
>                         fileSchema = list(chunk.columns)
>                         for key in list(sessionSchema):
>                             if key not in fileSchema:
>                                 del sessionSchema[key]
>                         fields = []
>                         for col,dtypes in sessionSchema.items():
>                             fields.append(pa.field(col, dtypes))
>                         glue_schema = pa.schema(fields)
>                         # To identify the boolean datatype and convert back 
> to STRING which was done during the BF schema
>                         for cols in chunk.columns:
>                             try:
>                                 if chunk[cols].dtype =='bool':
>                                     chunk[cols] = chunk[cols].astype('str')
>                                 if chunk[cols].dtype =='object':
>                                     chunk[cols] = 
> chunk[cols].fillna('').astype('str').tolist()
>                             except (ParserError,ValueError,TypeError):
>                                 pass
>                     log.debug("chunk count", chunk_number, "chunk length", 
> len(chunk), 'glue_schema', glue_schema, 'Wrote file', targetKey)
>                     #log.debug("during pandas chunk data ", chunk,"df 
> schemas:", chunk.dtypes)
>                     table = pa.Table.from_pandas(chunk,  schema=glue_schema , 
> preserve_index=False)
>                     log.info('Glue schema:',glue_schema,'for a 
> [file:',targetKey|file:///',targetKey])
>                     log.info('pandas memory utilization during chunk process: 
> ', chunk.memory_usage().sum(), 'Bytes.','\n\n\n')
>                     # Guess the schema of the CSV file from the first chunk
>                     #if pq_writer is None:
>                     if chunk_number == 1:
>                         #parquet_schema = table.schema
>                         # Open a Parquet file for writing
>                         pq_writer = pq.ParquetWriter(targetKey, 
> schema=glue_schema, compression='snappy') # In PyArrow we use, Snappy 
> generally results in better performance
>                         log.debug("table schema :", 
> pprint.pformat(table.schema).replace('\n', ',').replace('\r', ','),' for:', 

[jira] [Comment Edited] (ARROW-16822) Python Error: <>, exitCode: <139> when csv file converting parquet using pandas/pyarrow libraries

2022-06-16 Thread Mahesha Subrahamanya (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555274#comment-17555274
 ] 

Mahesha Subrahamanya edited comment on ARROW-16822 at 6/16/22 7:54 PM:
---

!convertCSV2Parquet.png|width=602,height=354!


was (Author: JIRAUSER290869):
!convertCSV2Parquet.png!

> Python Error: <>, exitCode: <139> when csv file converting parquet using 
> pandas/pyarrow libraries
> -
>
> Key: ARROW-16822
> URL: https://issues.apache.org/jira/browse/ARROW-16822
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Parquet, Python
>Affects Versions: 5.0.0
>Reporter: Mahesha Subrahamanya
>Priority: Blocker
> Attachments: convertCSV2Parquet.png
>
>
> Our main requirement is to read source file (structured/semi structured 
> /unstructured) which are residing in AWS s3 through AWS redshift database, 
> where our customer have direct access to analyze the data very 
> quickly/seamlessly for reporting purpose without defining the schema info for 
> the file.
> We have created an data lake (aws s3) workspace where our customers dumps 
> csv/parquet huge size files (like 10/15 GB). We have developed a framework 
> which is consuming pandas/pyarrow (parquet) libraries to read source files in 
> chunking manner and identifying schema meaning (datatype/length) and push it 
> to AWS Glue where AWS redshift database can talk seamlessly to s3 files can 
> read very quickly.
>  
> Following is the snippet of parquet conversion where i'm getting this error. 
> Please take a look
>  
> read_csv_args = \{'filepath_or_buffer': src_object, 'chunksize': 
> self.chunkSizeLimit, 'encoding': 'UTF-8','on_bad_lines': 'error','sep': 
> fileDelimiter, 'low_memory': False, 'skip_blank_lines': True, 'memory_map': 
> True} # 'verbose': True , In order to enable memory consumption logging
>             
> if srcPath.endswith('.gz'):
>                 read_csv_args['compression'] = 'gzip'
>             if fileTextQualifier:
>                 read_csv_args['quotechar'] = fileTextQualifier
> with pd.read_csv(**read_csv_args) as reader:
>                 for chunk_number, chunk in enumerate(reader, 1):
>                     # To support shape-shifting for the incoming datafiles, 
> need to make sure match file with number of columns if not delete
>                     if glueMasterSchema is not None:
>                         sessionSchema=copy.deepcopy(glueMasterSchema) 
> #copying using deepcopy() method
>                         chunk.columns = chunk.columns.str.lower() # modifying 
> the column header of all columns to lowercase
>                         fileSchema = list(chunk.columns)
>                         for key in list(sessionSchema):
>                             if key not in fileSchema:
>                                 del sessionSchema[key]
>                         fields = []
>                         for col,dtypes in sessionSchema.items():
>                             fields.append(pa.field(col, dtypes))
>                         glue_schema = pa.schema(fields)
>                         # To identify the boolean datatype and convert back 
> to STRING which was done during the BF schema
>                         for cols in chunk.columns:
>                             try:
>                                 if chunk[cols].dtype =='bool':
>                                     chunk[cols] = chunk[cols].astype('str')
>                                 if chunk[cols].dtype =='object':
>                                     chunk[cols] = 
> chunk[cols].fillna('').astype('str').tolist()
>                             except (ParserError,ValueError,TypeError):
>                                 pass
>                     log.debug("chunk count", chunk_number, "chunk length", 
> len(chunk), 'glue_schema', glue_schema, 'Wrote file', targetKey)
>                     #log.debug("during pandas chunk data ", chunk,"df 
> schemas:", chunk.dtypes)
>                     table = pa.Table.from_pandas(chunk,  schema=glue_schema , 
> preserve_index=False)
>                     log.info('Glue schema:',glue_schema,'for a 
> [file:',targetKey|file:///',targetKey])
>                     log.info('pandas memory utilization during chunk process: 
> ', chunk.memory_usage().sum(), 'Bytes.','\n\n\n')
>                     # Guess the schema of the CSV file from the first chunk
>                     #if pq_writer is None:
>                     if chunk_number == 1:
>                         #parquet_schema = table.schema
>                         # Open a Parquet file for writing
>                         pq_writer = pq.ParquetWriter(targetKey, 
> schema=glue_schema, compression='snappy') # In PyArrow we use, Snappy 
> generally results 

[jira] [Commented] (ARROW-16822) Python Error: <>, exitCode: <139> when csv file converting parquet using pandas/pyarrow libraries

2022-06-16 Thread Mahesha Subrahamanya (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555274#comment-17555274
 ] 

Mahesha Subrahamanya commented on ARROW-16822:
--

!convertCSV2Parquet.png!

> Python Error: <>, exitCode: <139> when csv file converting parquet using 
> pandas/pyarrow libraries
> -
>
> Key: ARROW-16822
> URL: https://issues.apache.org/jira/browse/ARROW-16822
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Parquet, Python
>Affects Versions: 5.0.0
>Reporter: Mahesha Subrahamanya
>Priority: Blocker
> Attachments: convertCSV2Parquet.png
>
>
> Our main requirement is to read source file (structured/semi structured 
> /unstructured) which are residing in AWS s3 through AWS redshift database, 
> where our customer have direct access to analyze the data very 
> quickly/seamlessly for reporting purpose without defining the schema info for 
> the file.
> We have created an data lake (aws s3) workspace where our customers dumps 
> csv/parquet huge size files (like 10/15 GB). We have developed a framework 
> which is consuming pandas/pyarrow (parquet) libraries to read source files in 
> chunking manner and identifying schema meaning (datatype/length) and push it 
> to AWS Glue where AWS redshift database can talk seamlessly to s3 files can 
> read very quickly.
>  
> Following is the snippet of parquet conversion where i'm getting this error. 
> Please take a look
>  
> read_csv_args = \{'filepath_or_buffer': src_object, 'chunksize': 
> self.chunkSizeLimit, 'encoding': 'UTF-8','on_bad_lines': 'error','sep': 
> fileDelimiter, 'low_memory': False, 'skip_blank_lines': True, 'memory_map': 
> True} # 'verbose': True , In order to enable memory consumption logging
>             
> if srcPath.endswith('.gz'):
>                 read_csv_args['compression'] = 'gzip'
>             if fileTextQualifier:
>                 read_csv_args['quotechar'] = fileTextQualifier
> with pd.read_csv(**read_csv_args) as reader:
>                 for chunk_number, chunk in enumerate(reader, 1):
>                     # To support shape-shifting for the incoming datafiles, 
> need to make sure match file with number of columns if not delete
>                     if glueMasterSchema is not None:
>                         sessionSchema=copy.deepcopy(glueMasterSchema) 
> #copying using deepcopy() method
>                         chunk.columns = chunk.columns.str.lower() # modifying 
> the column header of all columns to lowercase
>                         fileSchema = list(chunk.columns)
>                         for key in list(sessionSchema):
>                             if key not in fileSchema:
>                                 del sessionSchema[key]
>                         fields = []
>                         for col,dtypes in sessionSchema.items():
>                             fields.append(pa.field(col, dtypes))
>                         glue_schema = pa.schema(fields)
>                         # To identify the boolean datatype and convert back 
> to STRING which was done during the BF schema
>                         for cols in chunk.columns:
>                             try:
>                                 if chunk[cols].dtype =='bool':
>                                     chunk[cols] = chunk[cols].astype('str')
>                                 if chunk[cols].dtype =='object':
>                                     chunk[cols] = 
> chunk[cols].fillna('').astype('str').tolist()
>                             except (ParserError,ValueError,TypeError):
>                                 pass
>                     log.debug("chunk count", chunk_number, "chunk length", 
> len(chunk), 'glue_schema', glue_schema, 'Wrote file', targetKey)
>                     #log.debug("during pandas chunk data ", chunk,"df 
> schemas:", chunk.dtypes)
>                     table = pa.Table.from_pandas(chunk,  schema=glue_schema , 
> preserve_index=False)
>                     log.info('Glue schema:',glue_schema,'for a 
> [file:',targetKey|file:///',targetKey])
>                     log.info('pandas memory utilization during chunk process: 
> ', chunk.memory_usage().sum(), 'Bytes.','\n\n\n')
>                     # Guess the schema of the CSV file from the first chunk
>                     #if pq_writer is None:
>                     if chunk_number == 1:
>                         #parquet_schema = table.schema
>                         # Open a Parquet file for writing
>                         pq_writer = pq.ParquetWriter(targetKey, 
> schema=glue_schema, compression='snappy') # In PyArrow we use, Snappy 
> generally results in better performance
>                         log.debug("table schema :", 
> pprint.pformat(table.schema).replace('\n', 

[jira] [Commented] (ARROW-16822) Python Error: <>, exitCode: <139> when csv file converting parquet using pandas/pyarrow libraries

2022-06-16 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555267#comment-17555267
 ] 

Weston Pace commented on ARROW-16822:
-

Can you raise a different error other than SystemExit or provide a traceback?  
This is a rather large snippet of code to parse through to figure out which 
line might be failing.  Also, the exit code you are mentioning (139) does not 
seem like something that pyarrow would configure.  Pyarrow doesn't really 
interact with exit codes.

> Python Error: <>, exitCode: <139> when csv file converting parquet using 
> pandas/pyarrow libraries
> -
>
> Key: ARROW-16822
> URL: https://issues.apache.org/jira/browse/ARROW-16822
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Parquet, Python
>Affects Versions: 5.0.0
>Reporter: Mahesha Subrahamanya
>Priority: Blocker
>
> Our main requirement is to read source file (structured/semi structured 
> /unstructured) which are residing in AWS s3 through AWS redshift database, 
> where our customer have direct access to analyze the data very 
> quickly/seamlessly for reporting purpose without defining the schema info for 
> the file.
> We have created an data lake (aws s3) workspace where our customers dumps 
> csv/parquet huge size files (like 10/15 GB). We have developed a framework 
> which is consuming pandas/pyarrow (parquet) libraries to read source files in 
> chunking manner and identifying schema meaning (datatype/length) and push it 
> to AWS Glue where AWS redshift database can talk seamlessly to s3 files can 
> read very quickly.
>  
> Following is the snippet of parquet conversion where i'm getting this error. 
> Please take a look
>  
> read_csv_args = \{'filepath_or_buffer': src_object, 'chunksize': 
> self.chunkSizeLimit, 'encoding': 'UTF-8','on_bad_lines': 'error','sep': 
> fileDelimiter, 'low_memory': False, 'skip_blank_lines': True, 'memory_map': 
> True} # 'verbose': True , In order to enable memory consumption logging
>             
> if srcPath.endswith('.gz'):
>                 read_csv_args['compression'] = 'gzip'
>             if fileTextQualifier:
>                 read_csv_args['quotechar'] = fileTextQualifier
> with pd.read_csv(**read_csv_args) as reader:
>                 for chunk_number, chunk in enumerate(reader, 1):
>                     # To support shape-shifting for the incoming datafiles, 
> need to make sure match file with number of columns if not delete
>                     if glueMasterSchema is not None:
>                         sessionSchema=copy.deepcopy(glueMasterSchema) 
> #copying using deepcopy() method
>                         chunk.columns = chunk.columns.str.lower() # modifying 
> the column header of all columns to lowercase
>                         fileSchema = list(chunk.columns)
>                         for key in list(sessionSchema):
>                             if key not in fileSchema:
>                                 del sessionSchema[key]
>                         fields = []
>                         for col,dtypes in sessionSchema.items():
>                             fields.append(pa.field(col, dtypes))
>                         glue_schema = pa.schema(fields)
>                         # To identify the boolean datatype and convert back 
> to STRING which was done during the BF schema
>                         for cols in chunk.columns:
>                             try:
>                                 if chunk[cols].dtype =='bool':
>                                     chunk[cols] = chunk[cols].astype('str')
>                                 if chunk[cols].dtype =='object':
>                                     chunk[cols] = 
> chunk[cols].fillna('').astype('str').tolist()
>                             except (ParserError,ValueError,TypeError):
>                                 pass
>                     log.debug("chunk count", chunk_number, "chunk length", 
> len(chunk), 'glue_schema', glue_schema, 'Wrote file', targetKey)
>                     #log.debug("during pandas chunk data ", chunk,"df 
> schemas:", chunk.dtypes)
>                     table = pa.Table.from_pandas(chunk,  schema=glue_schema , 
> preserve_index=False)
>                     log.info('Glue schema:',glue_schema,'for a 
> [file:',targetKey|file:///',targetKey])
>                     log.info('pandas memory utilization during chunk process: 
> ', chunk.memory_usage().sum(), 'Bytes.','\n\n\n')
>                     # Guess the schema of the CSV file from the first chunk
>                     #if pq_writer is None:
>                     if chunk_number == 1:
>                         #parquet_schema = table.schema
>                         # Open a Parquet file for writing
>                         pq_writer = 

[jira] [Commented] (ARROW-16770) [C++] Arrow Substrait test fails with SIGSEGV, possibly due to gtest 1.11.0

2022-06-16 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555265#comment-17555265
 ] 

Weston Pace commented on ARROW-16770:
-

I don't know if it's possible but can you confirm the gmock library in 
pyarrow-dev matches your gtest version?

> [C++] Arrow Substrait test fails with SIGSEGV, possibly due to gtest 1.11.0
> ---
>
> Key: ARROW-16770
> URL: https://issues.apache.org/jira/browse/ARROW-16770
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yaron Gvili
>Priority: Major
>
> I built Arrow using the instructions in the Python development page, under 
> the pyarrow-dev environment, and found that `arrow-substrait-substrait-test` 
> fails with SIGSEGV - see gdb session below. The same Arrow builds and runs 
> correctly on my system, outside of pyarrow-dev. I suspect this is due to 
> something different about gtest 1.11.0 as compared to gtest 1.10.0 based on 
> the following observations:
>  # The backtrace in the gdb session shows gtest 1.11.0 is used.
>  # The backtrace also shows the error is deep inside gtest, working on an 
> `UnorderedElementsAre` expectation.
>  # My system, outside pyarrow-dev, uses gtest 1.10.0.
>  
> {noformat}
> $ gdb --args ./release/arrow-substrait-substrait-test 
> GNU gdb (Ubuntu 9.2-0ubuntu1~20.04) 9.2
> Copyright (C) 2020 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later 
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.
> Type "show copying" and "show warranty" for details.
> This GDB was configured as "x86_64-linux-gnu".
> Type "show configuration" for configuration details.
> For bug reporting instructions, please see:
> .
> Find the GDB manual and other documentation resources online at:
>     .
> For help, type "help".
> Type "apropos word" to search for commands related to "word"...
> Reading symbols from ./release/arrow-substrait-substrait-test...
> (gdb) run
> Starting program: 
> /mnt/user1/tscontract/github/rtpsw/arrow/cpp/build/debug/release/arrow-substrait-substrait-test
>  
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> [New Thread 0x741ff700 (LWP 115128)]
> Running main() from 
> /home/conda/feedstock_root/build_artifacts/gtest_1647154636757/work/googletest/src/gtest_main.cc
> [==] Running 33 tests from 3 test suites.
> [--] Global test environment set-up.
> [--] 4 tests from ExtensionIdRegistryTest
> [ RUN      ] ExtensionIdRegistryTest.RegisterTempTypes
> [       OK ] ExtensionIdRegistryTest.RegisterTempTypes (0 ms)
> [ RUN      ] ExtensionIdRegistryTest.RegisterTempFunctions
> [       OK ] ExtensionIdRegistryTest.RegisterTempFunctions (0 ms)
> [ RUN      ] ExtensionIdRegistryTest.RegisterNestedTypes
> [       OK ] ExtensionIdRegistryTest.RegisterNestedTypes (0 ms)
> [ RUN      ] ExtensionIdRegistryTest.RegisterNestedFunctions
> [       OK ] ExtensionIdRegistryTest.RegisterNestedFunctions (0 ms)
> [--] 4 tests from ExtensionIdRegistryTest (0 ms total)
> [--] 21 tests from Substrait
> [ RUN      ] Substrait.SupportedTypes
> [       OK ] Substrait.SupportedTypes (0 ms)
> [ RUN      ] Substrait.SupportedExtensionTypes
> [       OK ] Substrait.SupportedExtensionTypes (0 ms)
> [ RUN      ] Substrait.NamedStruct
> [       OK ] Substrait.NamedStruct (0 ms)
> [ RUN      ] Substrait.NoEquivalentArrowType
> [       OK ] Substrait.NoEquivalentArrowType (0 ms)
> [ RUN      ] Substrait.NoEquivalentSubstraitType
> [       OK ] Substrait.NoEquivalentSubstraitType (0 ms)
> [ RUN      ] Substrait.SupportedLiterals
> [       OK ] Substrait.SupportedLiterals (1 ms)
> [ RUN      ] Substrait.CannotDeserializeLiteral
> [       OK ] Substrait.CannotDeserializeLiteral (0 ms)
> [ RUN      ] Substrait.FieldRefRoundTrip
> [       OK ] Substrait.FieldRefRoundTrip (1 ms)
> [ RUN      ] Substrait.RecursiveFieldRef
> [       OK ] Substrait.RecursiveFieldRef (0 ms)
> [ RUN      ] Substrait.FieldRefsInExpressions
> [       OK ] Substrait.FieldRefsInExpressions (0 ms)
> [ RUN      ] Substrait.CallSpecialCaseRoundTrip
> [       OK ] Substrait.CallSpecialCaseRoundTrip (0 ms)
> [ RUN      ] Substrait.CallExtensionFunction
> [       OK ] Substrait.CallExtensionFunction (0 ms)
> [ RUN      ] Substrait.ReadRel
> Thread 1 "arrow-substrait" received signal SIGSEGV, Segmentation fault.
> 0x555b02e6 in 
> testing::internal::MatcherBase std::char_traits, std::allocator > const&>::MatchAndExplain 
> (listener=0x7fffb3a0, x=..., 
>     this=) at 
> 

[jira] [Updated] (ARROW-16424) [C++] Update uri_path parsing in FromProto

2022-06-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16424:
---
Labels: pull-request-available substrait  (was: substrait)

> [C++] Update uri_path parsing in FromProto
> --
>
> Key: ARROW-16424
> URL: https://issues.apache.org/jira/browse/ARROW-16424
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ariana Villegas
>Assignee: Sanjiban Sengupta
>Priority: Minor
>  Labels: pull-request-available, substrait
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> FromProto function in {{arrow/engine/substrait/relation_internal.cc}} parse 
> {{uri_path}} with {{string_view}} utilities. However this should be done with 
> {{Uri}} class from {{arrow/util/uri.h.}}
> {code:c++}
> else if (util::string_view{path}.ends_with(".arrow")) {
>   format = std::make_shared();
> } else if (util::string_view{path}.ends_with(".feather")) {
>   format = std::make_shared();
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16810) [Python] PyArrow: write_dataset - Could not open CSV input source

2022-06-16 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555262#comment-17555262
 ] 

Weston Pace commented on ARROW-16810:
-

I think the most common thing we see for this kind of repartitioning is to 
write the output files to a new directory.  That way a future discovery won't 
see both files.

The {{dataset()}} function can also accept a list of files.  So you should be 
able to use the python {{glob}} module to do your own discovery and then pass 
the list of files to pyarrow instead of a directory.  I slightly prefer this 
method I think since there are many different ways to list and exclude files 
and we don't offer much value in doing this versus doing it in python.

That being said, we did add some support for glob parsing in the C++ lib to 
handle Substrait (which can pass in paths like **/*.csv) so it might not be too 
hard to add support for option #3 that you supplied.

> [Python] PyArrow: write_dataset - Could not open CSV input source
> -
>
> Key: ARROW-16810
> URL: https://issues.apache.org/jira/browse/ARROW-16810
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
> Environment: Pop!_OS 20.04 LTS OS & Conda 4.11.0 /Mamba 0.23.0 
> Environment
>Reporter: Earle Lyons
>Priority: Minor
>
> Hi Arrow Community! 
> Happy Friday! I am a new user to Arrow, specifically using pyarrow. However, 
> I am very excited about the project. 
> I am experiencing issues with the '{*}write_dataset'{*} function from the 
> '{*}dataset{*}' module. Please forgive me, if this is a known issue. However, 
> I have searched the GitHub 'Issues', as well as Stack Overflow and I have not 
> identified a similar issue. 
> I have a directory that contains 90 CSV files (essentially one CSV for each 
> day between 2021-01-01 and 2021-03-31).  My objective was to read all the CSV 
> files into a dataset and write the dataset to a single Parquet file format. 
> Unfortunately, some of the CSV files contained nulls in some columns, which 
> presented some issues which were resolved by specifying DataTypes with the 
> following Stack Overflow solution:
> [How do I specify a dtype for all columns when reading a CSV file with 
> pyarrow?|https://stackoverflow.com/questions/71533197/how-do-i-specify-a-dtype-for-all-columns-when-reading-a-csv-file-with-pyarrow]
> The following code works on the first pass.
> {code:python}
> import pyarrow as pa
> import pyarrow.csv as csv
> import pyarrow.dataset as ds
> import re
> {code}
> {code:python}
> pa.__version__
> '8.0.0'
> {code}
> {code:python}
> column_types = {}
> csv_path = '/home/user/csv_files'
> field_re_pattern = "value_*"
> # Open a dataset with the 'csv_path' path and 'csv' file format
> # and assign to 'dataset1'
> dataset1 = ds.dataset(csv_path, format='csv')
> # Loop through each field in the 'dataset1' schema,
> # match the 'field_re_pattern' regex pattern in the field name,
> # and assign 'int64' DataType to the field.name in the 'column_types'
> # dictionary 
> for field in (field for field in dataset1.schema \
>               if re.match(field_re_pattern, field.name)):
>         column_types[field.name] = pa.int64()
> # Creates options for CSV data using the 'column_types' dictionary
> # This returns a 
> convert_options = csv.ConvertOptions(column_types=column_types)
> # Creates FileFormat for CSV using the 'convert_options' 
> # This returns a 
> custom_csv_format = ds.CsvFileFormat(convert_options=convert_options)
> # Open a a dataset with the 'csv_path' path, instead of using the 
> # 'csv' file format, use the 'custom_csv_format' and assign to 
> # 'dataset2'
> dataset2 = ds.dataset(csv_path, format=custom_csv_format)
> # Write the 'dataset2' to the 'csv_path' base directory in the 
> # 'parquet' format, and overwrite/ignore if the file exists
> ds.write_dataset(dataset2, base_dir=csv_path, format='parquet', 
> existing_data_behavior='overwrite_or_ignore')
> {code}
> As previously stated, on first pass, the code works and creates a single 
> parquet file (part-0.parquet) with the correct data, row count, and schema.
> However, if the code is run again, the following error is encountered:
> {code:python}
> ArrowInvalid: Could not open CSV input source 
> '/home/user/csv_files/part-0.parquet': Invalid: CSV parse error: Row #2: 
> Expected 4 columns, got 1: 6NQJRJV02XW$0Y8V p   A$A18CEBS
> 305DEM030TTW �5HZ50GCVJV1CSV
> {code}
> My interpretation of the error is that on the second pass the 'dataset2' 
> variable now includes the 'part-0.parquet' file (which can be confirmed with 
> the `dataset2.files` output showing the file) and the CSV reader is 
> attempting to parse/read the parquet file.
> If this is the case, is there an 

[jira] [Created] (ARROW-16844) [C++][Python] Implement to/from substrait for Expression

2022-06-16 Thread Will Jones (Jira)
Will Jones created ARROW-16844:
--

 Summary: [C++][Python] Implement to/from substrait for Expression
 Key: ARROW-16844
 URL: https://issues.apache.org/jira/browse/ARROW-16844
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Reporter: Will Jones


DataFusion has the ability to convert between Substrait expressions and it's 
own internal expressions. (See: 
[https://github.com/datafusion-contrib/datafusion-substrait] .) It would be 
cool if we had a similar conversion for Acero's Expression class.

This might unlock allowing datafusion-python to easily use PyArrow datasets, by 
using Substrait as intermediate format to pass down filter and projections from 
Datafusion into the scanner. (See early draft here: 
[https://github.com/datafusion-contrib/datafusion-python/pull/21].)

One problem is that it's unclear what should be the type of the object in 
Python representing the Substrait expression. IIUC Python doesn't have direct 
bindings to the Substrait protobuf.

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16831) [Go] ipc.Reader should panic for invalid string array offsets

2022-06-16 Thread Matthew Topol (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol resolved ARROW-16831.
---
Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 13381
[https://github.com/apache/arrow/pull/13381]

> [Go] ipc.Reader should panic for invalid string array offsets
> -
>
> Key: ARROW-16831
> URL: https://issues.apache.org/jira/browse/ARROW-16831
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Go
>Affects Versions: 8.0.0
>Reporter: Chris Hoff
>Assignee: Chris Hoff
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> ipc.Reader will silently accept string columns with invalid offsets. This 
> results in a panic later when attempting to access the table or write it with 
> ipc.Writer.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16843) [Python][CSV] CSV reader performs unsafe type conversion

2022-06-16 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555182#comment-17555182
 ] 

Antoine Pitrou commented on ARROW-16843:


Yes, making the inference abstract and overridable would be another possible 
way. Needs someone to look at the current inference internals and devise an API 
to make it overridable.


> [Python][CSV] CSV reader performs unsafe type conversion
> 
>
> Key: ARROW-16843
> URL: https://issues.apache.org/jira/browse/ARROW-16843
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: Thomas Buhrmann
>Priority: Major
>
> Hi, I've noticed that although pa.scalar and pa.array behave correctly when 
> given the largest possible (uint64) value (i.e. they fail correctly when 
> trying to cast to float e.g.), the CSV reader happily converts strings 
> representing uint64 values to float (see example below). Is this intended? 
> Would it be possible to have a safe-conversion-only option?
> The problem is that at the moment the only safe option to read a CSV whose 
> types are not known in advance is to read without any conversion (string 
> only) and perform the type inference oneself.
> It would be ok if Uint64 types couldn't be inferred, as long as the 
> corresponding columns aren't coerced in a destructive manner to float. I.e., 
> if they were left as string columns, one could then implement a custom 
> conversion, while still benefiting from the correct and automatic conversion 
> of the remaining columns.
>  
> The following correctly rejects the float type for uint64 values:
> {code:java}
> import pyarrow as pa
> uint64_max = 18_446_744_073_709_551_615
> type_ = pa.uint64()
> uint64_scalar = pa.scalar(uint64_max, type=type_)
> uint64_array = pa.array([uint64_max], type=type_)
> try:
>     f = pa.scalar(uint64_max, type=pa.float64())
> except Exception as exc:
>     print(exc)
>     
> try:
>     f = pa.scalar(uint64_max // 2, type=pa.float64())
> except Exception as exc:
>     print(exc) {code}
> {code:java}
> >> PyLong is too large to fit int64
> >> Integer value 9223372036854775807 is outside of the range exactly 
> >> representable by a IEEE 754 double precision value
> {code}
> The CSV reader, on the other hand, doesn't infer UInt64 types (which is fine, 
> as documented here 
> [https://arrow.apache.org/docs/cpp/csv.html#data-types),|https://arrow.apache.org/docs/cpp/csv.html#data-types)]
>   but does coerce values to float which shouldn't be coercable according to 
> above examples:
> {code:java}
> import io
> csv = "int64,uint64\n0,0\n4294967295,18446744073709551615"
> tbl = pa.csv.read_csv(io.BytesIO(csv.encode("utf-8")))
> print(tbl.schema)
> print(tbl.column("uint64")[1] == uint64_scalar)
> print(tbl.column("uint64")[1].cast(pa.uint64())) {code}
> {code:java}
> int64: int64
> uint64: double
> False
> 0
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (ARROW-16843) [Python][CSV] CSV reader performs unsafe type conversion

2022-06-16 Thread Thomas Buhrmann (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555180#comment-17555180
 ] 

Thomas Buhrmann edited comment on ARROW-16843 at 6/16/22 4:23 PM:
--

Or even expose the type inference itself in some way, so one could simply read 
all columns as strings and then use the underlying type inference on a column 
by column basis, using additional custom logic. I'm currently creating an 
additional inference layer, e.g., that also infers list types from string 
columns, timestamps with non-iso formats, downcasts ints to the smallest 
possible type etc. (the uint64 case is the only "problem" I had so far fwiw..)


was (Author: buhrmann):
Or even expose the type inference itself in some way, so one could simply read 
all columns as strings and then use the underlying type inference on a column 
by column basis, using additional custom logic. I'm currently creating an 
additional inference layer, e.g., that also infers list types from string 
columns, timestamps with non-iso formats, downcasts ints to the smallest 
possible type etc. (the uint64 case if the only "problem" I had so far fwiw..)

> [Python][CSV] CSV reader performs unsafe type conversion
> 
>
> Key: ARROW-16843
> URL: https://issues.apache.org/jira/browse/ARROW-16843
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: Thomas Buhrmann
>Priority: Major
>
> Hi, I've noticed that although pa.scalar and pa.array behave correctly when 
> given the largest possible (uint64) value (i.e. they fail correctly when 
> trying to cast to float e.g.), the CSV reader happily converts strings 
> representing uint64 values to float (see example below). Is this intended? 
> Would it be possible to have a safe-conversion-only option?
> The problem is that at the moment the only safe option to read a CSV whose 
> types are not known in advance is to read without any conversion (string 
> only) and perform the type inference oneself.
> It would be ok if Uint64 types couldn't be inferred, as long as the 
> corresponding columns aren't coerced in a destructive manner to float. I.e., 
> if they were left as string columns, one could then implement a custom 
> conversion, while still benefiting from the correct and automatic conversion 
> of the remaining columns.
>  
> The following correctly rejects the float type for uint64 values:
> {code:java}
> import pyarrow as pa
> uint64_max = 18_446_744_073_709_551_615
> type_ = pa.uint64()
> uint64_scalar = pa.scalar(uint64_max, type=type_)
> uint64_array = pa.array([uint64_max], type=type_)
> try:
>     f = pa.scalar(uint64_max, type=pa.float64())
> except Exception as exc:
>     print(exc)
>     
> try:
>     f = pa.scalar(uint64_max // 2, type=pa.float64())
> except Exception as exc:
>     print(exc) {code}
> {code:java}
> >> PyLong is too large to fit int64
> >> Integer value 9223372036854775807 is outside of the range exactly 
> >> representable by a IEEE 754 double precision value
> {code}
> The CSV reader, on the other hand, doesn't infer UInt64 types (which is fine, 
> as documented here 
> [https://arrow.apache.org/docs/cpp/csv.html#data-types),|https://arrow.apache.org/docs/cpp/csv.html#data-types)]
>   but does coerce values to float which shouldn't be coercable according to 
> above examples:
> {code:java}
> import io
> csv = "int64,uint64\n0,0\n4294967295,18446744073709551615"
> tbl = pa.csv.read_csv(io.BytesIO(csv.encode("utf-8")))
> print(tbl.schema)
> print(tbl.column("uint64")[1] == uint64_scalar)
> print(tbl.column("uint64")[1].cast(pa.uint64())) {code}
> {code:java}
> int64: int64
> uint64: double
> False
> 0
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16843) [Python][CSV] CSV reader performs unsafe type conversion

2022-06-16 Thread Thomas Buhrmann (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555180#comment-17555180
 ] 

Thomas Buhrmann commented on ARROW-16843:
-

Or even expose the type inference itself in some way, so one could simply read 
all columns as strings and then use the underlying type inference on a column 
by column basis, using additional custom logic. I'm currently creating an 
additional inference layer, e.g., that also infers list types from string 
columns, timestamps with non-iso formats, downcasts ints to the smallest 
possible type etc. (the uint64 case if the only "problem" I had so far fwiw..)

> [Python][CSV] CSV reader performs unsafe type conversion
> 
>
> Key: ARROW-16843
> URL: https://issues.apache.org/jira/browse/ARROW-16843
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: Thomas Buhrmann
>Priority: Major
>
> Hi, I've noticed that although pa.scalar and pa.array behave correctly when 
> given the largest possible (uint64) value (i.e. they fail correctly when 
> trying to cast to float e.g.), the CSV reader happily converts strings 
> representing uint64 values to float (see example below). Is this intended? 
> Would it be possible to have a safe-conversion-only option?
> The problem is that at the moment the only safe option to read a CSV whose 
> types are not known in advance is to read without any conversion (string 
> only) and perform the type inference oneself.
> It would be ok if Uint64 types couldn't be inferred, as long as the 
> corresponding columns aren't coerced in a destructive manner to float. I.e., 
> if they were left as string columns, one could then implement a custom 
> conversion, while still benefiting from the correct and automatic conversion 
> of the remaining columns.
>  
> The following correctly rejects the float type for uint64 values:
> {code:java}
> import pyarrow as pa
> uint64_max = 18_446_744_073_709_551_615
> type_ = pa.uint64()
> uint64_scalar = pa.scalar(uint64_max, type=type_)
> uint64_array = pa.array([uint64_max], type=type_)
> try:
>     f = pa.scalar(uint64_max, type=pa.float64())
> except Exception as exc:
>     print(exc)
>     
> try:
>     f = pa.scalar(uint64_max // 2, type=pa.float64())
> except Exception as exc:
>     print(exc) {code}
> {code:java}
> >> PyLong is too large to fit int64
> >> Integer value 9223372036854775807 is outside of the range exactly 
> >> representable by a IEEE 754 double precision value
> {code}
> The CSV reader, on the other hand, doesn't infer UInt64 types (which is fine, 
> as documented here 
> [https://arrow.apache.org/docs/cpp/csv.html#data-types),|https://arrow.apache.org/docs/cpp/csv.html#data-types)]
>   but does coerce values to float which shouldn't be coercable according to 
> above examples:
> {code:java}
> import io
> csv = "int64,uint64\n0,0\n4294967295,18446744073709551615"
> tbl = pa.csv.read_csv(io.BytesIO(csv.encode("utf-8")))
> print(tbl.schema)
> print(tbl.column("uint64")[1] == uint64_scalar)
> print(tbl.column("uint64")[1].cast(pa.uint64())) {code}
> {code:java}
> int64: int64
> uint64: double
> False
> 0
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-14846) [R] Bindings for lubridate's stamp, stamp_date, and stamp_time

2022-06-16 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-14846:
--
Parent: ARROW-16841
Issue Type: Sub-task  (was: Improvement)

> [R] Bindings for lubridate's stamp, stamp_date, and stamp_time
> --
>
> Key: ARROW-14846
> URL: https://issues.apache.org/jira/browse/ARROW-14846
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16843) [Python][CSV] CSV reader performs unsafe type conversion

2022-06-16 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555175#comment-17555175
 ] 

Antoine Pitrou commented on ARROW-16843:


An actual list of inferrable types wouldn't exactly work, by the way, because 
datatypes are not the inference granularity.

> [Python][CSV] CSV reader performs unsafe type conversion
> 
>
> Key: ARROW-16843
> URL: https://issues.apache.org/jira/browse/ARROW-16843
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: Thomas Buhrmann
>Priority: Major
>
> Hi, I've noticed that although pa.scalar and pa.array behave correctly when 
> given the largest possible (uint64) value (i.e. they fail correctly when 
> trying to cast to float e.g.), the CSV reader happily converts strings 
> representing uint64 values to float (see example below). Is this intended? 
> Would it be possible to have a safe-conversion-only option?
> The problem is that at the moment the only safe option to read a CSV whose 
> types are not known in advance is to read without any conversion (string 
> only) and perform the type inference oneself.
> It would be ok if Uint64 types couldn't be inferred, as long as the 
> corresponding columns aren't coerced in a destructive manner to float. I.e., 
> if they were left as string columns, one could then implement a custom 
> conversion, while still benefiting from the correct and automatic conversion 
> of the remaining columns.
>  
> The following correctly rejects the float type for uint64 values:
> {code:java}
> import pyarrow as pa
> uint64_max = 18_446_744_073_709_551_615
> type_ = pa.uint64()
> uint64_scalar = pa.scalar(uint64_max, type=type_)
> uint64_array = pa.array([uint64_max], type=type_)
> try:
>     f = pa.scalar(uint64_max, type=pa.float64())
> except Exception as exc:
>     print(exc)
>     
> try:
>     f = pa.scalar(uint64_max // 2, type=pa.float64())
> except Exception as exc:
>     print(exc) {code}
> {code:java}
> >> PyLong is too large to fit int64
> >> Integer value 9223372036854775807 is outside of the range exactly 
> >> representable by a IEEE 754 double precision value
> {code}
> The CSV reader, on the other hand, doesn't infer UInt64 types (which is fine, 
> as documented here 
> [https://arrow.apache.org/docs/cpp/csv.html#data-types),|https://arrow.apache.org/docs/cpp/csv.html#data-types)]
>   but does coerce values to float which shouldn't be coercable according to 
> above examples:
> {code:java}
> import io
> csv = "int64,uint64\n0,0\n4294967295,18446744073709551615"
> tbl = pa.csv.read_csv(io.BytesIO(csv.encode("utf-8")))
> print(tbl.schema)
> print(tbl.column("uint64")[1] == uint64_scalar)
> print(tbl.column("uint64")[1].cast(pa.uint64())) {code}
> {code:java}
> int64: int64
> uint64: double
> False
> 0
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16843) [Python][CSV] CSV reader performs unsafe type conversion

2022-06-16 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555173#comment-17555173
 ] 

Antoine Pitrou commented on ARROW-16843:


bq. Another possibility would be to pass a list of inferrable types (so one 
could exclude float64), in addition to the explicit column_types parameter.

Yes, I think this would be the better solution and would address more use cases.

> [Python][CSV] CSV reader performs unsafe type conversion
> 
>
> Key: ARROW-16843
> URL: https://issues.apache.org/jira/browse/ARROW-16843
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: Thomas Buhrmann
>Priority: Major
>
> Hi, I've noticed that although pa.scalar and pa.array behave correctly when 
> given the largest possible (uint64) value (i.e. they fail correctly when 
> trying to cast to float e.g.), the CSV reader happily converts strings 
> representing uint64 values to float (see example below). Is this intended? 
> Would it be possible to have a safe-conversion-only option?
> The problem is that at the moment the only safe option to read a CSV whose 
> types are not known in advance is to read without any conversion (string 
> only) and perform the type inference oneself.
> It would be ok if Uint64 types couldn't be inferred, as long as the 
> corresponding columns aren't coerced in a destructive manner to float. I.e., 
> if they were left as string columns, one could then implement a custom 
> conversion, while still benefiting from the correct and automatic conversion 
> of the remaining columns.
>  
> The following correctly rejects the float type for uint64 values:
> {code:java}
> import pyarrow as pa
> uint64_max = 18_446_744_073_709_551_615
> type_ = pa.uint64()
> uint64_scalar = pa.scalar(uint64_max, type=type_)
> uint64_array = pa.array([uint64_max], type=type_)
> try:
>     f = pa.scalar(uint64_max, type=pa.float64())
> except Exception as exc:
>     print(exc)
>     
> try:
>     f = pa.scalar(uint64_max // 2, type=pa.float64())
> except Exception as exc:
>     print(exc) {code}
> {code:java}
> >> PyLong is too large to fit int64
> >> Integer value 9223372036854775807 is outside of the range exactly 
> >> representable by a IEEE 754 double precision value
> {code}
> The CSV reader, on the other hand, doesn't infer UInt64 types (which is fine, 
> as documented here 
> [https://arrow.apache.org/docs/cpp/csv.html#data-types),|https://arrow.apache.org/docs/cpp/csv.html#data-types)]
>   but does coerce values to float which shouldn't be coercable according to 
> above examples:
> {code:java}
> import io
> csv = "int64,uint64\n0,0\n4294967295,18446744073709551615"
> tbl = pa.csv.read_csv(io.BytesIO(csv.encode("utf-8")))
> print(tbl.schema)
> print(tbl.column("uint64")[1] == uint64_scalar)
> print(tbl.column("uint64")[1].cast(pa.uint64())) {code}
> {code:java}
> int64: int64
> uint64: double
> False
> 0
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16843) [Python][CSV] CSV reader performs unsafe type conversion

2022-06-16 Thread Thomas Buhrmann (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555172#comment-17555172
 ] 

Thomas Buhrmann commented on ARROW-16843:
-

You're right, this also performs destructive conversion:
{code:java}
pa.scalar("18446744073709551615").cast(pa.float64()) {code}
{noformat}
 >> 

{noformat}
Which is why I think it would be good to have an option to not perform certain 
conversions automatically if they have the potential to be destructive (in the 
sense that one cannot cast back to string or another type without loss of 
information), even if the default may be destructive. E.g. it is quite common 
to have ID columns in the uint64 range, which at the moment cannot be read 
using the CSV reader (without disabling all type inference). 

Another possibility would be to pass a list of inferrable types (so one could 
exclude float64), in addition to the explicit [column_types 
parameter|https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N5arrow3csv14ConvertOptions12column_typesE].

> [Python][CSV] CSV reader performs unsafe type conversion
> 
>
> Key: ARROW-16843
> URL: https://issues.apache.org/jira/browse/ARROW-16843
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: Thomas Buhrmann
>Priority: Major
>
> Hi, I've noticed that although pa.scalar and pa.array behave correctly when 
> given the largest possible (uint64) value (i.e. they fail correctly when 
> trying to cast to float e.g.), the CSV reader happily converts strings 
> representing uint64 values to float (see example below). Is this intended? 
> Would it be possible to have a safe-conversion-only option?
> The problem is that at the moment the only safe option to read a CSV whose 
> types are not known in advance is to read without any conversion (string 
> only) and perform the type inference oneself.
> It would be ok if Uint64 types couldn't be inferred, as long as the 
> corresponding columns aren't coerced in a destructive manner to float. I.e., 
> if they were left as string columns, one could then implement a custom 
> conversion, while still benefiting from the correct and automatic conversion 
> of the remaining columns.
>  
> The following correctly rejects the float type for uint64 values:
> {code:java}
> import pyarrow as pa
> uint64_max = 18_446_744_073_709_551_615
> type_ = pa.uint64()
> uint64_scalar = pa.scalar(uint64_max, type=type_)
> uint64_array = pa.array([uint64_max], type=type_)
> try:
>     f = pa.scalar(uint64_max, type=pa.float64())
> except Exception as exc:
>     print(exc)
>     
> try:
>     f = pa.scalar(uint64_max // 2, type=pa.float64())
> except Exception as exc:
>     print(exc) {code}
> {code:java}
> >> PyLong is too large to fit int64
> >> Integer value 9223372036854775807 is outside of the range exactly 
> >> representable by a IEEE 754 double precision value
> {code}
> The CSV reader, on the other hand, doesn't infer UInt64 types (which is fine, 
> as documented here 
> [https://arrow.apache.org/docs/cpp/csv.html#data-types),|https://arrow.apache.org/docs/cpp/csv.html#data-types)]
>   but does coerce values to float which shouldn't be coercable according to 
> above examples:
> {code:java}
> import io
> csv = "int64,uint64\n0,0\n4294967295,18446744073709551615"
> tbl = pa.csv.read_csv(io.BytesIO(csv.encode("utf-8")))
> print(tbl.schema)
> print(tbl.column("uint64")[1] == uint64_scalar)
> print(tbl.column("uint64")[1].cast(pa.uint64())) {code}
> {code:java}
> int64: int64
> uint64: double
> False
> 0
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-13133) [R] Add support for locale-specific day of week (and month of year?) returns from timestamp accessor functions

2022-06-16 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-13133:
--
Parent: ARROW-16841
Issue Type: Sub-task  (was: Improvement)

> [R] Add support for locale-specific day of week (and month of year?) returns 
> from timestamp accessor functions
> --
>
> Key: ARROW-13133
> URL: https://issues.apache.org/jira/browse/ARROW-13133
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: good-first-issue
>
> The R binding for the wday date accessor added in this PR 
> [https://github.com/apache/arrow/pull/10507] currently doesn't support 
> returning the string representation of the day of the week (e.g. "Mon") and 
> only supports the numeric representation (e.g. 1).
> We should implement this, though discussion should be had about whether this 
> belongs at the R or C++ level.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16843) [Python][CSV] CSV reader performs unsafe type conversion

2022-06-16 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555150#comment-17555150
 ] 

Antoine Pitrou commented on ARROW-16843:


Note that's not really the same situation: in the first case, you are 
converting between number types, where you presumably expect the conversion to 
be exact. In the second case, you are converting from string to number, where 
it's not unreasonable to accept some possible precision loss.

It's also likely that raising an error would break a non-tiny number of 
real-world CSV reading use cases.


> [Python][CSV] CSV reader performs unsafe type conversion
> 
>
> Key: ARROW-16843
> URL: https://issues.apache.org/jira/browse/ARROW-16843
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: Thomas Buhrmann
>Priority: Major
>
> Hi, I've noticed that although pa.scalar and pa.array behave correctly when 
> given the largest possible (uint64) value (i.e. they fail correctly when 
> trying to cast to float e.g.), the CSV reader happily converts strings 
> representing uint64 values to float (see example below). Is this intended? 
> Would it be possible to have a safe-conversion-only option?
> The problem is that at the moment the only safe option to read a CSV whose 
> types are not known in advance is to read without any conversion (string 
> only) and perform the type inference oneself.
> It would be ok if Uint64 types couldn't be inferred, as long as the 
> corresponding columns aren't coerced in a destructive manner to float. I.e., 
> if they were left as string columns, one could then implement a custom 
> conversion, while still benefiting from the correct and automatic conversion 
> of the remaining columns.
>  
> The following correctly rejects the float type for uint64 values:
> {code:java}
> import pyarrow as pa
> uint64_max = 18_446_744_073_709_551_615
> type_ = pa.uint64()
> uint64_scalar = pa.scalar(uint64_max, type=type_)
> uint64_array = pa.array([uint64_max], type=type_)
> try:
>     f = pa.scalar(uint64_max, type=pa.float64())
> except Exception as exc:
>     print(exc)
>     
> try:
>     f = pa.scalar(uint64_max // 2, type=pa.float64())
> except Exception as exc:
>     print(exc) {code}
> {code:java}
> >> PyLong is too large to fit int64
> >> Integer value 9223372036854775807 is outside of the range exactly 
> >> representable by a IEEE 754 double precision value
> {code}
> The CSV reader, on the other hand, doesn't infer UInt64 types (which is fine, 
> as documented here 
> [https://arrow.apache.org/docs/cpp/csv.html#data-types),|https://arrow.apache.org/docs/cpp/csv.html#data-types)]
>   but does coerce values to float which shouldn't be coercable according to 
> above examples:
> {code:java}
> import io
> csv = "int64,uint64\n0,0\n4294967295,18446744073709551615"
> tbl = pa.csv.read_csv(io.BytesIO(csv.encode("utf-8")))
> print(tbl.schema)
> print(tbl.column("uint64")[1] == uint64_scalar)
> print(tbl.column("uint64")[1].cast(pa.uint64())) {code}
> {code:java}
> int64: int64
> uint64: double
> False
> 0
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16843) [Python][CSV] CSV reader performs unsafe type conversion

2022-06-16 Thread Thomas Buhrmann (Jira)
Thomas Buhrmann created ARROW-16843:
---

 Summary: [Python][CSV] CSV reader performs unsafe type conversion
 Key: ARROW-16843
 URL: https://issues.apache.org/jira/browse/ARROW-16843
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 8.0.0
Reporter: Thomas Buhrmann


Hi, I've noticed that although pa.scalar and pa.array behave correctly when 
given the largest possible (uint64) value (i.e. they fail correctly when trying 
to cast to float e.g.), the CSV reader happily converts strings representing 
uint64 values to float (see example below). Is this intended? Would it be 
possible to have a safe-conversion-only option?

The problem is that at the moment the only safe option to read a CSV whose 
types are not known in advance is to read without any conversion (string only) 
and perform the type inference oneself.

It would be ok if Uint64 types couldn't be inferred, as long as the 
corresponding columns aren't coerced in a destructive manner to float. I.e., if 
they were left as string columns, one could then implement a custom conversion, 
while still benefiting from the correct and automatic conversion of the 
remaining columns.

 

The following correctly rejects the float type for uint64 values:
{code:java}
import pyarrow as pa

uint64_max = 18_446_744_073_709_551_615

type_ = pa.uint64()
uint64_scalar = pa.scalar(uint64_max, type=type_)
uint64_array = pa.array([uint64_max], type=type_)

try:
    f = pa.scalar(uint64_max, type=pa.float64())
except Exception as exc:
    print(exc)
    
try:
    f = pa.scalar(uint64_max // 2, type=pa.float64())
except Exception as exc:
    print(exc) {code}
{code:java}
>> PyLong is too large to fit int64
>> Integer value 9223372036854775807 is outside of the range exactly 
>> representable by a IEEE 754 double precision value
{code}
The CSV reader, on the other hand, doesn't infer UInt64 types (which is fine, 
as documented here 
[https://arrow.apache.org/docs/cpp/csv.html#data-types),|https://arrow.apache.org/docs/cpp/csv.html#data-types)]
  but does coerce values to float which shouldn't be coercable according to 
above examples:
{code:java}
import io

csv = "int64,uint64\n0,0\n4294967295,18446744073709551615"
tbl = pa.csv.read_csv(io.BytesIO(csv.encode("utf-8")))

print(tbl.schema)
print(tbl.column("uint64")[1] == uint64_scalar)
print(tbl.column("uint64")[1].cast(pa.uint64())) {code}
{code:java}
int64: int64
uint64: double

False
0
{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Closed] (ARROW-16840) [CI] replace actions/setup-ruby with ruby/setup-ruby

2022-06-16 Thread Jacob Wujciak-Jens (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacob Wujciak-Jens closed ARROW-16840.
--
Resolution: Duplicate

> [CI] replace actions/setup-ruby with ruby/setup-ruby
> 
>
> Key: ARROW-16840
> URL: https://issues.apache.org/jira/browse/ARROW-16840
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Jacob Wujciak-Jens
>Assignee: Jacob Wujciak-Jens
>Priority: Blocker
> Fix For: 9.0.0
>
>
> [actions/setup-ruby|https://github.com/actions/setup-ruby] is deprecated and 
> should no longer be used. [rub/setup-ruby|https://github.com/ruby/setup-ruby] 
> is the maintained version



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Deleted] (ARROW-16398) [R] [Doc] Write a blogpost for the lubridate functionality

2022-06-16 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina deleted ARROW-16398:
--


> [R] [Doc] Write a blogpost for the lubridate functionality 
> ---
>
> Key: ARROW-16398
> URL: https://issues.apache.org/jira/browse/ARROW-16398
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Documentation, R
>Reporter: Dragoș Moldovan-Grünfeld
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
> Fix For: 9.0.0
>
>
> Write a blogpost on the lubridate functionality currently available in arrow.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-16319) [R] [Docs] Document the lubridate functions we support in {arrow}

2022-06-16 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane reassigned ARROW-16319:
--

Assignee: Stephanie Hazlitt  (was: Dragoș Moldovan-Grünfeld)

> [R] [Docs] Document the lubridate functions we support in {arrow}
> -
>
> Key: ARROW-16319
> URL: https://issues.apache.org/jira/browse/ARROW-16319
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, R
>Affects Versions: 8.0.0
>Reporter: Dragoș Moldovan-Grünfeld
>Assignee: Stephanie Hazlitt
>Priority: Major
> Fix For: 9.0.0
>
>
> Add documentation around the {{lubridate}} functionality supported in 
> {{arrow}}. Could be made up of:
> * a blogpost 
> * a more in-depth piece of documentation



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-15807) [R] Update as.Date() to support direct casting from double/float

2022-06-16 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-15807:
--
Parent Issue: ARROW-16841  (was: ARROW-15805)

> [R] Update as.Date() to support direct casting from double/float
> 
>
> Key: ARROW-15807
> URL: https://issues.apache.org/jira/browse/ARROW-15807
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Dragoș Moldovan-Grünfeld
>Priority: Major
>
> This might need support from C++ as, at present, it is not possible to cast 
> directly from a double. Currently, we are flooring the double to an integer 
> as an intermediate step. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Closed] (ARROW-16418) [R] Refactor the difftime() and as.diffime() bindings

2022-06-16 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane closed ARROW-16418.
--
Resolution: Won't Fix

> [R] Refactor the difftime() and as.diffime() bindings 
> --
>
> Key: ARROW-16418
> URL: https://issues.apache.org/jira/browse/ARROW-16418
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 8.0.0
>Reporter: Dragoș Moldovan-Grünfeld
>Priority: Major
> Fix For: 9.0.0
>
>
> ARROW-16060 is solved and these 2 functions have high cyclomatic complexity



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-15804) [R] Update as.Date() to support several tryFormats

2022-06-16 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-15804:
--
Parent Issue: ARROW-15805  (was: ARROW-16841)

> [R] Update as.Date() to support several tryFormats
> --
>
> Key: ARROW-15804
> URL: https://issues.apache.org/jira/browse/ARROW-15804
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Dragoș Moldovan-Grünfeld
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> We could use {{coalesce()}} to cycle through several {{formats}}.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-15804) [R] Update as.Date() to support several tryFormats

2022-06-16 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-15804:
--
Parent Issue: ARROW-16841  (was: ARROW-15805)

> [R] Update as.Date() to support several tryFormats
> --
>
> Key: ARROW-15804
> URL: https://issues.apache.org/jira/browse/ARROW-15804
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Dragoș Moldovan-Grünfeld
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> We could use {{coalesce()}} to cycle through several {{formats}}.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-14819) [R] Binding for lubridate::qday

2022-06-16 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-14819:
--
Parent: ARROW-16841
Issue Type: Sub-task  (was: Improvement)

> [R] Binding for lubridate::qday
> ---
>
> Key: ARROW-14819
> URL: https://issues.apache.org/jira/browse/ARROW-14819
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>  Labels: good-first-issue
>
> This can be implemented via use of floor_date etc - see the original 
> lubridate implementation



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Closed] (ARROW-14820) [R] Implement bindings for lubridate calculation functions

2022-06-16 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina closed ARROW-14820.
-
Resolution: Invalid

> [R] Implement bindings for lubridate calculation functions
> --
>
> Key: ARROW-14820
> URL: https://issues.apache.org/jira/browse/ARROW-14820
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
> Fix For: 9.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-14821) [R] Implement bindings for lubridate's floor_date, ceiling_date, and round_date

2022-06-16 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-14821:
--
Parent Issue: ARROW-16841  (was: ARROW-14820)

> [R] Implement bindings for lubridate's floor_date, ceiling_date, and 
> round_date
> ---
>
> Key: ARROW-14821
> URL: https://issues.apache.org/jira/browse/ARROW-14821
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: Danielle Navarro
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 25h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16803) [R][CI] Fix caching for R mingw build

2022-06-16 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-16803.

Resolution: Fixed

Issue resolved by pull request 13379
[https://github.com/apache/arrow/pull/13379]

> [R][CI] Fix caching for R mingw build
> -
>
> Key: ARROW-16803
> URL: https://issues.apache.org/jira/browse/ARROW-16803
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, R
>Affects Versions: 8.0.0
>Reporter: Jacob Wujciak-Jens
>Assignee: Jacob Wujciak-Jens
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> Caching is not working for the libarrow mingw builds due to overlapping cache 
> keys: 
> https://github.com/apache/arrow/runs/6819961123?check_suite_focus=true#step:21:2



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Deleted] (ARROW-16842) Lubridate features for version 10.0.0

2022-06-16 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina deleted ARROW-16842:
--


> Lubridate features for version 10.0.0
> -
>
> Key: ARROW-16842
> URL: https://issues.apache.org/jira/browse/ARROW-16842
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Alessandro Molina
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-14948) [R] Implement lubridate's %m+%, %m-%, add_with_rollback, and addition and subtraction with timestamp

2022-06-16 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-14948:
--
Parent Issue: ARROW-16841  (was: ARROW-16842)

> [R] Implement lubridate's %m+%, %m-%, add_with_rollback, and addition and 
> subtraction with timestamp
> 
>
> Key: ARROW-14948
> URL: https://issues.apache.org/jira/browse/ARROW-14948
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Reopened] (ARROW-14821) [R] Implement bindings for lubridate's floor_date, ceiling_date, and round_date

2022-06-16 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reopened ARROW-14821:
-

Still has open second PR: [https://github.com/apache/arrow/pull/12154] so 
reopening.

> [R] Implement bindings for lubridate's floor_date, ceiling_date, and 
> round_date
> ---
>
> Key: ARROW-14821
> URL: https://issues.apache.org/jira/browse/ARROW-14821
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: Danielle Navarro
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 25h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-14948) [R] Implement lubridate's %m+%, %m-%, add_with_rollback, and addition and subtraction with timestamp

2022-06-16 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-14948:
--
Parent: ARROW-16842
Issue Type: Sub-task  (was: Improvement)

> [R] Implement lubridate's %m+%, %m-%, add_with_rollback, and addition and 
> subtraction with timestamp
> 
>
> Key: ARROW-14948
> URL: https://issues.apache.org/jira/browse/ARROW-14948
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16842) Lubridate features for version 10.0.0

2022-06-16 Thread Alessandro Molina (Jira)
Alessandro Molina created ARROW-16842:
-

 Summary: Lubridate features for version 10.0.0
 Key: ARROW-16842
 URL: https://issues.apache.org/jira/browse/ARROW-16842
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Alessandro Molina






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16841) [R] Additional Lubridate Capabilities

2022-06-16 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-16841:
---
Issue Type: Wish  (was: Bug)

> [R] Additional Lubridate Capabilities
> -
>
> Key: ARROW-16841
> URL: https://issues.apache.org/jira/browse/ARROW-16841
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, R
>Affects Versions: 9.0.0
>Reporter: Alessandro Molina
>Priority: Major
>
> Umbrella Ticket for the remaining lubridate work.
> This is functionality that we have scoped, but we have decided to wait to 
> implement until it is requested by someone proactively.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16841) [R] Additional Lubridate Capabilities

2022-06-16 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-16841:
---
Description: 
Umbrella Ticket for the remaining lubridate work.

This is functionality that we have scoped, but we have decided to wait to 
implement until it is requested by someone proactively.

  was:
Umbrella Ticket for the remaining lubridate work.

Most fo the work here will be triggered by explicit user requests


> [R] Additional Lubridate Capabilities
> -
>
> Key: ARROW-16841
> URL: https://issues.apache.org/jira/browse/ARROW-16841
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 9.0.0
>Reporter: Alessandro Molina
>Priority: Major
>
> Umbrella Ticket for the remaining lubridate work.
> This is functionality that we have scoped, but we have decided to wait to 
> implement until it is requested by someone proactively.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16841) [R] Additional Lubridate Capabilities

2022-06-16 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-16841:
--
Description: 
Umbrella Ticket for the remaining lubridate work.

Most fo the work here will be triggered by explicit user requests

  was:Umbrella Ticket for the remaining lubridate work


> [R] Additional Lubridate Capabilities
> -
>
> Key: ARROW-16841
> URL: https://issues.apache.org/jira/browse/ARROW-16841
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 9.0.0
>Reporter: Alessandro Molina
>Priority: Major
>
> Umbrella Ticket for the remaining lubridate work.
> Most fo the work here will be triggered by explicit user requests



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16155) [R] lubridate functions for 9.0.0

2022-06-16 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-16155:
--
Description: 
Umbrella ticket for lubridate functions in 9.0.0

Future work that is not going to happen in v9 is recorder under 
https://issues.apache.org/jira/browse/ARROW-16841

  was:Umbrella ticket for lubridate functions in 9.0.0


> [R] lubridate functions for 9.0.0
> -
>
> Key: ARROW-16155
> URL: https://issues.apache.org/jira/browse/ARROW-16155
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 8.0.0
>Reporter: Alessandro Molina
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
> Fix For: 9.0.0
>
>
> Umbrella ticket for lubridate functions in 9.0.0
> Future work that is not going to happen in v9 is recorder under 
> https://issues.apache.org/jira/browse/ARROW-16841



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16440) [R] Implement bindings for lubridate's parse_date_time2

2022-06-16 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555088#comment-17555088
 ] 

Jonathan Keane commented on ARROW-16440:


What's special about `parse_date_time2()` compared to `parse_date_time()`?

> [R] Implement bindings for lubridate's parse_date_time2
> ---
>
> Key: ARROW-16440
> URL: https://issues.apache.org/jira/browse/ARROW-16440
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Dragoș Moldovan-Grünfeld
>Priority: Major
> Fix For: 9.0.0
>
>
> Split from ARROW-14848



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16653) [R] All formats are supported with the lubridate `parse_date_time` binding

2022-06-16 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555086#comment-17555086
 ] 

Jonathan Keane commented on ARROW-16653:


What formats do we currently not support?

> [R] All formats are supported with the lubridate `parse_date_time` binding
> --
>
> Key: ARROW-16653
> URL: https://issues.apache.org/jira/browse/ARROW-16653
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Affects Versions: 8.0.1
>Reporter: Dragoș Moldovan-Grünfeld
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Critical
> Fix For: 9.0.0
>
>
> Ensure:
> - all formats supported and tested



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16440) [R] Implement bindings for lubridate's parse_date_time2

2022-06-16 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-16440:
--
Parent Issue: ARROW-16841  (was: ARROW-14847)

> [R] Implement bindings for lubridate's parse_date_time2
> ---
>
> Key: ARROW-16440
> URL: https://issues.apache.org/jira/browse/ARROW-16440
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Dragoș Moldovan-Grünfeld
>Priority: Major
> Fix For: 9.0.0
>
>
> Split from ARROW-14848



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16841) [R] Additional Lubridate Capabilities

2022-06-16 Thread Alessandro Molina (Jira)
Alessandro Molina created ARROW-16841:
-

 Summary: [R] Additional Lubridate Capabilities
 Key: ARROW-16841
 URL: https://issues.apache.org/jira/browse/ARROW-16841
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, R
Affects Versions: 9.0.0
Reporter: Alessandro Molina


Umbrella Ticket for the remaining lubridate work



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16840) [CI] replace actions/setup-ruby with ruby/setup-ruby

2022-06-16 Thread Jacob Wujciak-Jens (Jira)
Jacob Wujciak-Jens created ARROW-16840:
--

 Summary: [CI] replace actions/setup-ruby with ruby/setup-ruby
 Key: ARROW-16840
 URL: https://issues.apache.org/jira/browse/ARROW-16840
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Jacob Wujciak-Jens
Assignee: Jacob Wujciak-Jens
 Fix For: 9.0.0


[actions/setup-ruby|https://github.com/actions/setup-ruby] is deprecated and 
should no longer be used. [rub/setup-ruby|https://github.com/ruby/setup-ruby] 
is the maintained version



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16706) [Python] Expose RankOptions

2022-06-16 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-16706.

Resolution: Fixed

Issue resolved by pull request 13327
[https://github.com/apache/arrow/pull/13327]

> [Python] Expose RankOptions
> ---
>
> Key: ARROW-16706
> URL: https://issues.apache.org/jira/browse/ARROW-16706
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Antoine Pitrou
>Assignee: Raúl Cumplido
>Priority: Critical
>  Labels: good-first-issue, pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Followup to ARROW-16234



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16521) [C++][R][Python] Configure curl timeout policy for S3

2022-06-16 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-16521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raúl Cumplido updated ARROW-16521:
--
Summary: [C++][R][Python] Configure curl timeout policy for S3  (was: 
[C++][R] Configure curl timeout policy for S3)

> [C++][R][Python] Configure curl timeout policy for S3
> -
>
> Key: ARROW-16521
> URL: https://issues.apache.org/jira/browse/ARROW-16521
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python, R
>Affects Versions: 7.0.0
>Reporter: Carl Boettiger
>Assignee: Ziheng Wang
>Priority: Major
>  Labels: good-first-issue, good-second-issue, 
> pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Is it possible for the user to increase the timeout allowed on the curl 
> settings when accessing S3 records?  The default setting appears to be more 
> aggressive than most other S3 clients I use, which means that I see a lot 
> more failures on arrow-based operations than the other clients see.  I'm not 
> seeing how this can be increased though?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (ARROW-16823) [C++] Arrow Substrait enhancements for UDF

2022-06-16 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554982#comment-17554982
 ] 

Yaron Gvili edited comment on ARROW-16823 at 6/16/22 9:26 AM:
--

[~vibhatha], before I address your points, I think it would help that I write 
my view of how nested registries would be used, in general and in the context 
of UDFs.

In general, a nested registry is created and passed to a new scope which is 
free to modify it without affecting its parent registries. This can be thought 
of as passing-by-value, as long as parent registries remain constant while the 
new scope is alive, and indeed this is the recommended way of using nested 
registries. With this way of use, registry nesting has the following desirable 
properties:
 # Value-semantics: modification are restricted to the passed "value".
 # Recursive: repeated nesting works as expected.
 # Thread-safety: a nested registry can be safely passed to a thread.

In the context of UDFs, a nested registry is created for temporarily 
registering UDFs for the lifetime of a separate scope in which they will be 
used. In a typical use case, this scope is for deserialization and execution of 
a Substrait plan. In this use case, one creates nested (function and 
extension-id) registries and uses them to deserialize a Substrait plan, 
register UDFs for this plan, and execute the plan, then drops the nested 
registries.

It is no accident that the above properties make nested registries powerful 
enough to cleanly support much more complex future use cases. I envision 
modular Substrait plans:
 * a Substrait plan can be shared (from author to its users)
 * shared Substrait plans can be gathered in libraries/modules
 * a Substrait plan can include invocations of other shared Substrait plans

and that they will become important for boosting user productivity with Arrow.

While this is my long-term vision, the current issue is about preparation for 
upcoming end-to-end Ibis/Ibis-Substrait/PyArrow support for Python-UDFs that 
I'm currently working on.

Now to your points.

> I think for general usage of UDFs we could also keep a temporary registry 
> which is in the scope of the application and it get destroyed when the 
> application ends it's life.

A single registry for UDF would go against the design goal of modularity. It 
would require support for unregistration, which is error-prone. See also the 
discussion in ARROW-16211.

> Thinking about a simple example to reflect the usage.

This is actually an example more complex than the 
single-Substrait-plan-with-UDFs one that I described above.

> Visually GFR->TF1, GFR->TF2 or GFR->TF1->TF2 right?

I think the right organization for your example is that each nested registry 
has the global one as its parent. Each of the 3 stages has its own set of UDFs 
to register.

> What if TF1 destroyed, that means TF2 get detached from the GFR, are we going 
> to correct that relationship when we remove TF1. Are we planning to handle 
> this or is this irrelevant?

When following the recommended way of using nested registries that I described 
above, even in a case of repeated nesting like GFR->TF1->TF2, it is incorrect 
to even modify, let alone drop, TF1 while TF2 is alive.

> Considering the practical usage, I assume what should happen is, when I ask 
> for function `f1` to be called, it should scan through the global, then go 
> level by level on the scoped and retrieve the function once located. Is this 
> right?

It's the other way around. In the case of GFR->TF1->TF2, the function is first 
looked up in TF2, then in TF1, and finally in GFR. This way, modification to 
TF2 take precedence, which is what one expects from value-semantics.

>  For Python UDF users or R UDF users, do we have to do anything special where 
> we expose the FunctionRegistry (I guess we don't have to, but curious)...

Eventually, the end-user should typically just invoke a single function to 
execute a Substrait plan. If the Substrait plan has UDFs, their registration 
into fresh nested registries will be automated (I have this locally worked out 
for Python-UDFs, and presumably R-UDFs should work out similarly). The 
facilities we discuss here are for developers and should eventually be 
encapsulated from the end-user.

> In addition, I have this general question, depending on the usage, should we 
> keep a separate temporary function registry for Substrait UDF users, plain 
> UDF users (directly using Arrow), in future there could be similar cases 
> where we need to support...

As described above, the recommended way is to create nested registries for a 
scope, not for a class-of-use (like Substrait-UDF-use and plain-UDF-use).

> Diving a little deep into the parallel case, we are going to have separate 
> scoped registry for each instance. I would say that is efficient for 
> communication and there is no sync issues. 

[jira] [Comment Edited] (ARROW-16823) [C++] Arrow Substrait enhancements for UDF

2022-06-16 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554982#comment-17554982
 ] 

Yaron Gvili edited comment on ARROW-16823 at 6/16/22 9:21 AM:
--

[~vibhatha], before I address your points, I think it would help that I write 
my view of how nested registries would be used, in general and in the context 
of UDFs.

In general, a nested registry is created and passed to a new scope which is 
free to modify it without affecting its parent registries. This can be thought 
of as passing-by-value, as long as parent registries remain constant while the 
new scope is alive, and indeed this is the recommended way of using nested 
registries. With this way of use, registry nesting has the following desirable 
properties:
 # Value-semantics: modification are restricted to the passed "value".
 # Recursive: repeated nesting works as expected.
 # Thread-safety: a nested registry can be safely passed to a thread.

In the context of UDFs, a nested registry is created for temporarily 
registering UDFs for the lifetime of a separate scope in which they will be 
used. In a typical use case, this scope is for deserialization and execution of 
a Substrait plan. In this use case, one creates nested (function and 
extension-id) registries and uses them to deserialize a Substrait plan, 
register UDFs for this plan, and execute the plan, then drops the nested 
registries.

It is no accident that the above properties make nested registries powerful 
enough to cleanly support much more complex future use cases. I envision 
modular Substrait plans:
 * a Substrait plan can be shared (from author to its users)
 * shared Substrait plans can be gathered in libraries/modules
 * a Substrait plan can include invocations of other shared Substrait plans

and that they will become important for boosting user productivity with Arrow.

While this is my long-term vision, the current issue is about preparation for 
upcoming end-to-end Ibis/Ibis-Substrait/PyArrow support for Python-UDFs that 
I'm currently working on.

Now to your points.

> I think for general usage of UDFs we could also keep a temporary registry 
> which is in the scope of the application and it get destroyed when the 
> application ends it's life.

A single registry for UDF would go against the design goal of modularity. It 
would require support for unregistration, which is error-prone. See also the 
discussion in ARROW-16211.

> Thinking about a simple example to reflect the usage.

This is actually an example more complex than the 
single-Substrait-plan-with-UDFs one that I described above.

> Visually GFR->TF1, GFR->TF2 or GFR->TF1->TF2 right?

I think the right organization for your example is that each nested registry 
has the global one as its parent. Each of the 3 stages has its own set of UDFs 
to register.

> What if TF1 destroyed, that means TF2 get detached from the GFR, are we going 
> to correct that relationship when we remove TF1. Are we planning to handle 
> this or is this irrelevant?

When following the recommended way of using nested registries that I described 
above, even in a case of repeated nesting like GFR->TF1->TF2, it is incorrect 
to even modify, let alone drop, TF1 while TF2 is alive.

> Considering the practical usage, I assume what should happen is, when I ask 
> for function `f1` to be called, it should scan through the global, then go 
> level by level on the scoped and retrieve the function once located. Is this 
> right?

It's the other way around. In the case of GFR->TF1->TF2, the function is first 
looked up in TF2, then in TF1, and finally in GFR. This way, modification to 
TF2 take precedence, which is what one expects from value-semantics.

>  For Python UDF users or R UDF users, do we have to do anything special where 
> we expose the FunctionRegistry (I guess we don't have to, but curious)...

Eventually, the end-user should typically just invoke a single function to 
execute a Substrait plan. If the Substrait plan has UDFs, their registration 
into fresh nested registries will be automated (I have this locally worked out 
for Python-UDFs). The facilities we discuss here are for developers and should 
eventually be encapsulated from the end-user.

> In addition, I have this general question, depending on the usage, should we 
> keep a separate temporary function registry for Substrait UDF users, plain 
> UDF users (directly using Arrow), in future there could be similar cases 
> where we need to support...

As described above, the recommended way is to create nested registries for a 
scope, not for a class-of-use (like Substrait-UDF-use and plain-UDF-use).

> Diving a little deep into the parallel case, we are going to have separate 
> scoped registry for each instance. I would say that is efficient for 
> communication and there is no sync issues. May be the intended use is 
> multiple plans with 

[jira] [Commented] (ARROW-16823) [C++] Arrow Substrait enhancements for UDF

2022-06-16 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554982#comment-17554982
 ] 

Yaron Gvili commented on ARROW-16823:
-

[~vibhatha], before I address your points, I think it would help that I write 
my view of how nested registries would be used, in general and in the context 
of UDFs.

In general, a nested registry is created and passed to a new scope which is 
free to modify it without affecting its parent registries. This can be thought 
of as passing-by-value, as long as parent registries remain constant while the 
new scope is alive, and indeed this is the recommended way of using nested 
registries. With this way of use, registry nesting has the following desirable 
properties:
 # Value-semantics: modification are restricted to the passed "value".
 # Recursive: repeated nesting works as expected.
 # Thread-safety: a nested registry can be safely passed to a thread.

In the context of UDFs, a nested registry is created for temporarily 
registering UDFs for the lifetime of a separate scope in which they will be 
used. In a typical use case, this scope is for deserialization and execution of 
a Substrait plan. In this use case, one creates nested (function and 
extension-id) registries and use them to deserialize a Substrait plan, register 
UDFs for this plan, and execute the plan, then drops the nested registries.

It is no accident that the above properties make nested registries powerful 
enough to cleanly support much more complex future use cases. I envision 
modular Substrait plans:
 * a Substrait plan can be shared (from author to its users)
 * shared Substrait plans can be gathered in libraries/modules
 * a Substrait plan can include invocations of other shared Substrait plans

and that they will become important for boosting user productivity with Arrow.

While this is my long-term vision, the current issue is about preparation for 
upcoming end-to-end Ibis/Ibis-Substrait/PyArrow support for Python-UDFs that 
I'm currently working on.

Now to your points.

> I think for general usage of UDFs we could also keep a temporary registry 
> which is in the scope of the application and it get destroyed when the 
> application ends it's life.

A single registry for UDF would go against the design goal of modularity. It 
would require support for unregistration, which is error-prone. See also the 
discussion in ARROW-16211.

> Thinking about a simple example to reflect the usage.

This is actually an example more complex than the 
single-Substrait-plan-with-UDFs one that I described above.

> Visually GFR->TF1, GFR->TF2 or GFR->TF1->TF2 right?

I think the right organization for your example is that each nested registry 
has the global one as its parent. Each of the 3 stages has its own set of UDFs 
to register.

> What if TF1 destroyed, that means TF2 get detached from the GFR, are we going 
> to correct that relationship when we remove TF1. Are we planning to handle 
> this or is this irrelevant?

When following the recommended way of using nested registries that I described 
above, even in a case of repeated nesting like GFR->TF1->TF2, it is incorrect 
to even modify, let alone drop, TF1 while TF2 is alive.

> Considering the practical usage, I assume what should happen is, when I ask 
> for function `f1` to be called, it should scan through the global, then go 
> level by level on the scoped and retrieve the function once located. Is this 
> right?

It's the other way around. In the case of GFR->TF1->TF2, the function is first 
looked up in TF2, then in TF1, and finally in GFR. This way, modification to 
TF2 take precedence, which is what one expects from value-semantics.

>  For Python UDF users or R UDF users, do we have to do anything special where 
> we expose the FunctionRegistry (I guess we don't have to, but curious)...

Eventually, the end-user should typically just invoke a single function to 
execute a Substrait plan. If the Substrait plan has UDFs, their registration 
into fresh nested registries will be automated (I have this locally worked out 
for Python-UDFs). The facilities we discuss here are for developers and should 
eventually be encapsulated from the end-user.

> In addition, I have this general question, depending on the usage, should we 
> keep a separate temporary function registry for Substrait UDF users, plain 
> UDF users (directly using Arrow), in future there could be similar cases 
> where we need to support...

As described above, the recommended way is to create nested registries for a 
scope, not for a class-of-use (like Substrait-UDF-use and plain-UDF-use).

> Diving a little deep into the parallel case, we are going to have separate 
> scoped registry for each instance. I would say that is efficient for 
> communication and there is no sync issues. May be the intended use is 
> multiple plans with non-overlapping functions? ...

A thread is a 

[jira] [Assigned] (ARROW-9392) [C++] Document more of the compute layer

2022-06-16 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-9392:
-

Assignee: (was: Eduardo Ponce)

> [C++] Document more of the compute layer
> 
>
> Key: ARROW-9392
> URL: https://issues.apache.org/jira/browse/ARROW-9392
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: Antoine Pitrou
>Priority: Major
>
> Ideally, we should add:
> * a description and examples of how to call compute functions
> * an API reference for concrete C++ functions such as {{Cast}}, 
> {{NthToIndices}}, etc.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-9392) [C++] Document more of the compute layer

2022-06-16 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554978#comment-17554978
 ] 

Antoine Pitrou commented on ARROW-9392:
---

Ok, I'm gonna close for now. Thanks for the heads-up [~octalene]

> [C++] Document more of the compute layer
> 
>
> Key: ARROW-9392
> URL: https://issues.apache.org/jira/browse/ARROW-9392
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: Antoine Pitrou
>Assignee: Eduardo Ponce
>Priority: Major
>
> Ideally, we should add:
> * a description and examples of how to call compute functions
> * an API reference for concrete C++ functions such as {{Cast}}, 
> {{NthToIndices}}, etc.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Closed] (ARROW-9392) [C++] Document more of the compute layer

2022-06-16 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou closed ARROW-9392.
-
Resolution: Done

> [C++] Document more of the compute layer
> 
>
> Key: ARROW-9392
> URL: https://issues.apache.org/jira/browse/ARROW-9392
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: Antoine Pitrou
>Priority: Major
>
> Ideally, we should add:
> * a description and examples of how to call compute functions
> * an API reference for concrete C++ functions such as {{Cast}}, 
> {{NthToIndices}}, etc.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16717) [C++] Enable Arrow to share an application's jemalloc instance

2022-06-16 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-16717.

Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 13373
[https://github.com/apache/arrow/pull/13373]

> [C++] Enable Arrow to share an application's jemalloc instance
> --
>
> Key: ARROW-16717
> URL: https://issues.apache.org/jira/browse/ARROW-16717
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 8.0.0
>Reporter: Ian Cook
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Is there any good way for an application that uses Arrow to tell Arrow to 
> share its own jemalloc instance?
> Normally when Arrow uses jemalloc, it creates its own instance of it. The 
> only method that I think would work to get Arrow to use the same jemalloc 
> instance as the application would be to do something hacky like this:
>  * Patch {{cpp/src/arrow/memory_pool.cc}} to use the application's jemalloc 
> header
>  * Configure the application's jemalloc instance to use the same 
> configuration that Arrow uses, as shown in {{je_arrow_malloc_conf}}
>  * Patch {{ThirdpartyToolchain.cmake}} to use the application's jemalloc 
> instance
> Is there any non-hacky way to achieve this? If not, would it be possible to 
> add a feature to Arrow to enable the user to tell it to use a specific 
> jemalloc instance at runtime?
> The benefit of this would be that a unified jemalloc instance could (at least 
> hypothetically) allocate memory more efficiently than two separate ones 
> running on the same machine.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16837) [C++] Investigate performance regressions observed in Unique, VisitArraySpanInline

2022-06-16 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554968#comment-17554968
 ] 

Antoine Pitrou commented on ARROW-16837:


Note that some benchmarks offer unstable numbers so it might just be a false 
positive.

> [C++] Investigate performance regressions observed in Unique, 
> VisitArraySpanInline
> --
>
> Key: ARROW-16837
> URL: https://issues.apache.org/jira/browse/ARROW-16837
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 9.0.0
>
>
> See discussion in https://github.com/apache/arrow/pull/13364



--
This message was sent by Atlassian Jira
(v8.20.7#820007)