[jira] [Updated] (ARROW-18253) [C++][Parquet] Improve bounds checking on some inputs
[ https://issues.apache.org/jira/browse/ARROW-18253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-18253: --- Labels: pull-request-available (was: ) > [C++][Parquet] Improve bounds checking on some inputs > - > > Key: ARROW-18253 > URL: https://issues.apache.org/jira/browse/ARROW-18253 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Micah Kornfield >Assignee: Micah Kornfield >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > In some cases we don't check for lower bound of 0, on some non-performance > critical paths we only have DCHECKs, and while unlikely in some cases we cast > from size_t to int32 which can overflow, adding some safety checks here would > be useful. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18253) [C++][Parquet] Improve bounds checking on some inputs
Micah Kornfield created ARROW-18253: --- Summary: [C++][Parquet] Improve bounds checking on some inputs Key: ARROW-18253 URL: https://issues.apache.org/jira/browse/ARROW-18253 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Micah Kornfield Assignee: Micah Kornfield In some cases we don't check for lower bound of 0, on some non-performance critical paths we only have DCHECKs, and while unlikely in some cases we cast from size_t to int32 which can overflow, adding some safety checks here would be useful. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-18248) [CI][Release] Use GitHub token to avoid API rate limit
[ https://issues.apache.org/jira/browse/ARROW-18248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-18248. -- Fix Version/s: 11.0.0 Resolution: Fixed Issue resolved by pull request 14588 [https://github.com/apache/arrow/pull/14588] > [CI][Release] Use GitHub token to avoid API rate limit > -- > > Key: ARROW-18248 > URL: https://issues.apache.org/jira/browse/ARROW-18248 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Fix For: 11.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > e.g.: > https://github.com/apache/arrow/actions/runs/3387695588/jobs/5628769268#step:7:25 > {noformat} > Error: test_vote(SourceTest): OpenURI::HTTPError: 403 rate limit exceeded > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-17820) [C++] Implement arithmetic kernels on List(number)
[ https://issues.apache.org/jira/browse/ARROW-17820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629225#comment-17629225 ] Weston Pace edited comment on ARROW-17820 at 11/4/22 11:24 PM: --- {quote} Such an approach doesn't really fit our kernels / Acero, I think. One option could be to have a generic kernel to "map" another kernel on the list values. where you can pass the function name you want to apply, and a FunctionOptions object matching the kernel. Would something like this be possible technically? {quote} Yes, I think that should be possible for unary kernels. Though I think mapping a single kernel (as opposed to a single expression) might be a bit limiting, though maybe it isn't so bad. For example, what if a user wants to do something like {{map(lambda f: f.upper() * 2, ["a", "b", "c"])}} Another thing is that it should be valid to use n-ary functions too provided the other arguments are scalars. This discussion has come up in Substrait with respect to lambdas (https://github.com/substrait-io/substrait/issues/349). Perhaps the "map function" for {{List}} could be an expression bound to a schema of "\{item: T\}" (e.g. so you could do {{field_ref(0)}} or {{field_ref("item")}}). Though if the map function is an expression then a kernel would have to execute an entire expression which may or may not be doable (I've reached the limit of my imagination for a Friday :) was (Author: westonpace): {quote} Such an approach doesn't really fit our kernels / Acero, I think. One option could be to have a generic kernel to "map" another kernel on the list values. where you can pass the function name you want to apply, and a FunctionOptions object matching the kernel. Would something like this be possible technically? {quote} Yes, I think that should be possible for unary kernels. Though I think mapping a single kernel (as opposed to a single expression) might be a bit limiting, though maybe it isn't so bad. For example, what if a user wants to do something like {{map(lambda f: f.upper() * 2, ["a", "b", "c"])}} Another thing is that it should be valid to use n-ary functions too provided the other arguments are scalars. This discussion has come up in Substrait with respect to lambdas (https://github.com/substrait-io/substrait/issues/349). Perhaps the "map function" for {{List}} could be an expression bound to a schema of "{item: T}" (e.g. so you could do {{field_ref(0)}} or {{field_ref("item")}}). Though if the map function is an expression then a kernel would have to execute an entire expression which may or may not be doable (I've reached the limit of my imagination for a Friday :) > [C++] Implement arithmetic kernels on List(number) > -- > > Key: ARROW-17820 > URL: https://issues.apache.org/jira/browse/ARROW-17820 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: Adam Lippai >Priority: Major > Labels: kernel, query-engine > > eg. rounding in list(float64()), similar to a map or foreach -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17820) [C++] Implement arithmetic kernels on List(number)
[ https://issues.apache.org/jira/browse/ARROW-17820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629225#comment-17629225 ] Weston Pace commented on ARROW-17820: - {quote} Such an approach doesn't really fit our kernels / Acero, I think. One option could be to have a generic kernel to "map" another kernel on the list values. where you can pass the function name you want to apply, and a FunctionOptions object matching the kernel. Would something like this be possible technically? {quote} Yes, I think that should be possible for unary kernels. Though I think mapping a single kernel (as opposed to a single expression) might be a bit limiting, though maybe it isn't so bad. For example, what if a user wants to do something like {{map(lambda f: f.upper() * 2, ["a", "b", "c"])}} Another thing is that it should be valid to use n-ary functions too provided the other arguments are scalars. This discussion has come up in Substrait with respect to lambdas (https://github.com/substrait-io/substrait/issues/349). Perhaps the "map function" for {{List}} could be an expression bound to a schema of "{item: T}" (e.g. so you could do {{field_ref(0)}} or {{field_ref("item")}}). Though if the map function is an expression then a kernel would have to execute an entire expression which may or may not be doable (I've reached the limit of my imagination for a Friday :) > [C++] Implement arithmetic kernels on List(number) > -- > > Key: ARROW-17820 > URL: https://issues.apache.org/jira/browse/ARROW-17820 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: Adam Lippai >Priority: Major > Labels: kernel, query-engine > > eg. rounding in list(float64()), similar to a map or foreach -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-18207) [Ruby] RubyGems for 10.0.0 aren't updated yet
[ https://issues.apache.org/jira/browse/ARROW-18207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-18207. -- Fix Version/s: 11.0.0 Resolution: Fixed Published. > [Ruby] RubyGems for 10.0.0 aren't updated yet > - > > Key: ARROW-18207 > URL: https://issues.apache.org/jira/browse/ARROW-18207 > Project: Apache Arrow > Issue Type: Bug > Components: Ruby >Affects Versions: 10.0.0 >Reporter: Noah Horton >Assignee: Kouhei Sutou >Priority: Major > Fix For: 11.0.0 > > > 10.0.0 just released, meaning that that all install scripts that use the > 'latest' tag are getting it. > Yet rubygems.org is still running with the 9.0.0 version a week after 10.0.0 > released. > The build scripts need to start updating rubygems.org automatically, or guide > users to a bundler config like > {code:ruby} > gem "red-arrow", github: "apache/arrow", glob: "ruby/red-arrow/*.gemspec", > require: "arrow", tag: 'apache-arrow-10.0.0' > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-18016) [CI] Add sccache to r jobs
[ https://issues.apache.org/jira/browse/ARROW-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-18016. -- Resolution: Fixed Issue resolved by pull request 14570 [https://github.com/apache/arrow/pull/14570] > [CI] Add sccache to r jobs > -- > > Key: ARROW-18016 > URL: https://issues.apache.org/jira/browse/ARROW-18016 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration >Reporter: Jacob Wujciak-Jens >Assignee: Jacob Wujciak-Jens >Priority: Major > Labels: pull-request-available > Fix For: 11.0.0 > > Time Spent: 7h 40m > Remaining Estimate: 0h > > Building on the work in ARROW-17021 we can now activate sccache on more > builds and save more time! To keep the PRs reviewable I have reduced this to > only R jobs and will open follow-ups for the other tasks. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array
[ https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629192#comment-17629192 ] Arthur Passos commented on ARROW-17459: --- Hi [~willjones127] . I have implemented your suggestion of GetRecordBatchReader and, at first, things seemed to work as expected. Recently, an issue regarding parquet data has been reported and reverting it to the ReadRowGroup solution seems to address this. This might be a misuse of the arrow library on my side, even though I have read the API docs and it looks correct. My question is pretty much: should there be difference in the output when using the two APIs? > [C++] Support nested data conversions for chunked array > --- > > Key: ARROW-17459 > URL: https://issues.apache.org/jira/browse/ARROW-17459 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Arthur Passos >Assignee: Arthur Passos >Priority: Blocker > > `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not > implemented for chunked array outputs". It fails on > [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95]) > Data schema is: > {code:java} > optional group fields_map (MAP) = 217 { > repeated group key_value { > required binary key (STRING) = 218; > optional binary value (STRING) = 219; > } > } > fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047 > fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963 > {code} > Is there a way to work around this issue in the cpp lib? > In any case, I am willing to implement this, but I need some guidance. I am > very new to parquet (as in started reading about it yesterday). > > Probably related to: https://issues.apache.org/jira/browse/ARROW-10958 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17487) [Python][Packaging] 3.11 wheels
[ https://issues.apache.org/jira/browse/ARROW-17487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-17487. -- Fix Version/s: 11.0.0 Resolution: Fixed Issue resolved by pull request 14499 [https://github.com/apache/arrow/pull/14499] > [Python][Packaging] 3.11 wheels > --- > > Key: ARROW-17487 > URL: https://issues.apache.org/jira/browse/ARROW-17487 > Project: Apache Arrow > Issue Type: Wish > Components: Packaging, Python >Reporter: Saugat Pachhai >Assignee: Raúl Cumplido >Priority: Major > Labels: pull-request-available > Fix For: 11.0.0 > > Time Spent: 12h 50m > Remaining Estimate: 0h > > Hi. > Could you please provide wheels with Python 3.11 support? > Python 3.11.0-rc1 was released: > [https://www.python.org/downloads/release/python-3110rc1/.|https://www.python.org/downloads/release/python-3110rc1/], > and from this release onward, there won't be any ABI changes. > > There will be no ABI changes from this point forward in the 3.11 series and > > the goal is that there will be as few code changes as possible. > So, I think it should be safe to release wheel. > Thank you. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17820) [C++] Implement arithmetic kernels on List(number)
[ https://issues.apache.org/jira/browse/ARROW-17820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-17820: -- Summary: [C++] Implement arithmetic kernels on List(number) (was: Implement arithmetic kernels on List(number)) > [C++] Implement arithmetic kernels on List(number) > -- > > Key: ARROW-17820 > URL: https://issues.apache.org/jira/browse/ARROW-17820 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: Adam Lippai >Priority: Major > Labels: kernel, query-engine > > eg. rounding in list(float64()), similar to a map or foreach -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17820) Implement arithmetic kernels on List(number)
[ https://issues.apache.org/jira/browse/ARROW-17820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629186#comment-17629186 ] Joris Van den Bossche commented on ARROW-17820: --- It would be nice if we would have a way that all unary scalar kernels could be applied on list arrays (indeed by being applied to the single child array of flat values). I think in SQL one could do this with a subquery with unnesting and aggregating again (eg https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays#creating_arrays_from_subqueries, although that example is actually not a unary kernel but a binary). Such an approach doesn't really fit our kernels / Acero, I think. One option could be to have a generic kernel to "map" another kernel on the list values. Like {code} list_map_function(list_array, "kernel_name", FunctionOptions) {code} where you can pass the function name you want to apply, and a FunctionOptions object matching the kernel. Would something like this be possible technically? Another option could be to directly register list type for unary kernels? (in many cases there might be no ambiguity about that we expect the function to be applied to each value in the list, instead of applied to each list. For example for {{round(list)}} or {{ascii_lower(list)}}) > Implement arithmetic kernels on List(number) > > > Key: ARROW-17820 > URL: https://issues.apache.org/jira/browse/ARROW-17820 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: Adam Lippai >Priority: Major > Labels: kernel, query-engine > > eg. rounding in list(float64()), similar to a map or foreach -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17820) Implement arithmetic kernels on List(number)
[ https://issues.apache.org/jira/browse/ARROW-17820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-17820: -- Labels: kernel query-engine (was: ) > Implement arithmetic kernels on List(number) > > > Key: ARROW-17820 > URL: https://issues.apache.org/jira/browse/ARROW-17820 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: Adam Lippai >Priority: Major > Labels: kernel, query-engine > > eg. rounding in list(float64()), similar to a map or foreach -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18252) Add Acero test to ensure names in the root relation of Substrain plans are retained
Bryce Mecum created ARROW-18252: --- Summary: Add Acero test to ensure names in the root relation of Substrain plans are retained Key: ARROW-18252 URL: https://issues.apache.org/jira/browse/ARROW-18252 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Bryce Mecum Currently, Acero retains the names in the root relation when it executes a Substrait plan but there isn't a unit test in place to prevent a future regression. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18246) [Python][Docs] PyArrow table join docstring typos for left and right suffix arguments
[ https://issues.apache.org/jira/browse/ARROW-18246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones updated ARROW-18246: --- Component/s: Documentation > [Python][Docs] PyArrow table join docstring typos for left and right suffix > arguments > - > > Key: ARROW-18246 > URL: https://issues.apache.org/jira/browse/ARROW-18246 > Project: Apache Arrow > Issue Type: Bug > Components: Documentation >Reporter: d33bs >Assignee: Will Jones >Priority: Minor > Labels: docs-impacting, documentation, pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Hello, thank you for all the amazing work on Arrow! I'd like to report a > potential issue with PyArrow's Table Join docstring which may make it > confusing for others to read. This content is I believe translated into the > documentation website as well. > The content which needs to be corrected may be found starting at: > [https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L4737] > The block currently reads: > {code:java} > left_suffix : str, default None > Which suffix to add to right column names. This prevents confusion > when the columns in left and right tables have colliding names. > right_suffix : str, default None > Which suffic to add to the left column names. This prevents confusion > when the columns in left and right tables have colliding names.{code} > It could be improved with the following: > {code:java} > left_suffix : str, default None > Which suffix to add to left column names. This prevents confusion > when the columns in left and right tables have colliding names. > right_suffix : str, default None > Which suffix to add to the right column names. This prevents confusion > when the columns in left and right tables have colliding names.{code} > Please let me know if I may clarify or if there are any questions on the > above. Thanks again for your help! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18246) [Python][Docs] PyArrow table join docstring typos for left and right suffix arguments
[ https://issues.apache.org/jira/browse/ARROW-18246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones updated ARROW-18246: --- Fix Version/s: 11.0.0 > [Python][Docs] PyArrow table join docstring typos for left and right suffix > arguments > - > > Key: ARROW-18246 > URL: https://issues.apache.org/jira/browse/ARROW-18246 > Project: Apache Arrow > Issue Type: Bug > Components: Documentation >Reporter: d33bs >Assignee: Will Jones >Priority: Minor > Labels: docs-impacting, documentation, pull-request-available > Fix For: 11.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Hello, thank you for all the amazing work on Arrow! I'd like to report a > potential issue with PyArrow's Table Join docstring which may make it > confusing for others to read. This content is I believe translated into the > documentation website as well. > The content which needs to be corrected may be found starting at: > [https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L4737] > The block currently reads: > {code:java} > left_suffix : str, default None > Which suffix to add to right column names. This prevents confusion > when the columns in left and right tables have colliding names. > right_suffix : str, default None > Which suffic to add to the left column names. This prevents confusion > when the columns in left and right tables have colliding names.{code} > It could be improved with the following: > {code:java} > left_suffix : str, default None > Which suffix to add to left column names. This prevents confusion > when the columns in left and right tables have colliding names. > right_suffix : str, default None > Which suffix to add to the right column names. This prevents confusion > when the columns in left and right tables have colliding names.{code} > Please let me know if I may clarify or if there are any questions on the > above. Thanks again for your help! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17984) pq.read_table doesn't seem to be thread safe
[ https://issues.apache.org/jira/browse/ARROW-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629119#comment-17629119 ] Antoine Pitrou commented on ARROW-17984: [~marsupialtail] Can you post the backtrace for _all_ threads? (using "thread apply all bt" as suggested by [~westonpace]). Since that will probably be quite long I suggest posting it on a site such as https://gist.github.com/ and posting the link here. > pq.read_table doesn't seem to be thread safe > > > Key: ARROW-17984 > URL: https://issues.apache.org/jira/browse/ARROW-17984 > Project: Apache Arrow > Issue Type: Bug > Components: Parquet >Affects Versions: 9.0.0 >Reporter: Ziheng Wang >Priority: Major > Attachments: _usr_bin_python3.8.1000.crash > > > Before PR: [https://github.com/apache/arrow/pull/13799] gets merged in > master, I am using multithreading to improve read bandwidth from S3. Even > after that PR gets merged, I probably will still try to use multithreading to > some extent. > However pq.read_table from S3 doesn't seem to be thread safe. Seems like it > uses the new dataset reader under the hood. I cannot provide a reproduction, > not a stable one anyway. But this is roughly the script I have been using > ~~~ > def get_next_batch(self, mapper_id, pos=None): > def download(file): > return pq.read_table("s3://" + self.bucket + "/" + file, > columns=self.columns, filters=self.filters) > > executor = concurrent.futures.ThreadPoolExecutor(max_workers=self.workers) > futures= \{executor.submit(download, file): file for file in my_files} > for future inconcurrent.futures.as_completed(futures): > yield future.result() > ~~~ > The errors all have to do with malloc segfaults which makes me suspect the > connection object is being reused across different pq.read_table invocations > in different threads > ``` > (InputReaderNode pid=25001, ip=172.31.60.29) malloc_consolidate(): invalid > chunk size > (InputReaderNode pid=25001, ip=172.31.60.29) *** SIGABRT received at > time=1665464922 on cpu 9 *** > (InputReaderNode pid=25001, ip=172.31.60.29) PC: @ 0x7f9a480a803b > (unknown) raise > (InputReaderNode pid=25001, ip=172.31.60.29) @ 0x7f9a480a80c0 > 4160 (unknown) > (InputReaderNode pid=25001, ip=172.31.60.29) @ 0x7f9a480fa32c > (unknown) (unknown) > ``` > Note, this multithreaded code is running inside a Ray actor process, but that > shouldn't be a problem. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18242) [R] arrow implementation of lubridate::dmy parses invalid date "00001976" as date
[ https://issues.apache.org/jira/browse/ARROW-18242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629101#comment-17629101 ] Lucas Mation commented on ARROW-18242: -- ok, same error as before arrow_table(x = '1976') %>% mutate(y = dmy(x)) %>% collect() # A tibble: 1 x 2 x y 1 1976 1975-11-30 > [R] arrow implementation of lubridate::dmy parses invalid date "1976" as > date > - > > Key: ARROW-18242 > URL: https://issues.apache.org/jira/browse/ARROW-18242 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Lucas Mation >Priority: Critical > > Sorry for so many issues, but I think this is another bug. > Wrong behavior of the arrow implementation of the `lubridate::dmy`. > An invalid date such as '1976' is being parsed as a valid (and completely > unrelated) date. > #in R > '1976' %>% dmy > [1] NA > Warning message: > All formats failed to parse. No formats found. > #In arrow > q <- data.table(x=c('1976','30111976','01011976')) > q %>% write_dataset('q') > q2 <- 'q' %>% open_dataset %>% mutate(x2=dmy) %>% collect > q2 > x > 1: 1975-11-30 > 2: 1976-11-30 > 3: 1976-01-01 > #notice '1976' is an invalid date. First row of x2 should be NA!!! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18242) [R] arrow implementation of lubridate::dmy parses invalid date "00001976" as date
[ https://issues.apache.org/jira/browse/ARROW-18242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629048#comment-17629048 ] Nicola Crane commented on ARROW-18242: -- Sorry, my bad, I accidentally used {{ymd()}} there instead of {{dmy()}}. Mind giving it a go with {{dmy()}}? > [R] arrow implementation of lubridate::dmy parses invalid date "1976" as > date > - > > Key: ARROW-18242 > URL: https://issues.apache.org/jira/browse/ARROW-18242 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Lucas Mation >Priority: Critical > > Sorry for so many issues, but I think this is another bug. > Wrong behavior of the arrow implementation of the `lubridate::dmy`. > An invalid date such as '1976' is being parsed as a valid (and completely > unrelated) date. > #in R > '1976' %>% dmy > [1] NA > Warning message: > All formats failed to parse. No formats found. > #In arrow > q <- data.table(x=c('1976','30111976','01011976')) > q %>% write_dataset('q') > q2 <- 'q' %>% open_dataset %>% mutate(x2=dmy) %>% collect > q2 > x > 1: 1975-11-30 > 2: 1976-11-30 > 3: 1976-01-01 > #notice '1976' is an invalid date. First row of x2 should be NA!!! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18242) [R] arrow implementation of lubridate::dmy parses invalid date "00001976" as date
[ https://issues.apache.org/jira/browse/ARROW-18242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629045#comment-17629045 ] Lucas Mation commented on ARROW-18242: -- So the error only occurs when the data is being read from the parquet file, then collected > [R] arrow implementation of lubridate::dmy parses invalid date "1976" as > date > - > > Key: ARROW-18242 > URL: https://issues.apache.org/jira/browse/ARROW-18242 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Lucas Mation >Priority: Critical > > Sorry for so many issues, but I think this is another bug. > Wrong behavior of the arrow implementation of the `lubridate::dmy`. > An invalid date such as '1976' is being parsed as a valid (and completely > unrelated) date. > #in R > '1976' %>% dmy > [1] NA > Warning message: > All formats failed to parse. No formats found. > #In arrow > q <- data.table(x=c('1976','30111976','01011976')) > q %>% write_dataset('q') > q2 <- 'q' %>% open_dataset %>% mutate(x2=dmy) %>% collect > q2 > x > 1: 1975-11-30 > 2: 1976-11-30 > 3: 1976-01-01 > #notice '1976' is an invalid date. First row of x2 should be NA!!! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18242) [R] arrow implementation of lubridate::dmy parses invalid date "00001976" as date
[ https://issues.apache.org/jira/browse/ARROW-18242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629044#comment-17629044 ] Lucas Mation commented on ARROW-18242: -- Weirdly, this returns correct output arrow_table(x = '1976') %>% + mutate(y = ymd(x)) %>% + collect() # A tibble: 1 x 2 x y 1 1976 NA > [R] arrow implementation of lubridate::dmy parses invalid date "1976" as > date > - > > Key: ARROW-18242 > URL: https://issues.apache.org/jira/browse/ARROW-18242 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Lucas Mation >Priority: Critical > > Sorry for so many issues, but I think this is another bug. > Wrong behavior of the arrow implementation of the `lubridate::dmy`. > An invalid date such as '1976' is being parsed as a valid (and completely > unrelated) date. > #in R > '1976' %>% dmy > [1] NA > Warning message: > All formats failed to parse. No formats found. > #In arrow > q <- data.table(x=c('1976','30111976','01011976')) > q %>% write_dataset('q') > q2 <- 'q' %>% open_dataset %>% mutate(x2=dmy) %>% collect > q2 > x > 1: 1975-11-30 > 2: 1976-11-30 > 3: 1976-01-01 > #notice '1976' is an invalid date. First row of x2 should be NA!!! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18246) [Python][Docs] PyArrow table join docstring typos for left and right suffix arguments
[ https://issues.apache.org/jira/browse/ARROW-18246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-18246: --- Labels: docs-impacting documentation pull-request-available (was: docs-impacting documentation) > [Python][Docs] PyArrow table join docstring typos for left and right suffix > arguments > - > > Key: ARROW-18246 > URL: https://issues.apache.org/jira/browse/ARROW-18246 > Project: Apache Arrow > Issue Type: Bug >Reporter: d33bs >Assignee: Will Jones >Priority: Minor > Labels: docs-impacting, documentation, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Hello, thank you for all the amazing work on Arrow! I'd like to report a > potential issue with PyArrow's Table Join docstring which may make it > confusing for others to read. This content is I believe translated into the > documentation website as well. > The content which needs to be corrected may be found starting at: > [https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L4737] > The block currently reads: > {code:java} > left_suffix : str, default None > Which suffix to add to right column names. This prevents confusion > when the columns in left and right tables have colliding names. > right_suffix : str, default None > Which suffic to add to the left column names. This prevents confusion > when the columns in left and right tables have colliding names.{code} > It could be improved with the following: > {code:java} > left_suffix : str, default None > Which suffix to add to left column names. This prevents confusion > when the columns in left and right tables have colliding names. > right_suffix : str, default None > Which suffix to add to the right column names. This prevents confusion > when the columns in left and right tables have colliding names.{code} > Please let me know if I may clarify or if there are any questions on the > above. Thanks again for your help! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18246) [Python][Docs] PyArrow table join docstring typos for left and right suffix arguments
[ https://issues.apache.org/jira/browse/ARROW-18246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629039#comment-17629039 ] Will Jones commented on ARROW-18246: Thanks for reporting. I have created an update fixing those and a couple other issues in the docs. > [Python][Docs] PyArrow table join docstring typos for left and right suffix > arguments > - > > Key: ARROW-18246 > URL: https://issues.apache.org/jira/browse/ARROW-18246 > Project: Apache Arrow > Issue Type: Bug >Reporter: d33bs >Assignee: Will Jones >Priority: Minor > Labels: docs-impacting, documentation > > Hello, thank you for all the amazing work on Arrow! I'd like to report a > potential issue with PyArrow's Table Join docstring which may make it > confusing for others to read. This content is I believe translated into the > documentation website as well. > The content which needs to be corrected may be found starting at: > [https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L4737] > The block currently reads: > {code:java} > left_suffix : str, default None > Which suffix to add to right column names. This prevents confusion > when the columns in left and right tables have colliding names. > right_suffix : str, default None > Which suffic to add to the left column names. This prevents confusion > when the columns in left and right tables have colliding names.{code} > It could be improved with the following: > {code:java} > left_suffix : str, default None > Which suffix to add to left column names. This prevents confusion > when the columns in left and right tables have colliding names. > right_suffix : str, default None > Which suffix to add to the right column names. This prevents confusion > when the columns in left and right tables have colliding names.{code} > Please let me know if I may clarify or if there are any questions on the > above. Thanks again for your help! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-18242) [R] arrow implementation of lubridate::dmy parses invalid date "00001976" as date
[ https://issues.apache.org/jira/browse/ARROW-18242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629025#comment-17629025 ] Nicola Crane edited comment on ARROW-18242 at 11/4/22 2:54 PM: --- I can't test this example as I don't have a Windows machine available to me now, but I'm guessing we get the same problem using this tiny reprex: {code:r} arrow_table(x = '1976') %>% mutate(y = ymd(x)) %>% collect() {code} was (Author: thisisnic): I can't test this example as I don't have a Windows machine available to me now, but I'm guessing we get the same problem using this tiny reprex: {{arrow_table(x = '1976') %>% mutate(y = ymd(x)) %>% collect()}} > [R] arrow implementation of lubridate::dmy parses invalid date "1976" as > date > - > > Key: ARROW-18242 > URL: https://issues.apache.org/jira/browse/ARROW-18242 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Lucas Mation >Priority: Critical > > Sorry for so many issues, but I think this is another bug. > Wrong behavior of the arrow implementation of the `lubridate::dmy`. > An invalid date such as '1976' is being parsed as a valid (and completely > unrelated) date. > #in R > '1976' %>% dmy > [1] NA > Warning message: > All formats failed to parse. No formats found. > #In arrow > q <- data.table(x=c('1976','30111976','01011976')) > q %>% write_dataset('q') > q2 <- 'q' %>% open_dataset %>% mutate(x2=dmy) %>% collect > q2 > x > 1: 1975-11-30 > 2: 1976-11-30 > 3: 1976-01-01 > #notice '1976' is an invalid date. First row of x2 should be NA!!! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18242) [R] arrow implementation of lubridate::dmy parses invalid date "00001976" as date
[ https://issues.apache.org/jira/browse/ARROW-18242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629025#comment-17629025 ] Nicola Crane commented on ARROW-18242: -- I can't test this example as I don't have a Windows machine available to me now, but I'm guessing we get the same problem using this tiny reprex: {{arrow_table(x = '1976') %>% mutate(y = ymd(x)) %>% collect()}} > [R] arrow implementation of lubridate::dmy parses invalid date "1976" as > date > - > > Key: ARROW-18242 > URL: https://issues.apache.org/jira/browse/ARROW-18242 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Lucas Mation >Priority: Critical > > Sorry for so many issues, but I think this is another bug. > Wrong behavior of the arrow implementation of the `lubridate::dmy`. > An invalid date such as '1976' is being parsed as a valid (and completely > unrelated) date. > #in R > '1976' %>% dmy > [1] NA > Warning message: > All formats failed to parse. No formats found. > #In arrow > q <- data.table(x=c('1976','30111976','01011976')) > q %>% write_dataset('q') > q2 <- 'q' %>% open_dataset %>% mutate(x2=dmy) %>% collect > q2 > x > 1: 1975-11-30 > 2: 1976-11-30 > 3: 1976-01-01 > #notice '1976' is an invalid date. First row of x2 should be NA!!! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18242) [R] arrow implementation of lubridate::dmy parses invalid date "00001976" as date
[ https://issues.apache.org/jira/browse/ARROW-18242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicola Crane updated ARROW-18242: - Component/s: R > [R] arrow implementation of lubridate::dmy parses invalid date "1976" as > date > - > > Key: ARROW-18242 > URL: https://issues.apache.org/jira/browse/ARROW-18242 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Lucas Mation >Priority: Critical > > Sorry for so many issues, but I think this is another bug. > Wrong behavior of the arrow implementation of the `lubridate::dmy`. > An invalid date such as '1976' is being parsed as a valid (and completely > unrelated) date. > #in R > '1976' %>% dmy > [1] NA > Warning message: > All formats failed to parse. No formats found. > #In arrow > q <- data.table(x=c('1976','30111976','01011976')) > q %>% write_dataset('q') > q2 <- 'q' %>% open_dataset %>% mutate(x2=dmy) %>% collect > q2 > x > 1: 1975-11-30 > 2: 1976-11-30 > 3: 1976-01-01 > #notice '1976' is an invalid date. First row of x2 should be NA!!! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18242) [R] arrow implementation of lubridate::dmy parses invalid date "00001976" as date
[ https://issues.apache.org/jira/browse/ARROW-18242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629023#comment-17629023 ] Nicola Crane commented on ARROW-18242: -- OK, my best guess as to what is going on here is that the original lubridate implementation uses a custom C parser to process these datetimes, and in the Arrow implementation some of this work is being done on whichever external library is being depended on for datetimes, which is why there's a difference between Windows and Linux. We might be able to add some additional pre-processing steps to our bindings (or the regex the setup code for them produces) to prevent this. > [R] arrow implementation of lubridate::dmy parses invalid date "1976" as > date > - > > Key: ARROW-18242 > URL: https://issues.apache.org/jira/browse/ARROW-18242 > Project: Apache Arrow > Issue Type: Bug >Reporter: Lucas Mation >Priority: Critical > > Sorry for so many issues, but I think this is another bug. > Wrong behavior of the arrow implementation of the `lubridate::dmy`. > An invalid date such as '1976' is being parsed as a valid (and completely > unrelated) date. > #in R > '1976' %>% dmy > [1] NA > Warning message: > All formats failed to parse. No formats found. > #In arrow > q <- data.table(x=c('1976','30111976','01011976')) > q %>% write_dataset('q') > q2 <- 'q' %>% open_dataset %>% mutate(x2=dmy) %>% collect > q2 > x > 1: 1975-11-30 > 2: 1976-11-30 > 3: 1976-01-01 > #notice '1976' is an invalid date. First row of x2 should be NA!!! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-18246) [Python][Docs] PyArrow table join docstring typos for left and right suffix arguments
[ https://issues.apache.org/jira/browse/ARROW-18246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones reassigned ARROW-18246: -- Assignee: Will Jones > [Python][Docs] PyArrow table join docstring typos for left and right suffix > arguments > - > > Key: ARROW-18246 > URL: https://issues.apache.org/jira/browse/ARROW-18246 > Project: Apache Arrow > Issue Type: Bug >Reporter: d33bs >Assignee: Will Jones >Priority: Minor > Labels: docs-impacting, documentation > > Hello, thank you for all the amazing work on Arrow! I'd like to report a > potential issue with PyArrow's Table Join docstring which may make it > confusing for others to read. This content is I believe translated into the > documentation website as well. > The content which needs to be corrected may be found starting at: > [https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L4737] > The block currently reads: > {code:java} > left_suffix : str, default None > Which suffix to add to right column names. This prevents confusion > when the columns in left and right tables have colliding names. > right_suffix : str, default None > Which suffic to add to the left column names. This prevents confusion > when the columns in left and right tables have colliding names.{code} > It could be improved with the following: > {code:java} > left_suffix : str, default None > Which suffix to add to left column names. This prevents confusion > when the columns in left and right tables have colliding names. > right_suffix : str, default None > Which suffix to add to the right column names. This prevents confusion > when the columns in left and right tables have colliding names.{code} > Please let me know if I may clarify or if there are any questions on the > above. Thanks again for your help! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-18245) wheels for PyArrow + Python 3.11
[ https://issues.apache.org/jira/browse/ARROW-18245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones closed ARROW-18245. -- Resolution: Duplicate Hello! This is being actively worked on in ARROW-17487. I've closed this ticket since it duplicates that one. > wheels for PyArrow + Python 3.11 > > > Key: ARROW-18245 > URL: https://issues.apache.org/jira/browse/ARROW-18245 > Project: Apache Arrow > Issue Type: Improvement >Affects Versions: 10.0.0 > Environment: Linux RH8 >Reporter: Aleksandar >Priority: Minor > > Hi, > May we know the plan for pypi pyarrow 10 package will have build dependencies > installed as part of the package. Right now pyarrow10 package has no > wheels for py3.11.0 . > Maybe this is not a right forum but someone is maintaining and packaging > these things for developers. > Thanks much and sorry for intruding ... -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18228) AWS Error SLOW_DOWN during PutObject operation
[ https://issues.apache.org/jira/browse/ARROW-18228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629016#comment-17629016 ] Will Jones commented on ARROW-18228: If you are still getting errors, it might be worth reviewing how you slow you app down somewhat to not hit these errors. [https://aws.amazon.com/premiumsupport/knowledge-center/http-5xx-errors-s3/] [https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html] [https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-prefixes.html] I'm not sure if we have any other settings to limit concurrent requests or tune the backoff strategy, but that might be helpful for cases like this. > AWS Error SLOW_DOWN during PutObject operation > -- > > Key: ARROW-18228 > URL: https://issues.apache.org/jira/browse/ARROW-18228 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 10.0.0 >Reporter: Vadym Dytyniak >Priority: Major > > We use Dask to parallelise read/write operations and pyarrow to write dataset > from worker nodes. > After pyarrow released version 10.0.0, our data flows automatically switched > to the latest version and some of them started to fail with the following > error: > {code:java} > File "/usr/local/lib/python3.10/dist-packages/org/store/storage.py", line > 768, in _write_partition > ds.write_dataset( > File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line > 988, in write_dataset > _filesystemdataset_write( > File "pyarrow/_dataset.pyx", line 2859, in > pyarrow._dataset._filesystemdataset_write > check_status(CFileSystemDataset.Write(c_options, c_scanner)) > File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status > raise IOError(message) > OSError: When creating key 'equities.us.level2.by_security/' in bucket > 'org-prod': AWS Error SLOW_DOWN during PutObject operation: Please reduce > your request rate. {code} > In total flow failed many times: most failed with the error above, but one > failed with: > {code:java} > File "/usr/local/lib/python3.10/dist-packages/chronos/store/storage.py", line > 857, in _load_partition > table = ds.dataset( > File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line > 752, in dataset > return _filesystem_dataset(source, **kwargs) > File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line > 444, in _filesystem_dataset > fs, paths_or_selector = _ensure_single_source(source, filesystem) > File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line > 411, in _ensure_single_source > file_info = filesystem.get_file_info(path) > File "pyarrow/_fs.pyx", line 564, in pyarrow._fs.FileSystem.get_file_info > info = GetResultValue(self.fs.GetFileInfo(path)) > File "pyarrow/error.pxi", line 144, in > pyarrow.lib.pyarrow_internal_check_status > return check_status(status) > File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status > raise IOError(message) > OSError: When getting information for key > 'ns/date=2022-10-31/channel=4/feed=A/9f41f928eedc431ca695a7ffe5fc60c2-0.parquet' > in bucket 'org-poc': AWS Error NETWORK_CONNECTION during HeadObject > operation: curlCode: 28, Timeout was reached {code} > > Do you have any idea what was changed for dataset write between 9.0.0 and > 10.0.0 to help us to fix the issue? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18242) [R] arrow implementation of lubridate::dmy parses invalid date "00001976" as date
[ https://issues.apache.org/jira/browse/ARROW-18242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicola Crane updated ARROW-18242: - Priority: Critical (was: Major) > [R] arrow implementation of lubridate::dmy parses invalid date "1976" as > date > - > > Key: ARROW-18242 > URL: https://issues.apache.org/jira/browse/ARROW-18242 > Project: Apache Arrow > Issue Type: Bug >Reporter: Lucas Mation >Priority: Critical > > Sorry for so many issues, but I think this is another bug. > Wrong behavior of the arrow implementation of the `lubridate::dmy`. > An invalid date such as '1976' is being parsed as a valid (and completely > unrelated) date. > #in R > '1976' %>% dmy > [1] NA > Warning message: > All formats failed to parse. No formats found. > #In arrow > q <- data.table(x=c('1976','30111976','01011976')) > q %>% write_dataset('q') > q2 <- 'q' %>% open_dataset %>% mutate(x2=dmy) %>% collect > q2 > x > 1: 1975-11-30 > 2: 1976-11-30 > 3: 1976-01-01 > #notice '1976' is an invalid date. First row of x2 should be NA!!! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18251) [CI][Python] AMD64 macOS 11 Python 3 job fails on master on pip install
[ https://issues.apache.org/jira/browse/ARROW-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629013#comment-17629013 ] Joris Van den Bossche commented on ARROW-18251: --- Not directly an idea. This build has been failing for some time (with a cython test failure) but it seems only recently started to fail with this installation issue. >From commits on master, 4 days ago test failure: >https://github.com/apache/arrow/actions/runs/3363608185/jobs/5576970377 3 days ago installation failure: https://github.com/apache/arrow/actions/runs/3372998317/jobs/5597074003 A relevant difference is a pip 22.2.2 -> 22.3 update. > [CI][Python] AMD64 macOS 11 Python 3 job fails on master on pip install > --- > > Key: ARROW-18251 > URL: https://issues.apache.org/jira/browse/ARROW-18251 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration, Python >Reporter: Raúl Cumplido >Priority: Critical > Fix For: 11.0.0 > > > Currently the job for AMD64 macOS 11 Python 3 is failing: > [https://github.com/apache/arrow/actions/runs/3388587979/jobs/5630747309] > with: > {code:java} > + python3 -m pip install --no-deps --no-build-isolation -vv . > ~/work/arrow/arrow/python ~/work/arrow/arrow > Using pip 22.3 from > /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pip > (python 3.11) > Non-user install because site-packages writeable > Created temporary directory: > /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-build-tracker-ib8gr4sw > Initialized build tracking at > /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-build-tracker-ib8gr4sw > Created build tracker: > /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-build-tracker-ib8gr4sw > Entered build tracker: > /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-build-tracker-ib8gr4sw > Created temporary directory: > /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-install-9ku2dtx5 > Created temporary directory: > /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-ephem-wheel-cache-za6jhm0e > Processing /Users/runner/work/arrow/arrow/python > Added file:///Users/runner/work/arrow/arrow/python to build tracker > '/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-build-tracker-ib8gr4sw' > Created temporary directory: > /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16 > Preparing metadata (pyproject.toml): started > Running command Preparing metadata (pyproject.toml) > running dist_info > creating > /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info > writing > /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/PKG-INFO > writing dependency_links to > /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/dependency_links.txt > writing entry points to > /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/entry_points.txt > writing requirements to > /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/requires.txt > writing top-level names to > /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/top_level.txt > writing manifest file > '/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/SOURCES.txt' > reading manifest template 'MANIFEST.in' > warning: no previously-included files matching '*.so' found anywhere in > distribution > warning: no previously-included files matching '*.pyc' found anywhere in > distribution > warning: no previously-included files matching '*~' found anywhere in > distribution > warning: no previously-included files matching '#*' found anywhere in > distribution > warning: no previously-included files matching '.DS_Store' found anywhere > in distribution > no previously-included directories found matching '.asv' > adding license file '../LICENSE.txt' > adding license file '../NOTICE.txt' > writing manifest file > '/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/SOURCES.txt' > creating > '/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow-11.0.0.dev55+g8e3a1e1b7.dist-info' > error: invalid command 'bdist_wheel' > error: subprocess-exited-with-error > > × Preparing metadata (pyproject.toml) did not run successfully. > │ exit code: 1 > ╰─> See above for output. > > note: This error originates from a subprocess, and is
[jira] [Commented] (ARROW-18251) [CI][Python] AMD64 macOS 11 Python 3 job fails on master on pip install
[ https://issues.apache.org/jira/browse/ARROW-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629008#comment-17629008 ] Raúl Cumplido commented on ARROW-18251: --- [~jorisvandenbossche] [~alenka] Any idea what might be the cause for this failures? I will try and investigate > [CI][Python] AMD64 macOS 11 Python 3 job fails on master on pip install > --- > > Key: ARROW-18251 > URL: https://issues.apache.org/jira/browse/ARROW-18251 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration, Python >Reporter: Raúl Cumplido >Priority: Critical > Fix For: 11.0.0 > > > Currently the job for AMD64 macOS 11 Python 3 is failing: > [https://github.com/apache/arrow/actions/runs/3388587979/jobs/5630747309] > with: > {code:java} > + python3 -m pip install --no-deps --no-build-isolation -vv . > ~/work/arrow/arrow/python ~/work/arrow/arrow > Using pip 22.3 from > /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pip > (python 3.11) > Non-user install because site-packages writeable > Created temporary directory: > /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-build-tracker-ib8gr4sw > Initialized build tracking at > /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-build-tracker-ib8gr4sw > Created build tracker: > /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-build-tracker-ib8gr4sw > Entered build tracker: > /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-build-tracker-ib8gr4sw > Created temporary directory: > /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-install-9ku2dtx5 > Created temporary directory: > /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-ephem-wheel-cache-za6jhm0e > Processing /Users/runner/work/arrow/arrow/python > Added file:///Users/runner/work/arrow/arrow/python to build tracker > '/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-build-tracker-ib8gr4sw' > Created temporary directory: > /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16 > Preparing metadata (pyproject.toml): started > Running command Preparing metadata (pyproject.toml) > running dist_info > creating > /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info > writing > /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/PKG-INFO > writing dependency_links to > /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/dependency_links.txt > writing entry points to > /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/entry_points.txt > writing requirements to > /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/requires.txt > writing top-level names to > /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/top_level.txt > writing manifest file > '/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/SOURCES.txt' > reading manifest template 'MANIFEST.in' > warning: no previously-included files matching '*.so' found anywhere in > distribution > warning: no previously-included files matching '*.pyc' found anywhere in > distribution > warning: no previously-included files matching '*~' found anywhere in > distribution > warning: no previously-included files matching '#*' found anywhere in > distribution > warning: no previously-included files matching '.DS_Store' found anywhere > in distribution > no previously-included directories found matching '.asv' > adding license file '../LICENSE.txt' > adding license file '../NOTICE.txt' > writing manifest file > '/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/SOURCES.txt' > creating > '/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow-11.0.0.dev55+g8e3a1e1b7.dist-info' > error: invalid command 'bdist_wheel' > error: subprocess-exited-with-error > > × Preparing metadata (pyproject.toml) did not run successfully. > │ exit code: 1 > ╰─> See above for output. > > note: This error originates from a subprocess, and is likely not a problem > with pip. > full command: /usr/local/bin/python3 > /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pip/_vendor/pep517/in_process/_in_process.py > prepare_metadata_for_build_wheel > /var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/tmpteb6e2qe > cwd: /Users/runner/work/arrow/arrow/python >
[jira] [Created] (ARROW-18251) [CI][Python] AMD64 macOS 11 Python 3 job fails on master on pip install
Raúl Cumplido created ARROW-18251: - Summary: [CI][Python] AMD64 macOS 11 Python 3 job fails on master on pip install Key: ARROW-18251 URL: https://issues.apache.org/jira/browse/ARROW-18251 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration, Python Reporter: Raúl Cumplido Fix For: 11.0.0 Currently the job for AMD64 macOS 11 Python 3 is failing: [https://github.com/apache/arrow/actions/runs/3388587979/jobs/5630747309] with: {code:java} + python3 -m pip install --no-deps --no-build-isolation -vv . ~/work/arrow/arrow/python ~/work/arrow/arrow Using pip 22.3 from /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pip (python 3.11) Non-user install because site-packages writeable Created temporary directory: /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-build-tracker-ib8gr4sw Initialized build tracking at /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-build-tracker-ib8gr4sw Created build tracker: /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-build-tracker-ib8gr4sw Entered build tracker: /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-build-tracker-ib8gr4sw Created temporary directory: /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-install-9ku2dtx5 Created temporary directory: /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-ephem-wheel-cache-za6jhm0e Processing /Users/runner/work/arrow/arrow/python Added file:///Users/runner/work/arrow/arrow/python to build tracker '/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-build-tracker-ib8gr4sw' Created temporary directory: /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16 Preparing metadata (pyproject.toml): started Running command Preparing metadata (pyproject.toml) running dist_info creating /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info writing /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/PKG-INFO writing dependency_links to /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/dependency_links.txt writing entry points to /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/entry_points.txt writing requirements to /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/requires.txt writing top-level names to /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/top_level.txt writing manifest file '/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/SOURCES.txt' reading manifest template 'MANIFEST.in' warning: no previously-included files matching '*.so' found anywhere in distribution warning: no previously-included files matching '*.pyc' found anywhere in distribution warning: no previously-included files matching '*~' found anywhere in distribution warning: no previously-included files matching '#*' found anywhere in distribution warning: no previously-included files matching '.DS_Store' found anywhere in distribution no previously-included directories found matching '.asv' adding license file '../LICENSE.txt' adding license file '../NOTICE.txt' writing manifest file '/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/SOURCES.txt' creating '/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow-11.0.0.dev55+g8e3a1e1b7.dist-info' error: invalid command 'bdist_wheel' error: subprocess-exited-with-error × Preparing metadata (pyproject.toml) did not run successfully. │ exit code: 1 ╰─> See above for output. note: This error originates from a subprocess, and is likely not a problem with pip. full command: /usr/local/bin/python3 /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pip/_vendor/pep517/in_process/_in_process.py prepare_metadata_for_build_wheel /var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/tmpteb6e2qe cwd: /Users/runner/work/arrow/arrow/python Preparing metadata (pyproject.toml): finished with status 'error' error: metadata-generation-failed× Encountered error while generating package metadata. ╰─> See above for output.note: This is an issue with the package mentioned above, not pip. hint: See above for details. Exception information: Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pip/_internal/operations/build/metadata.py", line 35, in generate_metadata distinfo_dir =
[jira] [Assigned] (ARROW-17487) [Python][Packaging] 3.11 wheels
[ https://issues.apache.org/jira/browse/ARROW-17487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raúl Cumplido reassigned ARROW-17487: - Assignee: Raúl Cumplido > [Python][Packaging] 3.11 wheels > --- > > Key: ARROW-17487 > URL: https://issues.apache.org/jira/browse/ARROW-17487 > Project: Apache Arrow > Issue Type: Wish > Components: Packaging, Python >Reporter: Saugat Pachhai >Assignee: Raúl Cumplido >Priority: Major > Labels: pull-request-available > Time Spent: 11h 40m > Remaining Estimate: 0h > > Hi. > Could you please provide wheels with Python 3.11 support? > Python 3.11.0-rc1 was released: > [https://www.python.org/downloads/release/python-3110rc1/.|https://www.python.org/downloads/release/python-3110rc1/], > and from this release onward, there won't be any ABI changes. > > There will be no ABI changes from this point forward in the 3.11 series and > > the goal is that there will be as few code changes as possible. > So, I think it should be safe to release wheel. > Thank you. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18242) [R] arrow implementation of lubridate::dmy parses invalid date "00001976" as date
[ https://issues.apache.org/jira/browse/ARROW-18242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628993#comment-17628993 ] Nicola Crane commented on ARROW-18242: -- Aha, this could be related to Windows as we've had some issues in the past there. I'll keep digging. > [R] arrow implementation of lubridate::dmy parses invalid date "1976" as > date > - > > Key: ARROW-18242 > URL: https://issues.apache.org/jira/browse/ARROW-18242 > Project: Apache Arrow > Issue Type: Bug >Reporter: Lucas Mation >Priority: Major > > Sorry for so many issues, but I think this is another bug. > Wrong behavior of the arrow implementation of the `lubridate::dmy`. > An invalid date such as '1976' is being parsed as a valid (and completely > unrelated) date. > #in R > '1976' %>% dmy > [1] NA > Warning message: > All formats failed to parse. No formats found. > #In arrow > q <- data.table(x=c('1976','30111976','01011976')) > q %>% write_dataset('q') > q2 <- 'q' %>% open_dataset %>% mutate(x2=dmy) %>% collect > q2 > x > 1: 1975-11-30 > 2: 1976-11-30 > 3: 1976-01-01 > #notice '1976' is an invalid date. First row of x2 should be NA!!! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18242) [R] arrow implementation of lubridate::dmy parses invalid date "00001976" as date
[ https://issues.apache.org/jira/browse/ARROW-18242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628973#comment-17628973 ] Lucas Mation commented on ARROW-18242: -- Weird. I can replicate the error both on arrow-10.0.0 (in the server) and arrow-dev-nightly-build (in my PC). Both are windows machines #For the server: ``` #For the server: arrow::arrow_info() Arrow package version: 10.0.0 Capabilities: dataset TRUE substrait FALSE parquet TRUE json TRUE s3 TRUE gcs TRUE utf8proc TRUE re2 TRUE snappy TRUE gzip TRUE brotli TRUE zstd TRUE lz4 TRUE lz4_frame TRUE lzo FALSE bz2 TRUE jemalloc FALSE mimalloc TRUE Arrow options(): arrow.use_threads FALSE Memory: Allocator mimalloc Current 1.12 Kb Max 25.77 Kb Runtime: SIMD Level avx2 Detected SIMD Level avx2 Build: C++ Library Version 10.0.0 C++ Compiler GNU C++ Compiler Version 10.3.0 Git ID aa7118b6e5f49b354fa8a93d9cf363c9ebe9a3f0 ``` On my PC ``` #For the PC: arrow_info() Arrow package version: 10.0.0.10050 Capabilities: dataset TRUE substrait FALSE parquet TRUE json TRUE s3 TRUE gcs TRUE utf8proc TRUE re2 TRUE snappy TRUE gzip TRUE brotli TRUE zstd TRUE lz4 TRUE lz4_frame TRUE lzo FALSE bz2 TRUE jemalloc FALSE mimalloc TRUE Arrow options(): arrow.use_threads FALSE Memory: Allocator mimalloc Current 128 bytes Max 25.52 Kb Runtime: SIMD Level avx2 Detected SIMD Level avx2 Build: C++ Library Version 11.0.0-SNAPSHOT C++ Compiler GNU C++ Compiler Version 10.3.0 Git ID 5e53978b56aa13f9c033f2e849cc22f2aed6e2d3 ``` #For the server: > [R] arrow implementation of lubridate::dmy parses invalid date "1976" as > date > - > > Key: ARROW-18242 > URL: https://issues.apache.org/jira/browse/ARROW-18242 > Project: Apache Arrow > Issue Type: Bug >Reporter: Lucas Mation >Priority: Major > > Sorry for so many issues, but I think this is another bug. > Wrong behavior of the arrow implementation of the `lubridate::dmy`. > An invalid date such as '1976' is being parsed as a valid (and completely > unrelated) date. > #in R > '1976' %>% dmy > [1] NA > Warning message: > All formats failed to parse. No formats found. > #In arrow > q <- data.table(x=c('1976','30111976','01011976')) > q %>% write_dataset('q') > q2 <- 'q' %>% open_dataset %>% mutate(x2=dmy) %>% collect > q2 > x > 1: 1975-11-30 > 2: 1976-11-30 > 3: 1976-01-01 > #notice '1976' is an invalid date. First row of x2 should be NA!!! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-18242) [R] arrow implementation of lubridate::dmy parses invalid date "00001976" as date
[ https://issues.apache.org/jira/browse/ARROW-18242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628963#comment-17628963 ] Nicola Crane edited comment on ARROW-18242 at 11/4/22 1:07 PM: --- I can't replicate this with 10.0.0 either. FWIW I re-organised the code a bit here as the {{mutate()}} call there gave me an error. {code:r} library(dplyr) library(data.table) library(lubridate) library(arrow) '1976' %>% dmy() #> Warning: All formats failed to parse. No formats found. #> [1] NA #In arrow q <- data.table(x=c('1976','30111976','01011976')) q %>% write_dataset('q') q2 <- open_dataset('q') %>% mutate(x2=dmy(x)) %>% collect() q2 #> x x2 #> 1: 1976 #> 2: 30111976 1976-11-30 #> 3: 01011976 1976-01-01 {code} [~lucasmation] If you run the code in the way I've rewritten it above, do you get anything different? Also, can you confirm which version of Arrow you are using? You can use {{arrow::arrow_info()}} to find it if you're not sure. was (Author: thisisnic): I can't replicate this with 10.0.0 either. FWIW I re-organised the code a bit here as the {{mutate()}} call there gave me an error. {code:r} library(dplyr) library(data.table) library(lubridate) library(arrow) '1976' %>% dmy() #In arrow q <- data.table(x=c('1976','30111976','01011976')) q %>% write_dataset('q') q2 <- open_dataset('q') %>% mutate(x2=dmy(x)) %>% collect() q2 {code} [~lucasmation] If you run the code in the way I've rewritten it above, do you get anything different? Also, can you confirm which version of Arrow you are using? You can use {{arrow::arrow_info()}} to find it if you're not sure. > [R] arrow implementation of lubridate::dmy parses invalid date "1976" as > date > - > > Key: ARROW-18242 > URL: https://issues.apache.org/jira/browse/ARROW-18242 > Project: Apache Arrow > Issue Type: Bug >Reporter: Lucas Mation >Priority: Major > > Sorry for so many issues, but I think this is another bug. > Wrong behavior of the arrow implementation of the `lubridate::dmy`. > An invalid date such as '1976' is being parsed as a valid (and completely > unrelated) date. > #in R > '1976' %>% dmy > [1] NA > Warning message: > All formats failed to parse. No formats found. > #In arrow > q <- data.table(x=c('1976','30111976','01011976')) > q %>% write_dataset('q') > q2 <- 'q' %>% open_dataset %>% mutate(x2=dmy) %>% collect > q2 > x > 1: 1975-11-30 > 2: 1976-11-30 > 3: 1976-01-01 > #notice '1976' is an invalid date. First row of x2 should be NA!!! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18242) [R] arrow implementation of lubridate::dmy parses invalid date "00001976" as date
[ https://issues.apache.org/jira/browse/ARROW-18242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628963#comment-17628963 ] Nicola Crane commented on ARROW-18242: -- I can't replicate this with 10.0.0 either. FWIW I re-organised the code a bit here as the {{mutate()}} call there gave me an error. {code:r} library(dplyr) library(data.table) library(lubridate) library(arrow) '1976' %>% dmy() #In arrow q <- data.table(x=c('1976','30111976','01011976')) q %>% write_dataset('q') q2 <- open_dataset('q') %>% mutate(x2=dmy(x)) %>% collect() q2 {code} [~lucasmation] If you run the code in the way I've rewritten it above, do you get anything different? Also, can you confirm which version of Arrow you are using? You can use {{arrow::arrow_info()}} to find it if you're not sure. > [R] arrow implementation of lubridate::dmy parses invalid date "1976" as > date > - > > Key: ARROW-18242 > URL: https://issues.apache.org/jira/browse/ARROW-18242 > Project: Apache Arrow > Issue Type: Bug >Reporter: Lucas Mation >Priority: Major > > Sorry for so many issues, but I think this is another bug. > Wrong behavior of the arrow implementation of the `lubridate::dmy`. > An invalid date such as '1976' is being parsed as a valid (and completely > unrelated) date. > #in R > '1976' %>% dmy > [1] NA > Warning message: > All formats failed to parse. No formats found. > #In arrow > q <- data.table(x=c('1976','30111976','01011976')) > q %>% write_dataset('q') > q2 <- 'q' %>% open_dataset %>% mutate(x2=dmy) %>% collect > q2 > x > 1: 1975-11-30 > 2: 1976-11-30 > 3: 1976-01-01 > #notice '1976' is an invalid date. First row of x2 should be NA!!! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18242) [R] arrow implementation of lubridate::dmy parses invalid date "00001976" as date
[ https://issues.apache.org/jira/browse/ARROW-18242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628961#comment-17628961 ] Nicola Crane commented on ARROW-18242: -- Interesting, I tried to replicate this with 9.0.0 but could not. Will try with 10.0.0 shortly. > [R] arrow implementation of lubridate::dmy parses invalid date "1976" as > date > - > > Key: ARROW-18242 > URL: https://issues.apache.org/jira/browse/ARROW-18242 > Project: Apache Arrow > Issue Type: Bug >Reporter: Lucas Mation >Priority: Major > > Sorry for so many issues, but I think this is another bug. > Wrong behavior of the arrow implementation of the `lubridate::dmy`. > An invalid date such as '1976' is being parsed as a valid (and completely > unrelated) date. > #in R > '1976' %>% dmy > [1] NA > Warning message: > All formats failed to parse. No formats found. > #In arrow > q <- data.table(x=c('1976','30111976','01011976')) > q %>% write_dataset('q') > q2 <- 'q' %>% open_dataset %>% mutate(x2=dmy) %>% collect > q2 > x > 1: 1975-11-30 > 2: 1976-11-30 > 3: 1976-01-01 > #notice '1976' is an invalid date. First row of x2 should be NA!!! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-18228) AWS Error SLOW_DOWN during PutObject operation
[ https://issues.apache.org/jira/browse/ARROW-18228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628957#comment-17628957 ] Vadym Dytyniak edited comment on ARROW-18228 at 11/4/22 12:54 PM: -- [~willjones127] It helped. Thanks. Do you recommend to use this strategy or it means that we exceed rate limit and should review our implementation? was (Author: JIRAUSER297843): [~willjones127] It helped. Do you recommend to use this strategy or it means that we exceed rate limit and should review our implementation? > AWS Error SLOW_DOWN during PutObject operation > -- > > Key: ARROW-18228 > URL: https://issues.apache.org/jira/browse/ARROW-18228 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 10.0.0 >Reporter: Vadym Dytyniak >Priority: Major > > We use Dask to parallelise read/write operations and pyarrow to write dataset > from worker nodes. > After pyarrow released version 10.0.0, our data flows automatically switched > to the latest version and some of them started to fail with the following > error: > {code:java} > File "/usr/local/lib/python3.10/dist-packages/org/store/storage.py", line > 768, in _write_partition > ds.write_dataset( > File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line > 988, in write_dataset > _filesystemdataset_write( > File "pyarrow/_dataset.pyx", line 2859, in > pyarrow._dataset._filesystemdataset_write > check_status(CFileSystemDataset.Write(c_options, c_scanner)) > File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status > raise IOError(message) > OSError: When creating key 'equities.us.level2.by_security/' in bucket > 'org-prod': AWS Error SLOW_DOWN during PutObject operation: Please reduce > your request rate. {code} > In total flow failed many times: most failed with the error above, but one > failed with: > {code:java} > File "/usr/local/lib/python3.10/dist-packages/chronos/store/storage.py", line > 857, in _load_partition > table = ds.dataset( > File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line > 752, in dataset > return _filesystem_dataset(source, **kwargs) > File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line > 444, in _filesystem_dataset > fs, paths_or_selector = _ensure_single_source(source, filesystem) > File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line > 411, in _ensure_single_source > file_info = filesystem.get_file_info(path) > File "pyarrow/_fs.pyx", line 564, in pyarrow._fs.FileSystem.get_file_info > info = GetResultValue(self.fs.GetFileInfo(path)) > File "pyarrow/error.pxi", line 144, in > pyarrow.lib.pyarrow_internal_check_status > return check_status(status) > File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status > raise IOError(message) > OSError: When getting information for key > 'ns/date=2022-10-31/channel=4/feed=A/9f41f928eedc431ca695a7ffe5fc60c2-0.parquet' > in bucket 'org-poc': AWS Error NETWORK_CONNECTION during HeadObject > operation: curlCode: 28, Timeout was reached {code} > > Do you have any idea what was changed for dataset write between 9.0.0 and > 10.0.0 to help us to fix the issue? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18228) AWS Error SLOW_DOWN during PutObject operation
[ https://issues.apache.org/jira/browse/ARROW-18228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628957#comment-17628957 ] Vadym Dytyniak commented on ARROW-18228: [~willjones127] It helped. Do you recommend to use this strategy or it means that we exceed rate limit and should review our implementation? > AWS Error SLOW_DOWN during PutObject operation > -- > > Key: ARROW-18228 > URL: https://issues.apache.org/jira/browse/ARROW-18228 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 10.0.0 >Reporter: Vadym Dytyniak >Priority: Major > > We use Dask to parallelise read/write operations and pyarrow to write dataset > from worker nodes. > After pyarrow released version 10.0.0, our data flows automatically switched > to the latest version and some of them started to fail with the following > error: > {code:java} > File "/usr/local/lib/python3.10/dist-packages/org/store/storage.py", line > 768, in _write_partition > ds.write_dataset( > File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line > 988, in write_dataset > _filesystemdataset_write( > File "pyarrow/_dataset.pyx", line 2859, in > pyarrow._dataset._filesystemdataset_write > check_status(CFileSystemDataset.Write(c_options, c_scanner)) > File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status > raise IOError(message) > OSError: When creating key 'equities.us.level2.by_security/' in bucket > 'org-prod': AWS Error SLOW_DOWN during PutObject operation: Please reduce > your request rate. {code} > In total flow failed many times: most failed with the error above, but one > failed with: > {code:java} > File "/usr/local/lib/python3.10/dist-packages/chronos/store/storage.py", line > 857, in _load_partition > table = ds.dataset( > File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line > 752, in dataset > return _filesystem_dataset(source, **kwargs) > File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line > 444, in _filesystem_dataset > fs, paths_or_selector = _ensure_single_source(source, filesystem) > File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line > 411, in _ensure_single_source > file_info = filesystem.get_file_info(path) > File "pyarrow/_fs.pyx", line 564, in pyarrow._fs.FileSystem.get_file_info > info = GetResultValue(self.fs.GetFileInfo(path)) > File "pyarrow/error.pxi", line 144, in > pyarrow.lib.pyarrow_internal_check_status > return check_status(status) > File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status > raise IOError(message) > OSError: When getting information for key > 'ns/date=2022-10-31/channel=4/feed=A/9f41f928eedc431ca695a7ffe5fc60c2-0.parquet' > in bucket 'org-poc': AWS Error NETWORK_CONNECTION during HeadObject > operation: curlCode: 28, Timeout was reached {code} > > Do you have any idea what was changed for dataset write between 9.0.0 and > 10.0.0 to help us to fix the issue? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18244) [R] `mutate(x2=ifelse(x=='',NA,x))` Error: Function 'if_else' has no kernel matching input types
[ https://issues.apache.org/jira/browse/ARROW-18244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628955#comment-17628955 ] Lucas Mation commented on ARROW-18244: -- Sure, created on ticket [#18250] > [R] `mutate(x2=ifelse(x=='',NA,x))` Error: Function 'if_else' has no kernel > matching input types > - > > Key: ARROW-18244 > URL: https://issues.apache.org/jira/browse/ARROW-18244 > Project: Apache Arrow > Issue Type: Bug >Reporter: Lucas Mation >Assignee: Neal Richardson >Priority: Major > Fix For: 11.0.0 > > > > ``` > q <- data.table(x=c('','30111976','01011976')) > q %>% write_dataset('q') > q2 <- 'q' %>% open_dataset %>% mutate(x2=ifelse(x=='',NA,x)) %>% collect > Error in `collect()`: > ! NotImplemented: Function 'if_else' has no kernel matching input types > (bool, bool, string) > Run `rlang::last_error()` to see where the error occurred. > ``` > In [Functions available in Arrow dplyr > queries|https://arrow.apache.org/docs/r/reference/acero.html] it is stated > that ifelse() is available, but the error above suggests otherwise. > Update, "str_replace" is another function that is supposedly available in > 10.0.0 but is not (or does not seem to be): > ``` > q2 <- 'q' %>% open_dataset %>% mutate(x2=x %>% str_replace('',NA)) %>% collect > Error: Expression x %>% str_replace("", NA) not supported in Arrow > Call collect() first to pull data into R. > ``` > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18250) [R] mutate(x2=x %>% str_replace('^ s*$',NA_character_)) Does not replicate behaviour of R
Lucas Mation created ARROW-18250: Summary: [R] mutate(x2=x %>% str_replace('^ s*$',NA_character_)) Does not replicate behaviour of R Key: ARROW-18250 URL: https://issues.apache.org/jira/browse/ARROW-18250 Project: Apache Arrow Issue Type: Bug Reporter: Lucas Mation ``` q <- data.table(x=c('','1','2')) q %>% write_dataset('q') #in R q %>% mutate(x2=x %>% str_replace('^ s*$',NA_character_)) x x2 1: 2: 1 1 3: 2 2 #in arrow q2 <- 'q' %>% open_dataset %>% mutate(x2=x %>% str_replace('^ s*$',NA_character_)) %>% collect q2 x x2 1: 2: 1 1 3: 2 2 ``` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18244) [R] `mutate(x2=ifelse(x=='',NA,x))` Error: Function 'if_else' has no kernel matching input types
[ https://issues.apache.org/jira/browse/ARROW-18244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628945#comment-17628945 ] Neal Richardson commented on ARROW-18244: - I'm not sure that's a valid use of string replacement since that's replacing substrings into the elements of the character vector, not necessarily replacing the entire element (ifelse is what you want for that), but we can detect and reject that case. Could you please open a new issue for that? Thanks! > [R] `mutate(x2=ifelse(x=='',NA,x))` Error: Function 'if_else' has no kernel > matching input types > - > > Key: ARROW-18244 > URL: https://issues.apache.org/jira/browse/ARROW-18244 > Project: Apache Arrow > Issue Type: Bug >Reporter: Lucas Mation >Assignee: Neal Richardson >Priority: Major > Fix For: 11.0.0 > > > > ``` > q <- data.table(x=c('','30111976','01011976')) > q %>% write_dataset('q') > q2 <- 'q' %>% open_dataset %>% mutate(x2=ifelse(x=='',NA,x)) %>% collect > Error in `collect()`: > ! NotImplemented: Function 'if_else' has no kernel matching input types > (bool, bool, string) > Run `rlang::last_error()` to see where the error occurred. > ``` > In [Functions available in Arrow dplyr > queries|https://arrow.apache.org/docs/r/reference/acero.html] it is stated > that ifelse() is available, but the error above suggests otherwise. > Update, "str_replace" is another function that is supposedly available in > 10.0.0 but is not (or does not seem to be): > ``` > q2 <- 'q' %>% open_dataset %>% mutate(x2=x %>% str_replace('',NA)) %>% collect > Error: Expression x %>% str_replace("", NA) not supported in Arrow > Call collect() first to pull data into R. > ``` > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18244) [R] `mutate(x2=ifelse(x=='',NA,x))` Error: Function 'if_else' has no kernel matching input types
[ https://issues.apache.org/jira/browse/ARROW-18244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628919#comment-17628919 ] Lucas Mation commented on ARROW-18244: -- [~npr] OK. Then, I guess we should close this issue, since ifelse if fixed in the dev version. But on str_replace (and maybe this should be a ticket of it's own) it does not replicate R's behaviuour in this case ``` q <- data.table(x=c('','1','2')) q %>% write_dataset('q') #in R q %>% mutate(x2=x %>% str_replace('^\\s*$',NA_character_)) x x2 1: 2: 1 1 3: 2 2 #in arrow q2 <- 'q' %>% open_dataset %>% mutate(x2=x %>% str_replace('^\\s*$',NA_character_)) %>% collect q2 x x2 1: 2: 1 1 3: 2 2 ``` > [R] `mutate(x2=ifelse(x=='',NA,x))` Error: Function 'if_else' has no kernel > matching input types > - > > Key: ARROW-18244 > URL: https://issues.apache.org/jira/browse/ARROW-18244 > Project: Apache Arrow > Issue Type: Bug >Reporter: Lucas Mation >Assignee: Neal Richardson >Priority: Major > Fix For: 11.0.0 > > > > ``` > q <- data.table(x=c('','30111976','01011976')) > q %>% write_dataset('q') > q2 <- 'q' %>% open_dataset %>% mutate(x2=ifelse(x=='',NA,x)) %>% collect > Error in `collect()`: > ! NotImplemented: Function 'if_else' has no kernel matching input types > (bool, bool, string) > Run `rlang::last_error()` to see where the error occurred. > ``` > In [Functions available in Arrow dplyr > queries|https://arrow.apache.org/docs/r/reference/acero.html] it is stated > that ifelse() is available, but the error above suggests otherwise. > Update, "str_replace" is another function that is supposedly available in > 10.0.0 but is not (or does not seem to be): > ``` > q2 <- 'q' %>% open_dataset %>% mutate(x2=x %>% str_replace('',NA)) %>% collect > Error: Expression x %>% str_replace("", NA) not supported in Arrow > Call collect() first to pull data into R. > ``` > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-18244) [R] `mutate(x2=ifelse(x=='',NA,x))` Error: Function 'if_else' has no kernel matching input types
[ https://issues.apache.org/jira/browse/ARROW-18244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-18244. - Fix Version/s: 11.0.0 Assignee: Neal Richardson Resolution: Fixed > [R] `mutate(x2=ifelse(x=='',NA,x))` Error: Function 'if_else' has no kernel > matching input types > - > > Key: ARROW-18244 > URL: https://issues.apache.org/jira/browse/ARROW-18244 > Project: Apache Arrow > Issue Type: Bug >Reporter: Lucas Mation >Assignee: Neal Richardson >Priority: Major > Fix For: 11.0.0 > > > > ``` > q <- data.table(x=c('','30111976','01011976')) > q %>% write_dataset('q') > q2 <- 'q' %>% open_dataset %>% mutate(x2=ifelse(x=='',NA,x)) %>% collect > Error in `collect()`: > ! NotImplemented: Function 'if_else' has no kernel matching input types > (bool, bool, string) > Run `rlang::last_error()` to see where the error occurred. > ``` > In [Functions available in Arrow dplyr > queries|https://arrow.apache.org/docs/r/reference/acero.html] it is stated > that ifelse() is available, but the error above suggests otherwise. > Update, "str_replace" is another function that is supposedly available in > 10.0.0 but is not (or does not seem to be): > ``` > q2 <- 'q' %>% open_dataset %>% mutate(x2=x %>% str_replace('',NA)) %>% collect > Error: Expression x %>% str_replace("", NA) not supported in Arrow > Call collect() first to pull data into R. > ``` > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-13938) [C++] Date and datetime types should autocast from strings
[ https://issues.apache.org/jira/browse/ARROW-13938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-13938. - Fix Version/s: 11.0.0 Assignee: Neal Richardson Resolution: Fixed > [C++] Date and datetime types should autocast from strings > -- > > Key: ARROW-13938 > URL: https://issues.apache.org/jira/browse/ARROW-13938 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Jonathan Keane >Assignee: Neal Richardson >Priority: Major > Labels: kernel > Fix For: 11.0.0 > > > When comparing dates and datetimes, people frequently expect that a string > (formatted as ISO8601) will auto-cast and compare to dates and times. > Examples in R: > {code:r} > library(arrow) > #> > #> Attaching package: 'arrow' > #> The following object is masked from 'package:utils': > #> > #> timestamp > arr <- Array$create(as.Date(c("1974-04-06", "1988-05-09"))) > arr > "1980-01-01" > #> Error: NotImplemented: Function greater has no kernel matching input types > (array[date32[day]], scalar[string]) > # creating the scalar as a date works, of course > arr > Scalar$create(as.Date("1980-01-01")) > #> Array > #> > #> [ > #> false, > #> true > #> ] > # datatimes also do not auto-cast > arr <- Array$create(Sys.time()) > arr > "1980-01-01 00:00:00" > #> Error: NotImplemented: Function greater has no kernel matching input types > (array[timestamp[us]], scalar[string]) > # or a more real-world example > library(dplyr) > #> > #> Attaching package: 'dplyr' > #> The following objects are masked from 'package:stats': > #> > #> filter, lag > #> The following objects are masked from 'package:base': > #> > #> intersect, setdiff, setequal, union > mtcars$date <- as.Date(c("1974-04-06", "1988-05-09")) > ds <- InMemoryDataset$create(mtcars) > ds %>% > filter(date > "1980-01-01") %>% > collect() > #> Error: NotImplemented: Function greater has no kernel matching input types > (array[date32[day]], scalar[string]) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18241) [C++] Add cast option to return null for values that can't convert
[ https://issues.apache.org/jira/browse/ARROW-18241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-18241: Component/s: C++ > [C++] Add cast option to return null for values that can't convert > -- > > Key: ARROW-18241 > URL: https://issues.apache.org/jira/browse/ARROW-18241 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Reporter: Lucas Mation >Priority: Major > > I am importing a dataset with arrow, and then converting variable types. But > I got an error message because the `arrow` implementation of `as.integer` > can't handle empty strings (which is legal in base R). Is this a bug? > {code:r} > #In R > '' %>% as.integer() > [1] NA > > #in arrow > q <- data.table(x=c('','1','2')) > q %>% write_dataset('q') > q2 <- 'q' %>% open_dataset %>% mutate(x=as.integer(x)) %>% collect > Error in `collect()`: > ! Invalid: Failed to parse string: '' as a scalar of type int32 > Run `rlang::last_error()` to see where the error occurred. > {code} > Update: tryed to preprocess x with `ifelse` but it also did not work. > {code:r} > 'q' %>% open_dataset %>% mutate(x= ifelse(x=='',NA,x)) %>% > mutate(x=as.integer(x)) %>% collect > Error in `collect()`: > ! NotImplemented: Function 'if_else' has no kernel matching input types > (bool, bool, string) > Run `rlang::last_error()` to see where the error occurred. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18241) [C++] Add cast option to return null for values that can't convert
[ https://issues.apache.org/jira/browse/ARROW-18241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-18241: Summary: [C++] Add cast option to return null for values that can't convert (was: [R] as.integer can't handdle empty character cels (ex c(''))) > [C++] Add cast option to return null for values that can't convert > -- > > Key: ARROW-18241 > URL: https://issues.apache.org/jira/browse/ARROW-18241 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Lucas Mation >Priority: Major > > I am importing a dataset with arrow, and then converting variable types. But > I got an error message because the `arrow` implementation of `as.integer` > can't handle empty strings (which is legal in base R). Is this a bug? > {code:r} > #In R > '' %>% as.integer() > [1] NA > > #in arrow > q <- data.table(x=c('','1','2')) > q %>% write_dataset('q') > q2 <- 'q' %>% open_dataset %>% mutate(x=as.integer(x)) %>% collect > Error in `collect()`: > ! Invalid: Failed to parse string: '' as a scalar of type int32 > Run `rlang::last_error()` to see where the error occurred. > {code} > Update: tryed to preprocess x with `ifelse` but it also did not work. > {code:r} > 'q' %>% open_dataset %>% mutate(x= ifelse(x=='',NA,x)) %>% > mutate(x=as.integer(x)) %>% collect > Error in `collect()`: > ! NotImplemented: Function 'if_else' has no kernel matching input types > (bool, bool, string) > Run `rlang::last_error()` to see where the error occurred. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18241) [R] as.integer can't handdle empty character cels (ex c(''))
[ https://issues.apache.org/jira/browse/ARROW-18241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628905#comment-17628905 ] Neal Richardson commented on ARROW-18241: - > "I agree this would be a nice option to have." > sure. But should be the > default behavior, as that is what happens in base R, no? Sorry, that was ambiguous. We would need the C++ cast function to support an option to return nulls for values that can't be converted, rather than just error. If/when that option exists, then yes, we would make that default in R. I'll rename this ticket to be about that feature. > [R] as.integer can't handdle empty character cels (ex c('')) > > > Key: ARROW-18241 > URL: https://issues.apache.org/jira/browse/ARROW-18241 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Lucas Mation >Priority: Major > > I am importing a dataset with arrow, and then converting variable types. But > I got an error message because the `arrow` implementation of `as.integer` > can't handle empty strings (which is legal in base R). Is this a bug? > {code:r} > #In R > '' %>% as.integer() > [1] NA > > #in arrow > q <- data.table(x=c('','1','2')) > q %>% write_dataset('q') > q2 <- 'q' %>% open_dataset %>% mutate(x=as.integer(x)) %>% collect > Error in `collect()`: > ! Invalid: Failed to parse string: '' as a scalar of type int32 > Run `rlang::last_error()` to see where the error occurred. > {code} > Update: tryed to preprocess x with `ifelse` but it also did not work. > {code:r} > 'q' %>% open_dataset %>% mutate(x= ifelse(x=='',NA,x)) %>% > mutate(x=as.integer(x)) %>% collect > Error in `collect()`: > ! NotImplemented: Function 'if_else' has no kernel matching input types > (bool, bool, string) > Run `rlang::last_error()` to see where the error occurred. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18241) [R] as.integer can't handdle empty character cels (ex c(''))
[ https://issues.apache.org/jira/browse/ARROW-18241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628902#comment-17628902 ] Lucas Mation commented on ARROW-18241: -- [~npr] , tks. 1) "I agree this would be a nice option to have." > sure. But should be the default behavior, as that is what happens in base R, no? 2) " it should work as you typed it on the development version" > tested. Works. Tks. {~}"On the released version, you can make it work by explicitly making the NA be a string so the types match{~}" > tested works. > [R] as.integer can't handdle empty character cels (ex c('')) > > > Key: ARROW-18241 > URL: https://issues.apache.org/jira/browse/ARROW-18241 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Lucas Mation >Priority: Major > > I am importing a dataset with arrow, and then converting variable types. But > I got an error message because the `arrow` implementation of `as.integer` > can't handle empty strings (which is legal in base R). Is this a bug? > {code:r} > #In R > '' %>% as.integer() > [1] NA > > #in arrow > q <- data.table(x=c('','1','2')) > q %>% write_dataset('q') > q2 <- 'q' %>% open_dataset %>% mutate(x=as.integer(x)) %>% collect > Error in `collect()`: > ! Invalid: Failed to parse string: '' as a scalar of type int32 > Run `rlang::last_error()` to see where the error occurred. > {code} > Update: tryed to preprocess x with `ifelse` but it also did not work. > {code:r} > 'q' %>% open_dataset %>% mutate(x= ifelse(x=='',NA,x)) %>% > mutate(x=as.integer(x)) %>% collect > Error in `collect()`: > ! NotImplemented: Function 'if_else' has no kernel matching input types > (bool, bool, string) > Run `rlang::last_error()` to see where the error occurred. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18241) [R] as.integer can't handdle empty character cels (ex c(''))
[ https://issues.apache.org/jira/browse/ARROW-18241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lucas Mation updated ARROW-18241: - Description: I am importing a dataset with arrow, and then converting variable types. But I got an error message because the `arrow` implementation of `as.integer` can't handle empty strings (which is legal in base R). Is this a bug? {code:r} #In R '' %>% as.integer() [1] NA #in arrow q <- data.table(x=c('','1','2')) q %>% write_dataset('q') q2 <- 'q' %>% open_dataset %>% mutate(x=as.integer(x)) %>% collect Error in `collect()`: ! Invalid: Failed to parse string: '' as a scalar of type int32 Run `rlang::last_error()` to see where the error occurred. {code} Update: tryed to preprocess x with `ifelse` but it also did not work. {code:r} 'q' %>% open_dataset %>% mutate(x= ifelse(x=='',NA,x)) %>% mutate(x=as.integer(x)) %>% collect Error in `collect()`: ! NotImplemented: Function 'if_else' has no kernel matching input types (bool, bool, string) Run `rlang::last_error()` to see where the error occurred. {code} was: I am importing a dataset with arrow, and then converting variable types. But I got an error message because the `arrow` implementation of `as.integer` can't handle empty strings (which is legal in base R). Is this a bug? {code:r} #In R '' %>% as.integer() [1] NA #in arrow q <- data.table(x=c('','1','2')) q %>% write_dataset('q') q2 <- 'q' %>% open_dataset %>% mutate(x=as.integer(x)) %>% collect Error in `collect()`: ! Invalid: Failed to parse string: '' as a scalar of type int32 Run `rlang::last_error()` to see where the error occurred. {code} Update: tryed to preprocess x with `ifelse` but it also did not work. {code:r} paste0(p2,'/q') %>% open_dataset %>% mutate(x= ifelse(x=='',NA,x)) %>% mutate(x=as.integer(x)) %>% collect Error in `collect()`: ! NotImplemented: Function 'if_else' has no kernel matching input types (bool, bool, string) Run `rlang::last_error()` to see where the error occurred. {code} > [R] as.integer can't handdle empty character cels (ex c('')) > > > Key: ARROW-18241 > URL: https://issues.apache.org/jira/browse/ARROW-18241 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Lucas Mation >Priority: Major > > I am importing a dataset with arrow, and then converting variable types. But > I got an error message because the `arrow` implementation of `as.integer` > can't handle empty strings (which is legal in base R). Is this a bug? > {code:r} > #In R > '' %>% as.integer() > [1] NA > > #in arrow > q <- data.table(x=c('','1','2')) > q %>% write_dataset('q') > q2 <- 'q' %>% open_dataset %>% mutate(x=as.integer(x)) %>% collect > Error in `collect()`: > ! Invalid: Failed to parse string: '' as a scalar of type int32 > Run `rlang::last_error()` to see where the error occurred. > {code} > Update: tryed to preprocess x with `ifelse` but it also did not work. > {code:r} > 'q' %>% open_dataset %>% mutate(x= ifelse(x=='',NA,x)) %>% > mutate(x=as.integer(x)) %>% collect > Error in `collect()`: > ! NotImplemented: Function 'if_else' has no kernel matching input types > (bool, bool, string) > Run `rlang::last_error()` to see where the error occurred. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18249) Update vcpkg port to arrow 10.0.0
[ https://issues.apache.org/jira/browse/ARROW-18249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bernhard Manfred Gruber updated ARROW-18249: Summary: Update vcpkg port to arrow 10.0.0 (was: Update vcpkg port to arrow 10.0) > Update vcpkg port to arrow 10.0.0 > - > > Key: ARROW-18249 > URL: https://issues.apache.org/jira/browse/ARROW-18249 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 10.0.0 >Reporter: Bernhard Manfred Gruber >Priority: Minor > > Please update the [vcpkg|https://github.com/microsoft/vcpkg] port of arrow to > the newly released version 10.0.0. The current version on vcpkg is 9.0.0. > I found this documentation on how to do it: > https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Updatingthevcpkgport > I need this downstream to update the xsimd port on vcpkg, which I need > downstream in another project. Thank you! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18249) Update vcpkg port to arrow 10.0
Bernhard Manfred Gruber created ARROW-18249: --- Summary: Update vcpkg port to arrow 10.0 Key: ARROW-18249 URL: https://issues.apache.org/jira/browse/ARROW-18249 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 10.0.0 Reporter: Bernhard Manfred Gruber Please update the [vcpkg|https://github.com/microsoft/vcpkg] port of arrow to the newly released version 10.0.0. The current version on vcpkg is 9.0.0. I found this documentation on how to do it: https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Updatingthevcpkgport I need this downstream to update the xsimd port on vcpkg, which I need downstream in another project. Thank you! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-12739) [C++] Function to combine Arrays row-wise into ListArray
[ https://issues.apache.org/jira/browse/ARROW-12739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jin Shang reassigned ARROW-12739: - Assignee: Jin Shang > [C++] Function to combine Arrays row-wise into ListArray > > > Key: ARROW-12739 > URL: https://issues.apache.org/jira/browse/ARROW-12739 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Ian Cook >Assignee: Jin Shang >Priority: Major > Labels: kernel, query-engine > > Add a variadic function that would take 2+ Arrays and combine/transpose them > rowwise into a ListArray. For example: > Input: > {code:java} > ArrayArray > [[ > "foo", "bar", > "push" "pop" > ]] > {code} > Output: > {code:java} > ListArray> > [ > ["foo","bar"], > ["push","pop"] > ] > {code} > This is similar to the StructArray constructor which takes a list of Arrays > and names (but in this case it would only need to take a list of Arrays). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18248) [CI][Release] Use GitHub token to avoid API rate limit
[ https://issues.apache.org/jira/browse/ARROW-18248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-18248: --- Labels: pull-request-available (was: ) > [CI][Release] Use GitHub token to avoid API rate limit > -- > > Key: ARROW-18248 > URL: https://issues.apache.org/jira/browse/ARROW-18248 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > e.g.: > https://github.com/apache/arrow/actions/runs/3387695588/jobs/5628769268#step:7:25 > {noformat} > Error: test_vote(SourceTest): OpenURI::HTTPError: 403 rate limit exceeded > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18248) [CI][Release] Use GitHub token to avoid API rate limit
Kouhei Sutou created ARROW-18248: Summary: [CI][Release] Use GitHub token to avoid API rate limit Key: ARROW-18248 URL: https://issues.apache.org/jira/browse/ARROW-18248 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration Reporter: Kouhei Sutou Assignee: Kouhei Sutou e.g.: https://github.com/apache/arrow/actions/runs/3387695588/jobs/5628769268#step:7:25 {noformat} Error: test_vote(SourceTest): OpenURI::HTTPError: 403 rate limit exceeded {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18185) [C++][Compute] Support KEEP_NULL option for compute::Filter
[ https://issues.apache.org/jira/browse/ARROW-18185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628751#comment-17628751 ] Joris Van den Bossche commented on ARROW-18185: --- > What about implementing this as an specialized optimization for if_else AAS > case, That sounds good to me > [C++][Compute] Support KEEP_NULL option for compute::Filter > --- > > Key: ARROW-18185 > URL: https://issues.apache.org/jira/browse/ARROW-18185 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Jin Shang >Assignee: Jin Shang >Priority: Minor > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > The current Filter implementation always drops the filtered values. In some > use cases, it's desirable for the output array to have the same size as the > inut array. So I added a new option FilterOptions::KEEP_NULL where the > filtered values are kept as nulls. > For example, with input [1, 2, 3] and filter [true, false, true], the current > implementation will output [1, 3] and with the new option it will output [1, > null, 3] > This option is simpler to implement since we only need to construct a new > validity bitmap and reuse the input buffers and child arrays. Except for > dense union arrays which don't have validity bitmaps. > It is also faster to filter with FilterOptions::KEEP_NULL according to the > benchmark result in most cases. So users can choose this option for better > performance when dropping filtered values is not required. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-12739) [C++] Function to combine Arrays row-wise into ListArray
[ https://issues.apache.org/jira/browse/ARROW-12739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-12739: -- Labels: kernel query-engine (was: ) > [C++] Function to combine Arrays row-wise into ListArray > > > Key: ARROW-12739 > URL: https://issues.apache.org/jira/browse/ARROW-12739 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Ian Cook >Priority: Major > Labels: kernel, query-engine > > Add a variadic function that would take 2+ Arrays and combine/transpose them > rowwise into a ListArray. For example: > Input: > {code:java} > ArrayArray > [[ > "foo", "bar", > "push" "pop" > ]] > {code} > Output: > {code:java} > ListArray> > [ > ["foo","bar"], > ["push","pop"] > ] > {code} > This is similar to the StructArray constructor which takes a list of Arrays > and names (but in this case it would only need to take a list of Arrays). -- This message was sent by Atlassian Jira (v8.20.10#820010)