[jira] [Created] (ARROW-9076) [Rust] Async CSV reader
Sergey Todyshev created ARROW-9076: -- Summary: [Rust] Async CSV reader Key: ARROW-9076 URL: https://issues.apache.org/jira/browse/ARROW-9076 Project: Apache Arrow Issue Type: New Feature Reporter: Sergey Todyshev rust-csv crate recently adds async implementation for CSV reader. It would be nice to have it in arrow crate as well. It is extremely useful in an application that needs to parse large CSV files in WebAssembly. It would be nice to have async JSON reader as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9075) [C++] Optimize Filter implementation
Wes McKinney created ARROW-9075: --- Summary: [C++] Optimize Filter implementation Key: ARROW-9075 URL: https://issues.apache.org/jira/browse/ARROW-9075 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 1.0.0 I split this off from ARROW-5760 -- This message was sent by Atlassian Jira (v8.3.4#803005)
float 16
Hi, There seems to be two competing standards for floats with 16 bits: - https://en.wikipedia.org/wiki/Bfloat16_floating-point_format - IEEE: https://en.wikipedia.org/wiki/IEEE_754-2008_revision Was there any thought on how this could be handled? Would it make sense to add some kind of DataType attribute to the HALF_FLOAT? Cheers, Pierre
[jira] [Created] (ARROW-9074) [GLib] Add missing arrow-json check
Kouhei Sutou created ARROW-9074: --- Summary: [GLib] Add missing arrow-json check Key: ARROW-9074 URL: https://issues.apache.org/jira/browse/ARROW-9074 Project: Apache Arrow Issue Type: Improvement Components: GLib Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9073) [C++] RapidJSON include directory detection doesn't work with RapidJSONConfig.cmake
Kouhei Sutou created ARROW-9073: --- Summary: [C++] RapidJSON include directory detection doesn't work with RapidJSONConfig.cmake Key: ARROW-9073 URL: https://issues.apache.org/jira/browse/ARROW-9073 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9072) [C++][Gandiva][MinGW] Enable crashed tests
Kouhei Sutou created ARROW-9072: --- Summary: [C++][Gandiva][MinGW] Enable crashed tests Key: ARROW-9072 URL: https://issues.apache.org/jira/browse/ARROW-9072 Project: Apache Arrow Issue Type: Improvement Components: C++ - Gandiva Reporter: Kouhei Sutou Some Gandiva tests are crashed with MinGW. They are disabled in {{ci/scripts/cpp_test.sh}}. We should fix the problems of the crashes and enable these tests. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [DISCUSS] Move JIRA notifications to separate mailing list?
I'm openly not very sympathetic toward people who don't take time to set up e-mail filters but I support having two e-mail lists: * One having new issues only. I think that active developers need to see new issues to create awareness of what others are doing in the project, so I think we should really encourage people to subscribe to this list (and set up an e-mail filter if they don't want the e-mails coming into their inbox). While I think having less "noise" on dev@ is a good thing (even though it's only "noise" if you don't set up e-mail filters) I'm concerned that this action will decrease developer engagement in the project. There are of course other ways [1] to subscribe to the JIRA activity feed if getting notifications in Slack or Zulip is your thing. * One having all JIRA traffic (i.e. what is currently at https://lists.apache.org/list.html?iss...@arrow.apache.org) [1]: https://github.com/ursa-labs/jira-zulip-bridge On Mon, Jun 8, 2020 at 1:57 PM Antoine Pitrou wrote: > > > I would welcome a separate list, but only with notifications of new JIRA > issues. I am not interested in generic JIRA traffic. > > Regards > > Antoine. > > > Le 08/06/2020 à 20:46, Neal Richardson a écrit : > > And if you're like me, and this message got filtered out of your inbox > > because it is from dev@ and contains "JIRA" in the subject, well, maybe > > that demonstrates the problem ;) > > > > On Mon, Jun 8, 2020 at 11:43 AM Neal Richardson > > > > wrote: > > > >> Hi all, > >> I've noticed that some other Apache projects have a separate mailing list > >> for JIRA notifications (Spark, for example, has iss...@spark.apache.org). > >> The result is that the dev@ mailing list is focused on actual discussions > >> threads (like this!), votes, and other official business. Would we be > >> interested in doing the same? > >> > >> In my opinion, the status quo is not great. The dev@ archives ( > >> https://lists.apache.org/list.html?dev@arrow.apache.org) aren't that > >> readable/browseable to me, and if I want to see what's going on in JIRA, I > >> go to JIRA. In fact, the first thing I/we recommend to people signing up > >> for the mailing list is to set up email filters to exclude the JIRA noise. > >> Having a separate mailing list will make it easier for people to manage > >> their own informations streams better. > >> > >> The counterargument is that moving JIRA traffic to a separate mailing > >> list, requiring an additional subscribe action, might mean that developers > >> miss out on things like new issues being created. I'm not personally > >> worried about this because I suspect that many of us already aren't using > >> the mailing list to stay on top of JIRA issues, and that those who want the > >> JIRA stream in their email can easily opt-in (subscribe). But I'm > >> interested in the community's opinions on this. > >> > >> Thoughts? > >> > >> Neal > >> > >
Re: [DISCUSS] Move JIRA notifications to separate mailing list?
I would welcome a separate list, but only with notifications of new JIRA issues. I am not interested in generic JIRA traffic. Regards Antoine. Le 08/06/2020 à 20:46, Neal Richardson a écrit : > And if you're like me, and this message got filtered out of your inbox > because it is from dev@ and contains "JIRA" in the subject, well, maybe > that demonstrates the problem ;) > > On Mon, Jun 8, 2020 at 11:43 AM Neal Richardson > wrote: > >> Hi all, >> I've noticed that some other Apache projects have a separate mailing list >> for JIRA notifications (Spark, for example, has iss...@spark.apache.org). >> The result is that the dev@ mailing list is focused on actual discussions >> threads (like this!), votes, and other official business. Would we be >> interested in doing the same? >> >> In my opinion, the status quo is not great. The dev@ archives ( >> https://lists.apache.org/list.html?dev@arrow.apache.org) aren't that >> readable/browseable to me, and if I want to see what's going on in JIRA, I >> go to JIRA. In fact, the first thing I/we recommend to people signing up >> for the mailing list is to set up email filters to exclude the JIRA noise. >> Having a separate mailing list will make it easier for people to manage >> their own informations streams better. >> >> The counterargument is that moving JIRA traffic to a separate mailing >> list, requiring an additional subscribe action, might mean that developers >> miss out on things like new issues being created. I'm not personally >> worried about this because I suspect that many of us already aren't using >> the mailing list to stay on top of JIRA issues, and that those who want the >> JIRA stream in their email can easily opt-in (subscribe). But I'm >> interested in the community's opinions on this. >> >> Thoughts? >> >> Neal >> >
Re: [DISCUSS] Move JIRA notifications to separate mailing list?
Hi Neal, On 08/06/2020 19:43, Neal Richardson wrote: I've noticed that some other Apache projects have a separate mailing list for JIRA notifications (Spark, for example, has iss...@spark.apache.org). The result is that the dev@ mailing list is focused on actual discussions threads (like this!), votes, and other official business. Would we be interested in doing the same? I have been lazy and not set up any anti-JIRA filters in the few weeks that I have been a member of this mailing list. Deleting JIRA notifications has fast become the most popular activity that my email client sees :-). So from the perspective of a new member of the community, I can see how some might find this a turn-off, and maybe even be dissuaded from participation - obviously not something anyone here would want. I'd certainly support a dedicated list for JIRA notifications. -- Adam Szmigin
Re: [DISCUSS] Move JIRA notifications to separate mailing list?
And if you're like me, and this message got filtered out of your inbox because it is from dev@ and contains "JIRA" in the subject, well, maybe that demonstrates the problem ;) On Mon, Jun 8, 2020 at 11:43 AM Neal Richardson wrote: > Hi all, > I've noticed that some other Apache projects have a separate mailing list > for JIRA notifications (Spark, for example, has iss...@spark.apache.org). > The result is that the dev@ mailing list is focused on actual discussions > threads (like this!), votes, and other official business. Would we be > interested in doing the same? > > In my opinion, the status quo is not great. The dev@ archives ( > https://lists.apache.org/list.html?dev@arrow.apache.org) aren't that > readable/browseable to me, and if I want to see what's going on in JIRA, I > go to JIRA. In fact, the first thing I/we recommend to people signing up > for the mailing list is to set up email filters to exclude the JIRA noise. > Having a separate mailing list will make it easier for people to manage > their own informations streams better. > > The counterargument is that moving JIRA traffic to a separate mailing > list, requiring an additional subscribe action, might mean that developers > miss out on things like new issues being created. I'm not personally > worried about this because I suspect that many of us already aren't using > the mailing list to stay on top of JIRA issues, and that those who want the > JIRA stream in their email can easily opt-in (subscribe). But I'm > interested in the community's opinions on this. > > Thoughts? > > Neal >
[DISCUSS] Move JIRA notifications to separate mailing list?
Hi all, I've noticed that some other Apache projects have a separate mailing list for JIRA notifications (Spark, for example, has iss...@spark.apache.org). The result is that the dev@ mailing list is focused on actual discussions threads (like this!), votes, and other official business. Would we be interested in doing the same? In my opinion, the status quo is not great. The dev@ archives ( https://lists.apache.org/list.html?dev@arrow.apache.org) aren't that readable/browseable to me, and if I want to see what's going on in JIRA, I go to JIRA. In fact, the first thing I/we recommend to people signing up for the mailing list is to set up email filters to exclude the JIRA noise. Having a separate mailing list will make it easier for people to manage their own informations streams better. The counterargument is that moving JIRA traffic to a separate mailing list, requiring an additional subscribe action, might mean that developers miss out on things like new issues being created. I'm not personally worried about this because I suspect that many of us already aren't using the mailing list to stay on top of JIRA issues, and that those who want the JIRA stream in their email can easily opt-in (subscribe). But I'm interested in the community's opinions on this. Thoughts? Neal
[jira] [Created] (ARROW-9071) [C++] MakeArrayOfNull makes invalid ListArray
Zhuo Peng created ARROW-9071: Summary: [C++] MakeArrayOfNull makes invalid ListArray Key: ARROW-9071 URL: https://issues.apache.org/jira/browse/ARROW-9071 Project: Apache Arrow Issue Type: Bug Components: C++, Python Reporter: Zhuo Peng One way to reproduce this bug is: >>> a = pa.array([[1, 2]]) >>> b = pa.array([None, None], type=pa.null()) >>> t1 = pa.Table.from_arrays([a], ["a"]) >>> t2 = pa.Table.from_arrays([b], ["b"]) >>> pa.concat_tables([t1, t2], promote=True) Traceback (most recent call last): File "", line 1, in File "pyarrow/table.pxi", line 2138, in pyarrow.lib.concat_tables File "pyarrow/public-api.pxi", line 390, in pyarrow.lib.pyarrow_wrap_table File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Column 0: In chunk 1: Invalid: List child array invalid: Invalid: Buffer #1 too small in array of type int64 and length 2: expected at least 16 byte(s), got 12 (because concat_tables(promote=True) will call MakeArrayOfNulls ([https://github.com/apache/arrow/blob/ec3bae18157723411bb772fca628cbd02eea5c25/cpp/src/arrow/table.cc#L647))|https://github.com/apache/arrow/blob/ec3bae18157723411bb772fca628cbd02eea5c25/cpp/src/arrow/table.cc#L647)'] The code here seems incorrect: [https://github.com/apache/arrow/blob/ec3bae18157723411bb772fca628cbd02eea5c25/cpp/src/arrow/array/util.cc#L218] the length of the child array of a ListArray may not equal to the length of the ListArray. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9070) [C++] StructScalar needs field accessor methods
Neal Richardson created ARROW-9070: -- Summary: [C++] StructScalar needs field accessor methods Key: ARROW-9070 URL: https://issues.apache.org/jira/browse/ARROW-9070 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Neal Richardson Fix For: 1.0.0 The minmax compute function returns a struct with fields "min" and "max". So to write an R binding for the {{min()}} method on arrow objects, I call "minmax" and then take the "min" field from the result. However, at least from my reading of scalar.h compared with array_nested.h, there are no field/GetFieldByName/etc. methods for StructScalar, so I can't get it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9069) [C++] MakeArrayFromScalar can't handle struct
Neal Richardson created ARROW-9069: -- Summary: [C++] MakeArrayFromScalar can't handle struct Key: ARROW-9069 URL: https://issues.apache.org/jira/browse/ARROW-9069 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Neal Richardson Fix For: 1.0.0 The R bindings translate data to/from Scalars by using the Array methods already implemented: to go from R object to a Scalar, it creates a length-1 Array and then slices out the 0th element with GetScalar(); to go from Scalar to R object, it calls MakeArrayFromScalar and then the as.vector method on that Array (in R, there is no scalar type anyway, only length-1 vectors). This generally works fine but if I get a Struct scalar (as the minmax compute function returns), I can't do anything with it because MakeArrayFromScalar doesn't work with structs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9068) [C++][Dataset] Simplify Partitioning interface
Francois Saint-Jacques created ARROW-9068: - Summary: [C++][Dataset] Simplify Partitioning interface Key: ARROW-9068 URL: https://issues.apache.org/jira/browse/ARROW-9068 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Francois Saint-Jacques The `int segment` of `Partitioning::Parse` should not be exposed to the user. KeyValuePartiioning should be a private Impl interface, not in public headers. The same apply to `Partitioning::Format`. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9067) [C++] Create reusable branchless / vectorized index boundschecking functions
Wes McKinney created ARROW-9067: --- Summary: [C++] Create reusable branchless / vectorized index boundschecking functions Key: ARROW-9067 URL: https://issues.apache.org/jira/browse/ARROW-9067 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 It is possible to do branch-free index boundschecking in batches for better performance. I am implementing this as part of the Take/Filter optimization (so please wait until I have PRs up for this work), but these functions can be moved somewhere more general purpose and used in places where we are currently boundschecking inside inner loops. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9066) [Python] Raise correct error in isnull()
Uwe Korn created ARROW-9066: --- Summary: [Python] Raise correct error in isnull() Key: ARROW-9066 URL: https://issues.apache.org/jira/browse/ARROW-9066 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.17.1 Reporter: Uwe Korn Assignee: Uwe Korn -- This message was sent by Atlassian Jira (v8.3.4#803005)
[NIGHTLY] Arrow Build Report for Job nightly-2020-06-08-0
Arrow Build Report for Job nightly-2020-06-08-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-08-0 Failed Tasks: - centos-7-aarch64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-08-0-travis-centos-7-aarch64 - centos-7-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-08-0-github-centos-7-amd64 - centos-8-aarch64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-08-0-travis-centos-8-aarch64 - centos-8-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-08-0-github-centos-8-amd64 - homebrew-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-08-0-travis-homebrew-cpp - homebrew-r-autobrew: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-08-0-travis-homebrew-r-autobrew - test-conda-cpp-valgrind: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-08-0-github-test-conda-cpp-valgrind - test-conda-python-3.7-dask-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-08-0-github-test-conda-python-3.7-dask-latest - test-conda-python-3.7-spark-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-08-0-github-test-conda-python-3.7-spark-master - test-conda-python-3.7-turbodbc-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-08-0-github-test-conda-python-3.7-turbodbc-latest - test-conda-python-3.7-turbodbc-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-08-0-github-test-conda-python-3.7-turbodbc-master - test-conda-python-3.8-dask-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-08-0-github-test-conda-python-3.8-dask-master - test-conda-python-3.8-jpype: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-08-0-github-test-conda-python-3.8-jpype - wheel-manylinux1-cp35m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-08-0-azure-wheel-manylinux1-cp35m - wheel-manylinux1-cp36m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-08-0-azure-wheel-manylinux1-cp36m - wheel-manylinux1-cp37m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-08-0-azure-wheel-manylinux1-cp37m - wheel-manylinux1-cp38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-08-0-azure-wheel-manylinux1-cp38 - wheel-manylinux2010-cp35m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-08-0-azure-wheel-manylinux2010-cp35m - wheel-manylinux2010-cp37m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-08-0-azure-wheel-manylinux2010-cp37m - wheel-manylinux2010-cp38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-08-0-azure-wheel-manylinux2010-cp38 - wheel-manylinux2014-cp35m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-08-0-azure-wheel-manylinux2014-cp35m - wheel-manylinux2014-cp36m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-08-0-azure-wheel-manylinux2014-cp36m - wheel-manylinux2014-cp38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-08-0-azure-wheel-manylinux2014-cp38 Succeeded Tasks: - centos-6-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-08-0-github-centos-6-amd64 - conda-clean: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-08-0-azure-conda-clean - conda-linux-gcc-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-08-0-azure-conda-linux-gcc-py36 - conda-linux-gcc-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-08-0-azure-conda-linux-gcc-py37 - conda-linux-gcc-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-08-0-azure-conda-linux-gcc-py38 - conda-osx-clang-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-08-0-azure-conda-osx-clang-py36 - conda-osx-clang-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-08-0-azure-conda-osx-clang-py37 - conda-osx-clang-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-08-0-azure-conda-osx-clang-py38 - conda-win-vs2015-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-08-0-azure-conda-win-vs2015-py36 - conda-win-vs2015-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-08-0-azure-conda-win-vs2015-py37 - conda-win-vs2015-py38: URL:
[jira] [Created] (ARROW-9065) Support parsing date32 in dataset partition folders
Dave Hirschfeld created ARROW-9065: -- Summary: Support parsing date32 in dataset partition folders Key: ARROW-9065 URL: https://issues.apache.org/jira/browse/ARROW-9065 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Reporter: Dave Hirschfeld I have some data which is partitioned by year/month/date. It would be useful if the date could be automatically parsed: ```python In [17]: schema = pa.schema([("year", pa.int16()), ("month", pa.int8()), ("day", pa.date32())]) In [18]: partition = DirectoryPartitioning(schema) In [19]: partition.parse("/2020/06/2020-06-08") --- ArrowNotImplementedError Traceback (most recent call last) in > 1 partition.parse("/2020/06/2020-06-08") ~\envs\dev\lib\site-packages\pyarrow\_dataset.pyx in pyarrow._dataset.Partitioning.parse() ~\envs\dev\lib\site-packages\pyarrow\error.pxi in pyarrow.lib.pyarrow_internal_check_status() ~\envs\dev\lib\site-packages\pyarrow\error.pxi in pyarrow.lib.check_status() ArrowNotImplementedError: parsing scalars of type date32[day] ``` Not a big issue since you can just use string and convert, but nevertheless it would be nice if it Just Worked ```python In [22]: schema = pa.schema([("year", pa.int16()), ("month", pa.int8()), ("day", pa.string())]) In [23]: partition = DirectoryPartitioning(schema) In [24]: partition.parse("/2020/06/2020-06-08") Out[24]: ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9064) optimization debian package manager tweaks
Pratik Raj created ARROW-9064: - Summary: optimization debian package manager tweaks Key: ARROW-9064 URL: https://issues.apache.org/jira/browse/ARROW-9064 Project: Apache Arrow Issue Type: Improvement Reporter: Pratik Raj By default, Ubuntu or Debian based "apt" or "apt-get" system installs recommended but not suggested packages . By passing "--no-install-recommends" option, the user lets apt-get know not to consider recommended packages as a dependency to install. This results in smaller downloads and installation of packages . Refer to blog at [Ubuntu Blog] at https://ubuntu.com/blog/we-reduced-our-docker-images-by-60-with-no-install-recommends -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9063) [Python][C++] Order of files are not respected using the new pyarrow.dataset
William Liu created ARROW-9063: -- Summary: [Python][C++] Order of files are not respected using the new pyarrow.dataset Key: ARROW-9063 URL: https://issues.apache.org/jira/browse/ARROW-9063 Project: Apache Arrow Issue Type: Bug Components: C++, Python Affects Versions: 0.17.1 Environment: ubuntu-18.04 Reporter: William Liu Say we have multiple parquet files under the same folder (a.parquet, b.parquet, c.parquet). If I pass a list of file paths into either of the two statements below {code:java} ds = pq.ParquetDataset(fps, use_legacy_dataset=False) ds = pyarrow.dataset(fps){code} Then rows of the resulting table will have: ......aaa......aaa...ccc..bbb... -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9062) [Rust] Support to read JSON into dictionary type
Sven Wagner-Boysen created ARROW-9062: - Summary: [Rust] Support to read JSON into dictionary type Key: ARROW-9062 URL: https://issues.apache.org/jira/browse/ARROW-9062 Project: Apache Arrow Issue Type: Sub-task Reporter: Sven Wagner-Boysen Currently a JSON reader build from a schema using the type dictionary for one of the fields in the schema will fail with JsonError("struct types are not yet supported") {code:java} let builder = ReaderBuilder::new().with_schema(..) let mut reader: Reader = builder.build::(File::open(path).unwrap()).unwrap(); let rb = reader.next().unwrap() {code} Suggested solution: Support reading into a dictionary in Json Reader: [https://github.com/apache/arrow/blob/master/rust/arrow/src/json/reader.rs#L368] -- This message was sent by Atlassian Jira (v8.3.4#803005)