Re: [C++] Enhancements to random Array/ChunkedArray/Table generator as a separate PR?

2021-01-31 Thread Micah Kornfield
I think it is OK to have a separate PR for random nested data generation.
We wanted to do this for parquet as well, but didn't get to it.  Instead we
constructed a very detailed set of nesting level tests.

On Sun, Jan 31, 2021 at 9:12 PM Ying Zhou  wrote:

> Hi,
>
> As a part of the process of reducing test size in this pull request
> https://github.com/apache/arrow/pull/8648 <
> https://github.com/apache/arrow/pull/8648> which contains the ORC writer
> for C++ and Python I wrote a random chunked array generator and a random
> table generator. To reduce test size to ideal levels it will be necessary
> to improve arrow::random::RandomArrayGenerator::ArrayOf to support nested
> types. I really don’t think such work really belongs to the ORC writer PR.
> Shall I first try to get this PR to pass and then file a separate one with
> improvements in arrow/testing/random or shall I file them together as one
> PR? Thanks!
>
> Ying


Re: [C++] Shall we modify the ORC reader?

2021-01-31 Thread Micah Kornfield
It probably makes sense to make this option configurable.  I think it is OK
to change the default to use Maps.  My guess is the initial ORC
implementation predated having a Map type in the specification.

On Thu, Jan 28, 2021 at 9:28 AM Ying Zhou  wrote:

> Hi,
>
> Really thanks Deepak!
>
> I really want to edit the ORC reader to read ORC MAPs as Arrow MAPs now
> and it’s not a serious hassle to do so. Is there anyone who needs the
> read-ORC-maps-as-lists-of-structs functionality? If not I will do it likely
> in my current PR.
>
> Ying
>
> > On Jan 19, 2021, at 8:45 PM, Deepak Majeti 
> wrote:
> >
> > Hi Ying,
> >
> > I can help review/merge any ORC C++ contributions.
> >
> >
> > On Thu, Jan 14, 2021 at 6:57 PM Ying Zhou  wrote:
> >
> >> Well, I haven’t found any. Thankfully ORC does work and I can figure out
> >> how it works by testing using simple examples. However I have never
> managed
> >> to contact the ORC community at all. They have never responded to any
> of my
> >> emails to d...@orc.apache.org  I do want to
> add
> >> write Snappy support (which was actually already done 2 years ago by
> >> someone else but due to lack of unit testing it was never merged into
> >> master. I can write the tests.) and maybe Decimal256 to ORC C++ if they
> are
> >> wiling to review and merge them. If anyone has successfully contacted
> the
> >> ORC community please let me know how.
> >>
> >> Best,
> >> Ying
> >>
> >>> On Jan 14, 2021, at 8:39 AM, Antoine Pitrou 
> wrote:
> >>>
> >>>
> >>> Hi Ying,
> >>>
> >>> Is there a semantic description of the ORC data types somewhere?
> >>> I've read through https://orc.apache.org/docs/types.html and
> >>> https://orc.apache.org/specification/ORCv1/ but those docs don't seem
> >>> to explain the intent and constraints of each of the data types.
> >>>
> >>> Regards
> >>>
> >>> Antoine.
> >>>
> >>>
> >>>
> >>>
> >>> On Mon, 11 Jan 2021 21:15:05 -0500
> >>> Ying Zhou  wrote:
>  Thanks! What about 3?
>  Shall we convert ORC maps to Arrow maps as opposed to lists of structs
> >> with fields of the structs named ‘key’ and ‘value’?
> 
> 
> 
> > On Jan 10, 2021, at 6:45 PM, Jacques Nadeau 
> >> wrote:
> >
> > I don't think 1 & 2 make sense. I don't think there are a lot of
> users
> > reading 2gb strings or lists with 2B objects in them. Saying we just
> >> don't
> > support that pattern seems fine for now. I also believe the string
> and
> >> list
> > types have better cross-language support than the large variants.
> >
> > On Sun, Jan 10, 2021 at 8:49 AM Ying Zhou 
> wrote:
> >
> >> Hi,
> >>
> >> While finishing the ORC writer in C++ I found that the ORC reader
> >> treats
> >> certain types in rather awkward ways. Hence I filed this Jira
> ticket:
> >> https://issues.apache.org/jira/browse/ARROW-7 <
> >> https://issues.apache.org/jira/browse/ARROW-7>
> >>
> >> After starting to work on ORC tickets mostly filed by myself I began
> >> to
> >> worry that the type mappings in the ORC reader might already be used
> >> by
> >> users of Arrow. I wonder whether we should grandfather the issues or
> >> gradually switch to a new type mapping.
> >>
> >> Here are my proposed changes:
> >> 1. The ORC STRING type should be converted to the Arrow LARGE_STRING
> >> type
> >> instead of STRING type since it is large.
> >> 2. The ORC LIST type should be converted to the Arrow LARGE_LIST
> type
> >> instead of LIST type since it is large.
> >> 3. The ORC MAP type should be converted to the Arrow MAP type
> instead
> >> of
> >> list of structs with hardcoded field names as long as
> >> the offsets fit into int32. Otherwise we shouldn't return OK.
> >>
> >> Thanks,
> >> Ying
> 
> 
> >>>
> >>>
> >>>
> >>
> >>
> >
> > --
> > regards,
> > Deepak Majeti
>
>


[C++] Enhancements to random Array/ChunkedArray/Table generator as a separate PR?

2021-01-31 Thread Ying Zhou
Hi,

As a part of the process of reducing test size in this pull request 
https://github.com/apache/arrow/pull/8648 
 which contains the ORC writer for 
C++ and Python I wrote a random chunked array generator and a random table 
generator. To reduce test size to ideal levels it will be necessary to improve 
arrow::random::RandomArrayGenerator::ArrayOf to support nested types. I really 
don’t think such work really belongs to the ORC writer PR. Shall I first try to 
get this PR to pass and then file a separate one with improvements in 
arrow/testing/random or shall I file them together as one PR? Thanks!

Ying

Re: [RUST] Arrow guide

2021-01-31 Thread Wes McKinney
To state the obvious, it would be great to have some community maintained
documentation (beyond generated API docs) for the Rust library. Writing
documentation almost always causes the quality of a code base to improve
because the process brings up rough edges, inconsistencies, or missing
features.

On Sun, Jan 31, 2021 at 11:47 AM Benjamin Blodgett <
benjaminblodg...@gmail.com> wrote:

> This is great, thanks for this!
>
> On Sun, Jan 31, 2021 at 9:25 AM Fernando Herrera <
> fernando.j.herr...@gmail.com> wrote:
>
> > Hi all,
> >
> > During the past months I have been trying to read and understand the code
> > base for the Rust implementation of Arrow. At the beginning I was just
> > reading the code and figuring out what each part or module was used for.
> > Unfortunately this approach didn't work very well and had to start from
> > scratch. The next time while trying to understand it I was also writing
> > descriptions of the things I was studying and how to implement them. This
> > approach led me to writing up a small Arrow guide.
> >
> > At this point is not complete and has several chapters missing, but
> that's
> > the point of this mail. I was wondering if someone that wants to work (or
> > is already working) on the Rust side would like to help me make the guide
> > better and richer.
> >
> > The first sections can be found here:
> > https://elferherrera.github.io/arrow_guide/introduction.html
> >
> > And the repo is here:
> > https://github.com/elferherrera/arrow_guide/
> >
> > The guide at the moment is written with mdbook and uses the doc-comment
> > crate to check all the code. Also, the book is pulling the Arrow crate
> from
> > git directly, so it is always reading the most recent api.
> >
> > I hope someone finds these writings useful and if you are willing to help
> > me just let me know.
> >
> > Thanks,
> > Fernando
> >
>


Re: [RUST] Arrow guide

2021-01-31 Thread Benjamin Blodgett
This is great, thanks for this!

On Sun, Jan 31, 2021 at 9:25 AM Fernando Herrera <
fernando.j.herr...@gmail.com> wrote:

> Hi all,
>
> During the past months I have been trying to read and understand the code
> base for the Rust implementation of Arrow. At the beginning I was just
> reading the code and figuring out what each part or module was used for.
> Unfortunately this approach didn't work very well and had to start from
> scratch. The next time while trying to understand it I was also writing
> descriptions of the things I was studying and how to implement them. This
> approach led me to writing up a small Arrow guide.
>
> At this point is not complete and has several chapters missing, but that's
> the point of this mail. I was wondering if someone that wants to work (or
> is already working) on the Rust side would like to help me make the guide
> better and richer.
>
> The first sections can be found here:
> https://elferherrera.github.io/arrow_guide/introduction.html
>
> And the repo is here:
> https://github.com/elferherrera/arrow_guide/
>
> The guide at the moment is written with mdbook and uses the doc-comment
> crate to check all the code. Also, the book is pulling the Arrow crate from
> git directly, so it is always reading the most recent api.
>
> I hope someone finds these writings useful and if you are willing to help
> me just let me know.
>
> Thanks,
> Fernando
>


[RUST] Arrow guide

2021-01-31 Thread Fernando Herrera
Hi all,

During the past months I have been trying to read and understand the code
base for the Rust implementation of Arrow. At the beginning I was just
reading the code and figuring out what each part or module was used for.
Unfortunately this approach didn't work very well and had to start from
scratch. The next time while trying to understand it I was also writing
descriptions of the things I was studying and how to implement them. This
approach led me to writing up a small Arrow guide.

At this point is not complete and has several chapters missing, but that's
the point of this mail. I was wondering if someone that wants to work (or
is already working) on the Rust side would like to help me make the guide
better and richer.

The first sections can be found here:
https://elferherrera.github.io/arrow_guide/introduction.html

And the repo is here:
https://github.com/elferherrera/arrow_guide/

The guide at the moment is written with mdbook and uses the doc-comment
crate to check all the code. Also, the book is pulling the Arrow crate from
git directly, so it is always reading the most recent api.

I hope someone finds these writings useful and if you are willing to help
me just let me know.

Thanks,
Fernando


[NIGHTLY] Arrow Build Report for Job nightly-2021-01-31-0

2021-01-31 Thread Crossbow


Arrow Build Report for Job nightly-2021-01-31-0

All tasks: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-01-31-0

Failed Tasks:
- conda-linux-gcc-py36-aarch64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-01-31-0-drone-conda-linux-gcc-py36-aarch64
- conda-linux-gcc-py37-aarch64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-01-31-0-drone-conda-linux-gcc-py37-aarch64
- conda-linux-gcc-py38-aarch64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-01-31-0-drone-conda-linux-gcc-py38-aarch64
- conda-linux-gcc-py39-aarch64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-01-31-0-drone-conda-linux-gcc-py39-aarch64
- gandiva-jar-ubuntu:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-01-31-0-github-gandiva-jar-ubuntu
- python-sdist:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-01-31-0-github-python-sdist
- test-conda-python-3.7-hdfs-3.2:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-01-31-0-github-test-conda-python-3.7-hdfs-3.2
- test-conda-python-3.7-spark-branch-3.0:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-01-31-0-github-test-conda-python-3.7-spark-branch-3.0
- test-conda-python-3.8-jpype:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-01-31-0-github-test-conda-python-3.8-jpype
- test-ubuntu-18.04-docs:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-01-31-0-azure-test-ubuntu-18.04-docs
- test-ubuntu-18.04-python-3:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-01-31-0-azure-test-ubuntu-18.04-python-3

Succeeded Tasks:
- centos-7-amd64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-01-31-0-github-centos-7-amd64
- centos-8-amd64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-01-31-0-github-centos-8-amd64
- conda-clean:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-01-31-0-azure-conda-clean
- conda-linux-gcc-py36-cpu-r36:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-01-31-0-azure-conda-linux-gcc-py36-cpu-r36
- conda-linux-gcc-py36-cuda:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-01-31-0-azure-conda-linux-gcc-py36-cuda
- conda-linux-gcc-py37-cpu-r40:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-01-31-0-azure-conda-linux-gcc-py37-cpu-r40
- conda-linux-gcc-py37-cuda:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-01-31-0-azure-conda-linux-gcc-py37-cuda
- conda-linux-gcc-py38-cpu:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-01-31-0-azure-conda-linux-gcc-py38-cpu
- conda-linux-gcc-py38-cuda:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-01-31-0-azure-conda-linux-gcc-py38-cuda
- conda-linux-gcc-py39-cpu:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-01-31-0-azure-conda-linux-gcc-py39-cpu
- conda-linux-gcc-py39-cuda:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-01-31-0-azure-conda-linux-gcc-py39-cuda
- conda-osx-clang-py36-r36:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-01-31-0-azure-conda-osx-clang-py36-r36
- conda-osx-clang-py37-r40:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-01-31-0-azure-conda-osx-clang-py37-r40
- conda-osx-clang-py38:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-01-31-0-azure-conda-osx-clang-py38
- conda-osx-clang-py39:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-01-31-0-azure-conda-osx-clang-py39
- conda-win-vs2017-py36-r36:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-01-31-0-azure-conda-win-vs2017-py36-r36
- conda-win-vs2017-py37-r40:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-01-31-0-azure-conda-win-vs2017-py37-r40
- conda-win-vs2017-py38:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-01-31-0-azure-conda-win-vs2017-py38
- debian-buster-amd64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-01-31-0-github-debian-buster-amd64
- example-cpp-minimal-build-static-system-dependency:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-01-31-0-github-example-cpp-minimal-build-static-system-dependency
- example-cpp-minimal-build-static:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-01-31-0-github-example-cpp-minimal-build-static
- g