Re: [C++] RandomArrayGenerator::List bugs

2021-02-07 Thread Ying Zhou
A Jira ticket on this bug has been filed: 
https://issues.apache.org/jira/browse/ARROW-11548 
 

> On Feb 7, 2021, at 3:29 PM, Ying Zhou  wrote:
> 
> Hi,
> 
> Recently I found a weird bug in RandomArrayGenerator.
> 
> RandomArrayGenerator::List consistently produces ListArrays with their length 
> 1 below what they should be according to their documentation. Moreover the 
> bitmaps we have are weird.
> 
> Here is some simple test:
> 
> TEST(TestAdapterWriteNested, ListTest) {
>   int64_t num_rows = 2;
>   static constexpr random::SeedType kRandomSeed2 = 0x0ff1ce;
>   arrow::random::RandomArrayGenerator rand(kRandomSeed2);
>   std::shared_ptr value_array = rand.ArrayOf(int32(), 2 * num_rows, 
> 0.2);
>   std::shared_ptr array = rand.List(*value_array, num_rows, 1);
>   RecordProperty("bitmap",*(array->null_bitmap_data()));
>   RecordProperty("length",array->length());
>   RecordProperty("array",array->ToString());
> }
> 
> Here are the results:
> 
>  timestamp="2021-02-07T15:23:16" classname="TestAdapterWriteNested">
> 
> 
> 
> 
> 
> 
> 
> Here is what RandomArrayGenerator::List should do:
> 
>   /// \brief Generate a random ListArray
>   ///
>   /// \param[in] values The underlying values array
>   /// \param[in] size The size of the generated list array
>   /// \param[in] null_probability the probability of a list value being null
>   /// \param[in] force_empty_nulls if true, null list entries must have 0 
> length
>   ///
>   /// \return a generated Array
>   std::shared_ptr List(const Array& values, int64_t size, double 
> null_probability,
>   bool force_empty_nulls = false);
> 
> Note that the generator failed in at least two aspects:
> 1. The length of the generated array is too low.
> 2. Even when null_probability is set to 1 there are still 1s in the bitmap. 
> 3. The size of the bitmap is larger than the size of the Array.
> 
> I’d like to know where we can find tests for arrow/testing/random. If they 
> are absent I need to write them.
> 
> Thanks,
> Ying
> 



[C++] RandomArrayGenerator::List bugs

2021-02-07 Thread Ying Zhou
Hi,

Recently I found a weird bug in RandomArrayGenerator.

RandomArrayGenerator::List consistently produces ListArrays with their length 1 
below what they should be according to their documentation. Moreover the 
bitmaps we have are weird.

Here is some simple test:

TEST(TestAdapterWriteNested, ListTest) {
  int64_t num_rows = 2;
  static constexpr random::SeedType kRandomSeed2 = 0x0ff1ce;
  arrow::random::RandomArrayGenerator rand(kRandomSeed2);
  std::shared_ptr value_array = rand.ArrayOf(int32(), 2 * num_rows, 0.2);
  std::shared_ptr array = rand.List(*value_array, num_rows, 1);
  RecordProperty("bitmap",*(array->null_bitmap_data()));
  RecordProperty("length",array->length());
  RecordProperty("array",array->ToString());
}

Here are the results:









Here is what RandomArrayGenerator::List should do:

  /// \brief Generate a random ListArray
  ///
  /// \param[in] values The underlying values array
  /// \param[in] size The size of the generated list array
  /// \param[in] null_probability the probability of a list value being null
  /// \param[in] force_empty_nulls if true, null list entries must have 0 length
  ///
  /// \return a generated Array
  std::shared_ptr List(const Array& values, int64_t size, double 
null_probability,
  bool force_empty_nulls = false);

Note that the generator failed in at least two aspects:
1. The length of the generated array is too low.
2. Even when null_probability is set to 1 there are still 1s in the bitmap. 
3. The size of the bitmap is larger than the size of the Array.

I’d like to know where we can find tests for arrow/testing/random. If they are 
absent I need to write them.

Thanks,
Ying



Re: Bintray sunsetting

2021-02-07 Thread Sutou Kouhei
FYI: We have a JIRA issue for it:
  https://issues.apache.org/jira/browse/ARROW-11499

In 
  "Bintray sunsetting" on Sat, 6 Feb 2021 17:15:35 -0600,
  Wes McKinney  wrote:

> Appears that JFrog is sunsetting Bintray, so we will need to sort out
> alternative hosting for Linux packages for the 4.0.0 release:
> 
> https://jfrog.com/blog/into-the-sunset-bintray-jcenter-gocenter-and-chartcenter/


Re: Arrow papers

2021-02-07 Thread Wes McKinney
Thanks for sharing these. I was aware of the Microsoft Magpie paper
but not the TU Dresden paper. It would be great to see some academic
groups engage in adding in-memory compression / encodings to the Arrow
format properly in collaboration with the Apache community.

On Sun, Feb 7, 2021 at 12:14 PM Julian Hyde  wrote:
>
> A couple of interesting Arrow-related papers have appeared at conferences 
> recently:
> Integrating Lightweight Compression Capabilities into Apache Arrow [1]
> Magpie: Python at Speed and Scale using Cloud Backends [2]
>
> I’m sharing them so that people are aware of the evolving state-of-the-art.
>
> Julian
>
> [1] 
> https://www.researchgate.net/publication/342996896_Integrating_Lightweight_Compression_Capabilities_into_Apache_Arrow
>  
> 
>
> [2] http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf 
> 


Arrow papers

2021-02-07 Thread Julian Hyde
A couple of interesting Arrow-related papers have appeared at conferences 
recently:
Integrating Lightweight Compression Capabilities into Apache Arrow [1]
Magpie: Python at Speed and Scale using Cloud Backends [2]

I’m sharing them so that people are aware of the evolving state-of-the-art.

Julian

[1] 
https://www.researchgate.net/publication/342996896_Integrating_Lightweight_Compression_Capabilities_into_Apache_Arrow
 


[2] http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf 
 

Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

2021-02-07 Thread Fernando Herrera
Hi Jorge,

I tried running the code you pasted but it didnt compile. I get the next
error:

the trait `AsRef<[u8]>` is not implemented for `[i32; 2i32]`


I had to change it to this to compile:

let buffer = Buffer::from(&[0u8, 2]);
> let data = ArrayData::new(DataType::Int64, 10, None, None, 0,
> vec![buffer], vec![]);
> let array = Float64Array::from(Arc::new(data));
>
println!("{:?}", array.value(1))


I didn't get an error out of it.

I do agree with you that there are several instances of unsafe in the code
that are not properly justified and that may lead to more problems in
the future.

Another thing that I have noticed is the pattern used in the api where a
struct has an implementation called new and another called
new_with_options. The second function happens to exist only if you want to
create it using an options object. This could be simplified by using an
enum with all the possible options and usen an Option<&[NEW ENUM]> for the
optional parameters.

I think it does make sense to think about the crate and plan it better to
avoid these issues and improve it further.

Regards,
Fernando

On Sun, Feb 7, 2021 at 1:42 PM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> Hi,
>
> Over the past 4 months, I have been growing more and more frustrated by the
> amount of undefined behaviour that I am finding and fixing on the Rust
> implementation. I would like to open the discussion of a broader overview
> about the problem in light of our current knowledge and what Rust enables
> as well as offer a solution to the bigger problem.
>
> Just to give you a gist of the seriousness of the issue, the following
> currently compiles, runs, and is undefined behavior in Rust:
>
> let buffer = Buffer::from(&[0i32, 2i32]);let data =
> ArrayData::new(DataType::Int64, 10, 0, None, 0, vec![buffer],
> vec![]);let array = Float64Array::from(Arc::new(data));
> println!("{:?}", array.value(1));
>
> I would like to propose a major refactor of the crate around physical
> traits, Buffer, MutableBuffer and ArrayData to make our code type-safe at
> compile time, thereby avoiding things like the example above from happening
> again.
>
> So far, I was able to reproduce all core features of the arrow crate
> (nested types, dynamic typing, FFI, memory alignment, performance) by using
> `Buffer` instead of `Buffer` and removing `ArrayData` and
> RawPointer altogether.
>
> Safety-wise, it significantly limits the usage of `unsafe` on higher end
> APIs, it has a single transmute (the bit chunk iterator one), and a
> guaranteed safe public API (which is not the case in our master, as shown
> above).
>
> Performance wise, it yields a 1.3x improvement over the current master
> (after this fix  of UB on the
> take kernel, 1.7x prior to it) for the `take` kernel for primitives. I
> should have other major performance improvements.
>
> API wise, it simplifies the traits that we have for memory layout as well
> as the handling of bitmaps, offsets, etc.
>
> The proposal is drafted as a README
>  on a
> repo that I created specifically for this from the ground up, and the full
> set of changes are in a PR <
> https://github.com/jorgecarleitao/arrow2/pull/1>
> so that anyone can view and comment on it. I haven't made any PR to master
> because this is too large to track as a diff against master, and is beyond
> the point, anyways.
>
> I haven't ported most of the crate as I only tried the non-trivial features
> (memory layout, bitmaps, FFI, dynamic typing, nested types).
>
> I would highly appreciate your thoughts about it.
>
> Best,
> Jorge
>


[Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

2021-02-07 Thread Jorge Cardoso Leitão
Hi,

Over the past 4 months, I have been growing more and more frustrated by the
amount of undefined behaviour that I am finding and fixing on the Rust
implementation. I would like to open the discussion of a broader overview
about the problem in light of our current knowledge and what Rust enables
as well as offer a solution to the bigger problem.

Just to give you a gist of the seriousness of the issue, the following
currently compiles, runs, and is undefined behavior in Rust:

let buffer = Buffer::from(&[0i32, 2i32]);let data =
ArrayData::new(DataType::Int64, 10, 0, None, 0, vec![buffer],
vec![]);let array = Float64Array::from(Arc::new(data));
println!("{:?}", array.value(1));

I would like to propose a major refactor of the crate around physical
traits, Buffer, MutableBuffer and ArrayData to make our code type-safe at
compile time, thereby avoiding things like the example above from happening
again.

So far, I was able to reproduce all core features of the arrow crate
(nested types, dynamic typing, FFI, memory alignment, performance) by using
`Buffer` instead of `Buffer` and removing `ArrayData` and
RawPointer altogether.

Safety-wise, it significantly limits the usage of `unsafe` on higher end
APIs, it has a single transmute (the bit chunk iterator one), and a
guaranteed safe public API (which is not the case in our master, as shown
above).

Performance wise, it yields a 1.3x improvement over the current master
(after this fix  of UB on the
take kernel, 1.7x prior to it) for the `take` kernel for primitives. I
should have other major performance improvements.

API wise, it simplifies the traits that we have for memory layout as well
as the handling of bitmaps, offsets, etc.

The proposal is drafted as a README
 on a
repo that I created specifically for this from the ground up, and the full
set of changes are in a PR 
so that anyone can view and comment on it. I haven't made any PR to master
because this is too large to track as a diff against master, and is beyond
the point, anyways.

I haven't ported most of the crate as I only tried the non-trivial features
(memory layout, bitmaps, FFI, dynamic typing, nested types).

I would highly appreciate your thoughts about it.

Best,
Jorge


[NIGHTLY] Arrow Build Report for Job nightly-2021-02-07-0

2021-02-07 Thread Crossbow


Arrow Build Report for Job nightly-2021-02-07-0

All tasks: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-07-0

Failed Tasks:
- conda-linux-gcc-py36-aarch64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-07-0-drone-conda-linux-gcc-py36-aarch64
- conda-linux-gcc-py37-aarch64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-07-0-drone-conda-linux-gcc-py37-aarch64
- conda-linux-gcc-py38-aarch64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-07-0-drone-conda-linux-gcc-py38-aarch64
- conda-linux-gcc-py39-aarch64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-07-0-drone-conda-linux-gcc-py39-aarch64
- conda-win-vs2017-py36-r36:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-07-0-azure-conda-win-vs2017-py36-r36
- test-conda-python-3.7-dask-latest:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-07-0-github-test-conda-python-3.7-dask-latest
- test-conda-python-3.8-dask-master:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-07-0-github-test-conda-python-3.8-dask-master
- test-conda-python-3.8-jpype:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-07-0-github-test-conda-python-3.8-jpype
- test-ubuntu-18.04-docs:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-07-0-azure-test-ubuntu-18.04-docs

Succeeded Tasks:
- centos-7-amd64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-07-0-github-centos-7-amd64
- centos-8-amd64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-07-0-github-centos-8-amd64
- conda-clean:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-07-0-azure-conda-clean
- conda-linux-gcc-py36-cpu-r36:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-07-0-azure-conda-linux-gcc-py36-cpu-r36
- conda-linux-gcc-py36-cuda:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-07-0-azure-conda-linux-gcc-py36-cuda
- conda-linux-gcc-py37-cpu-r40:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-07-0-azure-conda-linux-gcc-py37-cpu-r40
- conda-linux-gcc-py37-cuda:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-07-0-azure-conda-linux-gcc-py37-cuda
- conda-linux-gcc-py38-cpu:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-07-0-azure-conda-linux-gcc-py38-cpu
- conda-linux-gcc-py38-cuda:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-07-0-azure-conda-linux-gcc-py38-cuda
- conda-linux-gcc-py39-cpu:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-07-0-azure-conda-linux-gcc-py39-cpu
- conda-linux-gcc-py39-cuda:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-07-0-azure-conda-linux-gcc-py39-cuda
- conda-osx-clang-py36-r36:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-07-0-azure-conda-osx-clang-py36-r36
- conda-osx-clang-py37-r40:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-07-0-azure-conda-osx-clang-py37-r40
- conda-osx-clang-py38:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-07-0-azure-conda-osx-clang-py38
- conda-osx-clang-py39:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-07-0-azure-conda-osx-clang-py39
- conda-win-vs2017-py37-r40:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-07-0-azure-conda-win-vs2017-py37-r40
- conda-win-vs2017-py38:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-07-0-azure-conda-win-vs2017-py38
- debian-buster-amd64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-07-0-github-debian-buster-amd64
- example-cpp-minimal-build-static-system-dependency:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-07-0-github-example-cpp-minimal-build-static-system-dependency
- example-cpp-minimal-build-static:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-07-0-github-example-cpp-minimal-build-static
- gandiva-jar-osx:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-07-0-github-gandiva-jar-osx
- gandiva-jar-ubuntu:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-07-0-github-gandiva-jar-ubuntu
- homebrew-cpp:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-07-0-github-homebrew-cpp
- homebrew-r-autobrew:
  URL