[jira] [Created] (ARROW-6114) Datatypes are not preserved when a pandas dataframe partitioned and saved as parquet file using pyarrow

2019-08-01 Thread Naga (JIRA)
Naga created ARROW-6114:
---

 Summary: Datatypes are not preserved when a pandas dataframe 
partitioned and saved as parquet file using pyarrow
 Key: ARROW-6114
 URL: https://issues.apache.org/jira/browse/ARROW-6114
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.14.1
 Environment: Python 3.7.3
pyarrow 0.14.1
Reporter: Naga


h3. Datatypes are not preserved when a pandas data frame is *partitioned* and 
saved as parquet file using pyarrow but that's not the case when the data frame 
is not partitioned.

*Case 1: Saving a partitioned dataset - Data Types are NOT preserved*
{code:java}
# Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow
import pandas as pd
df = pd.DataFrame( \{'age': [77,32,234],'name':['agan','bbobby','test'] }
)
path = 'test'
partition_cols=['age']
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, partition_cols=partition_cols, 
preserve_index=False)

 # Loading a dataset partioned parquet dataset from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)
{code}
*Output:*
{code:java}
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
name object
age category
dtype: object
>From the above output, we could see that the data type for age is int64 in the 
>original pandas data frame but it got changed to object when we saved to local 
>and loaded back.
{code}
*Case 2: Non-partitioned dataset - Data types are preserved*
{code:java}
import pandas as pd
print('Saving a Pandas Dataframe to Local as a parquet file without 
partitioning using pyarrow')
df = pd.DataFrame(

{'age': [77,32,234],'name':['agan','bbobby','test'] }

)
path = 'test_without_partition'
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, preserve_index=False)
 # Loading a non-partioned parquet file from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)

{code}
*Output:*
{code:java}
Saving a Pandas Dataframe to Local as a parquet file without partitioning using 
pyarrow
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
age int64
name object
dtype: object
{code}

*Versions*
 * Python 3.7.3
 * pyarrow 0.14.1



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6113) [Java] Support vector deduplicate function

2019-08-01 Thread Liya Fan (JIRA)
Liya Fan created ARROW-6113:
---

 Summary: [Java] Support vector deduplicate function
 Key: ARROW-6113
 URL: https://issues.apache.org/jira/browse/ARROW-6113
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Remove adjacent deduplicated elements from a vector. This function can be used, 
for example, in finding distinct values, or in compressing the vector data.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6112) [Java] Update APIs to support 64-bit address space

2019-08-01 Thread Micah Kornfield (JIRA)
Micah Kornfield created ARROW-6112:
--

 Summary: [Java] Update APIs to support 64-bit address space
 Key: ARROW-6112
 URL: https://issues.apache.org/jira/browse/ARROW-6112
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Micah Kornfield
Assignee: Micah Kornfield


The arrow spec allows for 64 bit address range for buffers (and arrays) we 
should support this at the API level in Java even if the current Netty backing 
buffers don't support it.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6110) [Java] Support LargeList Type and add integration test with C++

2019-08-01 Thread Micah Kornfield (JIRA)
Micah Kornfield created ARROW-6110:
--

 Summary: [Java] Support LargeList Type and add integration test 
with C++
 Key: ARROW-6110
 URL: https://issues.apache.org/jira/browse/ARROW-6110
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Micah Kornfield
 Fix For: 0.15.0






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6111) [Java] Support LargeVarChar and LargeBinary types and add integration test with C++

2019-08-01 Thread Micah Kornfield (JIRA)
Micah Kornfield created ARROW-6111:
--

 Summary: [Java] Support LargeVarChar and LargeBinary types and add 
integration test with C++
 Key: ARROW-6111
 URL: https://issues.apache.org/jira/browse/ARROW-6111
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Micah Kornfield
 Fix For: 0.15.0






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6109) [Integration] Docker image for integration testing can't be built on windows

2019-08-01 Thread Paddy Horan (JIRA)
Paddy Horan created ARROW-6109:
--

 Summary: [Integration] Docker image for integration testing can't 
be built on windows
 Key: ARROW-6109
 URL: https://issues.apache.org/jira/browse/ARROW-6109
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Integration
Reporter: Paddy Horan
Assignee: Paddy Horan
 Fix For: 1.0.0


Git for windows checks files out with windows line endings and converts them 
back before checking them in.

This causes issues in the bash scripts (which are copied from the windows file 
system into the image) we use to build the "arrow_integration_xenial_base" 
image when using docker on windows.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6108) [C++] Appveyor Build_Debug configuration is hanging in C++ unit tests

2019-08-01 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-6108:
---

 Summary: [C++] Appveyor Build_Debug configuration is hanging in 
C++ unit tests
 Key: ARROW-6108
 URL: https://issues.apache.org/jira/browse/ARROW-6108
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.15.0


Not sure which patch introduced this, but here is one master build where it 
occurs

https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/26413929/job/sws48m0603ujwya1

The commit before this patch seems to have been OK



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6107) [Go] ipc.Writer Option to skip appending data buffers

2019-08-01 Thread Nick Poorman (JIRA)
Nick Poorman created ARROW-6107:
---

 Summary: [Go] ipc.Writer Option to skip appending data buffers
 Key: ARROW-6107
 URL: https://issues.apache.org/jira/browse/ARROW-6107
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Go
Reporter: Nick Poorman


For cases where we have a known shared memory region, it would be great if the 
ipc.Writer (and by extension ipc.Reader?) had the ability to write out 
everything but the actual buffers holding the data. That way we can still 
utilize the ipc mechanisms to communicate without having to serialize all the 
underlying data across the wire.

 

This seems like it should be possible since the `RecordBatch` flatbuffers only 
contain the metadata and the underlying data buffers are appended later. We 
just need to skip appending the underlying data buffers.

 

[~sbinet] thoughts?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6106) Scala lang support

2019-08-01 Thread Boris V.Kuznetsov (JIRA)
Boris V.Kuznetsov created ARROW-6106:


 Summary: Scala lang support
 Key: ARROW-6106
 URL: https://issues.apache.org/jira/browse/ARROW-6106
 Project: Apache Arrow
  Issue Type: Wish
Reporter: Boris V.Kuznetsov


I ported the testArrowStream.java to Scala Specs2 and added to the PR

Pls see more details in my [PR |https://github.com/apache/arrow/pull/4989]

I'm ready to port other tests as well and add SBT file

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6105) [C++][Parquet][Python] Add test case showing dictionary-encoded subfields in nested type

2019-08-01 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-6105:
---

 Summary: [C++][Parquet][Python] Add test case showing 
dictionary-encoded subfields in nested type
 Key: ARROW-6105
 URL: https://issues.apache.org/jira/browse/ARROW-6105
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.15.0


As follow up to ARROW-6077 -- this is fixed, but not yet fully tested. To 
contain the scope of ARROW-6077 I will add a test as a follow up



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [DISCUSS] Release cadence and release vote conventions

2019-08-01 Thread Wes McKinney
I agree. In my experiences as RM I have found the involvement of Maven
in the release process to be a nuisance. I think it makes more sense
in Java-only projects

On Thu, Aug 1, 2019 at 2:54 PM Andy Grove  wrote:
>
> I'll start taking a look at the maven issue. We might not want to use maven
> release plugin given that we control the version number already across this
> repository via other means.
>
> On Wed, Jul 31, 2019 at 4:26 PM Sutou Kouhei  wrote:
>
> > Hi,
> >
> > Sorry for not replying this thread.
> >
> > I think that the biggest problem is related to our Java
> > package.
> >
> >
> > We'll be able to resolve the GPG key problem by creating a
> > GPG key only for nightly release test. We can share the test
> > GPG key publicly because it's a just for testing.
> >
> > It'll work for our binary artifacts and APT/Yum repositories
> > but not work for our Java package. I don't know where GPG
> > key is used in our Java package...
> >
> >
> > We'll be able to resolve the Git commit problem by creating
> > a cloned Git repository for test. It's done in our
> > dev/release/00-prepare-test.rb[1].
> >
> > [1]
> > https://github.com/apache/arrow/blob/master/dev/release/00-prepare-test.rb#L30
> >
> > The biggest problem for the Git commit is our Java package
> > requires "apache-arrow-${VERSION}" tag on
> > https://github.com/apache/arrow . (Right?)
> > I think that "mvm release:perform" in
> > dev/release/01-perform.sh does so but I don't know the
> > details of "mvm release:perform"...
> >
> >
> > More details:
> >
> > dev/release/00-prepare.sh:
> >
> > We'll be able to run this automatically when we can resolve
> > the above GPG key problem in our Java package. We can
> > resolve the Git commit problem by creating a cloned Git
> > repository.
> >
> > dev/release/01-prepare.sh:
> >
> > We'll be able to run this automatically when we can resolve
> > the above Git commit ("apche-arrow-${VERSION}" tag) problem
> > in our Java package.
> >
> > dev/release/02-source.sh:
> >
> > We'll be able to run this automatically by creating a GPG
> > key for nightly release test. We'll use Bintray to upload RC
> > source archive instead of dist.apache.org. Ah, we need a
> > Bintray API key for this. It must be secret.
> >
> > dev/release/03-binary.sh:
> >
> > We'll be able to run this automatically by creating a GPG
> > key for nightly release test. We need a Bintray API key too.
> >
> > We need to improve this to support nightly release test. It
> > use "XXX-rc" such as "debian-rc" for Bintray "package" name.
> > It should use "XXX-nightly" such as "debian-nightly" for
> > nightly release test instead.
> >
> > dev/release/post-00-release.sh:
> >
> > We'll be able to skip this.
> >
> > dev/release/post-01-upload.sh:
> >
> > We'll be able to skip this.
> >
> > dev/release/post-02-binary.sh:
> >
> > We'll be able to run this automatically by creating Bintray
> > "packages" for nightly release and use them. We can create
> > "XXX-nightly-release" ("debian-nightly-release") Bintray
> > "packages" and use them instead of "XXX" ("debian") Bintray
> > "packages".
> >
> > "debian" Bintray "package": https://bintray.com/apache/debian/
> >
> > We need to improve this to support nightly release.
> >
> > dev/release/post-03-website.sh:
> >
> > We'll be able to run this automatically by creating a cloned
> > Git repository for test.
> >
> > It's better that we have a Web site to show generated pages.
> > We can create
> > https://github.com/apache/arrow-site/tree/asf-site/nightly
> > and use it but I don't like it. Because arrow-site increases
> > a commit day by day.
> > Can we prepare a Web site for this? (arrow-nightly.ursalabs.org?)
> >
> > dev/release/post-04-rubygems.sh:
> >
> > We may be able to use GitHub Package Registry[2] to upload
> > RubyGems. We can use "pre-release" package feature of
> > https://rubygems.org/ but it's not suitable for
> > nightly. It's for RC or beta release.
> >
> > [2] https://github.blog/2019-05-10-introducing-github-package-registry/
> >
> > dev/release/post-05-js.sh:
> >
> > We may be able to use GitHub Package Registry[2] to upload
> > npm packages.
> >
> > dev/release/post-06-csharp.sh:
> >
> > We may be able to use GitHub Package Registry[2] to upload
> > NuGet packages.
> >
> > dev/release/post-07-rust.sh:
> >
> > I don't have any idea. But it must be ran
> > automatically. It's always failed. I needed to run each
> > command manually.
> >
> > dev/release/post-08-remove-rc.sh:
> >
> > We'll be able to skip this.
> >
> >
> > Thanks,
> > --
> > kou
> >
> > In 
> >   "Re: [DISCUSS] Release cadence and release vote conventions" on Wed, 31
> > Jul 2019 15:35:57 -0500,
> >   Wes McKinney  wrote:
> >
> > > The PMC member and their GPG keys need to be in the loop at some
> > > point. The release artifacts can be produced by some kind of CI/CD
> > > system so long as the PMC member has confidence in the security of
> > > those artifacts before signing them. For example, we build the
> > > official binar

[jira] [Created] (ARROW-6104) [Rust] [DataFusion] Don't allow bare_trait_objects

2019-08-01 Thread Andy Grove (JIRA)
Andy Grove created ARROW-6104:
-

 Summary: [Rust] [DataFusion] Don't allow bare_trait_objects
 Key: ARROW-6104
 URL: https://issues.apache.org/jira/browse/ARROW-6104
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 0.15.0


Need to remove "{color:#808080}#![allow(bare_trait_objects)]" from cargo.toml 
and fix compiler warnings
{color}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6103) [Java] Do we really want to use the maven release plugin?

2019-08-01 Thread Andy Grove (JIRA)
Andy Grove created ARROW-6103:
-

 Summary: [Java] Do we really want to use the maven release plugin?
 Key: ARROW-6103
 URL: https://issues.apache.org/jira/browse/ARROW-6103
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 0.15.0


For reference .. I'm filing this issue to track investigation work around this 
..
{code:java}
The biggest problem for the Git commit is our Java package
requires "apache-arrow-${VERSION}" tag on
https://github.com/apache/arrow . (Right?)
I think that "mvm release:perform" in
dev/release/01-perform.sh does so but I don't know the
details of "mvm release:perform"...{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [DISCUSS] Release cadence and release vote conventions

2019-08-01 Thread Andy Grove
I'll start taking a look at the maven issue. We might not want to use maven
release plugin given that we control the version number already across this
repository via other means.

On Wed, Jul 31, 2019 at 4:26 PM Sutou Kouhei  wrote:

> Hi,
>
> Sorry for not replying this thread.
>
> I think that the biggest problem is related to our Java
> package.
>
>
> We'll be able to resolve the GPG key problem by creating a
> GPG key only for nightly release test. We can share the test
> GPG key publicly because it's a just for testing.
>
> It'll work for our binary artifacts and APT/Yum repositories
> but not work for our Java package. I don't know where GPG
> key is used in our Java package...
>
>
> We'll be able to resolve the Git commit problem by creating
> a cloned Git repository for test. It's done in our
> dev/release/00-prepare-test.rb[1].
>
> [1]
> https://github.com/apache/arrow/blob/master/dev/release/00-prepare-test.rb#L30
>
> The biggest problem for the Git commit is our Java package
> requires "apache-arrow-${VERSION}" tag on
> https://github.com/apache/arrow . (Right?)
> I think that "mvm release:perform" in
> dev/release/01-perform.sh does so but I don't know the
> details of "mvm release:perform"...
>
>
> More details:
>
> dev/release/00-prepare.sh:
>
> We'll be able to run this automatically when we can resolve
> the above GPG key problem in our Java package. We can
> resolve the Git commit problem by creating a cloned Git
> repository.
>
> dev/release/01-prepare.sh:
>
> We'll be able to run this automatically when we can resolve
> the above Git commit ("apche-arrow-${VERSION}" tag) problem
> in our Java package.
>
> dev/release/02-source.sh:
>
> We'll be able to run this automatically by creating a GPG
> key for nightly release test. We'll use Bintray to upload RC
> source archive instead of dist.apache.org. Ah, we need a
> Bintray API key for this. It must be secret.
>
> dev/release/03-binary.sh:
>
> We'll be able to run this automatically by creating a GPG
> key for nightly release test. We need a Bintray API key too.
>
> We need to improve this to support nightly release test. It
> use "XXX-rc" such as "debian-rc" for Bintray "package" name.
> It should use "XXX-nightly" such as "debian-nightly" for
> nightly release test instead.
>
> dev/release/post-00-release.sh:
>
> We'll be able to skip this.
>
> dev/release/post-01-upload.sh:
>
> We'll be able to skip this.
>
> dev/release/post-02-binary.sh:
>
> We'll be able to run this automatically by creating Bintray
> "packages" for nightly release and use them. We can create
> "XXX-nightly-release" ("debian-nightly-release") Bintray
> "packages" and use them instead of "XXX" ("debian") Bintray
> "packages".
>
> "debian" Bintray "package": https://bintray.com/apache/debian/
>
> We need to improve this to support nightly release.
>
> dev/release/post-03-website.sh:
>
> We'll be able to run this automatically by creating a cloned
> Git repository for test.
>
> It's better that we have a Web site to show generated pages.
> We can create
> https://github.com/apache/arrow-site/tree/asf-site/nightly
> and use it but I don't like it. Because arrow-site increases
> a commit day by day.
> Can we prepare a Web site for this? (arrow-nightly.ursalabs.org?)
>
> dev/release/post-04-rubygems.sh:
>
> We may be able to use GitHub Package Registry[2] to upload
> RubyGems. We can use "pre-release" package feature of
> https://rubygems.org/ but it's not suitable for
> nightly. It's for RC or beta release.
>
> [2] https://github.blog/2019-05-10-introducing-github-package-registry/
>
> dev/release/post-05-js.sh:
>
> We may be able to use GitHub Package Registry[2] to upload
> npm packages.
>
> dev/release/post-06-csharp.sh:
>
> We may be able to use GitHub Package Registry[2] to upload
> NuGet packages.
>
> dev/release/post-07-rust.sh:
>
> I don't have any idea. But it must be ran
> automatically. It's always failed. I needed to run each
> command manually.
>
> dev/release/post-08-remove-rc.sh:
>
> We'll be able to skip this.
>
>
> Thanks,
> --
> kou
>
> In 
>   "Re: [DISCUSS] Release cadence and release vote conventions" on Wed, 31
> Jul 2019 15:35:57 -0500,
>   Wes McKinney  wrote:
>
> > The PMC member and their GPG keys need to be in the loop at some
> > point. The release artifacts can be produced by some kind of CI/CD
> > system so long as the PMC member has confidence in the security of
> > those artifacts before signing them. For example, we build the
> > official binary packages on public CI services and then download and
> > sign them with Crossbow. I think the same could be done in theory with
> > the source release but we'd first need to figure out what to do about
> > the parts that create git commits.
> >
> > On Wed, Jul 31, 2019 at 11:23 AM Andy Grove 
> wrote:
> >>
> >> To what extent would it be possible to automate the release process via
> >> CICD?
> >>
> >> On Wed, Jul 31, 2019 at 9:19 AM Wes McKinney 
> wrote:
> >>
> >> > I think one thing that would h

[jira] [Created] (ARROW-6102) [Testing] Add partitioned CSV file to arrow-testing repo

2019-08-01 Thread Andy Grove (JIRA)
Andy Grove created ARROW-6102:
-

 Summary: [Testing] Add partitioned CSV file to arrow-testing repo
 Key: ARROW-6102
 URL: https://issues.apache.org/jira/browse/ARROW-6102
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Integration
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 0.15.0


I need to add a partitioned CSV file to arrow-testing for use in parallel query 
unit tests in DataFusion



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [DISCUSS] Add GetFlightSchema to Flight RPC

2019-08-01 Thread Bryan Cutler
Sounds good to me, I would just echo what others have said.

On Thu, Aug 1, 2019 at 8:17 AM Ryan Murray  wrote:

> Thanks Wes,
>
> The descriptor is only there to maintain a bit of symmetry with
> GetFlightInfo. Happy to remove it, I don't think its necessary and already
> a few people agree. Similar with the method name, I am neutral to the
> naming and can call it whatever the community is happy with.
>
> Best,
> Ryan
>
> On Thu, Aug 1, 2019 at 3:56 PM Wes McKinney  wrote:
>
> > I'm generally supporting of adding the new RPC endpoint.
> >
> > To make a couple points from the document
> >
> > * I'm not sure what the purpose of returning the FlightDescriptor is,
> > but I haven't thought too hard about it
> > * The Schema consists of a single IPC message -- dictionaries will
> > appear in the actual DoGet stream. To motivate why this is --
> > different endpoints might have different dictionaries corresponding to
> > fields in the schema, to have static/constant dictionaries in a
> > distributed Flight setting is likely to be impractical. I summarize
> > the issue as "dictionaries are data, not metadata".
> > * I would be OK calling this GetSchema instead of GetFlightSchema but
> > either is okay
> >
> > - Wes
> >
> > On Thu, Aug 1, 2019 at 8:08 AM David Li  wrote:
> > >
> > > Hi Ryan,
> > >
> > > Thanks for writing this up! I made a couple of minor comments in the
> > > doc/implementation, but overall I'm in favor of having this RPC
> > > method.
> > >
> > > Best,
> > > David
> > >
> > > On 8/1/19, Ryan Murray  wrote:
> > > > Hi All,
> > > >
> > > > Please see the attached document for a proposed addition to the
> Flight
> > > > RPC[1]. This is the result of a previous mailing list discussion[2].
> > > >
> > > > I have created the Pull Request[3] to make the proposal a little more
> > > > concrete.
> > > > 
> > > > Please let me know if you have any questions or concerns.
> > > >
> > > > Best,
> > > > Ryan
> > > >
> > > > [1]:
> > > >
> >
> https://docs.google.com/document/d/1zLdFYikk3owbKpHvJrARLMlmYpi-Ef6OJy7H90MqViA/edit?usp=sharing
> > > > [2]:
> > > >
> >
> https://lists.apache.org/thread.html/3539984493cf3d4d439bef25c150fa9e09e0b43ce0afb6be378d41df@%3Cdev.arrow.apache.org%3E
> > > > [3]: https://github.com/apache/arrow/pull/4980
> > > >
> >
>
>
> --
>
> Ryan Murray  | Principal Consulting Engineer
>
> +447540852009 | rym...@dremio.com
>
> 
> Check out our GitHub , join our community
> site  & Download Dremio
> 
>


[jira] [Created] (ARROW-6101) [Rust] [DataFusion] Create physical plan from logical plan

2019-08-01 Thread Andy Grove (JIRA)
Andy Grove created ARROW-6101:
-

 Summary: [Rust] [DataFusion] Create physical plan from logical plan
 Key: ARROW-6101
 URL: https://issues.apache.org/jira/browse/ARROW-6101
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Andy Grove
Assignee: Andy Grove


Once the physical plan is in place and can be executed, I will implement logic 
to convert the logical plan to a physical plan and remove the legacy code for 
directly executing a logical plan.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: Metadata orderedness?

2019-08-01 Thread Wes McKinney
I think that orderedness should not matter for equality testing.
Semantically I think that this field is supposed to be dictionary-like
and the keys are intended to be unique (but this isn't stipulated in
Schema.fbs at the moment)

On Thu, Aug 1, 2019 at 10:47 AM Antoine Pitrou  wrote:
>
>
> Hello,
>
> Is key/value metadata (as attached to fields) supposed to be ordered or
> unordered?  In the C++ codebase currently, order is significant in
> KeyValueMetadata::Equals().
>
> Regards
>
> Antoine.


[jira] [Created] (ARROW-6100) [Rust] Pin to specific Rust nightly release

2019-08-01 Thread Andy Grove (JIRA)
Andy Grove created ARROW-6100:
-

 Summary: [Rust] Pin to specific Rust nightly release
 Key: ARROW-6100
 URL: https://issues.apache.org/jira/browse/ARROW-6100
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 0.15.0


Builds are currently non-deterministic because rust-toolchain contains 
"nightly" meaning "use the latest nightly release of Rust". This can cause 
build seemingly random build failure in CI. I propose we modify rust-toolchain 
to refer to a specific nightly release e.g. "nightly-2019-07-31" so that builds 
are deterministic.

We can update this nightly version when needed (e.g. to pick up new features) 
as part of the regular PR process.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6099) [JAVA] Has the ability to not using slf4j logging framework

2019-08-01 Thread Haowei Yu (JIRA)
Haowei Yu created ARROW-6099:


 Summary: [JAVA] Has the ability to not using slf4j logging 
framework
 Key: ARROW-6099
 URL: https://issues.apache.org/jira/browse/ARROW-6099
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Affects Versions: 0.14.1
Reporter: Haowei Yu


Currently, the java library directly calls slf4j api, and there is no abstract 
layer. This leads to user need to install slf4j as a requirement even if we 
don't use slf4j at all. 

 

It is best if you can change the slf4j dependency to provided and log content 
only if slf4j jar file is provided at runtime.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6098) [C++] Partially mitigating CPU scaling effects in benchmarks

2019-08-01 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-6098:
---

 Summary: [C++] Partially mitigating CPU scaling effects in 
benchmarks
 Key: ARROW-6098
 URL: https://issues.apache.org/jira/browse/ARROW-6098
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


We have a lot of benchmarks that return results based on a single iteration


{code}
(arrow-3.7) 10:46 ~/code/arrow/cpp/build  (master)$ 
./release/arrow-builder-benchmark --benchmark_filter=Dict
2019-08-01 10:46:03
Running ./release/arrow-builder-benchmark
Run on (12 X 4400 MHz CPU s)
CPU Caches:
  L1 Data 32K (x6)
  L1 Instruction 32K (x6)
  L2 Unified 256K (x6)
  L3 Unified 12288K (x1)
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may 
be noisy and will incur extra overhead.
---
BenchmarkTime   CPU Iterations
---
BuildInt64DictionaryArrayRandom  622889286 ns  622864485 ns  1   
411.004MB/s
BuildInt64DictionaryArraySequential  546764048 ns  545992395 ns  1   
468.871MB/s
BuildInt64DictionaryArraySimilar 737759293 ns  737696850 ns  1   
347.026MB/s
BuildStringDictionaryArray   985433473 ns  985363901 ns  1   
346.608MB/s
(arrow-3.7) 10:46 ~/code/arrow/cpp/build  (master)$ 
./release/arrow-builder-benchmark --benchmark_filter=Dict
2019-08-01 10:46:09
Running ./release/arrow-builder-benchmark
Run on (12 X 4400 MHz CPU s)
CPU Caches:
  L1 Data 32K (x6)
  L1 Instruction 32K (x6)
  L2 Unified 256K (x6)
  L3 Unified 12288K (x1)
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may 
be noisy and will incur extra overhead.
---
BenchmarkTime   CPU Iterations
---
BuildInt64DictionaryArrayRandom  527063570 ns  527044023 ns  1   
485.728MB/s
BuildInt64DictionaryArraySequential  566285427 ns  566270336 ns  1   
452.081MB/s
BuildInt64DictionaryArraySimilar 762954193 ns  762332297 ns  1   
335.812MB/s
BuildStringDictionaryArray   991095766 ns  991018875 ns  1
344.63MB/s
{code}

I'm sure the result here is being heavily affected by CPU scaling but I think 
we can mitigate the impact of CPU scaling by using the `MinTime`. I find that 
adding `MinTime(1.0)` to these particular benchmarks makes them more consistent



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Metadata orderedness?

2019-08-01 Thread Antoine Pitrou


Hello,

Is key/value metadata (as attached to fields) supposed to be ordered or
unordered?  In the C++ codebase currently, order is significant in
KeyValueMetadata::Equals().

Regards

Antoine.


Re: [DISCUSS] Add GetFlightSchema to Flight RPC

2019-08-01 Thread Ryan Murray
Thanks Wes,

The descriptor is only there to maintain a bit of symmetry with
GetFlightInfo. Happy to remove it, I don't think its necessary and already
a few people agree. Similar with the method name, I am neutral to the
naming and can call it whatever the community is happy with.

Best,
Ryan

On Thu, Aug 1, 2019 at 3:56 PM Wes McKinney  wrote:

> I'm generally supporting of adding the new RPC endpoint.
>
> To make a couple points from the document
>
> * I'm not sure what the purpose of returning the FlightDescriptor is,
> but I haven't thought too hard about it
> * The Schema consists of a single IPC message -- dictionaries will
> appear in the actual DoGet stream. To motivate why this is --
> different endpoints might have different dictionaries corresponding to
> fields in the schema, to have static/constant dictionaries in a
> distributed Flight setting is likely to be impractical. I summarize
> the issue as "dictionaries are data, not metadata".
> * I would be OK calling this GetSchema instead of GetFlightSchema but
> either is okay
>
> - Wes
>
> On Thu, Aug 1, 2019 at 8:08 AM David Li  wrote:
> >
> > Hi Ryan,
> >
> > Thanks for writing this up! I made a couple of minor comments in the
> > doc/implementation, but overall I'm in favor of having this RPC
> > method.
> >
> > Best,
> > David
> >
> > On 8/1/19, Ryan Murray  wrote:
> > > Hi All,
> > >
> > > Please see the attached document for a proposed addition to the Flight
> > > RPC[1]. This is the result of a previous mailing list discussion[2].
> > >
> > > I have created the Pull Request[3] to make the proposal a little more
> > > concrete.
> > > 
> > > Please let me know if you have any questions or concerns.
> > >
> > > Best,
> > > Ryan
> > >
> > > [1]:
> > >
> https://docs.google.com/document/d/1zLdFYikk3owbKpHvJrARLMlmYpi-Ef6OJy7H90MqViA/edit?usp=sharing
> > > [2]:
> > >
> https://lists.apache.org/thread.html/3539984493cf3d4d439bef25c150fa9e09e0b43ce0afb6be378d41df@%3Cdev.arrow.apache.org%3E
> > > [3]: https://github.com/apache/arrow/pull/4980
> > >
>


-- 

Ryan Murray  | Principal Consulting Engineer

+447540852009 | rym...@dremio.com


Check out our GitHub , join our community
site  & Download Dremio



[jira] [Created] (ARROW-6097) [Java] Avro adapter implement unions type

2019-08-01 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6097:
-

 Summary: [Java] Avro adapter implement unions type
 Key: ARROW-6097
 URL: https://issues.apache.org/jira/browse/ARROW-6097
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Ji Liu
Assignee: Ji Liu


Support convert unions type like ["string"], ["string", 'int"] and nullable 
["string", "int", "null"]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [DISCUSS] Add GetFlightSchema to Flight RPC

2019-08-01 Thread Wes McKinney
I'm generally supporting of adding the new RPC endpoint.

To make a couple points from the document

* I'm not sure what the purpose of returning the FlightDescriptor is,
but I haven't thought too hard about it
* The Schema consists of a single IPC message -- dictionaries will
appear in the actual DoGet stream. To motivate why this is --
different endpoints might have different dictionaries corresponding to
fields in the schema, to have static/constant dictionaries in a
distributed Flight setting is likely to be impractical. I summarize
the issue as "dictionaries are data, not metadata".
* I would be OK calling this GetSchema instead of GetFlightSchema but
either is okay

- Wes

On Thu, Aug 1, 2019 at 8:08 AM David Li  wrote:
>
> Hi Ryan,
>
> Thanks for writing this up! I made a couple of minor comments in the
> doc/implementation, but overall I'm in favor of having this RPC
> method.
>
> Best,
> David
>
> On 8/1/19, Ryan Murray  wrote:
> > Hi All,
> >
> > Please see the attached document for a proposed addition to the Flight
> > RPC[1]. This is the result of a previous mailing list discussion[2].
> >
> > I have created the Pull Request[3] to make the proposal a little more
> > concrete.
> > 
> > Please let me know if you have any questions or concerns.
> >
> > Best,
> > Ryan
> >
> > [1]:
> > https://docs.google.com/document/d/1zLdFYikk3owbKpHvJrARLMlmYpi-Ef6OJy7H90MqViA/edit?usp=sharing
> > [2]:
> > https://lists.apache.org/thread.html/3539984493cf3d4d439bef25c150fa9e09e0b43ce0afb6be378d41df@%3Cdev.arrow.apache.org%3E
> > [3]: https://github.com/apache/arrow/pull/4980
> >


[jira] [Created] (ARROW-6096) [C++] Remove dependency on boost regex library

2019-08-01 Thread Hatem Helal (JIRA)
Hatem Helal created ARROW-6096:
--

 Summary: [C++] Remove dependency on boost regex library
 Key: ARROW-6096
 URL: https://issues.apache.org/jira/browse/ARROW-6096
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Hatem Helal
Assignee: Hatem Helal


There appears to be only one place where the boost regex library is used:

[cpp/src/parquet/metadata.cc|https://github.com/apache/arrow/blob/eb73b962e42b5ae6983bf026ebf825f1f707e245/cpp/src/parquet/metadata.cc#L32]

I think this can be replaced by the C++11 regex library.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6095) [C++] Python subproject ignores ARROW_TEST_LINKAGE

2019-08-01 Thread Benjamin Kietzman (JIRA)
Benjamin Kietzman created ARROW-6095:


 Summary: [C++] Python subproject ignores ARROW_TEST_LINKAGE
 Key: ARROW-6095
 URL: https://issues.apache.org/jira/browse/ARROW-6095
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Benjamin Kietzman
Assignee: Benjamin Kietzman


the python subproject links to arrow_python_shared and other shared libraries 
regardless of ARROW_TEST_LINKAGE 
https://github.com/apache/arrow/blob/eb5dd50/cpp/src/arrow/python/CMakeLists.txt#L131-L132



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [DISCUSS] Add GetFlightSchema to Flight RPC

2019-08-01 Thread David Li
Hi Ryan,

Thanks for writing this up! I made a couple of minor comments in the
doc/implementation, but overall I'm in favor of having this RPC
method.

Best,
David

On 8/1/19, Ryan Murray  wrote:
> Hi All,
>
> Please see the attached document for a proposed addition to the Flight
> RPC[1]. This is the result of a previous mailing list discussion[2].
>
> I have created the Pull Request[3] to make the proposal a little more
> concrete.
> 
> Please let me know if you have any questions or concerns.
>
> Best,
> Ryan
>
> [1]:
> https://docs.google.com/document/d/1zLdFYikk3owbKpHvJrARLMlmYpi-Ef6OJy7H90MqViA/edit?usp=sharing
> [2]:
> https://lists.apache.org/thread.html/3539984493cf3d4d439bef25c150fa9e09e0b43ce0afb6be378d41df@%3Cdev.arrow.apache.org%3E
> [3]: https://github.com/apache/arrow/pull/4980
>


Re: [DISCUSS][Format] FixedSizeList w/ row-length not specified as part of the type

2019-08-01 Thread Edward Loper
Brian: yes, you're correct.  Sorry, I've been playing around with a couple
different ways to extend things, and was conflating them when I wrote my
response.  For this proposal, the dimension must have the same size for all
items in a given record batch.

As suggested by Francois and Wes, I will look into using the ExtensionType
to implement this proposal.

-Edward

On Wed, Jul 31, 2019 at 1:00 PM Brian Hulette  wrote:

> I'm a little confused about the proposal now. If the unknown dimension
> doesn't have to be the same within a record batch, how would you be able to
> deduce it with the approach you described (dividing the logical length of
> the values array by the length of the record batch)?
>
> On Wed, Jul 31, 2019 at 8:24 AM Wes McKinney  wrote:
>
> > I agree this sounds like a good application for ExtensionType. At
> > minimum, ExtensionType can be used to develop a working version of
> > what you need to help guide further discussions.
> >
> > On Mon, Jul 29, 2019 at 2:29 PM Francois Saint-Jacques
> >  wrote:
> > >
> > > Hello,
> > >
> > > if each record has a different size, then I suggest to just use a
> > > Struct> where Dim is a struct (or expand in the outer
> > > struct). You can probably add your own logic with the recently
> > > introduced ExtensionType [1].
> > >
> > > François
> > > [1]
> >
> https://github.com/apache/arrow/blob/f77c3427ca801597b572fb197b92b0133269049b/cpp/src/arrow/extension_type.h
> > >
> > > On Mon, Jul 29, 2019 at 3:15 PM Edward Loper
> 
> > wrote:
> > > >
> > > > The intention is that each individual record could have a different
> > size.
> > > > This could be consistent within a given batch, but wouldn't need to
> be.
> > > > For example, if I wanted to send a 3-channel image, but the image
> size
> > may
> > > > vary for each record, then I could use
> > > > FixedSizeList[3]>[-1]>[-1].
> > > >
> > > > On Mon, Jul 29, 2019 at 1:18 PM Brian Hulette 
> > wrote:
> > > >
> > > > > This isn't really relevant but I feel compelled to point it out -
> the
> > > > > FixedSizeList type has actually been in the Arrow spec for a while,
> > but it
> > > > > was only implemented in JS and Java initially. It was implemented
> in
> > C++
> > > > > just a few months ago.
> > > > >
> > > >
> > > > Thanks for the clarification -- I was going based on the blame
> history
> > for
> > > > Layout.rst, but I guess it just didn't get officially documented
> there
> > > > until the c++ implementation was added.
> > > >
> > > > -Edward
> > > >
> > > >
> > > > > On Mon, Jul 29, 2019 at 7:01 AM Edward Loper
> > 
> > > > > wrote:
> > > > >
> > > > > > The FixedSizeList type, which was added to Arrow a few months
> ago,
> > is an
> > > > > > array where each slot contains a fixed-size sequence of values.
> > It is
> > > > > > specified as FixedSizeList[N], where T is a child type and N
> is
> > a
> > > > > signed
> > > > > > int32 that specifies the length of each list.
> > > > > >
> > > > > > This is useful for encoding fixed-size tensors.  E.g., if I have
> a
> > > > > 100x8x10
> > > > > > tensor, then I can encode it as
> > > > > > FixedSizeList[10]>[8]>[100].
> > > > > >
> > > > > > But I'm also interested in encoding tensors where some dimension
> > sizes
> > > > > are
> > > > > > not known in advance.  It seems to me that FixedSizeList could be
> > > > > extended
> > > > > > to support this fairly easily, by simply defining that N=-1 means
> > "each
> > > > > > array slot has the same length, but that length is not known in
> > advance."
> > > > > >  So e.g. we could encode a 100x?x10 tensor as
> > > > > > FixedSizeList[10]>[-1]>[100].
> > > > > >
> > > > > > Since these N=-1 row-lengths are not encoded in the type, we need
> > some
> > > > > way
> > > > > > to determine what they are.  Luckily, every Field in the schema
> > has a
> > > > > > corresponding FieldNode in the message; and those FieldNodes can
> > be used
> > > > > to
> > > > > > deduce the row lengths.  In particular, the row length must be
> > equal to
> > > > > the
> > > > > > length of the child node divided by the length of the
> > FixedSizeList.
> > > > > E.g.,
> > > > > > if we have a FixedSizeList[-1] array with the values [[1,
> > 2], [3,
> > > > > 4],
> > > > > > [5, 6]] then the message representation is:
> > > > > >
> > > > > > * Length: 3, Null count: 0
> > > > > > * Null bitmap buffer: Not required
> > > > > > * Values array (byte array):
> > > > > > * Length: 6,  Null count: 0
> > > > > > * Null bitmap buffer: Not required
> > > > > > * Value buffer: [1, 2, 3, 4, 5, 6,  bytes>]
> > > > > >
> > > > > > So we can deduce that the row length is 6/3=2.
> > > > > >
> > > > > > It looks to me like it would be fairly easy to add support for
> > this.
> > > > > E.g.,
> > > > > > in the FixedSizeListArray constructor in c++, if
> > list_type()->list_size()
> > > > > > is -1, then set list_size_ to values.length()/length.  There
> would
> > be no
> > > > > > changes to the schema.fbs/message.fbs files -- we wo

[DISCUSS] Add GetFlightSchema to Flight RPC

2019-08-01 Thread Ryan Murray
Hi All,

Please see the attached document for a proposed addition to the Flight
RPC[1]. This is the result of a previous mailing list discussion[2].

I have created the Pull Request[3] to make the proposal a little more
concrete.

Please let me know if you have any questions or concerns.

Best,
Ryan

[1]:
https://docs.google.com/document/d/1zLdFYikk3owbKpHvJrARLMlmYpi-Ef6OJy7H90MqViA/edit?usp=sharing
[2]:
https://lists.apache.org/thread.html/3539984493cf3d4d439bef25c150fa9e09e0b43ce0afb6be378d41df@%3Cdev.arrow.apache.org%3E
[3]: https://github.com/apache/arrow/pull/4980


[jira] [Created] (ARROW-6094) Add GetFlightSchema to Flight RPC

2019-08-01 Thread Ryan Murray (JIRA)
Ryan Murray created ARROW-6094:
--

 Summary: Add GetFlightSchema to Flight RPC
 Key: ARROW-6094
 URL: https://issues.apache.org/jira/browse/ARROW-6094
 Project: Apache Arrow
  Issue Type: Task
  Components: C++, FlightRPC, Java, Python
Reporter: Ryan Murray
Assignee: Ryan Murray
 Fix For: 0.15.0


Implement GetFlightSchema as per 
https://docs.google.com/document/d/1zLdFYikk3owbKpHvJrARLMlmYpi-Ef6OJy7H90MqViA/edit?usp=sharing

and 
https://lists.apache.org/thread.html/3539984493cf3d4d439bef25c150fa9e09e0b43ce0afb6be378d41df@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6093) [Java] reduce branches in algo for first match in VectorRangeSearcher

2019-08-01 Thread Pindikura Ravindra (JIRA)
Pindikura Ravindra created ARROW-6093:
-

 Summary: [Java] reduce branches in algo for first match in 
VectorRangeSearcher
 Key: ARROW-6093
 URL: https://issues.apache.org/jira/browse/ARROW-6093
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Pindikura Ravindra


This is a follow up Jira for the improvement suggested by [~fsaintjacques] in 
the PR for 

[https://github.com/apache/arrow/pull/4925]

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6092) [C++] unit test failure due to unexpected result

2019-08-01 Thread Lee June Woo (JIRA)
Lee June Woo created ARROW-6092:
---

 Summary: [C++] unit test failure due to unexpected result 
 Key: ARROW-6092
 URL: https://issues.apache.org/jira/browse/ARROW-6092
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
 Environment: python 2.7.13 in Mac OS
Reporter: Lee June Woo
 Attachments: arrow-python-test.txt

Please check the attached test log "arrow-python-test"

The unit test was executed with python 2.7.13 version. I'm not sure this 
project would support the compatibility.



[ RUN      ] CheckPyError.TestStatus

/Users/OpenSourceProject/arrow_repo/arrow/cpp/src/arrow/python/python-test.cc:94:
 Failure

Expected equality of these values:

  detail->ToString()

    Which is: "Python exception: exceptions.TypeError"

  expected_detail

    Which is: "Python exception: TypeError"

/Users/OpenSourceProject/arrow_repo/arrow/cpp/src/arrow/python/python-test.cc:94:
 Failure

Expected equality of these values:

  detail->ToString()

    Which is: "Python exception: exceptions.NotImplementedError"

  expected_detail

    Which is: "Python exception: NotImplementedError"

[  FAILED  ] CheckPyError.TestStatus (0 ms) 

[ RUN      ] CheckPyError.TestStatusNoGIL

/Users/OpenSourceProject/arrow_repo/arrow/cpp/src/arrow/python/python-test.cc:144:
 Failure

Expected equality of these values:

  st.detail()->ToString()

    Which is: "Python exception: exceptions.ZeroDivisionError"

  "Python exception: ZeroDivisionError"

[  FAILED  ] CheckPyError.TestStatusNoGIL (0 ms) 

[--] 2 tests from CheckPyError (0 ms total)

 

[--] 1 test from RestorePyError

[ RUN      ] RestorePyError.Basics

/Users/OpenSourceProject/arrow_repo/arrow/cpp/src/arrow/python/python-test.cc:154:
 Failure

Expected equality of these values:

  st.detail()->ToString()

    Which is: "Python exception: exceptions.ZeroDivisionError"



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)