Re: [DISCUSS] PR Backlog reduction

2019-05-29 Thread Micah Kornfield
That sounds great Wes, than you and your team for taking it on.

Can you clarify, if you would prefer this approach to the one I proposed
above (i.e. should I delete the spreadsheet) or are they complementary?

Thanks,
Micah

On Wed, May 29, 2019 at 12:07 PM Wes McKinney  wrote:

> On the call today we discussed possibly repurposing the Spark PR
> dashboard application for our use
>
> * https://github.com/databricks/spark-pr-dashboard
> * https://spark-prs.appspot.com/
>
> This is a project that my team could take on this year sometime
>
> On Wed, May 29, 2019 at 4:12 AM Fan Liya  wrote:
> >
> > Sounds like a great idea. I am interested in Java PRs.
> >
> > Best,
> > Liya Fan
> >
> > On Wed, May 29, 2019 at 1:28 PM Micah Kornfield 
> > wrote:
> >
> > > Sorry for the delay.  I created
> > >
> > >
> https://docs.google.com/spreadsheets/d/146lDg11c5ohgVkrOglrb42a1JB0Gm1qBRbnoDlvB8QY/edit#gid=0
> > > as
> > > simple way to distribute old PRs if you are interested in helping,
> please
> > > add a comment under the language and I'll add you.
> > >
> > > PMC/Committers, I can share edit access if you let me know which e-mail
> > > account I should grant access to.
> > >
> > > Thanks,
> > > Micah
> > >
> > > On Tue, May 21, 2019 at 9:22 PM Micah Kornfield  >
> > > wrote:
> > >
> > > > I agree on hand curation for now.
> > > >
> > > >  I'll try to setup a sign up spreadsheet for shepherding old PRs and
> once
> > > > that done assign reviewers/ping old PRs.  I expect to have something
> to
> > > > share by the weekend.
> > > >
> > > > On Tuesday, May 21, 2019, Wes McKinney  wrote:
> > > >
> > > >> I think maintainers or contributors should be responsible for
> closing
> > > >> PRs, it also helps with backlog curation (sometimes when a stale PR
> is
> > > >> closed the JIRA may also be closed if it's a Won't Fix)
> > > >>
> > > >> On Tue, May 21, 2019 at 1:12 PM Antoine Pitrou 
> > > >> wrote:
> > > >> >
> > > >> >
> > > >> >
> > > >> > Le 21/05/2019 à 20:02, Neal Richardson a écrit :
> > > >> > > Automatically close stale PRs? https://github.com/probot/stale
> > > >> >
> > > >> > That doesn't sound like a good idea to me.
> > > >> >
> > > >> > Regards
> > > >> >
> > > >> > Antoine.
> > > >>
> > > >
> > >
>


[jira] [Created] (ARROW-5445) [Website] Remove language that encourages pinning a version

2019-05-29 Thread Neal Richardson (JIRA)
Neal Richardson created ARROW-5445:
--

 Summary: [Website] Remove language that encourages pinning a 
version
 Key: ARROW-5445
 URL: https://issues.apache.org/jira/browse/ARROW-5445
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Website
Reporter: Neal Richardson
 Fix For: 1.0.0


See [https://github.com/apache/arrow/pull/4411#discussion_r288804415]. Whenever 
we decide to stop threatening to break APIs (1.0 release or otherwise), purge 
any recommendations like this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5444) [Release][Website] After 0.14 release, update what is an "official" release

2019-05-29 Thread Neal Richardson (JIRA)
Neal Richardson created ARROW-5444:
--

 Summary: [Release][Website] After 0.14 release, update what is an 
"official" release
 Key: ARROW-5444
 URL: https://issues.apache.org/jira/browse/ARROW-5444
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Website
Affects Versions: 0.13.0
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 0.14.0


See [https://github.com/apache/arrow/pull/4411#discussion_r288846129]. C++ and 
GLib (C) Packages for Debian GNU/Linux, Ubuntu and CentOS should be "official". 
And maybe we can make some of the others official too?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: ARROW-4714: Providing JNI interface to Read ORC file via Arrow C++

2019-05-29 Thread Yurui Zhou
Hey guys:

Currently all the comments has been resolved and all the builds and tests are 
passed.

Is there any other general comments regarding this changes?

Yurui
On 21 May 2019, 10:36 AM +0800, Yurui Zhou , wrote:
> Hi Micah:
>
> Thanks for the response. According to our benchmark, the cpp-orc is on 
> average 1% to 10% slower than the java-orc,
> While the on-heap to off-heap memory conversion overhead can easily outweigh 
> such a performance difference.
> And we are currently also working on some performance improvement patches to 
> cpp-orc to make sure it achieve at least the same performance as java-orc.
>
> Thanks
> Yurui
> On 20 May 2019, 9:22 PM +0800, Micah Kornfield , wrote:
> > Hi Yurui,
> > This is cool, I will try to leave some comments tonight.
> >
> > Reading the JIRA it references the conversion from on-heap to off heap
> > memory being the performance issue. Now that Arrow Java can point at
> > arbitrary memory do you know the performance delta between java-orc and
> > cpp-orc? (I'm wondering if we should do something similar for parquet-cpp)
> >
> > Thanks,
> > Micah
> >
> > On Monday, May 20, 2019, Yurui Zhou  wrote:
> >
> > > Hi Guys:
> > >
> > > I just created a PR with WIP changes about adding JNI interface for
> > > reading orc files.
> > >
> > > All the major changes has been done and I would like some early feedback
> > > from the community.
> > >
> > > Feel free to take a look and leave your feedback.
> > > https://github.com/apache/arrow/pull/4348
> > >
> > > Some clean up and unit tests will be added up in follow up iterations.
> > >
> > > Thanks
> > > Yurui
> > >
> > >


[jira] [Created] (ARROW-5443) [Gandiva][Crossbow] Turn parquet encryption off

2019-05-29 Thread Praveen Kumar Desabandu (JIRA)
Praveen Kumar Desabandu created ARROW-5443:
--

 Summary: [Gandiva][Crossbow] Turn parquet encryption off
 Key: ARROW-5443
 URL: https://issues.apache.org/jira/browse/ARROW-5443
 Project: Apache Arrow
  Issue Type: Task
  Components: Packaging
Reporter: Praveen Kumar Desabandu
Assignee: Praveen Kumar Desabandu
 Fix For: 0.14.0


Turn off Parquet Encryption for Gandiva crossbow, since it pulls in additional 
dependencies that is not required/present currently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] PR Backlog reduction

2019-05-29 Thread Wes McKinney
On the call today we discussed possibly repurposing the Spark PR
dashboard application for our use

* https://github.com/databricks/spark-pr-dashboard
* https://spark-prs.appspot.com/

This is a project that my team could take on this year sometime

On Wed, May 29, 2019 at 4:12 AM Fan Liya  wrote:
>
> Sounds like a great idea. I am interested in Java PRs.
>
> Best,
> Liya Fan
>
> On Wed, May 29, 2019 at 1:28 PM Micah Kornfield 
> wrote:
>
> > Sorry for the delay.  I created
> >
> > https://docs.google.com/spreadsheets/d/146lDg11c5ohgVkrOglrb42a1JB0Gm1qBRbnoDlvB8QY/edit#gid=0
> > as
> > simple way to distribute old PRs if you are interested in helping, please
> > add a comment under the language and I'll add you.
> >
> > PMC/Committers, I can share edit access if you let me know which e-mail
> > account I should grant access to.
> >
> > Thanks,
> > Micah
> >
> > On Tue, May 21, 2019 at 9:22 PM Micah Kornfield 
> > wrote:
> >
> > > I agree on hand curation for now.
> > >
> > >  I'll try to setup a sign up spreadsheet for shepherding old PRs and once
> > > that done assign reviewers/ping old PRs.  I expect to have something to
> > > share by the weekend.
> > >
> > > On Tuesday, May 21, 2019, Wes McKinney  wrote:
> > >
> > >> I think maintainers or contributors should be responsible for closing
> > >> PRs, it also helps with backlog curation (sometimes when a stale PR is
> > >> closed the JIRA may also be closed if it's a Won't Fix)
> > >>
> > >> On Tue, May 21, 2019 at 1:12 PM Antoine Pitrou 
> > >> wrote:
> > >> >
> > >> >
> > >> >
> > >> > Le 21/05/2019 à 20:02, Neal Richardson a écrit :
> > >> > > Automatically close stale PRs? https://github.com/probot/stale
> > >> >
> > >> > That doesn't sound like a good idea to me.
> > >> >
> > >> > Regards
> > >> >
> > >> > Antoine.
> > >>
> > >
> >


Re: Arrow sync call tomorrow (May 29) at 12:00 US/Eastern, 16:00 UTC

2019-05-29 Thread Neal Richardson
Attendees:
* Bryan Cutler
* François Saint-Jacques
* John Muehlhausen
* Neal Richardson
* Praveen Kumar
* Wes McKinney

John: Raised question of custom metadata in file footer (see
https://lists.apache.org/thread.html/c3b3d1456b7062a435f6795c0308ccb7c8fe55c818cfed2cf55f76c5@%3Cdev.arrow.apache.org%3E).
Agreed to discuss on the mailing list.

Wes: PR backlog management (see
https://lists.apache.org/thread.html/b667cb745c74467a5391edc767e7ca91bfb78387702e9d805eac0e5e@%3Cdev.arrow.apache.org%3E).
Raised the example of https://spark-prs.appspot.com/. Neal volunteered to
spend a couple hours trying to point that (
https://github.com/databricks/spark-pr-dashboard) at the Arrow GitHub and
JIRA. Also raised the issue that visibility into issues is only part of the
problem, that we may need to define clearer procedures and responsibilities
for reviewing and merging PRs. Often unclear who needs to take action next:
better visibility is part of the solution but probably not enough on its
own.


On Tue, May 28, 2019 at 12:35 PM Neal Richardson <
neal.p.richard...@gmail.com> wrote:

> Hi all,
> Reminder that the biweekly Arrow call is tomorrow at
> https://meet.google.com/vtm-teks-phx. All are welcome to join. Notes will
> be sent out to the mailing list afterwards.
>
> Neal
>


Re: Column/Partition Pruning

2019-05-29 Thread Russell Jurney
I've got things working like this:

# Test ticker
ticker = 'AAPL'

stocks_close_ds = ParquetDataset(
'data/v4.parquet',
filters=[('Ticker','=',ticker)]
)
table = stocks_close_ds.read()
stocks_close_df = table.to_pandas()

stocks_close_df.head() # prints the filtered pandas.DataFrame


I'll look at getting this working in pandas.

Thanks,
Russell Jurney @rjurney 
russell.jur...@gmail.com LI  FB
 datasyndrome.com


On Tue, May 28, 2019 at 6:56 PM Wes McKinney  wrote:

> hi Russell -- yes, you can use ParquetDataset directly and read to pandas.
>
> We have been discussing a more extensive Datasets framework in C++
> that will also support multiple file formats and pluggable partition
> schemes, read more at
>
>
> https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit
>
> On Tue, May 28, 2019 at 8:21 PM Russell Jurney 
> wrote:
> >
> > Thanks, Joris. It looks like filters isn't a valid argument for
> > pandas.read_parquet. Is it possible to instantiate a
> > pyarrow.parquet.ParquetDataset and then convert it to a pandas.DataFrame
> > and have the same effect?
> >
> > I filed an issue here: https://github.com/pandas-dev/pandas/issues/26551
> >
> > Thanks,
> > Russell Jurney @rjurney 
> > russell.jur...@gmail.com LI  FB
> >  datasyndrome.com
> >
> >
> > On Mon, May 27, 2019 at 11:06 PM Joris Van den Bossche <
> > jorisvandenboss...@gmail.com> wrote:
> >
> > > Hi Russel,
> > >
> > > Yes and no. When specifying a column selection with read_parquet,
> indeed
> > > only the relevant columns will be loaded (since Parquet is a columnar
> > > storage, this is possible).
> > > But the filtering you show is done on the returned pandas DataFrame.
> And
> > > currently, pandas does not support any lazy operations, so the
> dataframe
> > > returned by read_parquet (stocks_close_df) is the full, materialized
> > > dataframe on which you then filter a subset.
> > >
> > > But, filtering could also be done *when* reading the parquet file(s),
> to
> > > actually prevent reading everything into memory. However, this is only
> > > partly implemented in pyarrow at this moment. If you have a dataset
> > > consisting of partitioned files in nested directories (Hive like),
> pyarrow
> > > can filter on which files to read. See the "filters" keyword of
> > > ParquetDataset (
> > >
> > >
> https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html
> > > ).
> > > I am only not fully sure you can already use this through the pandas
> > > interface, it might be you need to use the pyarrow interface directly
> (in
> > > which case, feel free to open an issue on the pandas issue tracker).
> > > For filtering row groups within files, this is not yet implemented,
> there
> > > is an open issue: https://issues.apache.org/jira/browse/ARROW-1796.
> > >
> > > Best,
> > > Joris
> > >
> > > Op di 28 mei 2019 om 03:26 schreef Russell Jurney <
> > > russell.jur...@gmail.com
> > > >:
> > >
> > > > Hello, I am wondering if pandas.read_parquet(engine='pyarrow') takes
> > > > advantage of Parquet by only loading the relevant columns and by
> using
> > > the
> > > > partition column(s) sub-directories if a partition column is
> included in
> > > > the load and then filtered on later? Looking at the code for
> > > > pandas.read_parquet it is not clear.
> > > >
> > > > For example something like:
> > > >
> > > > stocks_close_df = pd.read_parquet(
> > > > 'data/v4.parquet',
> > > > columns=['DateTime', 'Close', 'Ticker'],
> > > > engine='pyarrow'
> > > > )
> > > >
> > > > # Filter the data to just this ticker
> > > > stocks_close_df = stocks_close_df[stocks_close_df.Ticker == ticker][[
> > > > 'DateTime', 'Close']]
> > > >
> > > > Thanks,
> > > > Russell Jurney @rjurney 
> > > > russell.jur...@gmail.com LI 
> FB
> > > >  datasyndrome.com
> > > >
> > >
>


[jira] [Created] (ARROW-5442) [Website] Clarify what makes a release artifact "official"

2019-05-29 Thread Neal Richardson (JIRA)
Neal Richardson created ARROW-5442:
--

 Summary: [Website] Clarify what makes a release artifact "official"
 Key: ARROW-5442
 URL: https://issues.apache.org/jira/browse/ARROW-5442
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Website
Reporter: Neal Richardson
Assignee: Neal Richardson


See discussion here: 
[https://github.com/apache/arrow/pull/4401#discussion_r288348562]

In order to minimize FUD, add some explanation and don't shout "unofficial" 
everywhere. 
[https://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support?action=show=Distribution]
 is one model.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] Parquet C++/Rust: Rename Parquet::LogicalType to Parquet::ConvertedType

2019-05-29 Thread Francois Saint-Jacques
+1 on renaming for to avoid confusion at the cost of breaking some API.

On Wed, May 29, 2019 at 10:09 AM Wes McKinney  wrote:
>
> I'm in favor of making the change -- it's slightly disruptive for
> library-users, but the fix is no more complicated than a
> search-and-replace. When the C++ project was started, the LogicalType
> union didn't exist and "LogicalType" seemed like a more appropriate
> name for ConvertedType.
>
> On Wed, May 29, 2019 at 7:11 AM Wes McKinney  wrote:
> >
> > You all probably want to join d...@parquet.apache.org and have the
> > discussion there. From a governance perspective that's where we need
> > to talk about making breaking changes to the Parquet C++ library
> >
> > LogicalType was introduced into the Parquet format in October 2017 to
> > be a more flexible and future-proof replacement for the original
> > ConvertedType metadata, see
> >
> > https://github.com/apache/parquet-format/commit/863875e0be3237c6aa4ed71733d54c91a51deabe
> >
> > Support and forward/backwards compatibility for the new LogicalType
> > union was just developed in PARQUET-1411
> >
> > https://github.com/apache/arrow/commit/38b1ddfb7f5def825ac57c8f27ffe5afa7fcb483
> >
> > On Wed, May 29, 2019 at 4:44 AM Joris Van den Bossche
> >  wrote:
> > >
> > > Yes, the LogicalType is newer than ConvertedType in the parquet format, 
> > > and
> > > was until recently not implemented in parquet-cpp.
> > > The problem is that originally, the parquet thrift::ConvertedType was
> > > implemented in parquet-cpp as LogicalType. Now, support is added in
> > > parquet-cpp for this newer parquet thrift::LogicalType, but the obvious
> > > name for that in parquet-cpp was already taken. Therefore, it was added as
> > > parquet::LogicalAnnotation. See this PR for context:
> > > https://github.com/apache/arrow/pull/4185
> > >
> > > So Deepak's question is if we can rename parquet-cpp's 
> > > parquet::LogicalType
> > > to parquet::ConvertedType (to match the thrift format naming), so we can
> > > actually use the logical name parquet::LogicalType instead of
> > > parquet::LogicalAnnotation for the new implementation.
> > > And to avoid the confusion we are having here ..
> > >
> > > But renaming like that would be hard break in parquet-cpp for libraries
> > > depending on that, though. But I can't really assess the impact of that.
> > >
> > > Best,
> > > Joris
> > >
> > > Op wo 29 mei 2019 om 11:04 schreef Antoine Pitrou :
> > >
> > > >
> > > > Le 29/05/2019 à 10:47, Deepak Majeti a écrit :
> > > > > "ConvertedType" term is used by the parquet specification below. This
> > > > type
> > > > > is used to map client data types to Parquet types.
> > > > >
> > > > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L48
> > > >
> > > > But apparently there's also "LogicalType":
> > > >
> > > > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L315
> > > >
> > > > "LogicalType annotations to replace ConvertedType"
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >


Re: [DISCUSS] Parquet C++/Rust: Rename Parquet::LogicalType to Parquet::ConvertedType

2019-05-29 Thread Chao Sun
I'm +1 on the change for the Rust side as well. It probably won't be as
disruptive as the C++ side.

On Wed, May 29, 2019 at 7:09 AM Wes McKinney  wrote:

> I'm in favor of making the change -- it's slightly disruptive for
> library-users, but the fix is no more complicated than a
> search-and-replace. When the C++ project was started, the LogicalType
> union didn't exist and "LogicalType" seemed like a more appropriate
> name for ConvertedType.
>
> On Wed, May 29, 2019 at 7:11 AM Wes McKinney  wrote:
> >
> > You all probably want to join d...@parquet.apache.org and have the
> > discussion there. From a governance perspective that's where we need
> > to talk about making breaking changes to the Parquet C++ library
> >
> > LogicalType was introduced into the Parquet format in October 2017 to
> > be a more flexible and future-proof replacement for the original
> > ConvertedType metadata, see
> >
> >
> https://github.com/apache/parquet-format/commit/863875e0be3237c6aa4ed71733d54c91a51deabe
> >
> > Support and forward/backwards compatibility for the new LogicalType
> > union was just developed in PARQUET-1411
> >
> >
> https://github.com/apache/arrow/commit/38b1ddfb7f5def825ac57c8f27ffe5afa7fcb483
> >
> > On Wed, May 29, 2019 at 4:44 AM Joris Van den Bossche
> >  wrote:
> > >
> > > Yes, the LogicalType is newer than ConvertedType in the parquet
> format, and
> > > was until recently not implemented in parquet-cpp.
> > > The problem is that originally, the parquet thrift::ConvertedType was
> > > implemented in parquet-cpp as LogicalType. Now, support is added in
> > > parquet-cpp for this newer parquet thrift::LogicalType, but the obvious
> > > name for that in parquet-cpp was already taken. Therefore, it was
> added as
> > > parquet::LogicalAnnotation. See this PR for context:
> > > https://github.com/apache/arrow/pull/4185
> > >
> > > So Deepak's question is if we can rename parquet-cpp's
> parquet::LogicalType
> > > to parquet::ConvertedType (to match the thrift format naming), so we
> can
> > > actually use the logical name parquet::LogicalType instead of
> > > parquet::LogicalAnnotation for the new implementation.
> > > And to avoid the confusion we are having here ..
> > >
> > > But renaming like that would be hard break in parquet-cpp for libraries
> > > depending on that, though. But I can't really assess the impact of
> that.
> > >
> > > Best,
> > > Joris
> > >
> > > Op wo 29 mei 2019 om 11:04 schreef Antoine Pitrou  >:
> > >
> > > >
> > > > Le 29/05/2019 à 10:47, Deepak Majeti a écrit :
> > > > > "ConvertedType" term is used by the parquet specification below.
> This
> > > > type
> > > > > is used to map client data types to Parquet types.
> > > > >
> > > >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L48
> > > >
> > > > But apparently there's also "LogicalType":
> > > >
> > > >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L315
> > > >
> > > > "LogicalType annotations to replace ConvertedType"
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
>


[jira] [Created] (ARROW-5441) [C++] Implement FindArrowFlight.cmake

2019-05-29 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-5441:
-

 Summary: [C++] Implement FindArrowFlight.cmake
 Key: ARROW-5441
 URL: https://issues.apache.org/jira/browse/ARROW-5441
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, FlightRPC
Affects Versions: 0.13.0
Reporter: Antoine Pitrou


This is required so that downstream CMake users can write:
{code}
find_package(ArrowFlight)
{code}

to find the Flight library paths. In turn this will allow 
{{PYARROW_BUNDLE_ARROW_CPP}} to work with Arrow Flight.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5440) Rust Parquet requiring libstd-xxx.so dependency on centos

2019-05-29 Thread Tenzin Rigden (JIRA)
Tenzin Rigden created ARROW-5440:


 Summary: Rust Parquet requiring libstd-xxx.so dependency on centos
 Key: ARROW-5440
 URL: https://issues.apache.org/jira/browse/ARROW-5440
 Project: Apache Arrow
  Issue Type: Bug
 Environment: CentOS Linux release 7.6.1810 (Core) 
Reporter: Tenzin Rigden
 Attachments: parquet-test-libstd.tar.gz

Hello,

In the rust parquet implementation ([https://github.com/sunchao/parquet-rs]) on 
centos, the binary created has a `libstd-hash.so` shared library dependency 
that is causing issues since it's a shared library found in the rustup 
directory. This `libstd-hash.so` dependency isn't there on any other rust 
binaries I've made before. This dependency means that I can't run this binary 
anywhere where rustup isn't installed with that exact libstd library.

This is not an issue on Mac.

I've attached the rust files and here is the command line output below.

```

[centos@_ parquet-test]$ cat /etc/centos-release

CentOS Linux release 7.6.1810 (Core)


[centos@_ parquet-test]$ rustc --version

rustc 1.36.0-nightly (e70d5386d 2019-05-27)

[centos@_ parquet-test]$ ldd target/release/parquet-test

 linux-vdso.so.1 =>  (0x7ffd02fee000)

 libstd-44988553032616b2.so => not found

 librt.so.1 => /lib64/librt.so.1 (0x7f6ecd209000)

 libpthread.so.0 => /lib64/libpthread.so.0 (0x7f6eccfed000)

 libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x7f6eccdd7000)

 libc.so.6 => /lib64/libc.so.6 (0x7f6ecca0a000)

 libm.so.6 => /lib64/libm.so.6 (0x7f6ecc708000)

 /lib64/ld-linux-x86-64.so.2 (0x7f6ecd8b1000)

[centos@_ parquet-test]$ ls -l 
~/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/libstd-44988553032616b2.so
 

-rw-r--r--. 1 centos centos 5623568 May 27 21:46 
/home/centos/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/libstd-44988553032616b2.so
```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] Parquet C++/Rust: Rename Parquet::LogicalType to Parquet::ConvertedType

2019-05-29 Thread Wes McKinney
I'm in favor of making the change -- it's slightly disruptive for
library-users, but the fix is no more complicated than a
search-and-replace. When the C++ project was started, the LogicalType
union didn't exist and "LogicalType" seemed like a more appropriate
name for ConvertedType.

On Wed, May 29, 2019 at 7:11 AM Wes McKinney  wrote:
>
> You all probably want to join d...@parquet.apache.org and have the
> discussion there. From a governance perspective that's where we need
> to talk about making breaking changes to the Parquet C++ library
>
> LogicalType was introduced into the Parquet format in October 2017 to
> be a more flexible and future-proof replacement for the original
> ConvertedType metadata, see
>
> https://github.com/apache/parquet-format/commit/863875e0be3237c6aa4ed71733d54c91a51deabe
>
> Support and forward/backwards compatibility for the new LogicalType
> union was just developed in PARQUET-1411
>
> https://github.com/apache/arrow/commit/38b1ddfb7f5def825ac57c8f27ffe5afa7fcb483
>
> On Wed, May 29, 2019 at 4:44 AM Joris Van den Bossche
>  wrote:
> >
> > Yes, the LogicalType is newer than ConvertedType in the parquet format, and
> > was until recently not implemented in parquet-cpp.
> > The problem is that originally, the parquet thrift::ConvertedType was
> > implemented in parquet-cpp as LogicalType. Now, support is added in
> > parquet-cpp for this newer parquet thrift::LogicalType, but the obvious
> > name for that in parquet-cpp was already taken. Therefore, it was added as
> > parquet::LogicalAnnotation. See this PR for context:
> > https://github.com/apache/arrow/pull/4185
> >
> > So Deepak's question is if we can rename parquet-cpp's parquet::LogicalType
> > to parquet::ConvertedType (to match the thrift format naming), so we can
> > actually use the logical name parquet::LogicalType instead of
> > parquet::LogicalAnnotation for the new implementation.
> > And to avoid the confusion we are having here ..
> >
> > But renaming like that would be hard break in parquet-cpp for libraries
> > depending on that, though. But I can't really assess the impact of that.
> >
> > Best,
> > Joris
> >
> > Op wo 29 mei 2019 om 11:04 schreef Antoine Pitrou :
> >
> > >
> > > Le 29/05/2019 à 10:47, Deepak Majeti a écrit :
> > > > "ConvertedType" term is used by the parquet specification below. This
> > > type
> > > > is used to map client data types to Parquet types.
> > > >
> > > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L48
> > >
> > > But apparently there's also "LogicalType":
> > >
> > > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L315
> > >
> > > "LogicalType annotations to replace ConvertedType"
> > >
> > > Regards
> > >
> > > Antoine.
> > >


Propose custom_metadata for Footer

2019-05-29 Thread John Muehlhausen
Original write of File:

Schema: custom_metadata: {"value":1}
Message
Message
Footer
  Schema: custom_metadata: {"value":1}


Process appends messages (new data in bold):

Schema: custom_metadata: {"value":1}
Message
Message
*Message*
*Footer*
*  Schema: custom_metadata: {"value":2}*


Re-writing the entire file may not be practical due to the size, so after
this operation custom_metadata at the Schema level is out of sync... the
two Schema copies are no longer the same.

In order to address this I would like to propose adding custom_metadata to
Footer as well.  Any objections?

-John


[jira] [Created] (ARROW-5439) [Java] Utilize stream EOS in File format

2019-05-29 Thread John Muehlhausen (JIRA)
John Muehlhausen created ARROW-5439:
---

 Summary: [Java] Utilize stream EOS in File format
 Key: ARROW-5439
 URL: https://issues.apache.org/jira/browse/ARROW-5439
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: John Muehlhausen


We currently do not write EOS at the end of a Message stream inside the File 
format.  As a result, the file cannot be parsed sequentially.  This change 
prepares for other implementations or future reference features that parse a 
File sequentially... i.e. without access to seek().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5438) [JS] Utilize stream EOS in File format

2019-05-29 Thread John Muehlhausen (JIRA)
John Muehlhausen created ARROW-5438:
---

 Summary: [JS] Utilize stream EOS in File format
 Key: ARROW-5438
 URL: https://issues.apache.org/jira/browse/ARROW-5438
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: John Muehlhausen


We currently do not write EOS at the end of a Message stream inside the File 
format.  As a result, the file cannot be parsed sequentially.  This change 
prepares for other implementations or future reference features that parse a 
File sequentially... i.e. without access to seek().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5437) [Python] Misssing pandas pytest marker from parquet tests

2019-05-29 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-5437:
--

 Summary: [Python] Misssing pandas pytest marker from parquet tests
 Key: ARROW-5437
 URL: https://issues.apache.org/jira/browse/ARROW-5437
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs


So the test suite fails without pandas installed: 
https://travis-ci.org/ursa-labs/crossbow/builds/538694001



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] Parquet C++/Rust: Rename Parquet::LogicalType to Parquet::ConvertedType

2019-05-29 Thread Wes McKinney
You all probably want to join d...@parquet.apache.org and have the
discussion there. From a governance perspective that's where we need
to talk about making breaking changes to the Parquet C++ library

LogicalType was introduced into the Parquet format in October 2017 to
be a more flexible and future-proof replacement for the original
ConvertedType metadata, see

https://github.com/apache/parquet-format/commit/863875e0be3237c6aa4ed71733d54c91a51deabe

Support and forward/backwards compatibility for the new LogicalType
union was just developed in PARQUET-1411

https://github.com/apache/arrow/commit/38b1ddfb7f5def825ac57c8f27ffe5afa7fcb483

On Wed, May 29, 2019 at 4:44 AM Joris Van den Bossche
 wrote:
>
> Yes, the LogicalType is newer than ConvertedType in the parquet format, and
> was until recently not implemented in parquet-cpp.
> The problem is that originally, the parquet thrift::ConvertedType was
> implemented in parquet-cpp as LogicalType. Now, support is added in
> parquet-cpp for this newer parquet thrift::LogicalType, but the obvious
> name for that in parquet-cpp was already taken. Therefore, it was added as
> parquet::LogicalAnnotation. See this PR for context:
> https://github.com/apache/arrow/pull/4185
>
> So Deepak's question is if we can rename parquet-cpp's parquet::LogicalType
> to parquet::ConvertedType (to match the thrift format naming), so we can
> actually use the logical name parquet::LogicalType instead of
> parquet::LogicalAnnotation for the new implementation.
> And to avoid the confusion we are having here ..
>
> But renaming like that would be hard break in parquet-cpp for libraries
> depending on that, though. But I can't really assess the impact of that.
>
> Best,
> Joris
>
> Op wo 29 mei 2019 om 11:04 schreef Antoine Pitrou :
>
> >
> > Le 29/05/2019 à 10:47, Deepak Majeti a écrit :
> > > "ConvertedType" term is used by the parquet specification below. This
> > type
> > > is used to map client data types to Parquet types.
> > >
> > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L48
> >
> > But apparently there's also "LogicalType":
> >
> > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L315
> >
> > "LogicalType annotations to replace ConvertedType"
> >
> > Regards
> >
> > Antoine.
> >


[jira] [Created] (ARROW-5436) [Python] expose filters argument in parquet.read_table

2019-05-29 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5436:


 Summary: [Python] expose filters argument in parquet.read_table
 Key: ARROW-5436
 URL: https://issues.apache.org/jira/browse/ARROW-5436
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 0.14.0


Currently, the {{parquet.read_table}} function can be used both for reading a 
single file (interface to ParquetFile) as a directory (interface to 
ParquetDataset). 

ParquetDataset has some extra keywords such as {{filters}} that would be nice 
to expose through {{read_table}} as well.

Of course one can always use {{ParquetDataset}} if you need its power, but for 
pandas wrapping pyarrow it is easier to be able to pass through keywords just 
to {{parquet.read_table}} instead of calling either {{read_table}} or 
{{ParquetDataset}}. Context: https://github.com/pandas-dev/pandas/issues/26551



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] Parquet C++/Rust: Rename Parquet::LogicalType to Parquet::ConvertedType

2019-05-29 Thread Joris Van den Bossche
Yes, the LogicalType is newer than ConvertedType in the parquet format, and
was until recently not implemented in parquet-cpp.
The problem is that originally, the parquet thrift::ConvertedType was
implemented in parquet-cpp as LogicalType. Now, support is added in
parquet-cpp for this newer parquet thrift::LogicalType, but the obvious
name for that in parquet-cpp was already taken. Therefore, it was added as
parquet::LogicalAnnotation. See this PR for context:
https://github.com/apache/arrow/pull/4185

So Deepak's question is if we can rename parquet-cpp's parquet::LogicalType
to parquet::ConvertedType (to match the thrift format naming), so we can
actually use the logical name parquet::LogicalType instead of
parquet::LogicalAnnotation for the new implementation.
And to avoid the confusion we are having here ..

But renaming like that would be hard break in parquet-cpp for libraries
depending on that, though. But I can't really assess the impact of that.

Best,
Joris

Op wo 29 mei 2019 om 11:04 schreef Antoine Pitrou :

>
> Le 29/05/2019 à 10:47, Deepak Majeti a écrit :
> > "ConvertedType" term is used by the parquet specification below. This
> type
> > is used to map client data types to Parquet types.
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L48
>
> But apparently there's also "LogicalType":
>
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L315
>
> "LogicalType annotations to replace ConvertedType"
>
> Regards
>
> Antoine.
>


[jira] [Created] (ARROW-5435) IntervalYearVector#getObject should return Period with both year and month

2019-05-29 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5435:
-

 Summary: IntervalYearVector#getObject should return Period with 
both year and month
 Key: ARROW-5435
 URL: https://issues.apache.org/jira/browse/ARROW-5435
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Ji Liu
Assignee: Ji Liu


IntervalYearVector#getObject today return Period with specific month. However, 
this vector stores interval (years and months, e.g. 2 years and 3 months is 
stored as 27(total months)), it should return Period with both years and 
months(now only months is assigned). 

As shown in the example above, now it return Period(27 months), I think it 
should return Period(2 years, 3 months).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] PR Backlog reduction

2019-05-29 Thread Fan Liya
Sounds like a great idea. I am interested in Java PRs.

Best,
Liya Fan

On Wed, May 29, 2019 at 1:28 PM Micah Kornfield 
wrote:

> Sorry for the delay.  I created
>
> https://docs.google.com/spreadsheets/d/146lDg11c5ohgVkrOglrb42a1JB0Gm1qBRbnoDlvB8QY/edit#gid=0
> as
> simple way to distribute old PRs if you are interested in helping, please
> add a comment under the language and I'll add you.
>
> PMC/Committers, I can share edit access if you let me know which e-mail
> account I should grant access to.
>
> Thanks,
> Micah
>
> On Tue, May 21, 2019 at 9:22 PM Micah Kornfield 
> wrote:
>
> > I agree on hand curation for now.
> >
> >  I'll try to setup a sign up spreadsheet for shepherding old PRs and once
> > that done assign reviewers/ping old PRs.  I expect to have something to
> > share by the weekend.
> >
> > On Tuesday, May 21, 2019, Wes McKinney  wrote:
> >
> >> I think maintainers or contributors should be responsible for closing
> >> PRs, it also helps with backlog curation (sometimes when a stale PR is
> >> closed the JIRA may also be closed if it's a Won't Fix)
> >>
> >> On Tue, May 21, 2019 at 1:12 PM Antoine Pitrou 
> >> wrote:
> >> >
> >> >
> >> >
> >> > Le 21/05/2019 à 20:02, Neal Richardson a écrit :
> >> > > Automatically close stale PRs? https://github.com/probot/stale
> >> >
> >> > That doesn't sound like a good idea to me.
> >> >
> >> > Regards
> >> >
> >> > Antoine.
> >>
> >
>


Re: [DISCUSS] Parquet C++/Rust: Rename Parquet::LogicalType to Parquet::ConvertedType

2019-05-29 Thread Antoine Pitrou


Le 29/05/2019 à 10:47, Deepak Majeti a écrit :
> "ConvertedType" term is used by the parquet specification below. This type
> is used to map client data types to Parquet types.
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L48

But apparently there's also "LogicalType":
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L315

"LogicalType annotations to replace ConvertedType"

Regards

Antoine.


Re: [DISCUSS] Parquet C++/Rust: Rename Parquet::LogicalType to Parquet::ConvertedType

2019-05-29 Thread Joris Van den Bossche
Op wo 29 mei 2019 om 10:00 schreef Antoine Pitrou :

>
> Why "converted"?  Is there a conversion?
>
> "Converted" is the terminology used in the parquet format:
https://github.com/apache/parquet-format/blob/b5d34faf47b59b1220a1bbe0fc438be71fed6d90/src/main/thrift/parquet.thrift#L43-L48


>
> Le 29/05/2019 à 08:46, Deepak Majeti a écrit :
> > Hi Everyone,
> >
> > In the early days of parquet-cpp development, the developers mapped the
> > thrift::ConvertedType to parquet::LogicalType.
> > This now leads to confusion with the recent introduction of
> > thrift::LogicalType.
> >
> > Parquet Rust also adopted this incorrect naming convention.
> >
> > Are there any objections to renaming the Parquet::LogicalType to
> > Parquet::ConvertedType in both C++ and Rust?
> >
>


Re: [DISCUSS] Parquet C++/Rust: Rename Parquet::LogicalType to Parquet::ConvertedType

2019-05-29 Thread Deepak Majeti
"ConvertedType" term is used by the parquet specification below. This type
is used to map client data types to Parquet types.
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L48

On Wed, May 29, 2019 at 1:30 PM Antoine Pitrou  wrote:

>
> Why "converted"?  Is there a conversion?
>
>
> Le 29/05/2019 à 08:46, Deepak Majeti a écrit :
> > Hi Everyone,
> >
> > In the early days of parquet-cpp development, the developers mapped the
> > thrift::ConvertedType to parquet::LogicalType.
> > This now leads to confusion with the recent introduction of
> > thrift::LogicalType.
> >
> > Parquet Rust also adopted this incorrect naming convention.
> >
> > Are there any objections to renaming the Parquet::LogicalType to
> > Parquet::ConvertedType in both C++ and Rust?
> >
>


-- 
regards,
Deepak Majeti


Re: [DISCUSS] Parquet C++/Rust: Rename Parquet::LogicalType to Parquet::ConvertedType

2019-05-29 Thread Antoine Pitrou


Why "converted"?  Is there a conversion?


Le 29/05/2019 à 08:46, Deepak Majeti a écrit :
> Hi Everyone,
> 
> In the early days of parquet-cpp development, the developers mapped the
> thrift::ConvertedType to parquet::LogicalType.
> This now leads to confusion with the recent introduction of
> thrift::LogicalType.
> 
> Parquet Rust also adopted this incorrect naming convention.
> 
> Are there any objections to renaming the Parquet::LogicalType to
> Parquet::ConvertedType in both C++ and Rust?
> 


[DISCUSS] Parquet C++/Rust: Rename Parquet::LogicalType to Parquet::ConvertedType

2019-05-29 Thread Deepak Majeti
Hi Everyone,

In the early days of parquet-cpp development, the developers mapped the
thrift::ConvertedType to parquet::LogicalType.
This now leads to confusion with the recent introduction of
thrift::LogicalType.

Parquet Rust also adopted this incorrect naming convention.

Are there any objections to renaming the Parquet::LogicalType to
Parquet::ConvertedType in both C++ and Rust?

-- 
regards,
Deepak Majeti


[jira] [Created] (ARROW-5434) Introduce wrappers for backward compatibility.

2019-05-29 Thread Praveen Kumar Desabandu (JIRA)
Praveen Kumar Desabandu created ARROW-5434:
--

 Summary: Introduce wrappers for backward compatibility.
 Key: ARROW-5434
 URL: https://issues.apache.org/jira/browse/ARROW-5434
 Project: Apache Arrow
  Issue Type: Task
Reporter: Praveen Kumar Desabandu
Assignee: Praveen Kumar Desabandu


* Introduce some wrapper methods in ArrowBuf to reduce amount of client changes.
 * Changes were introduced as part of patch to support arrow buffers on random 
memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)