Re: Allow users to fine-tune parquet writing

2020-02-04 Thread Gabor Szadovszky
Hi Ryan, I wouldn't say they are expensive. But, in case the column data is kind of random the column indexes would not help in filtering but would have a small overhead in performance. Why would we save column indexes for such columns wasting (a little amount of) space and some time at filtering?

[jira] [Commented] (PARQUET-1783) [C++] Parquet statistics wrong for dictionary type

2020-02-04 Thread Uwe Korn (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17029811#comment-17029811 ] Uwe Korn commented on PARQUET-1783: --- The problem is somewhere in the PARQUET C++ code

[jira] [Moved] (PARQUET-1783) [C++] Parquet statistics wrong for dictionary type

2020-02-04 Thread Uwe Korn (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Korn moved ARROW-7732 to PARQUET-1783: -- Component/s: (was: C++) parquet-cpp

[jira] [Commented] (PARQUET-1783) [C++] Parquet statistics wrong for dictionary type

2020-02-04 Thread Francois Saint-Jacques (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17029822#comment-17029822 ] Francois Saint-Jacques commented on PARQUET-1783: - There's a [TODO|htt

[jira] [Comment Edited] (PARQUET-1783) [C++] Parquet statistics wrong for dictionary type

2020-02-04 Thread Francois Saint-Jacques (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17029822#comment-17029822 ] Francois Saint-Jacques edited comment on PARQUET-1783 at 2/4/20 12:51 PM: ---

Re: Allow users to fine-tune parquet writing

2020-02-04 Thread Manik Singla
I think making parquet more configurable is nice idea. We had similar kind of requirement where we wanted to have different configurations for different columns. ( I dont even remember details now as its 2-3 months) We already had some kind of optimizations in system for frequently queried column

Re: Allow users to fine-tune parquet writing

2020-02-04 Thread Radev, Martin
Dear all, in our project of using Parquet for streaming fp data with various entropy, we definitely needed to treat the columns differently. For fp data with low entropy, dictionary encoding provided good results. For fp data with entropy >15 bits element, the newly added encoding + zstd yield

Arrow 1404: Adding index for Page-level Skipping

2020-02-04 Thread Lekshmi Narayanan, Arun Balajiee
Hi Parquet dev Deepak Majeti was my dev lead during my summer internship, from when I am trying to add a few changes in the Arrow Parquet Project for the ticket below https://issues.apache.org/jira/browse/PARQUET-1404 (Assigned to Deepak) With this regard, I am making a few changes to src/parqu

[jira] [Commented] (PARQUET-1781) [C++] 1.4.0+ reader ignore stats created by 1.3.* writer

2020-02-04 Thread Deepak Majeti (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-1781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030160#comment-17030160 ] Deepak Majeti commented on PARQUET-1781: Even though the 1.3 writer wrote the "

[jira] [Commented] (PARQUET-1783) [C++] Parquet statistics wrong for dictionary type

2020-02-04 Thread Wes McKinney (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030200#comment-17030200 ] Wes McKinney commented on PARQUET-1783: --- Do we need to create a corresponding Arr

[jira] [Commented] (PARQUET-1783) [C++] Parquet statistics wrong for dictionary type

2020-02-04 Thread Wes McKinney (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030199#comment-17030199 ] Wes McKinney commented on PARQUET-1783: --- I suppose it's good at least that the mi

Re: Arrow 1404: Adding index for Page-level Skipping

2020-02-04 Thread Wes McKinney
Here's a compare link in case others want to have a look https://github.com/apache/arrow/compare/master...a2un:PARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp On Tue, Feb 4, 2020 at 5:41 PM Wes McKinney wrote: > > hi Arun, > > I took a brief look at y

Re: Arrow 1404: Adding index for Page-level Skipping

2020-02-04 Thread Wes McKinney
hi Arun, I took a brief look at your branch. One thing that is missing is the proposed public APIs that use the index pages -- that would be very helpful for this discussion. I don't think we have any code for doing random access of a particular data page in a column chunk, so having as an initia

[jira] [Assigned] (PARQUET-1716) [C++] Add support for BYTE_STREAM_SPLIT encoding

2020-02-04 Thread Wes McKinney (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned PARQUET-1716: - Assignee: Martin Radev > [C++] Add support for BYTE_STREAM_SPLIT encoding > --

[jira] [Resolved] (PARQUET-1716) [C++] Add support for BYTE_STREAM_SPLIT encoding

2020-02-04 Thread Wes McKinney (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved PARQUET-1716. --- Fix Version/s: cpp-1.6.0 Resolution: Fixed Issue resolved by pull request 6005 [http

RE: Arrow 1404: Adding index for Page-level Skipping

2020-02-04 Thread Lekshmi Narayanan, Arun Balajiee
Actually I made some changes after the date on the pull request ( even in this year), which are not getting reflected on this compare link Regards, Arun Balajiee From: Wes McKinney Sent: Tuesday, February 4, 2020 6:43 PM To: Parquet Dev

Re: Arrow 1404: Adding index for Page-level Skipping

2020-02-04 Thread Wes McKinney
hi Arun, We can keep the discussion going on here and on GitHub when you have a pull request to discuss. There are a number of different people who can give advice. Thanks On Tue, Feb 4, 2020 at 10:11 PM Lekshmi Narayanan, Arun Balajiee wrote: > > Actually I made some changes after the date on