[jira] [Resolved] (PARQUET-724) Test more advanced properties setting
[ https://issues.apache.org/jira/browse/PARQUET-724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved PARQUET-724. -- Resolution: Fixed Fix Version/s: cpp-0.1 Issue resolved by pull request 166 [https://github.com/apache/parquet-cpp/pull/166] > Test more advanced properties setting > - > > Key: PARQUET-724 > URL: https://issues.apache.org/jira/browse/PARQUET-724 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn > Fix For: cpp-0.1 > > > Test that handling of global and column specific is tested and behaving > correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (PARQUET-722) Building with JDK 8 fails over a maven bug
[ https://issues.apache.org/jira/browse/PARQUET-722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Le Dem closed PARQUET-722. - > Building with JDK 8 fails over a maven bug > -- > > Key: PARQUET-722 > URL: https://issues.apache.org/jira/browse/PARQUET-722 > Project: Parquet > Issue Type: Bug >Reporter: Niels Basjes > > When I build parquet on my system I get this error during the build: > {quote} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-remote-resources-plugin:1.5:process (default) > on project parquet-generator: Error rendering velocity resource. > NullPointerException -> [Help 1] > {quote} > About a year ago [~julienledem] responded that this is caused due to a bug in > Maven in combination with Java 8: > At this page > http://stackoverflow.com/questions/31229445/build-failure-apache-parquet-mr-source-mvn-install-failure/33360512#33360512 > > Now this bug has been solved at the Maven end in maven-filtering 1.2 > https://issues.apache.org/jira/browse/MSHARED-319 > The problem is that this fix has not yet been integrated into the latest > available maven versions yet. > I'll put up a pull request with a proposed fix for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-722) Building with JDK 8 fails over a maven bug
[ https://issues.apache.org/jira/browse/PARQUET-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511173#comment-15511173 ] Julien Le Dem commented on PARQUET-722: --- Thanks for spending the time! > Building with JDK 8 fails over a maven bug > -- > > Key: PARQUET-722 > URL: https://issues.apache.org/jira/browse/PARQUET-722 > Project: Parquet > Issue Type: Bug >Reporter: Niels Basjes > > When I build parquet on my system I get this error during the build: > {quote} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-remote-resources-plugin:1.5:process (default) > on project parquet-generator: Error rendering velocity resource. > NullPointerException -> [Help 1] > {quote} > About a year ago [~julienledem] responded that this is caused due to a bug in > Maven in combination with Java 8: > At this page > http://stackoverflow.com/questions/31229445/build-failure-apache-parquet-mr-source-mvn-install-failure/33360512#33360512 > > Now this bug has been solved at the Maven end in maven-filtering 1.2 > https://issues.apache.org/jira/browse/MSHARED-319 > The problem is that this fix has not yet been integrated into the latest > available maven versions yet. > I'll put up a pull request with a proposed fix for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Python Parquet package
Sure, I'm happy to do that. Do you want me to take care of refactoring to account for the arrow::io API changes I just made? Then we can go ahead and remove arrow/parquet from the Arrow project. On Wed, Sep 21, 2016 at 3:47 PM, Uwe Korn wrote: > Sounds reasonable for me. I will then to continue to implement the missing > interfaces for Parquet in pyarrow.parquet. > > @wesm Can you take care that we easily depend on a pinned version of > parquet-cpp in pyarrow’s travis builds? > > Uwe > >> Am 21.09.2016 um 20:07 schrieb Wes McKinney : >> >> I don't agree with this approach right now. Here are my reasons: >> >> 1. The Parquet Python integration will need to depend both on PyArrow >> and the Arrow C++ libraries, so these libraries would generally need >> to be developed together >> >> 2. PyArrow would need to define and maintain a C++ or Cython API so >> that the equivalent of the current pyarrow.parquet library can access >> C-level data. For example: >> >> https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.pyx#L31 >> >> Cython does permit cross-project C API access (we are already doing >> cross-module Cython APi access within pyarrow). This adds additional >> complexity that I think we should avoid for now. >> >> 3. Maintaining a separate C++ build toolchain for a Python package >> adds additional maintenance and packaging burden on us >> >> My inclination is to keep the code where it is and make the Parquet >> extension optional. >> >> - Wes >> >> On Wed, Sep 21, 2016 at 10:16 AM, Uwe Korn wrote: >>> Hello, >>> >>> as we have moved the Arrow<->Parquet C++ integration into parquet-cpp, we >>> still have to decide on how we are going to proceed with the Arrow<->Parquet >>> Python integration. For the moment, it seems that the best way to go ahead >>> is to pull the pyarrow.parquet module out into a separate Python package. >>> From an organisational point, I'm unclear how I should proceed here. Should >>> we put this in a separate repo? If so, as part of the Apache organisation? >>> >>> Uwe >
[jira] [Updated] (PARQUET-724) Test more advanced properties setting
[ https://issues.apache.org/jira/browse/PARQUET-724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe L. Korn updated PARQUET-724: Description: Test that handling of global and column specific is tested and behaving correctly. > Test more advanced properties setting > - > > Key: PARQUET-724 > URL: https://issues.apache.org/jira/browse/PARQUET-724 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn > > Test that handling of global and column specific is tested and behaving > correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-724) Test more advanced properties setting
[ https://issues.apache.org/jira/browse/PARQUET-724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511056#comment-15511056 ] Uwe L. Korn commented on PARQUET-724: - https://github.com/apache/parquet-cpp/pull/166 > Test more advanced properties setting > - > > Key: PARQUET-724 > URL: https://issues.apache.org/jira/browse/PARQUET-724 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PARQUET-724) Test more advanced properties setting
Uwe L. Korn created PARQUET-724: --- Summary: Test more advanced properties setting Key: PARQUET-724 URL: https://issues.apache.org/jira/browse/PARQUET-724 Project: Parquet Issue Type: Improvement Components: parquet-cpp Reporter: Uwe L. Korn Assignee: Uwe L. Korn -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Python Parquet package
Sounds reasonable for me. I will then to continue to implement the missing interfaces for Parquet in pyarrow.parquet. @wesm Can you take care that we easily depend on a pinned version of parquet-cpp in pyarrow’s travis builds? Uwe > Am 21.09.2016 um 20:07 schrieb Wes McKinney : > > I don't agree with this approach right now. Here are my reasons: > > 1. The Parquet Python integration will need to depend both on PyArrow > and the Arrow C++ libraries, so these libraries would generally need > to be developed together > > 2. PyArrow would need to define and maintain a C++ or Cython API so > that the equivalent of the current pyarrow.parquet library can access > C-level data. For example: > > https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.pyx#L31 > > Cython does permit cross-project C API access (we are already doing > cross-module Cython APi access within pyarrow). This adds additional > complexity that I think we should avoid for now. > > 3. Maintaining a separate C++ build toolchain for a Python package > adds additional maintenance and packaging burden on us > > My inclination is to keep the code where it is and make the Parquet > extension optional. > > - Wes > > On Wed, Sep 21, 2016 at 10:16 AM, Uwe Korn wrote: >> Hello, >> >> as we have moved the Arrow<->Parquet C++ integration into parquet-cpp, we >> still have to decide on how we are going to proceed with the Arrow<->Parquet >> Python integration. For the moment, it seems that the best way to go ahead >> is to pull the pyarrow.parquet module out into a separate Python package. >> From an organisational point, I'm unclear how I should proceed here. Should >> we put this in a separate repo? If so, as part of the Apache organisation? >> >> Uwe
Re: Python Parquet package
I don't agree with this approach right now. Here are my reasons: 1. The Parquet Python integration will need to depend both on PyArrow and the Arrow C++ libraries, so these libraries would generally need to be developed together 2. PyArrow would need to define and maintain a C++ or Cython API so that the equivalent of the current pyarrow.parquet library can access C-level data. For example: https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.pyx#L31 Cython does permit cross-project C API access (we are already doing cross-module Cython APi access within pyarrow). This adds additional complexity that I think we should avoid for now. 3. Maintaining a separate C++ build toolchain for a Python package adds additional maintenance and packaging burden on us My inclination is to keep the code where it is and make the Parquet extension optional. - Wes On Wed, Sep 21, 2016 at 10:16 AM, Uwe Korn wrote: > Hello, > > as we have moved the Arrow<->Parquet C++ integration into parquet-cpp, we > still have to decide on how we are going to proceed with the Arrow<->Parquet > Python integration. For the moment, it seems that the best way to go ahead > is to pull the pyarrow.parquet module out into a separate Python package. > From an organisational point, I'm unclear how I should proceed here. Should > we put this in a separate repo? If so, as part of the Apache organisation? > > Uwe
[jira] [Created] (PARQUET-723) parquet is not storing the type for the column.
Narasimha created PARQUET-723: - Summary: parquet is not storing the type for the column. Key: PARQUET-723 URL: https://issues.apache.org/jira/browse/PARQUET-723 Project: Parquet Issue Type: Bug Components: parquet-format Reporter: Narasimha 1. Create Text file format table CREATE EXTERNAL TABLE IF NOT EXISTS emp( id INT, first_name STRING, last_name STRING, dateofBirth STRING, join_date INT ) COMMENT 'This is Employee Table Date Of Birth of type String' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/user/employee/beforePartition'; 2. Load the data into table load data inpath '/user/somupoc_timestamp/employeeData_partitioned.csv' into table emp; select * from emp; 3. Create Partitioned table with file format as Parquet (dateofBirth STRING)) create external table emp_afterpartition( id int, first_name STRING, last_name STRING, dateofBirth STRING) COMMENT 'Employee partitioned table with dateofBirth of type string' partitioned by (join_date int) STORED as parquet LOCATION '/user/employee/afterpartition'; 4. Fetch the data from Partitioned column set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; insert overwrite table emp_afterpartition partition (join_date) select * from emp; select * from emp_afterpartition; 5. Create Partitioned table with file format as Parquet (dateofBirth TIMESTAMP)) CREATE EXTERNAL TABLE IF NOT EXISTS employee_afterpartition_timestamp_parq( id INT,first_name STRING,last_name STRING,dateofBirth TIMESTAMP) COMMENT 'employee partitioned table with dateofBirth of type TIMESTAMP' PARTITIONED BY (join_date INT) STORED AS PARQUET LOCATION '/user/employee/afterpartition'; select * from employee_afterpartition_timestamp_parq; -- 0 records returned impala :: alter table employee_afterpartition_timestamp_parq RECOVER PARTITIONS; Hive :: MSCK REPAIR TABLE employee_afterpartition_timestamp_parq; -- MSCK works in Hive and RECOVER PARTITIONS works in Impala -- metastore check command with the repair table option: select * from employee_afterpartition_timestamp_parq; Actual Result :: Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable Expected Result :: Data should display Note: if file format is text file instead of Parquet then I am able to fetch the data. Observation : Two tables having different column type pointing to same location(HDFS ). sample Data = 1,Joyce,Garza,2016-07-17 14:42:18,201607 2,Jerry,Ortiz,2016-08-17 21:36:54,201608 3,Steven,Ryan,2016-09-10 01:32:40,201609 4,Lisa,Black,2015-10-12 15:05:13,201610 5,Jose,Turner,2015-011-10 06:38:40,201611 6,Joyce,Garza,2016-08-02,201608 7,Jerry,Ortiz,2016-01-01,201601 8,Steven,Ryan,2016/08/20,201608 9,Lisa,Black,2016/09/12,201609 10,Jose,Turner,09/19/2016,201609 11,Jose,Turner,20160915,201609 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Python Parquet package
Hello, as we have moved the Arrow<->Parquet C++ integration into parquet-cpp, we still have to decide on how we are going to proceed with the Arrow<->Parquet Python integration. For the moment, it seems that the best way to go ahead is to pull the pyarrow.parquet module out into a separate Python package. From an organisational point, I'm unclear how I should proceed here. Should we put this in a separate repo? If so, as part of the Apache organisation? Uwe
[jira] [Commented] (PARQUET-721) Performance benchmarks for reading into Arrow structures
[ https://issues.apache.org/jira/browse/PARQUET-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15509983#comment-15509983 ] Uwe L. Korn commented on PARQUET-721: - PR: https://github.com/apache/parquet-cpp/pull/165 > Performance benchmarks for reading into Arrow structures > > > Key: PARQUET-721 > URL: https://issues.apache.org/jira/browse/PARQUET-721 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn > > Simple benchmarks that show per column and repetition type how fast we can > read into Arrow memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PARQUET-722) Building with JDK 8 fails over a maven bug
[ https://issues.apache.org/jira/browse/PARQUET-722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niels Basjes resolved PARQUET-722. -- Resolution: Invalid I was looking at a really old version of the code. This problem has already been fixed. > Building with JDK 8 fails over a maven bug > -- > > Key: PARQUET-722 > URL: https://issues.apache.org/jira/browse/PARQUET-722 > Project: Parquet > Issue Type: Bug >Reporter: Niels Basjes > > When I build parquet on my system I get this error during the build: > {quote} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-remote-resources-plugin:1.5:process (default) > on project parquet-generator: Error rendering velocity resource. > NullPointerException -> [Help 1] > {quote} > About a year ago [~julienledem] responded that this is caused due to a bug in > Maven in combination with Java 8: > At this page > http://stackoverflow.com/questions/31229445/build-failure-apache-parquet-mr-source-mvn-install-failure/33360512#33360512 > > Now this bug has been solved at the Maven end in maven-filtering 1.2 > https://issues.apache.org/jira/browse/MSHARED-319 > The problem is that this fix has not yet been integrated into the latest > available maven versions yet. > I'll put up a pull request with a proposed fix for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PARQUET-722) Building with JDK 8 fails over a maven bug
Niels Basjes created PARQUET-722: Summary: Building with JDK 8 fails over a maven bug Key: PARQUET-722 URL: https://issues.apache.org/jira/browse/PARQUET-722 Project: Parquet Issue Type: Bug Reporter: Niels Basjes When I build parquet on my system I get this error during the build: {quote} [ERROR] Failed to execute goal org.apache.maven.plugins:maven-remote-resources-plugin:1.5:process (default) on project parquet-generator: Error rendering velocity resource. NullPointerException -> [Help 1] {quote} About a year ago [~julienledem] responded that this is caused due to a bug in Maven in combination with Java 8: At this page http://stackoverflow.com/questions/31229445/build-failure-apache-parquet-mr-source-mvn-install-failure/33360512#33360512 Now this bug has been solved at the Maven end in maven-filtering 1.2 https://issues.apache.org/jira/browse/MSHARED-319 The problem is that this fix has not yet been integrated into the latest available maven versions yet. I'll put up a pull request with a proposed fix for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)