[jira] [Commented] (SPARK-10113) Support for unsigned Parquet logical types

2015-11-11 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000203#comment-15000203 ] Cheng Lian commented on SPARK-10113: I think emitting a clear error message is more reasonable since

[jira] [Commented] (SPARK-11089) Add a option for thrift-server to share a single session across all connections

2015-11-11 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000195#comment-15000195 ] Cheng Lian commented on SPARK-11089: OK, I'm taking this. > Add a option for thrift-server to sh

[jira] [Commented] (SPARK-9686) Spark hive jdbc client cannot get table from metadata store

2015-11-11 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000215#comment-15000215 ] Cheng Lian commented on SPARK-9686: --- [~navis] [~bugg_tb] [~pin_zhang] May I ask were you all using

[jira] [Updated] (SPARK-11500) Not deterministic order of columns when using merging schemas.

2015-11-11 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11500: --- Fix Version/s: 1.6.0 > Not deterministic order of columns when using merging sche

[jira] [Commented] (SPARK-5968) Parquet warning in spark-shell

2015-11-11 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000200#comment-15000200 ] Cheng Lian commented on SPARK-5968: --- It had once been fixed via a quite hacky trick. Unfortunately

[jira] [Commented] (SPARK-10954) Parquet version in the "created_by" metadata field of Parquet files written by Spark 1.5 and 1.6 is wrong

2015-11-11 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000222#comment-15000222 ] Cheng Lian commented on SPARK-10954: Figured out the reason why {{created_by}} is wrong in Spark

Re: Unwanted SysOuts in Spark Parquet

2015-11-10 Thread Cheng Lian
This is because of PARQUET-369 , which prevents users or other libraries to override Parquet's JUL logging settings via SLF4J. It has been fixed in the most recent parquet-format master (PR #32

[jira] [Commented] (SPARK-11191) [1.5] Can't create UDF's using hive thrift service

2015-11-09 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996519#comment-14996519 ] Cheng Lian commented on SPARK-11191: One of the problem here is SPARK-11595. However, after fixing

[jira] [Created] (SPARK-11595) "ADD JAR" doesn't work if the given path contains URL scheme like "file:/" and "hdfs:/"

2015-11-09 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-11595: -- Summary: "ADD JAR" doesn't work if the given path contains URL scheme like "file:/" and "hdfs:/" Key: SPARK-11595 URL: https://issues.apac

[jira] [Updated] (SPARK-11595) "ADD JAR" doesn't work if the given path contains URL scheme like "file:/" and "hdfs:/"

2015-11-09 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11595: --- Description: When handling {{ADD JAR}}, Spark constructs a {{java.io.File}} first using the input

Re: very slow parquet file write

2015-11-06 Thread Cheng Lian
I'd expect writing Parquet files slower than writing JSON files since Parquet involves more complicated encoders, but maybe not that slow. Would you mind to try to profile one Spark executor using tools like YJP to see what's the hotspot? Cheng On 11/6/15 7:34 AM, rok wrote: Apologies if

Re: very slow parquet file write

2015-11-06 Thread Cheng Lian
of your responses are there either. I am definitely subscribed to the list though (I get daily digests). Any clue how to fix it? Sorry, no idea :-/ On Nov 6, 2015, at 9:26 AM, Cheng Lian <lian.cs@gmail.com <mailto:lian.cs@gmail.com>> wrote: I'd expect writing Parquet

[jira] [Commented] (SPARK-11500) Not deterministic order of columns when using merging schemas.

2015-11-05 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992834#comment-14992834 ] Cheng Lian commented on SPARK-11500: [~hyukjin.kwon] Thanks for reporting. Would you like to work

Re: Issue of Hive parquet partitioned table schema mismatch

2015-11-04 Thread Cheng Lian
Is there any chance that " spark.sql.hive.convertMetastoreParquet" is turned off? Cheng On 11/4/15 5:15 PM, Rex Xiong wrote: Thanks Cheng Lian. I found in 1.5, if I use spark to create this table with partition discovery, the partition pruning can be performed, but for my

Re: Issue of Hive parquet partitioned table schema mismatch

2015-11-03 Thread Cheng Lian
SPARK-11153 should be irrelevant because you are filtering on a partition key while SPARK-11153 is about Parquet filter push-down and doesn't affect partition pruning. Cheng On 11/3/15 7:14 PM, Rex Xiong wrote: We found the query performance is very poor due to this issue

[jira] [Resolved] (SPARK-10533) DataFrame filter is not handling float/double with Scientific Notation 'e' / 'E'

2015-11-03 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-10533. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9085 [https

[jira] [Updated] (SPARK-10533) DataFrame filter is not handling float/double with Scientific Notation 'e' / 'E'

2015-11-03 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-10533: --- Assignee: Adrian Wang > DataFrame filter is not handling float/double with Scientific Notation

[jira] [Updated] (SPARK-10533) DataFrame filter is not handling float/double with Scientific Notation 'e' / 'E'

2015-11-03 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-10533: --- Description: In DataFrames filter operation,when giving float comparison with e (2.0e2

[jira] [Resolved] (SPARK-10786) SparkSQLCLIDriver should take the whole statement to generate the CommandProcessor

2015-11-02 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-10786. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8895 [https

[jira] [Assigned] (SPARK-10978) Allow PrunedFilterScan to eliminate predicates from further evaluation

2015-11-02 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian reassigned SPARK-10978: -- Assignee: Cheng Lian > Allow PrunedFilterScan to eliminate predicates from further evaluat

[jira] [Updated] (SPARK-10786) SparkSQLCLIDriver should take the whole statement to generate the CommandProcessor

2015-11-02 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-10786: --- Assignee: SaintBacchus > SparkSQLCLIDriver should take the whole statement to gener

[jira] [Resolved] (SPARK-11311) spark cannot describe temporary functions

2015-11-02 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-11311. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9277 [https

[jira] [Updated] (SPARK-7673) DataSourceStrategy's buildPartitionedTableScan always list file status for all data files

2015-10-30 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-7673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-7673: -- Summary: DataSourceStrategy's buildPartitionedTableScan always list file status for all data files

[jira] [Resolved] (SPARK-11103) Parquet filters push-down may cause exception when schema merging is turned on

2015-10-30 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-11103. Resolution: Fixed Fix Version/s: 1.5.2 1.6.0 Issue resolved by pull

[jira] [Commented] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)

2015-10-29 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14979950#comment-14979950 ] Cheng Lian commented on SPARK-11103: [~rxin] I think this should be a blocker for 1.5.2 because we

[jira] [Updated] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)

2015-10-29 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11103: --- Priority: Blocker (was: Major) > Filter applied on Merged Parquet shema with new column f

[jira] [Commented] (SPARK-11376) Invalid generated Java code in GenerateColumnAccessor

2015-10-29 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14979935#comment-14979935 ] Cheng Lian commented on SPARK-11376: No, {{GenerateColumnAccessor}} only exist in master. > Inva

[jira] [Updated] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)

2015-10-29 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11103: --- Target Version/s: 1.5.2, 1.6.0 (was: 1.5.3, 1.6.0) > Filter applied on Merged Parquet sh

[jira] [Created] (SPARK-11376) Invalid generated Java code in GenerateColumnAccessor

2015-10-28 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-11376: -- Summary: Invalid generated Java code in GenerateColumnAccessor Key: SPARK-11376 URL: https://issues.apache.org/jira/browse/SPARK-11376 Project: Spark Issue Type

[jira] [Updated] (SPARK-11376) Invalid generated Java code in GenerateColumnAccessor

2015-10-28 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11376: --- Priority: Major (was: Minor) > Invalid generated Java code in GenerateColumnAcces

[jira] [Updated] (SPARK-11376) Invalid generated Java code in GenerateColumnAccessor

2015-10-28 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11376: --- Description: There are two {{mutableRow}} fields in the generated code within

Re: Filter applied on merged Parquet shemsa with new column fails.

2015-10-28 Thread Cheng Lian
Hey Hyukjin, Sorry that I missed the JIRA ticket. Thanks for bring this issue up here, your detailed investigation. From my side, I think this is a bug of Parquet. Parquet was designed to support schema evolution. When scanning a Parquet, if a column exists in the requested schema but

[jira] [Commented] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)

2015-10-28 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14977961#comment-14977961 ] Cheng Lian commented on SPARK-11103: Quoted from my reply on the user list: For 1: This one

Re: Filter applied on merged Parquet shemsa with new column fails.

2015-10-28 Thread Cheng Lian
Hey Hyukjin, Sorry that I missed the JIRA ticket. Thanks for bring this issue up here, your detailed investigation. From my side, I think this is a bug of Parquet. Parquet was designed to support schema evolution. When scanning a Parquet, if a column exists in the requested schema but

[jira] [Created] (PARQUET-389) Filter predicates should work with missing columns

2015-10-28 Thread Cheng Lian (JIRA)
Cheng Lian created PARQUET-389: -- Summary: Filter predicates should work with missing columns Key: PARQUET-389 URL: https://issues.apache.org/jira/browse/PARQUET-389 Project: Parquet Issue Type

[jira] [Updated] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)

2015-10-28 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11103: --- Assignee: Hyukjin Kwon > Filter applied on Merged Parquet shema with new column f

[jira] [Resolved] (SPARK-11376) Invalid generated Java code in GenerateColumnAccessor

2015-10-28 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-11376. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9335 [https

[jira] [Created] (SPARK-11345) Make HadoopFsRelation always outputs UnsafeRow

2015-10-27 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-11345: -- Summary: Make HadoopFsRelation always outputs UnsafeRow Key: SPARK-11345 URL: https://issues.apache.org/jira/browse/SPARK-11345 Project: Spark Issue Type: Bug

[jira] [Updated] (SPARK-10562) .partitionBy() creates the metastore partition columns in all lowercase, but persists the data path as MixedCase resulting in an error when the data is later attempted t

2015-10-25 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-10562: --- Description: When using DataFrame.write.partitionBy().saveAsTable() it creates the partiton

[jira] [Resolved] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns

2015-10-20 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-11153. Resolution: Fixed Fix Version/s: 1.5.2 1.6.0 Issue resolved by pull

[jira] [Updated] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns

2015-10-16 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11153: --- Description: Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be written

[jira] [Created] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns

2015-10-16 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-11153: -- Summary: Turns off Parquet filter push-down for string and binary columns Key: SPARK-11153 URL: https://issues.apache.org/jira/browse/SPARK-11153 Project: Spark

[jira] [Updated] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns

2015-10-16 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11153: --- Priority: Blocker (was: Critical) > Turns off Parquet filter push-down for string and bin

[jira] [Updated] (SPARK-10895) Add pushdown string filters for Parquet

2015-10-16 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-10895: --- Assignee: Liang-Chi Hsieh > Add pushdown string filters for Parq

Re: Projection pushdown with nested data type

2015-10-16 Thread Cheng Lian
ter to start with. --Mohammad On Thursday, October 15, 2015 10:04 AM, Cheng Lian <lian.cs@gmail.com> wrote: At its core, Parquet definitely supports reading selected fields of nested structs, and that's actually one of the initial motivations of Parquet. However, n

[jira] [Commented] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns

2015-10-16 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961281#comment-14961281 ] Cheng Lian commented on SPARK-11153: Yes, it's the statistics information that is corrupted. And yes

[jira] [Commented] (SPARK-6859) Parquet File Binary column statistics error when reuse byte[] among rows

2015-10-16 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961318#comment-14961318 ] Cheng Lian commented on SPARK-6859: --- This issue was left unresolved because Parquet filter push-down

Re: requestedSchema vs fileSchema

2015-10-15 Thread Cheng Lian
Actually requested schema is not necessary to be a subset of the file schema. If a field in the requested schema doesn't exist in the file schema, Parquet fills that field with nulls, as long as the field is optional. Cheng On 10/14/15 6:25 PM, Alex Levenson wrote: It's always a cooperation

Re: Projection pushdown with nested data type

2015-10-15 Thread Cheng Lian
At its core, Parquet definitely supports reading selected fields of nested structs, and that's actually one of the initial motivations of Parquet. However, not all upper level Parquet data models enable it. For example, parquet-avro and parquet-thrift work fine, while parquet-hive has to

[jira] [Created] (SPARK-11117) PhysicalRDD.outputsUnsafeRows should return true when the underlying data source produces UnsafeRows

2015-10-14 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-7: -- Summary: PhysicalRDD.outputsUnsafeRows should return true when the underlying data source produces UnsafeRows Key: SPARK-7 URL: https://issues.apache.org/jira/browse/SPARK-7

[jira] [Resolved] (SPARK-10829) Scan DataSource with predicate expression combine partition key and attributes doesn't work

2015-10-14 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-10829. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8916 [https

[jira] [Resolved] (SPARK-6561) Add partition support in saveAsParquet

2015-10-13 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-6561. --- Resolution: Duplicate Assignee: Cheng Lian Fix Version/s: 1.4.0 > Add partit

[jira] [Created] (SPARK-11088) Optimize DataSourceStrategy.mergeWithPartitionValues

2015-10-13 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-11088: -- Summary: Optimize DataSourceStrategy.mergeWithPartitionValues Key: SPARK-11088 URL: https://issues.apache.org/jira/browse/SPARK-11088 Project: Spark Issue Type

[jira] [Resolved] (SPARK-11018) Support UDT in codegen and unsafe projection

2015-10-12 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-11018. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9016 [https

[jira] [Resolved] (SPARK-10990) Avoid the serialization multiple times during unrolling of complex types

2015-10-12 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-10990. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9016 [https

Re: Fixed writer version as version1 for Parquet as wring a Parquet file.

2015-10-09 Thread Cheng Lian
Hi Hyukjin, Thanks for bringing this up. Could you please make a PR for this one? We didn't use PARQUET_2_0 mostly because it's less mature than PARQUET_1_0, but we should let users choose the writer version, as long as PARQUET_1_0 remains the default option. Cheng On 10/8/15 11:04 PM,

[jira] [Created] (SPARK-10999) Physical plan node Coalesce should be able to handle UnsafeRow

2015-10-08 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-10999: -- Summary: Physical plan node Coalesce should be able to handle UnsafeRow Key: SPARK-10999 URL: https://issues.apache.org/jira/browse/SPARK-10999 Project: Spark

[jira] [Resolved] (SPARK-10999) Physical plan node Coalesce should be able to handle UnsafeRow

2015-10-08 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-10999. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9024 [https

[jira] [Created] (HIVE-12069) Hive cannot read Parquet decimals backed by INT32 or INT64

2015-10-08 Thread Cheng Lian (JIRA)
Cheng Lian created HIVE-12069: - Summary: Hive cannot read Parquet decimals backed by INT32 or INT64 Key: HIVE-12069 URL: https://issues.apache.org/jira/browse/HIVE-12069 Project: Hive Issue Type

Re: Parquet file size

2015-10-08 Thread Cheng Lian
<mailto:younes.nag...@streamtheworld.com>** *From:* odeach...@gmail.com [odeach...@gmail.com] on behalf of Deng Ching-Mallete [och...@apache.org] *Sent:* Wednesday, October 07, 2015 9:14 PM *To:* Younes Naguib *Cc:* Cheng Lian

Re: Question about slight performance regression in 1.8.1 and 1.8.2 release schedule

2015-10-08 Thread Cheng Lian
a good idea to fix a performance regression. I'd really like to find out what caused it and what fixed it, though. Is it possible for you to bisect the Parquet tree and run the test? rb On 10/06/2015 10:09 AM, Cheng Lian wrote: Could anybody help elaborating on 1.8.2 release plan? Thanks

[jira] [Created] (SPARK-11007) Add dictionary support for CatalystDecimalConverter

2015-10-08 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-11007: -- Summary: Add dictionary support for CatalystDecimalConverter Key: SPARK-11007 URL: https://issues.apache.org/jira/browse/SPARK-11007 Project: Spark Issue Type

[jira] [Resolved] (SPARK-6774) Implement Parquet complex types backwards-compatiblity rules

2015-10-08 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-6774. --- Resolution: Fixed Finally, fixed all the Parquet compatibility issues after 6 months! > Implem

[jira] [Resolved] (SPARK-8848) Write Parquet LISTs and MAPs conforming to Parquet format spec

2015-10-08 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-8848. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8988 [https://github.com

Re: Parquet file size

2015-10-07 Thread Cheng Lian
, without month and day). Cheng So you want to dump all data into a single large Parquet file? On 10/7/15 1:55 PM, Younes Naguib wrote: The TSV original files is 600GB and generated 40k files of 15-25MB. y *From:*Cheng Lian [mailto:lian.cs@gmail.com] *Sent:* October-07-15 3:18 PM

Re: Parquet file size

2015-10-07 Thread Cheng Lian
Why do you want larger files? Doesn't the result Parquet file contain all the data in the original TSV file? Cheng On 10/7/15 11:07 AM, Younes Naguib wrote: Hi, I’m reading a large tsv file, and creating parquet files using sparksql: insert overwrite table tbl partition(year, month,

[jira] [Created] (SPARK-10954) Parquet version in the "created_by" metadata field of Parquet files written by Spark 1.5 and 1.6 is wrong

2015-10-06 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-10954: -- Summary: Parquet version in the "created_by" metadata field of Parquet files written by Spark 1.5 and 1.6 is wrong Key: SPARK-10954 URL: https://issues.apache.org/jira/br

[jira] [Commented] (SPARK-10954) Parquet version in the "created_by" metadata field of Parquet files written by Spark 1.5 and 1.6 is wrong

2015-10-06 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14946032#comment-14946032 ] Cheng Lian commented on SPARK-10954: Thanks, I'm assigning this one to you. > Parquet vers

[jira] [Updated] (SPARK-10954) Parquet version in the "created_by" metadata field of Parquet files written by Spark 1.5 and 1.6 is wrong

2015-10-06 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-10954: --- Assignee: Gayathri Murali > Parquet version in the "created_by" metadata field of

Re: Metadata in Parquet

2015-09-30 Thread Cheng Lian
Unfortunately this isn't supported at the moment https://issues.apache.org/jira/browse/SPARK-10803 Cheng On 9/30/15 10:54 AM, Philip Weaver wrote: Hi, I am using org.apache.spark.sql.types.Metadata to store extra information along with each of my fields. I'd also like to store Metadata for

[jira] [Resolved] (SPARK-10811) Minimize array copying cost in Parquet converters

2015-09-30 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-10811. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8907 [https

[jira] [Comment Edited] (PARQUET-379) PrimitiveType.union erases original type

2015-09-28 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14933630#comment-14933630 ] Cheng Lian edited comment on PARQUET-379 at 9/28/15 5:34 PM: - While trying

[jira] [Created] (PARQUET-385) PrimitiveType.union accepts fixed_len_byte_array fields with different length when strict mode is on

2015-09-28 Thread Cheng Lian (JIRA)
Cheng Lian created PARQUET-385: -- Summary: PrimitiveType.union accepts fixed_len_byte_array fields with different length when strict mode is on Key: PARQUET-385 URL: https://issues.apache.org/jira/browse/PARQUET-385

[jira] [Commented] (PARQUET-379) PrimitiveType.union erases original type

2015-09-28 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14933630#comment-14933630 ] Cheng Lian commented on PARQUET-379: While trying to fix this issue, I got a problem regarding

[jira] [Commented] (PARQUET-379) PrimitiveType.union erases original type

2015-09-28 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934084#comment-14934084 ] Cheng Lian commented on PARQUET-379: So deprecating non-strict schema merging seems to be reasonable

[jira] [Updated] (PARQUET-385) PrimitiveType.union accepts fixed_len_byte_array fields with different lengths when strict mode is on

2015-09-28 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated PARQUET-385: --- Summary: PrimitiveType.union accepts fixed_len_byte_array fields with different lengths when strict

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
I guess you're probably using Spark 1.5? Spark SQL does support schema merging, but we disabled it by default since 1.5 because it introduces extra performance costs (it's turned on by default in 1.4 and 1.3). You may enable schema merging via either the Parquet data source specific option

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
Also, you may find more details in the programming guide: - http://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging - http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration Cheng On 9/28/15 3:54 PM, Cheng Lian wrote: I guess you're probably using

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
! The problem now is to filter out bad (miswritten) Parquet files, as they are causing this operation to fail. Any suggestions on detecting them quickly and easily? *From:*Cheng Lian [mailto:lian.cs@gmail.com] *Sent:* Monday, September 28, 2015 5:56 PM *To:* Thomas, Jordan <jordan.

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
nd re-transferred. Thanks, Jordan *From:*Cheng Lian [mailto:lian.cs@gmail.com] *Sent:* Monday, September 28, 2015 6:15 PM *To:* Thomas, Jordan <jordan.tho...@accenture.com>; mich...@databricks.com *Cc:* user@spark.apache.org *Subject:* Re: Performance when iterating over many parquet

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
g very similar this weekend. It works but is very slow. The Spark method I included in my original post is about 5-6 times faster. Just wondering if there is something even faster than that. I see this as being a recurring problem over the next few months. *From:*Cheng Lian [mailto:l

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
g very similar this weekend. It works but is very slow. The Spark method I included in my original post is about 5-6 times faster. Just wondering if there is something even faster than that. I see this as being a recurring problem over the next few months. *From:*Cheng Lian [mailto:l

[jira] [Updated] (PARQUET-379) PrimitiveType.union erases original type

2015-09-27 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated PARQUET-379: --- Description: The following ScalaTest test case {code} test("merge primitive types")

[jira] [Created] (SPARK-10845) SQL option "spark.sql.hive.version" doesn't show up in the result of "SET -v"

2015-09-26 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-10845: -- Summary: SQL option "spark.sql.hive.version" doesn't show up in the result of "SET -v" Key: SPARK-10845 URL: https://issues.apache.org/jira/browse/SPARK-10845

[jira] [Resolved] (SPARK-10845) SQL option "spark.sql.hive.version" doesn't show up in the result of "SET -v"

2015-09-26 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-10845. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8925 [https

[jira] [Updated] (SPARK-10845) SQL option "spark.sql.hive.version" doesn't show up in the result of "SET -v"

2015-09-26 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-10845: --- Labels: backport-needed (was: ) > SQL option "spark.sql.hive.version" doesn't show up

[jira] [Reopened] (SPARK-10845) SQL option "spark.sql.hive.version" doesn't show up in the result of "SET -v"

2015-09-26 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian reopened SPARK-10845: Reopening it since it's not backported to branch-1.5 yet. > SQL option "spark.sql.hive

Re: Is this a Spark issue or Hive issue that Spark cannot read the string type data in the Parquet generated by Hive

2015-09-25 Thread Cheng Lian
Please set the the SQL option spark.sql.parquet.binaryAsString to true when reading Parquet files containing strings generated by Hive. This is actually a bug of parquet-hive. When generating Parquet schema for a string field, Parquet requires a "UTF8" annotation, something like: message

Re: Is this a Spark issue or Hive issue that Spark cannot read the string type data in the Parquet generated by Hive

2015-09-25 Thread Cheng Lian
BTW, just checked that this bug should have been fixed since Hive 0.14.0. So the SQL option I mentioned is mostly used for reading legacy Parquet files generated by older versions of Hive. Cheng On 9/25/15 2:42 PM, Cheng Lian wrote: Please set the the SQL option

Re: Using Map and Basic Operators yield java.lang.ClassCastException (Parquet + Hive + Spark SQL 1.5.0 + Thrift)

2015-09-25 Thread Cheng Lian
ndling INT is all good but float and double are causing the exception. Thanks. Dominic Ricard Triton Digital -Original Message- From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: Thursday, September 24, 2015 5:47 PM To: Dominic Ricard; user@spark.apache.org Subject: Re: Using Map a

Re: spark + parquet + schema name and metadata

2015-09-24 Thread Cheng Lian
t out to see if this works ok. I am planning to use "stable" metadata - so those will be same across all parquet files inside directory hierarchy... On Tue, 22 Sep 2015 at 18:54 Cheng Lian <lian.cs@gmail.com <mailto:lian.cs@gmail.com>> wrote: Mi

[jira] [Created] (SPARK-10803) Allow users to write and query Parquet user-defined key-value metadata directly

2015-09-24 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-10803: -- Summary: Allow users to write and query Parquet user-defined key-value metadata directly Key: SPARK-10803 URL: https://issues.apache.org/jira/browse/SPARK-10803 Project

[jira] [Created] (SPARK-10811) Minimize array copying cost in Parquet converters

2015-09-24 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-10811: -- Summary: Minimize array copying cost in Parquet converters Key: SPARK-10811 URL: https://issues.apache.org/jira/browse/SPARK-10811 Project: Spark Issue Type

[jira] [Commented] (SPARK-10659) DataFrames and SparkSQL saveAsParquetFile does not preserve REQUIRED (not nullable) flag in schema

2015-09-23 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905856#comment-14905856 ] Cheng Lian commented on SPARK-10659: This behavior had once been a hacky way to workaround

[jira] [Comment Edited] (SPARK-10659) DataFrames and SparkSQL saveAsParquetFile does not preserve REQUIRED (not nullable) flag in schema

2015-09-23 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905856#comment-14905856 ] Cheng Lian edited comment on SPARK-10659 at 9/24/15 5:51 AM: - This behavior

[jira] [Updated] (SPARK-10659) DataFrames and SparkSQL saveAsParquetFile does not preserve REQUIRED (not nullable) flag in schema

2015-09-23 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-10659: --- Description: DataFrames currently automatically promotes all Parquet schema fields to optional when

[jira] [Updated] (SPARK-10659) DataFrames and SparkSQL saveAsParquetFile does not preserve REQUIRED (not nullable) flag in schema

2015-09-23 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-10659: --- Description: DataFrames currently automatically promotes all Parquet schema fields to optional when

[jira] [Created] (PARQUET-379) PrimitiveType.union erases original type

2015-09-23 Thread Cheng Lian (JIRA)
Cheng Lian created PARQUET-379: -- Summary: PrimitiveType.union erases original type Key: PARQUET-379 URL: https://issues.apache.org/jira/browse/PARQUET-379 Project: Parquet Issue Type: Bug

[jira] [Updated] (PARQUET-379) PrimitiveType.union erases original type

2015-09-23 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated PARQUET-379: --- Description: The following ScalaTest test case {code} test("merge primitive types")

Re: spark + parquet + schema name and metadata

2015-09-22 Thread Cheng Lian
ot; them in some way (giving the schema appropriate name or attaching some key/values) and then it is fairly easy to get basic metadata about parquet files when processing and discovering those later on. On Mon, 21 Sep 2015 at 18:17 Cheng Lian <lian.cs@gmail.com <mailto:lia

<    5   6   7   8   9   10   11   12   13   14   >