[jira] [Commented] (SPARK-5528) Support schema merging while reading Parquet files

2015-03-10 Thread chirag aggarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14354731#comment-14354731
 ] 

chirag aggarwal commented on SPARK-5528:


This feature shall have severe performance impact for cases where there are 
large number of partitions in the table and only few of them are referred in 
the query. Earlier, the schema information was being extracted only for the 
partitions referred in the query (which is usually less). But now, to get the 
unified schema, all the partitions in the table would be looked for the schema 
(which can be a huge penalty for large number of partitions).

 Support schema merging while reading Parquet files
 --

 Key: SPARK-5528
 URL: https://issues.apache.org/jira/browse/SPARK-5528
 Project: Spark
  Issue Type: Improvement
Reporter: Cheng Lian
Assignee: Cheng Lian
 Fix For: 1.3.0


 Spark 1.2.0 and prior versions only reads Parquet schema from {{_metadata}} 
 or a random Parquet part-file, and assumes all part-files share exactly the 
 same schema.
 In practice, it's common that users append new columns to existing Parquet 
 schema. Parquet has native schema merging support for such scenarios. Spark 
 SQL should also support this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6242) Support replace (drop) column for parquet table

2015-03-10 Thread chirag aggarwal (JIRA)
chirag aggarwal created SPARK-6242:
--

 Summary: Support replace (drop) column for parquet table
 Key: SPARK-6242
 URL: https://issues.apache.org/jira/browse/SPARK-6242
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.3.0
Reporter: chirag aggarwal






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6242) Support replace (drop) column for parquet table

2015-03-10 Thread chirag aggarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chirag aggarwal updated SPARK-6242:
---
Description: 
SPARK-5528 provides a easy way of support for add column to parquet tables. 
This is done by using the native parquet capability of merging the schema from 
all the part-files and _common_metadata files.
But, if someone wants to drop a column from the parquet table, this still does 
not work. This happens because, the merged schema shall still show the dropped 
column, but the column is no more there in metastore. So, the schema's obtained 
from the two sources do not match, and hence any subsequent query on this table 
fails.
Instead of checking for exact match between the two schemas, spark should only 
check if the schema obtained from metastore is subset of parquet merged schema 
or not. If this check passes, then the columns present in metastore should be 
allowed to be referred in the query.  

 Support replace (drop) column for parquet table
 ---

 Key: SPARK-6242
 URL: https://issues.apache.org/jira/browse/SPARK-6242
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.3.0
Reporter: chirag aggarwal

 SPARK-5528 provides a easy way of support for add column to parquet tables. 
 This is done by using the native parquet capability of merging the schema 
 from all the part-files and _common_metadata files.
 But, if someone wants to drop a column from the parquet table, this still 
 does not work. This happens because, the merged schema shall still show the 
 dropped column, but the column is no more there in metastore. So, the 
 schema's obtained from the two sources do not match, and hence any subsequent 
 query on this table fails.
 Instead of checking for exact match between the two schemas, spark should 
 only check if the schema obtained from metastore is subset of parquet merged 
 schema or not. If this check passes, then the columns present in metastore 
 should be allowed to be referred in the query.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3807) SparkSql does not work for tables created using custom serde

2014-10-07 Thread chirag aggarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chirag aggarwal updated SPARK-3807:
---
Description: 
SparkSql crashes on selecting tables using custom serde. 

Example:


CREATE EXTERNAL TABLE table_name PARTITIONED BY ( a int) ROW FORMAT 'SERDE 
org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer with 
serdeproperties(serialization.format=org.apache.thrift.protocol.TBinaryProtocol,serialization.class=ser_class)
 STORED AS SEQUENCEFILE;

The following exception is seen on running a query like 'select * from 
table_name limit 1': 

ERROR CliDriver: org.apache.hadoop.hive.serde2.SerDeException: 
java.lang.NullPointerException 
at 
org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer.initialize(ThriftDeserializer.java:68)
 
at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializer(TableDesc.java:80) 
at 
org.apache.spark.sql.hive.execution.HiveTableScan.addColumnMetadataToConf(HiveTableScan.scala:86)
 
at 
org.apache.spark.sql.hive.execution.HiveTableScan.init(HiveTableScan.scala:100)
 
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
 
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
 
at 
org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:364)
 
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:184)
 
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
 
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
 
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) 
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
 
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
 
at 
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:280)
 
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
 
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
 
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) 
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
 
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402)
 
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400) 
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:406)
 
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:406)
 
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:406)
 
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:59)
 
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:291)
 
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413) 
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226)
 
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
 
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) 
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) 
at java.lang.reflect.Method.invoke(Unknown Source) 
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) 
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) 
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 
Caused by: java.lang.NullPointerException


After fixing this issue, when some columns in the table were referred in the 
query, sparksql could not resolve those references.

  was:
SparkSql crashes on selecting tables using custom serde. 

Example:


CREATE EXTERNAL TABLE table_name PARTITIONED BY ( a int) ROW FORMAT 'SERDE 
org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer with 
serdeproperties(serialization.format=org.apache.thrift.protocol.TBinaryProtocol,serialization.class=ser_class)
 STORED AS SEQUENCEFILE;

The following exception is seen on running a query like 'select * from 
table_name limit 1': 

ERROR CliDriver: org.apache.hadoop.hive.serde2.SerDeException: 
java.lang.NullPointerException 
at 
org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer.initialize(ThriftDeserializer.java:68)
 
at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializer(TableDesc.java:80) 
at 
org.apache.spark.sql.hive.execution.HiveTableScan.addColumnMetadataToConf(HiveTableScan.scala:86)
 
at 
org.apache.spark.sql.hive.execution.HiveTableScan.init(HiveTableScan.scala:100)
 
at 

[jira] [Created] (SPARK-3807) SparkSql does not work for tables created using custom serde

2014-10-06 Thread chirag aggarwal (JIRA)
chirag aggarwal created SPARK-3807:
--

 Summary: SparkSql does not work for tables created using custom 
serde
 Key: SPARK-3807
 URL: https://issues.apache.org/jira/browse/SPARK-3807
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: chirag aggarwal
 Fix For: 1.1.1


SparkSql crashes on selecting tables using custom serde. 

The following exception is seen on running a query like 'select * from 
table_name limit 1': 

ERROR CliDriver: org.apache.hadoop.hive.serde2.SerDeException: 
java.lang.NullPointerException 
at 
org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer.initialize(ThriftDeserializer.java:68)
 
at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializer(TableDesc.java:80) 
at 
org.apache.spark.sql.hive.execution.HiveTableScan.addColumnMetadataToConf(HiveTableScan.scala:86)
 
at 
org.apache.spark.sql.hive.execution.HiveTableScan.init(HiveTableScan.scala:100)
 
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
 
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
 
at 
org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:364)
 
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:184)
 
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
 
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
 
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) 
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
 
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
 
at 
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:280)
 
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
 
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
 
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) 
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
 
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402)
 
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400) 
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:406)
 
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:406)
 
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:406)
 
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:59)
 
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:291)
 
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413) 
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226)
 
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
 
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) 
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) 
at java.lang.reflect.Method.invoke(Unknown Source) 
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) 
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) 
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 
Caused by: java.lang.NullPointerException



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3807) SparkSql does not work for tables created using custom serde

2014-10-06 Thread chirag aggarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160197#comment-14160197
 ] 

chirag aggarwal commented on SPARK-3807:


Pull request for this issue:
https://github.com/apache/spark/pull/2674

 SparkSql does not work for tables created using custom serde
 

 Key: SPARK-3807
 URL: https://issues.apache.org/jira/browse/SPARK-3807
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: chirag aggarwal
 Fix For: 1.1.1


 SparkSql crashes on selecting tables using custom serde. 
 The following exception is seen on running a query like 'select * from 
 table_name limit 1': 
 ERROR CliDriver: org.apache.hadoop.hive.serde2.SerDeException: 
 java.lang.NullPointerException 
 at 
 org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer.initialize(ThriftDeserializer.java:68)
  
 at 
 org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializer(TableDesc.java:80) 
 at 
 org.apache.spark.sql.hive.execution.HiveTableScan.addColumnMetadataToConf(HiveTableScan.scala:86)
  
 at 
 org.apache.spark.sql.hive.execution.HiveTableScan.init(HiveTableScan.scala:100)
  
 at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
  
 at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
  
 at 
 org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:364)
  
 at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:184)
  
 at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
  
 at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
  
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) 
 at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
  
 at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
  
 at 
 org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:280)
  
 at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
  
 at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
  
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) 
 at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
  
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402)
  
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400)
  
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:406)
  
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:406)
  
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:406)
  
 at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:59)
  
 at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:291)
  
 at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413) 
 at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226)
  
 at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
  
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
 at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) 
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) 
 at java.lang.reflect.Method.invoke(Unknown Source) 
 at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) 
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) 
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 
 Caused by: java.lang.NullPointerException



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3807) SparkSql does not work for tables created using custom serde

2014-10-06 Thread chirag aggarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chirag aggarwal updated SPARK-3807:
---
Description: 
SparkSql crashes on selecting tables using custom serde. 

Example:


CREATE EXTERNAL TABLE table_name PARTITIONED BY ( a int) ROW FORMAT 'SERDE 
org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer with 
serdeproperties(serialization.format=org.apache.thrift.protocol.TBinaryProtocol,serialization.class=ser_class)
 STORED AS SEQUENCEFILE;

The following exception is seen on running a query like 'select * from 
table_name limit 1': 

ERROR CliDriver: org.apache.hadoop.hive.serde2.SerDeException: 
java.lang.NullPointerException 
at 
org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer.initialize(ThriftDeserializer.java:68)
 
at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializer(TableDesc.java:80) 
at 
org.apache.spark.sql.hive.execution.HiveTableScan.addColumnMetadataToConf(HiveTableScan.scala:86)
 
at 
org.apache.spark.sql.hive.execution.HiveTableScan.init(HiveTableScan.scala:100)
 
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
 
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
 
at 
org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:364)
 
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:184)
 
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
 
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
 
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) 
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
 
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
 
at 
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:280)
 
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
 
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
 
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) 
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
 
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402)
 
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400) 
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:406)
 
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:406)
 
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:406)
 
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:59)
 
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:291)
 
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413) 
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226)
 
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
 
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) 
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) 
at java.lang.reflect.Method.invoke(Unknown Source) 
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) 
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) 
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 
Caused by: java.lang.NullPointerException

  was:
SparkSql crashes on selecting tables using custom serde. 

The following exception is seen on running a query like 'select * from 
table_name limit 1': 

ERROR CliDriver: org.apache.hadoop.hive.serde2.SerDeException: 
java.lang.NullPointerException 
at 
org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer.initialize(ThriftDeserializer.java:68)
 
at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializer(TableDesc.java:80) 
at 
org.apache.spark.sql.hive.execution.HiveTableScan.addColumnMetadataToConf(HiveTableScan.scala:86)
 
at 
org.apache.spark.sql.hive.execution.HiveTableScan.init(HiveTableScan.scala:100)
 
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
 
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
 
at 
org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:364)
 
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:184)
 
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
 
at 

[jira] [Created] (SPARK-3231) select on a table in parquet format containing smallest as a field type does not work

2014-08-26 Thread chirag aggarwal (JIRA)
chirag aggarwal created SPARK-3231:
--

 Summary: select on a table in parquet format containing smallest 
as a field type does not work
 Key: SPARK-3231
 URL: https://issues.apache.org/jira/browse/SPARK-3231
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: The table is created through Hive-0.13.
SparkSql 1.1 is used.
Reporter: chirag aggarwal


A table is created through hive. This table has a field of type smallint. The 
format of the table is parquet.
select on this table works perfectly on hive shell.
But, when the select is run on this table from spark-sql, then the query fails.

Steps to reproduce the issue:
--
hive create table abct (a smallint, b int) row format delimited fields 
terminated by '|' stored as textfile;
A text file is stored in hdfs for this table.

hive create table abc (a smallint, b int) stored as parquet; 
hive insert overwrite table abc select * from abct;
hive select * from abc;
2   1
2   2
2   3

spark-sql select * from abc;
10:08:46 ERROR CliDriver: org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 0.0 in stage 33.0 (TID 2340) had a not serializable result: 
org.apache.hadoop.io.IntWritable
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1158)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1147)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1146)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1146)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:685)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:685)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:685)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



But, if the type of this table is now changed to int, then spark-sql gives the 
correct results.

hive alter table abc change a a int;
spark-sql select * from abc;

2   1
2   2
2   3



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3231) select on a table in parquet format containing smallint as a field type does not work

2014-08-26 Thread chirag aggarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chirag aggarwal updated SPARK-3231:
---

Summary: select on a table in parquet format containing smallint as a field 
type does not work  (was: select on a table in parquet format containing 
smallest as a field type does not work)

 select on a table in parquet format containing smallint as a field type does 
 not work
 -

 Key: SPARK-3231
 URL: https://issues.apache.org/jira/browse/SPARK-3231
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: The table is created through Hive-0.13.
 SparkSql 1.1 is used.
Reporter: chirag aggarwal

 A table is created through hive. This table has a field of type smallint. The 
 format of the table is parquet.
 select on this table works perfectly on hive shell.
 But, when the select is run on this table from spark-sql, then the query 
 fails.
 Steps to reproduce the issue:
 --
 hive create table abct (a smallint, b int) row format delimited fields 
 terminated by '|' stored as textfile;
 A text file is stored in hdfs for this table.
 hive create table abc (a smallint, b int) stored as parquet; 
 hive insert overwrite table abc select * from abct;
 hive select * from abc;
 2 1
 2 2
 2 3
 spark-sql select * from abc;
 10:08:46 ERROR CliDriver: org.apache.spark.SparkException: Job aborted due to 
 stage failure: Task 0.0 in stage 33.0 (TID 2340) had a not serializable 
 result: org.apache.hadoop.io.IntWritable
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1158)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1147)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1146)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1146)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:685)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:685)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:685)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 But, if the type of this table is now changed to int, then spark-sql gives 
 the correct results.
 hive alter table abc change a a int;
 spark-sql select * from abc;
 2 1
 2 2
 2 3



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org