Re: Column not found in schema when querying partitioned table

2015-03-27 Thread ๏̯͡๏
Hello Jon,
Are you able to connect to existing Hive and read tables created in hive ?
Regards,
deepak

On Thu, Mar 26, 2015 at 4:16 PM, Jon Chase jon.ch...@gmail.com wrote:

 I've filed this as https://issues.apache.org/jira/browse/SPARK-6554

 On Thu, Mar 26, 2015 at 6:29 AM, Jon Chase jon.ch...@gmail.com wrote:

 Spark 1.3.0, Parquet

 I'm having trouble referencing partition columns in my queries.

 In the following example, 'probeTypeId' is a partition column.  For
 example, the directory structure looks like this:

 /mydata
 /probeTypeId=1
 ...files...
 /probeTypeId=2
 ...files...

 I see the column when I reference load a DF using the /mydata directory
 and call df.printSchema():

 ...
  |-- probeTypeId: integer (nullable = true)
 ...

 Parquet is also aware of the column:
  optional int32 probeTypeId;

 And this works fine:

 sqlContext.sql(select probeTypeId from df limit 1);

 ...as does df.show() - it shows the correct values for the partition
 column.


 However, when I try to use a partition column in a where clause, I get an
 exception stating that the column was not found in the schema:

 sqlContext.sql(select probeTypeId from df where probeTypeId = 1 limit
 1);



 ...
 ...
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage
 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: Column
 [probeTypeId] was not found in schema!
 at parquet.Preconditions.checkArgument(Preconditions.java:47)
 at
 parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
 at
 parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
 at
 parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
 at
 parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
 at
 parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
 at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
 at
 parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
 at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
 at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
 at
 parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
 at
 parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
 at
 parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
 ...
 ...



 What am I doing wrong?







 Here's the full stack trace:

 using local[*] for master
 06:05:55,675 |-INFO in
 ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute
 not set
 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction -
 About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction -
 Naming appender as [STDOUT]
 06:05:55,721 |-INFO in
 ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default
 type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder]
 property
 06:05:55,768 |-INFO in
 ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of
 ROOT logger to INFO
 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction
 - Attaching appender named [STDOUT] to Logger[ROOT]
 06:05:55,769 |-INFO in
 ch.qos.logback.classic.joran.action.ConfigurationAction - End of
 configuration.
 06:05:55,770 |-INFO in
 ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering
 current configuration as safe fallback point

 INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
 WARN  o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable
 INFO  org.apache.spark.SecurityManager Changing view acls to: jon
 INFO  org.apache.spark.SecurityManager Changing modify acls to: jon
 INFO  org.apache.spark.SecurityManager SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(jon); users
 with modify permissions: Set(jon)
 INFO  akka.event.slf4j.Slf4jLogger Slf4jLogger started
 INFO  Remoting Starting remoting
 INFO  Remoting Remoting started; listening on addresses :[akka.tcp://
 sparkDriver@192.168.1.134:62493]
 INFO  org.apache.spark.util.Utils Successfully started service
 'sparkDriver' on port 62493.
 INFO  org.apache.spark.SparkEnv Registering MapOutputTracker
 INFO  org.apache.spark.SparkEnv Registering BlockManagerMaster
 INFO  o.a.spark.storage.DiskBlockManager Created local directory at
 

Column not found in schema when querying partitioned table

2015-03-26 Thread Jon Chase
Spark 1.3.0, Parquet

I'm having trouble referencing partition columns in my queries.

In the following example, 'probeTypeId' is a partition column.  For
example, the directory structure looks like this:

/mydata
/probeTypeId=1
...files...
/probeTypeId=2
...files...

I see the column when I reference load a DF using the /mydata directory and
call df.printSchema():

...
 |-- probeTypeId: integer (nullable = true)
...

Parquet is also aware of the column:
 optional int32 probeTypeId;

And this works fine:

sqlContext.sql(select probeTypeId from df limit 1);

...as does df.show() - it shows the correct values for the partition column.


However, when I try to use a partition column in a where clause, I get an
exception stating that the column was not found in the schema:

sqlContext.sql(select probeTypeId from df where probeTypeId = 1 limit 1);



...
...
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage
0.0 (TID 0, localhost): java.lang.IllegalArgumentException: Column
[probeTypeId] was not found in schema!
at parquet.Preconditions.checkArgument(Preconditions.java:47)
at
parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
at
parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
at
parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
at
parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
at
parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
at
parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
at
parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
at
parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
at
parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
...
...



What am I doing wrong?







Here's the full stack trace:

using local[*] for master
06:05:55,675 |-INFO in
ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute
not set
06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction -
About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction -
Naming appender as [STDOUT]
06:05:55,721 |-INFO in
ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default
type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder]
property
06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction
- Setting level of ROOT logger to INFO
06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction -
Attaching appender named [STDOUT] to Logger[ROOT]
06:05:55,769 |-INFO in
ch.qos.logback.classic.joran.action.ConfigurationAction - End of
configuration.
06:05:55,770 |-INFO in
ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering
current configuration as safe fallback point

INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
WARN  o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library
for your platform... using builtin-java classes where applicable
INFO  org.apache.spark.SecurityManager Changing view acls to: jon
INFO  org.apache.spark.SecurityManager Changing modify acls to: jon
INFO  org.apache.spark.SecurityManager SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(jon); users
with modify permissions: Set(jon)
INFO  akka.event.slf4j.Slf4jLogger Slf4jLogger started
INFO  Remoting Starting remoting
INFO  Remoting Remoting started; listening on addresses :[akka.tcp://
sparkDriver@192.168.1.134:62493]
INFO  org.apache.spark.util.Utils Successfully started service
'sparkDriver' on port 62493.
INFO  org.apache.spark.SparkEnv Registering MapOutputTracker
INFO  org.apache.spark.SparkEnv Registering BlockManagerMaster
INFO  o.a.spark.storage.DiskBlockManager Created local directory at
/var/folders/x7/9hdp8kw9569864088tsl4jmmgn/T/spark-150e23b2-ff19-4a51-8cfc-25fb8e1b3f2b/blockmgr-6eea286c-7473-4bda-8886-7250156b68f4
INFO  org.apache.spark.storage.MemoryStore MemoryStore started with
capacity 1966.1 MB
INFO  org.apache.spark.HttpFileServer HTTP File server directory is
/var/folders/x7/9hdp8kw9569864088tsl4jmmgn/T/spark-cf4687bd-1563-4ddf-b697-21c96fd95561/httpd-6343b9c9-bb66-43ac-ac43-6da80c7a1f95
INFO  org.apache.spark.HttpServer Starting HTTP Server
INFO  o.spark-project.jetty.server.Server 

Re: Column not found in schema when querying partitioned table

2015-03-26 Thread Jon Chase
I've filed this as https://issues.apache.org/jira/browse/SPARK-6554

On Thu, Mar 26, 2015 at 6:29 AM, Jon Chase jon.ch...@gmail.com wrote:

 Spark 1.3.0, Parquet

 I'm having trouble referencing partition columns in my queries.

 In the following example, 'probeTypeId' is a partition column.  For
 example, the directory structure looks like this:

 /mydata
 /probeTypeId=1
 ...files...
 /probeTypeId=2
 ...files...

 I see the column when I reference load a DF using the /mydata directory
 and call df.printSchema():

 ...
  |-- probeTypeId: integer (nullable = true)
 ...

 Parquet is also aware of the column:
  optional int32 probeTypeId;

 And this works fine:

 sqlContext.sql(select probeTypeId from df limit 1);

 ...as does df.show() - it shows the correct values for the partition
 column.


 However, when I try to use a partition column in a where clause, I get an
 exception stating that the column was not found in the schema:

 sqlContext.sql(select probeTypeId from df where probeTypeId = 1 limit 1);



 ...
 ...
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage
 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: Column
 [probeTypeId] was not found in schema!
 at parquet.Preconditions.checkArgument(Preconditions.java:47)
 at
 parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
 at
 parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
 at
 parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
 at
 parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
 at
 parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
 at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
 at
 parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
 at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
 at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
 at
 parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
 at
 parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
 at
 parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
 ...
 ...



 What am I doing wrong?







 Here's the full stack trace:

 using local[*] for master
 06:05:55,675 |-INFO in
 ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute
 not set
 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction -
 About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction -
 Naming appender as [STDOUT]
 06:05:55,721 |-INFO in
 ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default
 type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder]
 property
 06:05:55,768 |-INFO in
 ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of
 ROOT logger to INFO
 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction
 - Attaching appender named [STDOUT] to Logger[ROOT]
 06:05:55,769 |-INFO in
 ch.qos.logback.classic.joran.action.ConfigurationAction - End of
 configuration.
 06:05:55,770 |-INFO in
 ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering
 current configuration as safe fallback point

 INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
 WARN  o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable
 INFO  org.apache.spark.SecurityManager Changing view acls to: jon
 INFO  org.apache.spark.SecurityManager Changing modify acls to: jon
 INFO  org.apache.spark.SecurityManager SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(jon); users
 with modify permissions: Set(jon)
 INFO  akka.event.slf4j.Slf4jLogger Slf4jLogger started
 INFO  Remoting Starting remoting
 INFO  Remoting Remoting started; listening on addresses :[akka.tcp://
 sparkDriver@192.168.1.134:62493]
 INFO  org.apache.spark.util.Utils Successfully started service
 'sparkDriver' on port 62493.
 INFO  org.apache.spark.SparkEnv Registering MapOutputTracker
 INFO  org.apache.spark.SparkEnv Registering BlockManagerMaster
 INFO  o.a.spark.storage.DiskBlockManager Created local directory at
 /var/folders/x7/9hdp8kw9569864088tsl4jmmgn/T/spark-150e23b2-ff19-4a51-8cfc-25fb8e1b3f2b/blockmgr-6eea286c-7473-4bda-8886-7250156b68f4
 INFO  org.apache.spark.storage.MemoryStore MemoryStore started with
 capacity 1966.1 MB
 INFO  org.apache.spark.HttpFileServer HTTP File server directory is