[jira] [Resolved] (SPARK-6546) Using the wrong code that will make spark compile failed!!

2015-03-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-6546.
---
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5198
[https://github.com/apache/spark/pull/5198]

 Using the wrong code that will make spark compile failed!! 
 ---

 Key: SPARK-6546
 URL: https://issues.apache.org/jira/browse/SPARK-6546
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.4.0
Reporter: DoingDone9
Assignee: DoingDone9
 Fix For: 1.4.0


 wrong code : val tmpDir = Files.createTempDir()
 not Files should Utils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6546) Using the wrong code that will make spark compile failed!!

2015-03-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6546:
--
Affects Version/s: 1.4.0

 Using the wrong code that will make spark compile failed!! 
 ---

 Key: SPARK-6546
 URL: https://issues.apache.org/jira/browse/SPARK-6546
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.4.0
Reporter: DoingDone9
Assignee: DoingDone9
 Fix For: 1.4.0


 wrong code : val tmpDir = Files.createTempDir()
 not Files should Utils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6546) Build failure caused by PR #5029 together with #4289

2015-03-26 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381606#comment-14381606
 ] 

Cheng Lian commented on SPARK-6546:
---

Updated ticket title and description to refect the root cause.

 Build failure caused by PR #5029 together with #4289
 

 Key: SPARK-6546
 URL: https://issues.apache.org/jira/browse/SPARK-6546
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Pei, Zhongshuai
Assignee: Pei, Zhongshuai
 Fix For: 1.4.0


 PR [#4289|https://github.com/apache/spark/pull/4289] was using Guava's 
 {{com.google.common.io.Files}} according to the first commit of that PR, see 
 [here|https://github.com/jeanlyn/spark/blob/3b27af36f82580c2171df965140c9a14e62fd5f0/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L22].
  However, [PR #5029|https://github.com/apache/spark/pull/5029] was merged 
 earlier, and deprecated Guava {{Files}} by {{Utils.Files}}. These two 
 combined caused this build failure. (There're no conflicts in the eyes of 
 Git, but there do exist semantic conflicts.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6555) Override equals and hashCode in MetastoreRelation

2015-03-26 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-6555:
-

 Summary: Override equals and hashCode in MetastoreRelation
 Key: SPARK-6555
 URL: https://issues.apache.org/jira/browse/SPARK-6555
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.2.1, 1.1.1, 1.0.2
Reporter: Cheng Lian


This is a follow-up of SPARK-6450.

As explained in [this 
comment|https://issues.apache.org/jira/browse/SPARK-6450?focusedCommentId=14379499page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14379499]
 of SPARK-6450, we resorted to a more surgical fix due to the upcoming 1.3.1 
release. But overriding {{equals}} and {{hashCode}} is the proper fix to that 
problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6554) Cannot use partition columns in where clause

2015-03-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6554:
--
Target Version/s: 1.3.1, 1.4.0

 Cannot use partition columns in where clause
 

 Key: SPARK-6554
 URL: https://issues.apache.org/jira/browse/SPARK-6554
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Jon Chase
Assignee: Cheng Lian

 I'm having trouble referencing partition columns in my queries with Parquet.  
 In the following example, 'probeTypeId' is a partition column.  For example, 
 the directory structure looks like this:
 {noformat}
 /mydata
 /probeTypeId=1
 ...files...
 /probeTypeId=2
 ...files...
 {noformat}
 I see the column when I reference load a DF using the /mydata directory and 
 call df.printSchema():
 {noformat}
  |-- probeTypeId: integer (nullable = true)
 {noformat}
 Parquet is also aware of the column:
 {noformat}
  optional int32 probeTypeId;
 {noformat}
 And this works fine:
 {code}
 sqlContext.sql(select probeTypeId from df limit 1);
 {code}
 ...as does {{df.show()}} - it shows the correct values for the partition 
 column.
 However, when I try to use a partition column in a where clause, I get an 
 exception stating that the column was not found in the schema:
 {noformat}
 sqlContext.sql(select probeTypeId from df where probeTypeId = 1 limit 1);
 ...
 ...
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
 (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] 
 was not found in schema!
   at parquet.Preconditions.checkArgument(Preconditions.java:47)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
   at 
 parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
   at 
 parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
   at 
 parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
 ...
 ...
 {noformat}
 Here's the full stack trace:
 {noformat}
 using local[*] for master
 06:05:55,675 |-INFO in 
 ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not 
 set
 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
 About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
 Naming appender as [STDOUT]
 06:05:55,721 |-INFO in 
 ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default 
 type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] 
 property
 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - 
 Setting level of ROOT logger to INFO
 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - 
 Attaching appender named [STDOUT] to Logger[ROOT]
 06:05:55,769 |-INFO in 
 ch.qos.logback.classic.joran.action.ConfigurationAction - End of 
 configuration.
 06:05:55,770 |-INFO in 
 ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current 
 configuration as safe fallback point
 INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
 WARN  o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 INFO  org.apache.spark.SecurityManager Changing view acls to: jon
 INFO  org.apache.spark.SecurityManager Changing modify acls to: jon
 INFO  org.apache.spark.SecurityManager SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(jon); users with 
 modify permissions: Set(jon)
 INFO  akka.event.slf4j.Slf4jLogger Slf4jLogger started
 INFO  Remoting Starting remoting
 INFO  Remoting Remoting started; listening on addresses 
 

[jira] [Assigned] (SPARK-6554) Cannot use partition columns in where clause

2015-03-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-6554:
-

Assignee: Cheng Lian

 Cannot use partition columns in where clause
 

 Key: SPARK-6554
 URL: https://issues.apache.org/jira/browse/SPARK-6554
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Jon Chase
Assignee: Cheng Lian

 I'm having trouble referencing partition columns in my queries with Parquet.  
 In the following example, 'probeTypeId' is a partition column.  For example, 
 the directory structure looks like this:
 {noformat}
 /mydata
 /probeTypeId=1
 ...files...
 /probeTypeId=2
 ...files...
 {noformat}
 I see the column when I reference load a DF using the /mydata directory and 
 call df.printSchema():
 {noformat}
  |-- probeTypeId: integer (nullable = true)
 {noformat}
 Parquet is also aware of the column:
 {noformat}
  optional int32 probeTypeId;
 {noformat}
 And this works fine:
 {code}
 sqlContext.sql(select probeTypeId from df limit 1);
 {code}
 ...as does {{df.show()}} - it shows the correct values for the partition 
 column.
 However, when I try to use a partition column in a where clause, I get an 
 exception stating that the column was not found in the schema:
 {noformat}
 sqlContext.sql(select probeTypeId from df where probeTypeId = 1 limit 1);
 ...
 ...
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
 (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] 
 was not found in schema!
   at parquet.Preconditions.checkArgument(Preconditions.java:47)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
   at 
 parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
   at 
 parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
   at 
 parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
 ...
 ...
 {noformat}
 Here's the full stack trace:
 {noformat}
 using local[*] for master
 06:05:55,675 |-INFO in 
 ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not 
 set
 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
 About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
 Naming appender as [STDOUT]
 06:05:55,721 |-INFO in 
 ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default 
 type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] 
 property
 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - 
 Setting level of ROOT logger to INFO
 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - 
 Attaching appender named [STDOUT] to Logger[ROOT]
 06:05:55,769 |-INFO in 
 ch.qos.logback.classic.joran.action.ConfigurationAction - End of 
 configuration.
 06:05:55,770 |-INFO in 
 ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current 
 configuration as safe fallback point
 INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
 WARN  o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 INFO  org.apache.spark.SecurityManager Changing view acls to: jon
 INFO  org.apache.spark.SecurityManager Changing modify acls to: jon
 INFO  org.apache.spark.SecurityManager SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(jon); users with 
 modify permissions: Set(jon)
 INFO  akka.event.slf4j.Slf4jLogger Slf4jLogger started
 INFO  Remoting Starting remoting
 INFO  Remoting Remoting started; listening on addresses 
 

[jira] [Commented] (SPARK-6481) Set In Progress when a PR is opened for an issue

2015-03-26 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381703#comment-14381703
 ] 

Cheng Lian commented on SPARK-6481:
---

Maybe unrelated to this issue, but I saw a lot of JIRA notifications about 
Assignee updates, jumping between a normal user and Apache Spark. Is this 
behavior a side effect of the In Progress PR? (Seems caused by [this code 
block|https://github.com/databricks/spark-pr-dashboard/pull/49/files#diff-6f3562e8b8a773341837373ab53b5462R34].)

 Set In Progress when a PR is opened for an issue
 --

 Key: SPARK-6481
 URL: https://issues.apache.org/jira/browse/SPARK-6481
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Reporter: Michael Armbrust
Assignee: Nicholas Chammas

 [~pwendell] and I are not sure if this is possible, but it would be really 
 helpful if the JIRA status was updated to In Progress when we do the 
 linking to an open pull request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6465) GenericRowWithSchema: KryoException: Class cannot be created (missing no-arg constructor):

2015-03-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-6465.
---
   Resolution: Fixed
Fix Version/s: 1.4.0
   1.3.1

Issue resolved by pull request 5191
[https://github.com/apache/spark/pull/5191]

 GenericRowWithSchema: KryoException: Class cannot be created (missing no-arg 
 constructor):
 --

 Key: SPARK-6465
 URL: https://issues.apache.org/jira/browse/SPARK-6465
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
 Environment: Spark 1.3, YARN 2.6.0, CentOS
Reporter: Earthson Lu
Assignee: Michael Armbrust
Priority: Critical
 Fix For: 1.3.1, 1.4.0

   Original Estimate: 0.5h
  Remaining Estimate: 0.5h

 I can not find a issue for this. 
 register for GenericRowWithSchema is lost in  
 org.apache.spark.sql.execution.SparkSqlSerializer.
 Is this the only thing we need to do?
 Here is the log
 {code}
 15/03/23 16:21:00 WARN TaskSetManager: Lost task 9.0 in stage 20.0 (TID 
 31978, datanode06.site): com.esotericsoftware.kryo.KryoException: Class 
 cannot be created (missing no-arg constructor): 
 org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
 at com.esotericsoftware.kryo.Kryo.newInstantiator(Kryo.java:1050)
 at com.esotericsoftware.kryo.Kryo.newInstance(Kryo.java:1062)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.create(FieldSerializer.java:228)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:217)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42)
 at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 at 
 org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:138)
 at 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
 at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
 at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
 at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at 
 org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:66)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at 
 org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:217)
 at 
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6554) Cannot use partition columns in where clause

2015-03-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6554:
--
Description: 
I'm having trouble referencing partition columns in my queries with Parquet.  

In the following example, 'probeTypeId' is a partition column.  For example, 
the directory structure looks like this:
{noformat}
/mydata
/probeTypeId=1
...files...
/probeTypeId=2
...files...
{noformat}
I see the column when I reference load a DF using the /mydata directory and 
call df.printSchema():
{noformat}
 |-- probeTypeId: integer (nullable = true)
{noformat}
Parquet is also aware of the column:
{noformat}
 optional int32 probeTypeId;
{noformat}
And this works fine:
{code}
sqlContext.sql(select probeTypeId from df limit 1);
{code}
...as does {{df.show()}} - it shows the correct values for the partition column.

However, when I try to use a partition column in a where clause, I get an 
exception stating that the column was not found in the schema:
{noformat}
sqlContext.sql(select probeTypeId from df where probeTypeId = 1 limit 1);
...
...
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 
0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] was not 
found in schema!
at parquet.Preconditions.checkArgument(Preconditions.java:47)
at 
parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
at 
parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
at 
parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
at 
parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
at 
parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
at 
parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
at 
parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
at 
parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
at 
parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
...
...
{noformat}
Here's the full stack trace:
{noformat}
using local[*] for master
06:05:55,675 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction 
- debug attribute not set
06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About 
to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming 
appender as [STDOUT]
06:05:55,721 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA 
- Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] 
for [encoder] property
06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - 
Setting level of ROOT logger to INFO
06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - 
Attaching appender named [STDOUT] to Logger[ROOT]
06:05:55,769 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction 
- End of configuration.
06:05:55,770 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd 
- Registering current configuration as safe fallback point

INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
WARN  o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library for 
your platform... using builtin-java classes where applicable
INFO  org.apache.spark.SecurityManager Changing view acls to: jon
INFO  org.apache.spark.SecurityManager Changing modify acls to: jon
INFO  org.apache.spark.SecurityManager SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(jon); users with 
modify permissions: Set(jon)
INFO  akka.event.slf4j.Slf4jLogger Slf4jLogger started
INFO  Remoting Starting remoting
INFO  Remoting Remoting started; listening on addresses 
:[akka.tcp://sparkDriver@192.168.1.134:62493]
INFO  org.apache.spark.util.Utils Successfully started service 'sparkDriver' on 
port 62493.
INFO  org.apache.spark.SparkEnv Registering MapOutputTracker
INFO  org.apache.spark.SparkEnv Registering BlockManagerMaster
INFO  o.a.spark.storage.DiskBlockManager Created local directory at 
/var/folders/x7/9hdp8kw9569864088tsl4jmmgn/T/spark-150e23b2-ff19-4a51-8cfc-25fb8e1b3f2b/blockmgr-6eea286c-7473-4bda-8886-7250156b68f4
INFO  

[jira] [Commented] (SPARK-6554) Cannot use partition columns in where clause when Parquet filter push-down is enabled

2015-03-26 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381850#comment-14381850
 ] 

Cheng Lian commented on SPARK-6554:
---

Marked this as critical rather than blocker mostly because Parquet filter 
push-down is not enabled by default in 1.3.0.

 Cannot use partition columns in where clause when Parquet filter push-down is 
 enabled
 -

 Key: SPARK-6554
 URL: https://issues.apache.org/jira/browse/SPARK-6554
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Jon Chase
Assignee: Cheng Lian
Priority: Critical

 I'm having trouble referencing partition columns in my queries with Parquet.  
 In the following example, 'probeTypeId' is a partition column.  For example, 
 the directory structure looks like this:
 {noformat}
 /mydata
 /probeTypeId=1
 ...files...
 /probeTypeId=2
 ...files...
 {noformat}
 I see the column when I reference load a DF using the /mydata directory and 
 call df.printSchema():
 {noformat}
  |-- probeTypeId: integer (nullable = true)
 {noformat}
 Parquet is also aware of the column:
 {noformat}
  optional int32 probeTypeId;
 {noformat}
 And this works fine:
 {code}
 sqlContext.sql(select probeTypeId from df limit 1);
 {code}
 ...as does {{df.show()}} - it shows the correct values for the partition 
 column.
 However, when I try to use a partition column in a where clause, I get an 
 exception stating that the column was not found in the schema:
 {noformat}
 sqlContext.sql(select probeTypeId from df where probeTypeId = 1 limit 1);
 ...
 ...
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
 (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] 
 was not found in schema!
   at parquet.Preconditions.checkArgument(Preconditions.java:47)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
   at 
 parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
   at 
 parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
   at 
 parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
 ...
 ...
 {noformat}
 Here's the full stack trace:
 {noformat}
 using local[*] for master
 06:05:55,675 |-INFO in 
 ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not 
 set
 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
 About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
 Naming appender as [STDOUT]
 06:05:55,721 |-INFO in 
 ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default 
 type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] 
 property
 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - 
 Setting level of ROOT logger to INFO
 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - 
 Attaching appender named [STDOUT] to Logger[ROOT]
 06:05:55,769 |-INFO in 
 ch.qos.logback.classic.joran.action.ConfigurationAction - End of 
 configuration.
 06:05:55,770 |-INFO in 
 ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current 
 configuration as safe fallback point
 INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
 WARN  o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 INFO  org.apache.spark.SecurityManager Changing view acls to: jon
 INFO  org.apache.spark.SecurityManager Changing modify acls to: jon
 INFO  org.apache.spark.SecurityManager SecurityManager: authentication 
 disabled; ui 

[jira] [Commented] (SPARK-6554) Cannot use partition columns in where clause when Parquet filter push-down is enabled

2015-03-26 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381862#comment-14381862
 ] 

Cheng Lian commented on SPARK-6554:
---

Parquet filter push-down isn't enabled by default in 1.3.0 because the most 
recent Parquet version (1.6.0rc3) up until Spark 1.3.0 release suffers from two 
bugs (PARQUET-136  PARQUET-173). So it's generally not recommended to be used 
in production yet. These two bugs have been fixed in Parquet master, and the 
official 1.6.0 release should be out pretty soon. We probably will upgrade to 
Parquet 1.6.0 in Spark 1.4.0.

 Cannot use partition columns in where clause when Parquet filter push-down is 
 enabled
 -

 Key: SPARK-6554
 URL: https://issues.apache.org/jira/browse/SPARK-6554
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Jon Chase
Assignee: Cheng Lian
Priority: Critical

 I'm having trouble referencing partition columns in my queries with Parquet.  
 In the following example, 'probeTypeId' is a partition column.  For example, 
 the directory structure looks like this:
 {noformat}
 /mydata
 /probeTypeId=1
 ...files...
 /probeTypeId=2
 ...files...
 {noformat}
 I see the column when I reference load a DF using the /mydata directory and 
 call df.printSchema():
 {noformat}
  |-- probeTypeId: integer (nullable = true)
 {noformat}
 Parquet is also aware of the column:
 {noformat}
  optional int32 probeTypeId;
 {noformat}
 And this works fine:
 {code}
 sqlContext.sql(select probeTypeId from df limit 1);
 {code}
 ...as does {{df.show()}} - it shows the correct values for the partition 
 column.
 However, when I try to use a partition column in a where clause, I get an 
 exception stating that the column was not found in the schema:
 {noformat}
 sqlContext.sql(select probeTypeId from df where probeTypeId = 1 limit 1);
 ...
 ...
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
 (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] 
 was not found in schema!
   at parquet.Preconditions.checkArgument(Preconditions.java:47)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
   at 
 parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
   at 
 parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
   at 
 parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
 ...
 ...
 {noformat}
 Here's the full stack trace:
 {noformat}
 using local[*] for master
 06:05:55,675 |-INFO in 
 ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not 
 set
 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
 About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
 Naming appender as [STDOUT]
 06:05:55,721 |-INFO in 
 ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default 
 type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] 
 property
 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - 
 Setting level of ROOT logger to INFO
 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - 
 Attaching appender named [STDOUT] to Logger[ROOT]
 06:05:55,769 |-INFO in 
 ch.qos.logback.classic.joran.action.ConfigurationAction - End of 
 configuration.
 06:05:55,770 |-INFO in 
 ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current 
 configuration as safe fallback point
 INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
 WARN  o.a.hadoop.util.NativeCodeLoader Unable to load 

[jira] [Updated] (SPARK-6554) Cannot use partition columns in where clause when Parquet filter push-down is enabled

2015-03-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6554:
--
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-5463

 Cannot use partition columns in where clause when Parquet filter push-down is 
 enabled
 -

 Key: SPARK-6554
 URL: https://issues.apache.org/jira/browse/SPARK-6554
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.3.0
Reporter: Jon Chase
Assignee: Cheng Lian
Priority: Critical

 I'm having trouble referencing partition columns in my queries with Parquet.  
 In the following example, 'probeTypeId' is a partition column.  For example, 
 the directory structure looks like this:
 {noformat}
 /mydata
 /probeTypeId=1
 ...files...
 /probeTypeId=2
 ...files...
 {noformat}
 I see the column when I reference load a DF using the /mydata directory and 
 call df.printSchema():
 {noformat}
  |-- probeTypeId: integer (nullable = true)
 {noformat}
 Parquet is also aware of the column:
 {noformat}
  optional int32 probeTypeId;
 {noformat}
 And this works fine:
 {code}
 sqlContext.sql(select probeTypeId from df limit 1);
 {code}
 ...as does {{df.show()}} - it shows the correct values for the partition 
 column.
 However, when I try to use a partition column in a where clause, I get an 
 exception stating that the column was not found in the schema:
 {noformat}
 sqlContext.sql(select probeTypeId from df where probeTypeId = 1 limit 1);
 ...
 ...
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
 (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] 
 was not found in schema!
   at parquet.Preconditions.checkArgument(Preconditions.java:47)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
   at 
 parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
   at 
 parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
   at 
 parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
 ...
 ...
 {noformat}
 Here's the full stack trace:
 {noformat}
 using local[*] for master
 06:05:55,675 |-INFO in 
 ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not 
 set
 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
 About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
 Naming appender as [STDOUT]
 06:05:55,721 |-INFO in 
 ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default 
 type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] 
 property
 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - 
 Setting level of ROOT logger to INFO
 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - 
 Attaching appender named [STDOUT] to Logger[ROOT]
 06:05:55,769 |-INFO in 
 ch.qos.logback.classic.joran.action.ConfigurationAction - End of 
 configuration.
 06:05:55,770 |-INFO in 
 ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current 
 configuration as safe fallback point
 INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
 WARN  o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 INFO  org.apache.spark.SecurityManager Changing view acls to: jon
 INFO  org.apache.spark.SecurityManager Changing modify acls to: jon
 INFO  org.apache.spark.SecurityManager SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(jon); users with 
 modify permissions: Set(jon)
 INFO  

[jira] [Updated] (SPARK-5463) Fix Parquet filter push-down

2015-03-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-5463:
--
Affects Version/s: 1.3.0

 Fix Parquet filter push-down
 

 Key: SPARK-5463
 URL: https://issues.apache.org/jira/browse/SPARK-5463
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1, 1.2.2, 1.3.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6481) Set In Progress when a PR is opened for an issue

2015-03-26 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381811#comment-14381811
 ] 

Cheng Lian commented on SPARK-6481:
---

Aha, so I'm not the only one! Although I just started doing this pretty 
recently :P 

 Set In Progress when a PR is opened for an issue
 --

 Key: SPARK-6481
 URL: https://issues.apache.org/jira/browse/SPARK-6481
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Reporter: Michael Armbrust
Assignee: Nicholas Chammas

 [~pwendell] and I are not sure if this is possible, but it would be really 
 helpful if the JIRA status was updated to In Progress when we do the 
 linking to an open pull request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6554) Cannot use partition columns in where clause when Parquet filter push-down is enabled

2015-03-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6554:
--
Summary: Cannot use partition columns in where clause when Parquet filter 
push-down is enabled  (was: Cannot use partition columns in where clause)

 Cannot use partition columns in where clause when Parquet filter push-down is 
 enabled
 -

 Key: SPARK-6554
 URL: https://issues.apache.org/jira/browse/SPARK-6554
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Jon Chase
Assignee: Cheng Lian
Priority: Critical

 I'm having trouble referencing partition columns in my queries with Parquet.  
 In the following example, 'probeTypeId' is a partition column.  For example, 
 the directory structure looks like this:
 {noformat}
 /mydata
 /probeTypeId=1
 ...files...
 /probeTypeId=2
 ...files...
 {noformat}
 I see the column when I reference load a DF using the /mydata directory and 
 call df.printSchema():
 {noformat}
  |-- probeTypeId: integer (nullable = true)
 {noformat}
 Parquet is also aware of the column:
 {noformat}
  optional int32 probeTypeId;
 {noformat}
 And this works fine:
 {code}
 sqlContext.sql(select probeTypeId from df limit 1);
 {code}
 ...as does {{df.show()}} - it shows the correct values for the partition 
 column.
 However, when I try to use a partition column in a where clause, I get an 
 exception stating that the column was not found in the schema:
 {noformat}
 sqlContext.sql(select probeTypeId from df where probeTypeId = 1 limit 1);
 ...
 ...
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
 (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] 
 was not found in schema!
   at parquet.Preconditions.checkArgument(Preconditions.java:47)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
   at 
 parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
   at 
 parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
   at 
 parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
 ...
 ...
 {noformat}
 Here's the full stack trace:
 {noformat}
 using local[*] for master
 06:05:55,675 |-INFO in 
 ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not 
 set
 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
 About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
 Naming appender as [STDOUT]
 06:05:55,721 |-INFO in 
 ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default 
 type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] 
 property
 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - 
 Setting level of ROOT logger to INFO
 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - 
 Attaching appender named [STDOUT] to Logger[ROOT]
 06:05:55,769 |-INFO in 
 ch.qos.logback.classic.joran.action.ConfigurationAction - End of 
 configuration.
 06:05:55,770 |-INFO in 
 ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current 
 configuration as safe fallback point
 INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
 WARN  o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 INFO  org.apache.spark.SecurityManager Changing view acls to: jon
 INFO  org.apache.spark.SecurityManager Changing modify acls to: jon
 INFO  org.apache.spark.SecurityManager SecurityManager: authentication 
 disabled; ui acls disabled; users with 

[jira] [Commented] (SPARK-6471) Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns

2015-03-26 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381846#comment-14381846
 ] 

Cheng Lian commented on SPARK-6471:
---

Bumped to blocker level since this is actually a regression from 1.2.

 Metastore schema should only be a subset of parquet schema to support 
 dropping of columns using replace columns
 ---

 Key: SPARK-6471
 URL: https://issues.apache.org/jira/browse/SPARK-6471
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yash Datta
Assignee: Yash Datta
Priority: Blocker
 Fix For: 1.3.1, 1.4.0


 Currently in the parquet relation 2 implementation, error is thrown in case 
 merged schema is not exactly the same as metastore schema. 
 But to support cases like deletion of column using replace column command, we 
 can relax the restriction so that even if metastore schema is a subset of 
 merged parquet schema, the query will work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6471) Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns

2015-03-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6471:
--
Priority: Blocker  (was: Major)
Target Version/s: 1.3.1, 1.4.0

 Metastore schema should only be a subset of parquet schema to support 
 dropping of columns using replace columns
 ---

 Key: SPARK-6471
 URL: https://issues.apache.org/jira/browse/SPARK-6471
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yash Datta
Assignee: Yash Datta
Priority: Blocker
 Fix For: 1.3.1, 1.4.0


 Currently in the parquet relation 2 implementation, error is thrown in case 
 merged schema is not exactly the same as metastore schema. 
 But to support cases like deletion of column using replace column command, we 
 can relax the restriction so that even if metastore schema is a subset of 
 merged parquet schema, the query will work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6554) Cannot use partition columns in where clause

2015-03-26 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381788#comment-14381788
 ] 

Cheng Lian commented on SPARK-6554:
---

Hi [~jonchase], did you happen to turn on Parquet filter push-down by setting 
spark.sql.parquet.filterPushdown to true? The reason behind this is that, in 
your case, the partition column doesn't exist in the Parquet data file, thus 
Parquet filter push-down logics sees it as an invalid column. We should remove 
all predicates that touch those partition columns which don't exist in Parquet 
data files before doing the push-down optimization.

 Cannot use partition columns in where clause
 

 Key: SPARK-6554
 URL: https://issues.apache.org/jira/browse/SPARK-6554
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Jon Chase
Assignee: Cheng Lian

 I'm having trouble referencing partition columns in my queries with Parquet.  
 In the following example, 'probeTypeId' is a partition column.  For example, 
 the directory structure looks like this:
 {noformat}
 /mydata
 /probeTypeId=1
 ...files...
 /probeTypeId=2
 ...files...
 {noformat}
 I see the column when I reference load a DF using the /mydata directory and 
 call df.printSchema():
 {noformat}
  |-- probeTypeId: integer (nullable = true)
 {noformat}
 Parquet is also aware of the column:
 {noformat}
  optional int32 probeTypeId;
 {noformat}
 And this works fine:
 {code}
 sqlContext.sql(select probeTypeId from df limit 1);
 {code}
 ...as does {{df.show()}} - it shows the correct values for the partition 
 column.
 However, when I try to use a partition column in a where clause, I get an 
 exception stating that the column was not found in the schema:
 {noformat}
 sqlContext.sql(select probeTypeId from df where probeTypeId = 1 limit 1);
 ...
 ...
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
 (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] 
 was not found in schema!
   at parquet.Preconditions.checkArgument(Preconditions.java:47)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
   at 
 parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
   at 
 parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
   at 
 parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
 ...
 ...
 {noformat}
 Here's the full stack trace:
 {noformat}
 using local[*] for master
 06:05:55,675 |-INFO in 
 ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not 
 set
 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
 About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
 Naming appender as [STDOUT]
 06:05:55,721 |-INFO in 
 ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default 
 type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] 
 property
 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - 
 Setting level of ROOT logger to INFO
 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - 
 Attaching appender named [STDOUT] to Logger[ROOT]
 06:05:55,769 |-INFO in 
 ch.qos.logback.classic.joran.action.ConfigurationAction - End of 
 configuration.
 06:05:55,770 |-INFO in 
 ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current 
 configuration as safe fallback point
 INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
 WARN  o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 INFO  

[jira] [Updated] (SPARK-6554) Cannot use partition columns in where clause

2015-03-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6554:
--
Priority: Critical  (was: Major)

 Cannot use partition columns in where clause
 

 Key: SPARK-6554
 URL: https://issues.apache.org/jira/browse/SPARK-6554
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Jon Chase
Assignee: Cheng Lian
Priority: Critical

 I'm having trouble referencing partition columns in my queries with Parquet.  
 In the following example, 'probeTypeId' is a partition column.  For example, 
 the directory structure looks like this:
 {noformat}
 /mydata
 /probeTypeId=1
 ...files...
 /probeTypeId=2
 ...files...
 {noformat}
 I see the column when I reference load a DF using the /mydata directory and 
 call df.printSchema():
 {noformat}
  |-- probeTypeId: integer (nullable = true)
 {noformat}
 Parquet is also aware of the column:
 {noformat}
  optional int32 probeTypeId;
 {noformat}
 And this works fine:
 {code}
 sqlContext.sql(select probeTypeId from df limit 1);
 {code}
 ...as does {{df.show()}} - it shows the correct values for the partition 
 column.
 However, when I try to use a partition column in a where clause, I get an 
 exception stating that the column was not found in the schema:
 {noformat}
 sqlContext.sql(select probeTypeId from df where probeTypeId = 1 limit 1);
 ...
 ...
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
 (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] 
 was not found in schema!
   at parquet.Preconditions.checkArgument(Preconditions.java:47)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
   at 
 parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
   at 
 parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
   at 
 parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
 ...
 ...
 {noformat}
 Here's the full stack trace:
 {noformat}
 using local[*] for master
 06:05:55,675 |-INFO in 
 ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not 
 set
 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
 About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
 Naming appender as [STDOUT]
 06:05:55,721 |-INFO in 
 ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default 
 type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] 
 property
 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - 
 Setting level of ROOT logger to INFO
 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - 
 Attaching appender named [STDOUT] to Logger[ROOT]
 06:05:55,769 |-INFO in 
 ch.qos.logback.classic.joran.action.ConfigurationAction - End of 
 configuration.
 06:05:55,770 |-INFO in 
 ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current 
 configuration as safe fallback point
 INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
 WARN  o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 INFO  org.apache.spark.SecurityManager Changing view acls to: jon
 INFO  org.apache.spark.SecurityManager Changing modify acls to: jon
 INFO  org.apache.spark.SecurityManager SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(jon); users with 
 modify permissions: Set(jon)
 INFO  akka.event.slf4j.Slf4jLogger Slf4jLogger started
 INFO  Remoting Starting remoting
 INFO  Remoting Remoting started; 

[jira] [Resolved] (SPARK-6471) Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns

2015-03-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-6471.
---
   Resolution: Fixed
Fix Version/s: 1.3.1

Issue resolved by pull request 5141
[https://github.com/apache/spark/pull/5141]

 Metastore schema should only be a subset of parquet schema to support 
 dropping of columns using replace columns
 ---

 Key: SPARK-6471
 URL: https://issues.apache.org/jira/browse/SPARK-6471
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yash Datta
 Fix For: 1.3.1, 1.4.0


 Currently in the parquet relation 2 implementation, error is thrown in case 
 merged schema is not exactly the same as metastore schema. 
 But to support cases like deletion of column using replace column command, we 
 can relax the restriction so that even if metastore schema is a subset of 
 merged parquet schema, the query will work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6471) Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns

2015-03-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6471:
--
Assignee: Yash Datta

 Metastore schema should only be a subset of parquet schema to support 
 dropping of columns using replace columns
 ---

 Key: SPARK-6471
 URL: https://issues.apache.org/jira/browse/SPARK-6471
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yash Datta
Assignee: Yash Datta
 Fix For: 1.3.1, 1.4.0


 Currently in the parquet relation 2 implementation, error is thrown in case 
 merged schema is not exactly the same as metastore schema. 
 But to support cases like deletion of column using replace column command, we 
 can relax the restriction so that even if metastore schema is a subset of 
 merged parquet schema, the query will work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6575) Add configuration to disable schema merging while converting metastore Parquet tables

2015-03-28 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6575:
--
Description: 
Consider a metastore Parquet table that
# doesn't have schema evolution issue
# has lots of data files and/or partitions

In this case, driver schema merging can be both slow and unnecessary. Would be 
good to have a configuration to let the use disable schema merging when 
converting such a metastore Parquet table.

  was:
Consider a metastore Parquet table that
# doesn't have schema evolution issue
# has lots of data files and/or partitions

In this case, driver schema merging can be both slow and unnecessary. Would be 
good to have a configuration to let the use disable schema merging when 
coverting such a metastore Parquet table.


 Add configuration to disable schema merging while converting metastore 
 Parquet tables
 -

 Key: SPARK-6575
 URL: https://issues.apache.org/jira/browse/SPARK-6575
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Assignee: Cheng Lian

 Consider a metastore Parquet table that
 # doesn't have schema evolution issue
 # has lots of data files and/or partitions
 In this case, driver schema merging can be both slow and unnecessary. Would 
 be good to have a configuration to let the use disable schema merging when 
 converting such a metastore Parquet table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6608) Make DataFrame.rdd a lazy val

2015-03-30 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-6608:
-

Assignee: Cheng Lian

 Make DataFrame.rdd a lazy val
 -

 Key: SPARK-6608
 URL: https://issues.apache.org/jira/browse/SPARK-6608
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Minor

 Before 1.3.0, {{SchemaRDD.id}} works as a unique identifier of each 
 {{SchemaRDD}}. In 1.3.0, unlike {{SchemaRDD}}, {{DataFrame}} is no longer an 
 RDD, and {{DataFrame.rdd}} is actually a function which always return a new 
 RDD instance. Making {{DataFrame.rdd}} a {{lazy val}} should bring the unique 
 identifier back.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6607) Aggregation attribute name including special chars '(' and ')' should be replaced before generating Parquet schema

2015-03-30 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6607:
--
Assignee: Liang-Chi Hsieh

 Aggregation attribute name including special chars '(' and ')' should be 
 replaced before generating Parquet schema
 --

 Key: SPARK-6607
 URL: https://issues.apache.org/jira/browse/SPARK-6607
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh

 '(' and ')' are special characters used in Parquet schema for type 
 annotation. When we run an aggregation query, we will obtain attribute name 
 such as MAX(a).
 If we directly store the generated DataFrame as Parquet file, it causes 
 failure when reading and parsing the stored schema string.
 Several methods can be adopted to solve this. This pr uses a simplest one to 
 just replace attribute names before generating Parquet schema based on these 
 attributes.
 Another possible method might be modifying all aggregation expression names 
 from func(column) to func[column].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6607) Aggregation attribute name including special chars '(' and ')' should be replaced before generating Parquet schema

2015-03-30 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6607:
--
 Target Version/s: 1.4.0
Affects Version/s: 1.1.1
   1.2.1
   1.3.0

 Aggregation attribute name including special chars '(' and ')' should be 
 replaced before generating Parquet schema
 --

 Key: SPARK-6607
 URL: https://issues.apache.org/jira/browse/SPARK-6607
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.1, 1.2.1, 1.3.0
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh

 '(' and ')' are special characters used in Parquet schema for type 
 annotation. When we run an aggregation query, we will obtain attribute name 
 such as MAX(a).
 If we directly store the generated DataFrame as Parquet file, it causes 
 failure when reading and parsing the stored schema string.
 Several methods can be adopted to solve this. This pr uses a simplest one to 
 just replace attribute names before generating Parquet schema based on these 
 attributes.
 Another possible method might be modifying all aggregation expression names 
 from func(column) to func[column].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6608) Make DataFrame.rdd a lazy val

2015-03-30 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-6608:
-

 Summary: Make DataFrame.rdd a lazy val
 Key: SPARK-6608
 URL: https://issues.apache.org/jira/browse/SPARK-6608
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Priority: Minor


Before 1.3.0, {{SchemaRDD.id}} works as a unique identifier of each 
{{SchemaRDD}}. In 1.3.0, unlike {{SchemaRDD}}, {{DataFrame}} is no longer an 
RDD, and {{DataFrame.rdd}} is actually a function which always return a new RDD 
instance. Making {{DataFrame.rdd}} a {{lazy val}} should bring the unique 
identifier back.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6595) DataFrame self joins with MetastoreRelations fail

2015-03-30 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-6595.
---
Resolution: Fixed

Fixed by https://github.com/apache/spark/pull/5251

 DataFrame self joins with MetastoreRelations fail
 -

 Key: SPARK-6595
 URL: https://issues.apache.org/jira/browse/SPARK-6595
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4226) SparkSQL - Add support for subqueries in predicates

2015-03-30 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-4226:
--
Description: 
I have a test table defined in Hive as follows:
{code:sql}
CREATE TABLE sparkbug (
  id INT,
  event STRING
) STORED AS PARQUET;
{code}
and insert some sample data with ids 1, 2, 3.

In a Spark shell, I then create a HiveContext and then execute the following 
HQL to test out subquery predicates:
{code}
val hc = HiveContext(hc)
hc.hql(select customerid from sparkbug where customerid in (select customerid 
from sparkbug where customerid in (2,3)))
{code}
I get the following error:
{noformat}
java.lang.RuntimeException: Unsupported language features in query: select 
customerid from sparkbug where customerid in (select customerid from sparkbug 
where customerid in (2,3))
TOK_QUERY
  TOK_FROM
TOK_TABREF
  TOK_TABNAME
sparkbug
  TOK_INSERT
TOK_DESTINATION
  TOK_DIR
TOK_TMP_FILE
TOK_SELECT
  TOK_SELEXPR
TOK_TABLE_OR_COL
  customerid
TOK_WHERE
  TOK_SUBQUERY_EXPR
TOK_SUBQUERY_OP
  in
TOK_QUERY
  TOK_FROM
TOK_TABREF
  TOK_TABNAME
sparkbug
  TOK_INSERT
TOK_DESTINATION
  TOK_DIR
TOK_TMP_FILE
TOK_SELECT
  TOK_SELEXPR
TOK_TABLE_OR_COL
  customerid
TOK_WHERE
  TOK_FUNCTION
in
TOK_TABLE_OR_COL
  customerid
2
3
TOK_TABLE_OR_COL
  customerid

scala.NotImplementedError: No parse rules for ASTNode type: 817, text: 
TOK_SUBQUERY_EXPR :
TOK_SUBQUERY_EXPR
  TOK_SUBQUERY_OP
in
  TOK_QUERY
TOK_FROM
  TOK_TABREF
TOK_TABNAME
  sparkbug
TOK_INSERT
  TOK_DESTINATION
TOK_DIR
  TOK_TMP_FILE
  TOK_SELECT
TOK_SELEXPR
  TOK_TABLE_OR_COL
customerid
  TOK_WHERE
TOK_FUNCTION
  in
  TOK_TABLE_OR_COL
customerid
  2
  3
  TOK_TABLE_OR_COL
customerid
 +
 
org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1098)

at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:252)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
{noformat}
[This 
thread|http://apache-spark-user-list.1001560.n3.nabble.com/Subquery-in-having-clause-Spark-1-1-0-td17401.html]
 also brings up lack of subquery support in SparkSQL. It would be nice to have 
subquery predicate support in a near, future release (1.3, maybe?).

  was:
I have a test table defined in Hive as follows:

CREATE TABLE sparkbug (
  id INT,
  event STRING
) STORED AS PARQUET;

and insert some sample data with ids 1, 2, 3.

In a Spark shell, I then create a HiveContext and then execute the following 
HQL to test out subquery predicates:

val hc = HiveContext(hc)
hc.hql(select customerid from sparkbug where customerid in (select customerid 
from sparkbug where customerid in (2,3)))

I get the following error:

java.lang.RuntimeException: Unsupported language features in query: select 
customerid from sparkbug where customerid in (select customerid from sparkbug 
where customerid in (2,3))
TOK_QUERY
  TOK_FROM
TOK_TABREF
  TOK_TABNAME
sparkbug
  TOK_INSERT
TOK_DESTINATION
  TOK_DIR
TOK_TMP_FILE
TOK_SELECT
  TOK_SELEXPR
TOK_TABLE_OR_COL
  customerid
TOK_WHERE
  TOK_SUBQUERY_EXPR
TOK_SUBQUERY_OP
  in
TOK_QUERY
  TOK_FROM
TOK_TABREF
  TOK_TABNAME
sparkbug
  TOK_INSERT
TOK_DESTINATION
  TOK_DIR
TOK_TMP_FILE
TOK_SELECT
  TOK_SELEXPR
TOK_TABLE_OR_COL
  customerid
TOK_WHERE
  TOK_FUNCTION
in
TOK_TABLE_OR_COL
  customerid
2
3
TOK_TABLE_OR_COL
  customerid

scala.NotImplementedError: No parse rules for ASTNode type: 817, text: 
TOK_SUBQUERY_EXPR :
TOK_SUBQUERY_EXPR
  TOK_SUBQUERY_OP
in
  TOK_QUERY
TOK_FROM
  TOK_TABREF
TOK_TABNAME
  sparkbug
TOK_INSERT
  TOK_DESTINATION
TOK_DIR
  TOK_TMP_FILE
  TOK_SELECT
TOK_SELEXPR
  TOK_TABLE_OR_COL
customerid
  TOK_WHERE
TOK_FUNCTION
  in
  TOK_TABLE_OR_COL

[jira] [Resolved] (SPARK-6369) InsertIntoHiveTable and Parquet Relation should use logic from SparkHadoopWriter

2015-03-30 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-6369.
---
   Resolution: Fixed
Fix Version/s: 1.4.0
   1.3.1

Issue resolved by pull request 5139
[https://github.com/apache/spark/pull/5139]

 InsertIntoHiveTable and Parquet Relation should use logic from 
 SparkHadoopWriter
 

 Key: SPARK-6369
 URL: https://issues.apache.org/jira/browse/SPARK-6369
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.3.1, 1.4.0


 Right now it is possible that we will corrupt the output if there is a race 
 between competing speculative tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6555) Override equals and hashCode in MetastoreRelation

2015-03-30 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-6555:
-

Assignee: Cheng Lian

 Override equals and hashCode in MetastoreRelation
 -

 Key: SPARK-6555
 URL: https://issues.apache.org/jira/browse/SPARK-6555
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0
Reporter: Cheng Lian
Assignee: Cheng Lian

 This is a follow-up of SPARK-6450.
 As explained in [this 
 comment|https://issues.apache.org/jira/browse/SPARK-6450?focusedCommentId=14379499page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14379499]
  of SPARK-6450, we resorted to a more surgical fix due to the upcoming 1.3.1 
 release. But overriding {{equals}} and {{hashCode}} is the proper fix to that 
 problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6618) HiveMetastoreCatalog.lookupRelation should use fine-grained lock

2015-03-31 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-6618.
---
   Resolution: Fixed
Fix Version/s: 1.4.0
   1.3.1

 HiveMetastoreCatalog.lookupRelation should use fine-grained lock
 

 Key: SPARK-6618
 URL: https://issues.apache.org/jira/browse/SPARK-6618
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Blocker
 Fix For: 1.3.1, 1.4.0


 Right now the entire method of HiveMetastoreCatalog.lookupRelation has a lock 
 (https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L173)
  and the scope of lock will cover resolving data source tables 
 (https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L93).
  So, lookupRelation can be extremely expensive when we are doing expensive 
 operations like parquet schema discovery. So, we should use fine-grained lock 
 for lookupRelation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6542) Add CreateStruct as an Expression

2015-03-31 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-6542.
---
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5195
[https://github.com/apache/spark/pull/5195]

 Add CreateStruct as an Expression
 -

 Key: SPARK-6542
 URL: https://issues.apache.org/jira/browse/SPARK-6542
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.3.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.4.0


 Similar to CreateArray, we can add CreateStruct as an Expression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6644) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL

2015-04-01 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6644:
--
Description: 
In Hive, the schema of a partition may differ from the table schema. For 
example, we may add new columns to the table after importing existing 
partitions. When using {{spark-sql}} to query the data in a partition whose 
schema is different from the table schema, problems may arise. Part of them 
have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. 
However, after adding new column(s) to the table, when inserting data into old 
partitions, values of newly added columns are all {{NULL}}.

The following snippet can be used to reproduce this issue:
{code}
case class TestData(key: Int, value: String)

val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, 
i.toString))).toDF()
testData.registerTempTable(testData)

sql(DROP TABLE IF EXISTS table_with_partition )
sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int, value string) 
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}')
sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT 
key, value FROM testData)

// Add new columns to the table
sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT 
key, value, 'test', 1.11 FROM testData)

sql(SELECT * FROM table_with_partition WHERE ds = 
'1').collect().foreach(println)  
{code}
Actual result:
{noformat}
[1,1,null,null,1]
[2,2,null,null,1]
{noformat}
Expected result:
{noformat}
[1,1,test,1.11,1]
[2,2,test,1.11,1]
{noformat}

  was:
In hive,the schema of partition may be difference from the table schema. For 
example, we add new column. When we use spark-sql to query the data of 
partition which schema is difference from the table schema.
Some problems have been solved at PR4289 
(https://github.com/apache/spark/pull/4289), 
but if we add new column, and put new data into the old partition schema,new 
column value is NULL

[According to the following steps]:
--
case class TestData(key: Int, value: String)

val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, 
i.toString))).toDF()
  testData.registerTempTable(testData)

 sql(DROP TABLE IF EXISTS table_with_partition )
 sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value string) 
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value FROM testData)

// add column to table
 sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
 sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value,'test',1.11 FROM testData)

 sql(select * from table_with_partition where ds='1' 
).collect().foreach(println)  
 
-
result: 
[1,1,null,null,1]
[2,2,null,null,1]
 
result we expect:
[1,1,test,1.11,1]
[2,2,test,1.11,1]

This bug will cause the wrong query number ,when we query : 

select  count(1)  from  table_with_partition  where   key1  is not NULL


 [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), 
 new column value is NULL
 --

 Key: SPARK-6644
 URL: https://issues.apache.org/jira/browse/SPARK-6644
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: dongxu

 In Hive, the schema of a partition may differ from the table schema. For 
 example, we may add new columns to the table after importing existing 
 partitions. When using {{spark-sql}} to query the data in a partition whose 
 schema is different from the table schema, problems may arise. Part of them 
 have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. 
 However, after adding new column(s) to the table, when inserting data into 
 old partitions, values of newly added columns are all {{NULL}}.
 The following snippet can be used to reproduce this issue:
 {code}
 case class TestData(key: Int, value: String)
 val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = 
 TestData(i, i.toString))).toDF()
 testData.registerTempTable(testData)
 sql(DROP TABLE IF EXISTS table_with_partition )
 sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int, value string) 
 PARTITIONED by (ds string) location '${tmpDir.toURI.toString}')
 sql(INSERT OVERWRITE TABLE table_with_partition 

[jira] [Updated] (SPARK-6644) After adding new columns to a partitioned table and inserting data to an old partition, data of newly added columns are all NULL

2015-04-01 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6644:
--
Summary: After adding new columns to a partitioned table and inserting data 
to an old partition, data of newly added columns are all NULL  (was: 
[SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), 
new column value is NULL)

 After adding new columns to a partitioned table and inserting data to an old 
 partition, data of newly added columns are all NULL
 

 Key: SPARK-6644
 URL: https://issues.apache.org/jira/browse/SPARK-6644
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: dongxu

 In Hive, the schema of a partition may differ from the table schema. For 
 example, we may add new columns to the table after importing existing 
 partitions. When using {{spark-sql}} to query the data in a partition whose 
 schema is different from the table schema, problems may arise. Part of them 
 have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. 
 However, after adding new column(s) to the table, when inserting data into 
 old partitions, values of newly added columns are all {{NULL}}.
 The following snippet can be used to reproduce this issue:
 {code}
 case class TestData(key: Int, value: String)
 val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = 
 TestData(i, i.toString))).toDF()
 testData.registerTempTable(testData)
 sql(DROP TABLE IF EXISTS table_with_partition )
 sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int, value string) 
 PARTITIONED by (ds string) location '${tmpDir.toURI.toString}')
 sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT 
 key, value FROM testData)
 // Add new columns to the table
 sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
 sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
 sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT 
 key, value, 'test', 1.11 FROM testData)
 sql(SELECT * FROM table_with_partition WHERE ds = 
 '1').collect().foreach(println)
 {code}
 Actual result:
 {noformat}
 [1,1,null,null,1]
 [2,2,null,null,1]
 {noformat}
 Expected result:
 {noformat}
 [1,1,test,1.11,1]
 [2,2,test,1.11,1]
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6644) After adding new columns to a partitioned table and inserting data to an old partition, data of newly added columns are all NULL

2015-04-01 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6644:
--
Description: 
In Hive, the schema of a partition may differ from the table schema. For 
example, we may add new columns to the table after importing existing 
partitions. When using {{spark-sql}} to query the data in a partition whose 
schema is different from the table schema, problems may arise. Part of them 
have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. 
However, after adding new column(s) to the table, when inserting data into old 
partitions, values of newly added columns are all {{NULL}}.

The following snippet can be used to reproduce this issue:
{code}
case class TestData(key: Int, value: String)

val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, 
i.toString))).toDF()
testData.registerTempTable(testData)

sql(DROP TABLE IF EXISTS table_with_partition )
sql(sCREATE TABLE IF NOT EXISTS table_with_partition (key INT, value STRING) 
PARTITIONED BY (ds STRING) LOCATION '${tmpDir.toURI.toString}')
sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT 
key, value FROM testData)

// Add new columns to the table
sql(ALTER TABLE table_with_partition ADD COLUMNS (key1 STRING))
sql(ALTER TABLE table_with_partition ADD COLUMNS (destlng DOUBLE)) 
sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT 
key, value, 'test', 1.11 FROM testData)

sql(SELECT * FROM table_with_partition WHERE ds = 
'1').collect().foreach(println)  
{code}
Actual result:
{noformat}
[1,1,null,null,1]
[2,2,null,null,1]
{noformat}
Expected result:
{noformat}
[1,1,test,1.11,1]
[2,2,test,1.11,1]
{noformat}

  was:
In Hive, the schema of a partition may differ from the table schema. For 
example, we may add new columns to the table after importing existing 
partitions. When using {{spark-sql}} to query the data in a partition whose 
schema is different from the table schema, problems may arise. Part of them 
have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. 
However, after adding new column(s) to the table, when inserting data into old 
partitions, values of newly added columns are all {{NULL}}.

The following snippet can be used to reproduce this issue:
{code}
case class TestData(key: Int, value: String)

val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, 
i.toString))).toDF()
testData.registerTempTable(testData)

sql(DROP TABLE IF EXISTS table_with_partition )
sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int, value string) 
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}')
sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT 
key, value FROM testData)

// Add new columns to the table
sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT 
key, value, 'test', 1.11 FROM testData)

sql(SELECT * FROM table_with_partition WHERE ds = 
'1').collect().foreach(println)  
{code}
Actual result:
{noformat}
[1,1,null,null,1]
[2,2,null,null,1]
{noformat}
Expected result:
{noformat}
[1,1,test,1.11,1]
[2,2,test,1.11,1]
{noformat}


 After adding new columns to a partitioned table and inserting data to an old 
 partition, data of newly added columns are all NULL
 

 Key: SPARK-6644
 URL: https://issues.apache.org/jira/browse/SPARK-6644
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: dongxu

 In Hive, the schema of a partition may differ from the table schema. For 
 example, we may add new columns to the table after importing existing 
 partitions. When using {{spark-sql}} to query the data in a partition whose 
 schema is different from the table schema, problems may arise. Part of them 
 have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. 
 However, after adding new column(s) to the table, when inserting data into 
 old partitions, values of newly added columns are all {{NULL}}.
 The following snippet can be used to reproduce this issue:
 {code}
 case class TestData(key: Int, value: String)
 val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = 
 TestData(i, i.toString))).toDF()
 testData.registerTempTable(testData)
 sql(DROP TABLE IF EXISTS table_with_partition )
 sql(sCREATE TABLE IF NOT EXISTS table_with_partition (key INT, value STRING) 
 PARTITIONED BY (ds STRING) LOCATION '${tmpDir.toURI.toString}')
 sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT 
 key, value FROM testData)
 // Add new columns to the table
 

[jira] [Commented] (SPARK-6566) Update Spark to use the latest version of Parquet libraries

2015-03-27 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383627#comment-14383627
 ] 

Cheng Lian commented on SPARK-6566:
---

Hi [~k.shaposhni...@gmail.com], as described in SPARK-5463, we do want to 
upgrade Parquet. However, currently we have two concerns:
# The most recent Parquet RC release introduces subtle API incompatibilities 
related to filter push-down and Parquet metadata gathering, which I believe 
requires more work than the patch you provided if we want everything works 
perfectly with the best performance.
# We'd like to wait for the official release of Parquet 1.6.0. This is the 
first release for Parquet as an Apache top-level project, so it takes more time 
than usual.
We probably will first try to upgrade to a most recent 1.6.0 RC release in 
Spark master, and then switch to the official 1.6.0 release in Spark 1.4.0 (and 
Spark 1.3.2 if there will be one).

 Update Spark to use the latest version of Parquet libraries
 ---

 Key: SPARK-6566
 URL: https://issues.apache.org/jira/browse/SPARK-6566
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Konstantin Shaposhnikov

 There are a lot of bug fixes in the latest version of parquet (1.6.0rc7). 
 E.g. PARQUET-136
 It would be good to update Spark to use the latest parquet version.
 The following changes are required:
 {code}
 diff --git a/pom.xml b/pom.xml
 index 5ad39a9..095b519 100644
 --- a/pom.xml
 +++ b/pom.xml
 @@ -132,7 +132,7 @@
  !-- Version used for internal directory structure --
  hive.version.short0.13.1/hive.version.short
  derby.version10.10.1.1/derby.version
 -parquet.version1.6.0rc3/parquet.version
 +parquet.version1.6.0rc7/parquet.version
  jblas.version1.2.3/jblas.version
  jetty.version8.1.14.v20131031/jetty.version
  orbit.version3.0.0.v201112011016/orbit.version
 {code}
 and
 {code}
 --- 
 a/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala
 +++ 
 b/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala
 @@ -480,7 +480,7 @@ private[parquet] class FilteringParquetRowInputFormat
  globalMetaData = new GlobalMetaData(globalMetaData.getSchema,
mergedMetadata, globalMetaData.getCreatedBy)
  
 -val readContext = getReadSupport(configuration).init(
 +val readContext = 
 ParquetInputFormat.getReadSupportInstance(configuration).init(
new InitContext(configuration,
  globalMetaData.getKeyValueMetaData,
  globalMetaData.getSchema))
 {code}
 I am happy to prepare a pull request if necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6565) Deprecate jsonRDD and replace it by jsonDataFrame / jsonDF

2015-03-27 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-6565:
-

 Summary: Deprecate jsonRDD and replace it by jsonDataFrame / jsonDF
 Key: SPARK-6565
 URL: https://issues.apache.org/jira/browse/SPARK-6565
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Priority: Minor


Since 1.3.0, {{SQLContext.jsonRDD}} actually returns a {{DataFrame}}, the 
original name becomes confusing. Would be better to deprecate it and add 
{{jsonDataFrame}} or {{jsonDF}} instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6587) Inferring schema for case class hierarchy fails with mysterious message

2015-03-29 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6587:
--
Description: 
(Don't know if this is a functionality bug, error reporting bug or an RFE ...)

I define the following hierarchy:

{code}
private abstract class MyHolder
private case class StringHolder(s: String) extends MyHolder
private case class IntHolder(i: Int) extends MyHolder
private case class BooleanHolder(b: Boolean) extends MyHolder
{code}

and a top level case class:

{code}
private case class Thing(key: Integer, foo: MyHolder)
{code}

When I try to convert it:

{code}
val things = Seq(
  Thing(1, IntHolder(42)),
  Thing(2, StringHolder(hello)),
  Thing(3, BooleanHolder(false))
)
val thingsDF = sc.parallelize(things, 4).toDF()

thingsDF.registerTempTable(things)

val all = sqlContext.sql(SELECT * from things)
{code}

I get the following stack trace:

{noformat}
Exception in thread main scala.MatchError: 
sql.CaseClassSchemaProblem.MyHolder (of class 
scala.reflect.internal.Types$ClassNoArgsTypeRef)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:112)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:159)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:157)
at scala.collection.immutable.List.map(List.scala:276)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:157)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:107)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:312)
at 
org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHolder(SQLContext.scala:250)
at sql.CaseClassSchemaProblem$.main(CaseClassSchemaProblem.scala:35)
at sql.CaseClassSchemaProblem.main(CaseClassSchemaProblem.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
{noformat}

I wrote this to answer [a question on 
StackOverflow|http://stackoverflow.com/questions/29310405/what-is-the-right-way-to-represent-an-any-type-in-spark-sql]
 which uses a much simpler approach and suffers the same problem.

Looking at what seems to me to be the [relevant unit test 
suite|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/ScalaReflectionRelationSuite.scala]
 I see that this case is not covered.  

  was:
(Don't know if this is a functionality bug, error reporting bug or an RFE ...)

I define the following hierarchy:

{code}
private abstract class MyHolder
private case class StringHolder(s: String) extends MyHolder
private case class IntHolder(i: Int) extends MyHolder
private case class BooleanHolder(b: Boolean) extends MyHolder
{code}

and a top level case class:

{code}
private case class Thing(key: Integer, foo: MyHolder)
{code}

When I try to convert it:

{code}
val things = Seq(
  Thing(1, IntHolder(42)),
  Thing(2, StringHolder(hello)),
  Thing(3, BooleanHolder(false))
)
val thingsDF = sc.parallelize(things, 4).toDF()

thingsDF.registerTempTable(things)

val all = sqlContext.sql(SELECT * from things)
{code}

I get the following stack trace:

{quote}
Exception in thread main scala.MatchError: 
sql.CaseClassSchemaProblem.MyHolder (of class 
scala.reflect.internal.Types$ClassNoArgsTypeRef)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:112)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:159)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:157)
at scala.collection.immutable.List.map(List.scala:276)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:157)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:107)
at 

[jira] [Commented] (SPARK-6587) Inferring schema for case class hierarchy fails with mysterious message

2015-03-29 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385695#comment-14385695
 ] 

Cheng Lian commented on SPARK-6587:
---

This behavior is expected. There are two problems in your case:

# Because {{things}} contains instances of all three case classes, the type of 
{{things}} is {{Seq[MyHolder]}}. Since {{MyHolder}} doesn't extend {{Product}}, 
can't be recognized by {{ScalaReflection}}.

# You can only use a single concrete case class {{T}} when converting 
{{RDD[T]}} or {{Seq[T]}} to a DataFrame. For {{things}}, we can't figure out 
what data type should the {{foo}} field in the reflected schema have.

 Inferring schema for case class hierarchy fails with mysterious message
 ---

 Key: SPARK-6587
 URL: https://issues.apache.org/jira/browse/SPARK-6587
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
 Environment: At least Windows 8, Scala 2.11.2.  
Reporter: Spiro Michaylov

 (Don't know if this is a functionality bug, error reporting bug or an RFE ...)
 I define the following hierarchy:
 {code}
 private abstract class MyHolder
 private case class StringHolder(s: String) extends MyHolder
 private case class IntHolder(i: Int) extends MyHolder
 private case class BooleanHolder(b: Boolean) extends MyHolder
 {code}
 and a top level case class:
 {code}
 private case class Thing(key: Integer, foo: MyHolder)
 {code}
 When I try to convert it:
 {code}
 val things = Seq(
   Thing(1, IntHolder(42)),
   Thing(2, StringHolder(hello)),
   Thing(3, BooleanHolder(false))
 )
 val thingsDF = sc.parallelize(things, 4).toDF()
 thingsDF.registerTempTable(things)
 val all = sqlContext.sql(SELECT * from things)
 {code}
 I get the following stack trace:
 {noformat}
 Exception in thread main scala.MatchError: 
 sql.CaseClassSchemaProblem.MyHolder (of class 
 scala.reflect.internal.Types$ClassNoArgsTypeRef)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:112)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:159)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:157)
   at scala.collection.immutable.List.map(List.scala:276)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:157)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:107)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
   at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:312)
   at 
 org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHolder(SQLContext.scala:250)
   at sql.CaseClassSchemaProblem$.main(CaseClassSchemaProblem.scala:35)
   at sql.CaseClassSchemaProblem.main(CaseClassSchemaProblem.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
 {noformat}
 I wrote this to answer [a question on 
 StackOverflow|http://stackoverflow.com/questions/29310405/what-is-the-right-way-to-represent-an-any-type-in-spark-sql]
  which uses a much simpler approach and suffers the same problem.
 Looking at what seems to me to be the [relevant unit test 
 suite|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/ScalaReflectionRelationSuite.scala]
  I see that this case is not covered.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6587) Inferring schema for case class hierarchy fails with mysterious message

2015-03-29 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385695#comment-14385695
 ] 

Cheng Lian edited comment on SPARK-6587 at 3/29/15 10:32 AM:
-

This behavior is expected. There are two problems in your case:
# Because {{things}} contains instances of all three case classes, the type of 
{{things}} is {{Seq[MyHolder]}}. Since {{MyHolder}} doesn't extend {{Product}}, 
can't be recognized by {{ScalaReflection}}.
# You can only use a single concrete case class {{T}} when converting 
{{RDD[T]}} or {{Seq[T]}} to a DataFrame. For {{things}}, we can't figure out 
what data type should the {{foo}} field in the reflected schema have.


was (Author: lian cheng):
This behavior is expected. There are two problems in your case:

# Because {{things}} contains instances of all three case classes, the type of 
{{things}} is {{Seq[MyHolder]}}. Since {{MyHolder}} doesn't extend {{Product}}, 
can't be recognized by {{ScalaReflection}}.

# You can only use a single concrete case class {{T}} when converting 
{{RDD[T]}} or {{Seq[T]}} to a DataFrame. For {{things}}, we can't figure out 
what data type should the {{foo}} field in the reflected schema have.

 Inferring schema for case class hierarchy fails with mysterious message
 ---

 Key: SPARK-6587
 URL: https://issues.apache.org/jira/browse/SPARK-6587
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
 Environment: At least Windows 8, Scala 2.11.2.  
Reporter: Spiro Michaylov

 (Don't know if this is a functionality bug, error reporting bug or an RFE ...)
 I define the following hierarchy:
 {code}
 private abstract class MyHolder
 private case class StringHolder(s: String) extends MyHolder
 private case class IntHolder(i: Int) extends MyHolder
 private case class BooleanHolder(b: Boolean) extends MyHolder
 {code}
 and a top level case class:
 {code}
 private case class Thing(key: Integer, foo: MyHolder)
 {code}
 When I try to convert it:
 {code}
 val things = Seq(
   Thing(1, IntHolder(42)),
   Thing(2, StringHolder(hello)),
   Thing(3, BooleanHolder(false))
 )
 val thingsDF = sc.parallelize(things, 4).toDF()
 thingsDF.registerTempTable(things)
 val all = sqlContext.sql(SELECT * from things)
 {code}
 I get the following stack trace:
 {noformat}
 Exception in thread main scala.MatchError: 
 sql.CaseClassSchemaProblem.MyHolder (of class 
 scala.reflect.internal.Types$ClassNoArgsTypeRef)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:112)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:159)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:157)
   at scala.collection.immutable.List.map(List.scala:276)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:157)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:107)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
   at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:312)
   at 
 org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHolder(SQLContext.scala:250)
   at sql.CaseClassSchemaProblem$.main(CaseClassSchemaProblem.scala:35)
   at sql.CaseClassSchemaProblem.main(CaseClassSchemaProblem.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
 {noformat}
 I wrote this to answer [a question on 
 StackOverflow|http://stackoverflow.com/questions/29310405/what-is-the-right-way-to-represent-an-any-type-in-spark-sql]
  which uses a much simpler approach and suffers the same problem.
 Looking at what seems to me to be the [relevant unit test 
 suite|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/ScalaReflectionRelationSuite.scala]
  I see that this case is not covered.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org

[jira] [Commented] (SPARK-6579) save as parquet with overwrite failed

2015-03-29 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385694#comment-14385694
 ] 

Cheng Lian commented on SPARK-6579:
---

Here's another Parquet issue with Hadoop 1.0.4: SPARK-6581.

 save as parquet with overwrite failed
 -

 Key: SPARK-6579
 URL: https://issues.apache.org/jira/browse/SPARK-6579
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.4.0
Reporter: Davies Liu
Assignee: Michael Armbrust
Priority: Critical

 {code}
 df = sc.parallelize(xrange(n), 4).map(lambda x: (x, str(x) * 
 2,)).toDF(['int', 'str'])
 df.save(test_data, source=parquet, mode='overwrite')
 df.save(test_data, source=parquet, mode='overwrite')
 {code}
 it failed with:
 {code}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in 
 stage 3.0 failed 1 times, most recent failure: Lost task 3.0 in stage 3.0 
 (TID 6, localhost): java.lang.IllegalArgumentException: You cannot call 
 toBytes() more than once without calling reset()
   at parquet.Preconditions.checkArgument(Preconditions.java:47)
   at 
 parquet.column.values.rle.RunLengthBitPackingHybridEncoder.toBytes(RunLengthBitPackingHybridEncoder.java:254)
   at 
 parquet.column.values.rle.RunLengthBitPackingHybridValuesWriter.getBytes(RunLengthBitPackingHybridValuesWriter.java:68)
   at 
 parquet.column.impl.ColumnWriterImpl.writePage(ColumnWriterImpl.java:147)
   at parquet.column.impl.ColumnWriterImpl.flush(ColumnWriterImpl.java:236)
   at 
 parquet.column.impl.ColumnWriteStoreImpl.flush(ColumnWriteStoreImpl.java:113)
   at 
 parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:153)
   at 
 parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112)
   at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:663)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:677)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:677)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:212)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1211)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1200)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1199)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1199)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1399)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1360)
   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
 {code}
 run it again, it failed with:
 {code}
 15/03/27 13:26:16 WARN FSInputChecker: Problem opening checksum file: 
 file:/Users/davies/work/spark/tmp/test_data/_temporary/_attempt_201503271324_0011_r_03_0/part-r-4.parquet.
   Ignoring exception: java.io.EOFException
   at java.io.DataInputStream.readFully(DataInputStream.java:197)
   at java.io.DataInputStream.readFully(DataInputStream.java:169)
   at 
 org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:134)
   at 
 org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283)
   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
   at 
 parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:402)
   at 
 

[jira] [Updated] (SPARK-6579) save as parquet with overwrite failed when linking with Hadoop 1.0.4

2015-03-29 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6579:
--
Summary: save as parquet with overwrite failed when linking with Hadoop 
1.0.4  (was: save as parquet with overwrite failed)

 save as parquet with overwrite failed when linking with Hadoop 1.0.4
 

 Key: SPARK-6579
 URL: https://issues.apache.org/jira/browse/SPARK-6579
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.4.0
Reporter: Davies Liu
Assignee: Michael Armbrust
Priority: Critical

 {code}
 df = sc.parallelize(xrange(n), 4).map(lambda x: (x, str(x) * 
 2,)).toDF(['int', 'str'])
 df.save(test_data, source=parquet, mode='overwrite')
 df.save(test_data, source=parquet, mode='overwrite')
 {code}
 it failed with:
 {code}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in 
 stage 3.0 failed 1 times, most recent failure: Lost task 3.0 in stage 3.0 
 (TID 6, localhost): java.lang.IllegalArgumentException: You cannot call 
 toBytes() more than once without calling reset()
   at parquet.Preconditions.checkArgument(Preconditions.java:47)
   at 
 parquet.column.values.rle.RunLengthBitPackingHybridEncoder.toBytes(RunLengthBitPackingHybridEncoder.java:254)
   at 
 parquet.column.values.rle.RunLengthBitPackingHybridValuesWriter.getBytes(RunLengthBitPackingHybridValuesWriter.java:68)
   at 
 parquet.column.impl.ColumnWriterImpl.writePage(ColumnWriterImpl.java:147)
   at parquet.column.impl.ColumnWriterImpl.flush(ColumnWriterImpl.java:236)
   at 
 parquet.column.impl.ColumnWriteStoreImpl.flush(ColumnWriteStoreImpl.java:113)
   at 
 parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:153)
   at 
 parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112)
   at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:663)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:677)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:677)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:212)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1211)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1200)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1199)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1199)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1399)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1360)
   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
 {code}
 run it again, it failed with:
 {code}
 15/03/27 13:26:16 WARN FSInputChecker: Problem opening checksum file: 
 file:/Users/davies/work/spark/tmp/test_data/_temporary/_attempt_201503271324_0011_r_03_0/part-r-4.parquet.
   Ignoring exception: java.io.EOFException
   at java.io.DataInputStream.readFully(DataInputStream.java:197)
   at java.io.DataInputStream.readFully(DataInputStream.java:169)
   at 
 org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:134)
   at 
 org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283)
   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
   at 
 

[jira] [Updated] (SPARK-6581) Metadata is missing when saving parquet file using hadoop 1.0.4

2015-03-28 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6581:
--
Target Version/s: 1.4.0

 Metadata is missing when saving parquet file using hadoop 1.0.4
 ---

 Key: SPARK-6581
 URL: https://issues.apache.org/jira/browse/SPARK-6581
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
 Environment: hadoop 1.0.4
Reporter: Pei-Lun Lee

 When saving parquet file with {code}df.save(foo, parquet){code}
 It generates only _common_data while _metadata is missing:
 {noformat}
 -rwxrwxrwx  1 peilunlee  staff0 Mar 27 11:29 _SUCCESS*
 -rwxrwxrwx  1 peilunlee  staff  250 Mar 27 11:29 _common_metadata*
 -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-1.parquet*
 -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-2.parquet*
 -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-3.parquet*
 -rwxrwxrwx  1 peilunlee  staff  488 Mar 27 11:29 part-r-4.parquet*
 {noformat}
 If saving with {code}df.save(foo, parquet, SaveMode.Overwrite){code} Both 
 _metadata and _common_metadata are missing:
 {noformat}
 -rwxrwxrwx  1 peilunlee  staff0 Mar 27 11:29 _SUCCESS*
 -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-1.parquet*
 -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-2.parquet*
 -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-3.parquet*
 -rwxrwxrwx  1 peilunlee  staff  488 Mar 27 11:29 part-r-4.parquet*
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6570) Spark SQL arrays: explode() fails and cannot save array type to Parquet

2015-03-28 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6570:
--
Target Version/s: 1.4.0

 Spark SQL arrays: explode() fails and cannot save array type to Parquet
 -

 Key: SPARK-6570
 URL: https://issues.apache.org/jira/browse/SPARK-6570
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Jon Chase

 {code}
 @Rule
 public TemporaryFolder tmp = new TemporaryFolder();
 @Test
 public void testPercentileWithExplode() throws Exception {
 StructType schema = DataTypes.createStructType(Lists.newArrayList(
 DataTypes.createStructField(col1, DataTypes.StringType, 
 false),
 DataTypes.createStructField(col2s, 
 DataTypes.createArrayType(DataTypes.IntegerType, true), true)
 ));
 JavaRDDRow rowRDD = sc.parallelize(Lists.newArrayList(
 RowFactory.create(test, new int[]{1, 2, 3})
 ));
 DataFrame df = sql.createDataFrame(rowRDD, schema);
 df.registerTempTable(df);
 df.printSchema();
 Listint[] ints = sql.sql(select col2s from df).javaRDD()
   .map(row - (int[]) row.get(0)).collect();
 assertEquals(1, ints.size());
 assertArrayEquals(new int[]{1, 2, 3}, ints.get(0));
 // fails: lateral view explode does not work: 
 java.lang.ClassCastException: [I cannot be cast to scala.collection.Seq
 ListInteger explodedInts = sql.sql(select col2 from df lateral 
 view explode(col2s) splode as col2).javaRDD()
 .map(row - row.getInt(0)).collect();
 assertEquals(3, explodedInts.size());
 assertEquals(Lists.newArrayList(1, 2, 3), explodedInts);
 // fails: java.lang.ClassCastException: [I cannot be cast to 
 scala.collection.Seq
 df.saveAsParquetFile(tmp.getRoot().getAbsolutePath() + /parquet);
 DataFrame loadedDf = sql.load(tmp.getRoot().getAbsolutePath() + 
 /parquet);
 loadedDf.registerTempTable(loadedDf);
 Listint[] moreInts = sql.sql(select col2s from loadedDf).javaRDD()
   .map(row - (int[]) row.get(0)).collect();
 assertEquals(1, moreInts.size());
 assertArrayEquals(new int[]{1, 2, 3}, moreInts.get(0));
 }
 {code}
 {code}
 root
  |-- col1: string (nullable = false)
  |-- col2s: array (nullable = true)
  ||-- element: integer (containsNull = true)
 ERROR org.apache.spark.executor.Executor Exception in task 7.0 in stage 1.0 
 (TID 15)
 java.lang.ClassCastException: [I cannot be cast to scala.collection.Seq
   at 
 org.apache.spark.sql.catalyst.expressions.Explode.eval(generators.scala:125) 
 ~[spark-catalyst_2.10-1.3.0.jar:1.3.0]
   at 
 org.apache.spark.sql.execution.Generate$$anonfun$2$$anonfun$apply$1.apply(Generate.scala:70)
  ~[spark-sql_2.10-1.3.0.jar:1.3.0]
   at 
 org.apache.spark.sql.execution.Generate$$anonfun$2$$anonfun$apply$1.apply(Generate.scala:69)
  ~[spark-sql_2.10-1.3.0.jar:1.3.0]
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) 
 ~[scala-library-2.10.4.jar:na]
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) 
 ~[scala-library-2.10.4.jar:na]
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) 
 ~[scala-library-2.10.4.jar:na]
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) 
 ~[scala-library-2.10.4.jar:na]
   at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
 ~[scala-library-2.10.4.jar:na]
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
 ~[scala-library-2.10.4.jar:na]
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6450) Self joining query failure

2015-03-22 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375343#comment-14375343
 ] 

Cheng Lian commented on SPARK-6450:
---

Ah, I see. Will handle this ASAP. [~chinnitv] Thanks for reporting this!

 Self joining query failure
 --

 Key: SPARK-6450
 URL: https://issues.apache.org/jira/browse/SPARK-6450
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Anand Mohan Tumuluri
Assignee: Cheng Lian
Priority: Blocker

 The below query was working fine till 1.3 commit 
 9a151ce58b3e756f205c9f3ebbbf3ab0ba5b33fd.(Yes it definitely works at this 
 commit although this commit is completely unrelated)
 It got broken in 1.3.0 release with an AnalysisException: resolved attributes 
 ... missing from  (although this list contains the fields which it 
 reports missing)
 {code}
 at 
 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:189)
   at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
   at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
   at 
 org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
   at 
 org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
   at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
   at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
   at com.sun.proxy.$Proxy17.executeStatementAsync(Unknown Source)
   at 
 org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)
   at 
 org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344)
   at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313)
   at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298)
   at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
   at 
 org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
   at 
 org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}
 {code}
 select Orders.Country, Orders.ProductCategory,count(1) from Orders join 
 (select Orders.Country, count(1) CountryOrderCount from Orders where 
 to_date(Orders.PlacedDate)  '2015-01-01' group by Orders.Country order by 
 CountryOrderCount DESC LIMIT 5) Top5Countries on Top5Countries.Country = 
 Orders.Country where to_date(Orders.PlacedDate)  '2015-01-01' group by 
 Orders.Country,Orders.ProductCategory;
 {code}
 The temporary workaround is to add explicit alias for the table Orders
 {code}
 select o.Country, o.ProductCategory,count(1) from Orders o join (select 
 r.Country, count(1) CountryOrderCount from Orders r where 
 to_date(r.PlacedDate)  '2015-01-01' group by r.Country order by 
 CountryOrderCount DESC LIMIT 5) Top5Countries on Top5Countries.Country = 
 o.Country where to_date(o.PlacedDate)  '2015-01-01' group by 
 o.Country,o.ProductCategory;
 {code}
 However this change not only affects self joins, it also seems to affect 
 union queries as well, like the below query which was again working 
 before(commit 9a151ce) got broken
 {code}
 select Orders.Country,null,count(1) OrderCount from Orders group by 
 Orders.Country,null
 union all
 select null,Orders.ProductCategory,count(1) OrderCount from Orders group by 
 null, Orders.ProductCategory
 {code}
 also fails with a Analysis exception.
 The workaround is to add different aliases for the tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (SPARK-4985) Parquet support for date type

2015-03-22 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-4985.
---
   Resolution: Fixed
Fix Version/s: 1.4.0
   1.3.1

Issue resolved by pull request 3822
[https://github.com/apache/spark/pull/3822]

 Parquet support for date type
 -

 Key: SPARK-4985
 URL: https://issues.apache.org/jira/browse/SPARK-4985
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Adrian Wang
 Fix For: 1.3.1, 1.4.0


 Parquet serde support for DATE type



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4985) Parquet support for date type

2015-03-22 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-4985:
--
Assignee: Adrian Wang

 Parquet support for date type
 -

 Key: SPARK-4985
 URL: https://issues.apache.org/jira/browse/SPARK-4985
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Adrian Wang
Assignee: Adrian Wang
 Fix For: 1.3.1, 1.4.0


 Parquet serde support for DATE type



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4985) Parquet support for date type

2015-03-22 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-4985:
--
Target Version/s: 1.3.1, 1.4.0

 Parquet support for date type
 -

 Key: SPARK-4985
 URL: https://issues.apache.org/jira/browse/SPARK-4985
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Adrian Wang
Assignee: Adrian Wang
 Fix For: 1.3.1, 1.4.0


 Parquet serde support for DATE type



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6020) Flaky test: o.a.s.sql.columnar.PartitionBatchPruningSuite

2015-03-02 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14344602#comment-14344602
 ] 

Cheng Lian commented on SPARK-6020:
---

Hey [~andrewor14], I think [PR #4835|https://github.com/apache/spark/pull/4835] 
has already fixed this PR. {{InMemoryColumnarTableScan}} uses accumulators to 
generate debugging information for testing purposes. Test failures related to 
this JIRA ticket showed that accumulator updates got lost nondeterministically, 
which was also exactly what PR #4835 fixed. Also, according to [your amazing 
statistics 
sheets|https://docs.google.com/spreadsheets/d/1VSCTXLBqnglk0XMd0R4IhvUPb2MEDQUaRNwxcHBtSlk/edit#gid=52877182],
 this test suite hadn't been flaky for a week.

 Flaky test: o.a.s.sql.columnar.PartitionBatchPruningSuite
 -

 Key: SPARK-6020
 URL: https://issues.apache.org/jira/browse/SPARK-6020
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 1.3.0
Reporter: Andrew Or
Assignee: Cheng Lian
Priority: Critical

 Observed in the following builds, only one of which has something to do with 
 SQL:
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27931/
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27930/
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27929/
 org.apache.spark.sql.columnar.PartitionBatchPruningSuite.SELECT key FROM 
 pruningData WHERE NOT (key IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 
 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30))
 {code}
 Error Message
 8 did not equal 10 Wrong number of read batches: == Parsed Logical Plan == 
 'Project ['key]  'Filter NOT 'key IN 
 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30)
'UnresolvedRelation [pruningData], None  == Analyzed Logical Plan == 
 Project [key#5245]  Filter NOT key#5245 IN 
 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30)
LogicalRDD [key#5245,value#5246], MapPartitionsRDD[3202] at mapPartitions 
 at ExistingRDD.scala:35  == Optimized Logical Plan == Project [key#5245]  
 Filter NOT key#5245 INSET 
 (5,10,24,25,14,20,29,1,6,28,21,9,13,2,17,22,27,12,7,3,18,16,11,26,23,8,30,19,4,15)
InMemoryRelation [key#5245,value#5246], true, 10, StorageLevel(true, true, 
 false, true, 1), (PhysicalRDD [key#5245,value#5246], MapPartitionsRDD[3202] 
 at mapPartitions at ExistingRDD.scala:35), Some(pruningData)  == Physical 
 Plan == Filter NOT key#5245 INSET 
 (5,10,24,25,14,20,29,1,6,28,21,9,13,2,17,22,27,12,7,3,18,16,11,26,23,8,30,19,4,15)
   InMemoryColumnarTableScan [key#5245], [NOT key#5245 INSET 
 (5,10,24,25,14,20,29,1,6,28,21,9,13,2,17,22,27,12,7,3,18,16,11,26,23,8,30,19,4,15)],
  (InMemoryRelation [key#5245,value#5246], true, 10, StorageLevel(true, true, 
 false, true, 1), (PhysicalRDD [key#5245,value#5246], MapPartitionsRDD[3202] 
 at mapPartitions at ExistingRDD.scala:35), Some(pruningData))  Code 
 Generation: false == RDD ==
 Stacktrace
 sbt.ForkMain$ForkError: 8 did not equal 10 Wrong number of read batches: == 
 Parsed Logical Plan ==
 'Project ['key]
  'Filter NOT 'key IN 
 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30)
   'UnresolvedRelation [pruningData], None
 == Analyzed Logical Plan ==
 Project [key#5245]
  Filter NOT key#5245 IN 
 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30)
   LogicalRDD [key#5245,value#5246], MapPartitionsRDD[3202] at mapPartitions 
 at ExistingRDD.scala:35
 == Optimized Logical Plan ==
 Project [key#5245]
  Filter NOT key#5245 INSET 
 (5,10,24,25,14,20,29,1,6,28,21,9,13,2,17,22,27,12,7,3,18,16,11,26,23,8,30,19,4,15)
   InMemoryRelation [key#5245,value#5246], true, 10, StorageLevel(true, true, 
 false, true, 1), (PhysicalRDD [key#5245,value#5246], MapPartitionsRDD[3202] 
 at mapPartitions at ExistingRDD.scala:35), Some(pruningData)
 == Physical Plan ==
 Filter NOT key#5245 INSET 
 (5,10,24,25,14,20,29,1,6,28,21,9,13,2,17,22,27,12,7,3,18,16,11,26,23,8,30,19,4,15)
  InMemoryColumnarTableScan [key#5245], [NOT key#5245 INSET 
 (5,10,24,25,14,20,29,1,6,28,21,9,13,2,17,22,27,12,7,3,18,16,11,26,23,8,30,19,4,15)],
  (InMemoryRelation [key#5245,value#5246], true, 10, StorageLevel(true, true, 
 false, true, 1), (PhysicalRDD [key#5245,value#5246], MapPartitionsRDD[3202] 
 at mapPartitions at ExistingRDD.scala:35), Some(pruningData))
 Code Generation: false
 == RDD ==
   at 
 org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
   at 
 org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
   at 
 org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
   at 
 

[jira] [Created] (SPARK-6136) Docker client library introduces Guava 17.0, which causes runtime binary incompatibilities

2015-03-03 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-6136:
-

 Summary: Docker client library introduces Guava 17.0, which causes 
runtime binary incompatibilities
 Key: SPARK-6136
 URL: https://issues.apache.org/jira/browse/SPARK-6136
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Assignee: Cheng Lian


Integration test suites in the JDBC data sources ({{MySQLIntegration}} and 
{{PostgresIntegration}}) depends on docker-client 2.7.5, which transitively 
depends on Guava 17.0. Unfortunately, Guava 17.0 is causing runtime binary 
compatibility issues
{code}
$ ./build/sbt -Pyarn,hadoop-2.4,hive,hive-0.12.0,scala-2.10 
-Dhadoop.version=2.4.1
...
 sql/test-only *.ParquetDataSourceOffIOSuite
...
[info] ParquetDataSourceOffIOSuite:
[info] Exception encountered when attempting to run a suite with class name: 
org.apache.spark.sql.parquet.ParquetDataSourceOffIOSuite *** ABORTED *** (134 
milliseconds)
[info]   java.lang.IllegalAccessError: tried to access method 
com.google.common.base.Stopwatch.init()V from class 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat
[info]   at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:261)
[info]   at 
parquet.hadoop.ParquetInputFormat.listStatus(ParquetInputFormat.java:277)
[info]   at 
org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:437)
[info]   at 
org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95)
[info]   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
[info]   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
[info]   at scala.Option.getOrElse(Option.scala:120)
[info]   at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
[info]   at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
[info]   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
[info]   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
[info]   at scala.Option.getOrElse(Option.scala:120)
[info]   at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
[info]   at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
[info]   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
[info]   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
[info]   at scala.Option.getOrElse(Option.scala:120)
[info]   at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
[info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1525)
[info]   at org.apache.spark.rdd.RDD.collect(RDD.scala:813)
[info]   at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:83)
[info]   at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:797)
[info]   at org.apache.spark.sql.QueryTest$.checkAnswer(QueryTest.scala:115)
[info]   at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:60)
[info]   at 
org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$checkParquetFile$1.apply(ParquetIOSuite.scala:76)
[info]   at 
org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$checkParquetFile$1.apply(ParquetIOSuite.scala:76)
[info]   at 
org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetDataFrame$1.apply(ParquetTest.scala:105)
[info]   at 
org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetDataFrame$1.apply(ParquetTest.scala:105)
[info]   at 
org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetFile$1.apply(ParquetTest.scala:94)
[info]   at 
org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetFile$1.apply(ParquetTest.scala:92)
[info]   at 
org.apache.spark.sql.parquet.ParquetTest$class.withTempPath(ParquetTest.scala:71)
[info]   at 
org.apache.spark.sql.parquet.ParquetIOSuiteBase.withTempPath(ParquetIOSuite.scala:67)
[info]   at 
org.apache.spark.sql.parquet.ParquetTest$class.withParquetFile(ParquetTest.scala:92)
[info]   at 
org.apache.spark.sql.parquet.ParquetIOSuiteBase.withParquetFile(ParquetIOSuite.scala:67)
[info]   at 
org.apache.spark.sql.parquet.ParquetTest$class.withParquetDataFrame(ParquetTest.scala:105)
[info]   at 
org.apache.spark.sql.parquet.ParquetIOSuiteBase.withParquetDataFrame(ParquetIOSuite.scala:67)
[info]   at 
org.apache.spark.sql.parquet.ParquetIOSuiteBase.checkParquetFile(ParquetIOSuite.scala:76)
[info]   at 
org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$1.apply$mcV$sp(ParquetIOSuite.scala:83)
[info]   at 
org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$1.apply(ParquetIOSuite.scala:79)
[info]   at 
org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$1.apply(ParquetIOSuite.scala:79)
[info]   at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
[info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
[info]   at 

[jira] [Updated] (SPARK-5707) Enabling spark.sql.codegen throws ClassNotFound exception

2015-03-03 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-5707:
--
Description: 
Exception thrown:
{noformat}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 13 in 
stage 133.0 failed 4 times, most recent failure: Lost task 13.3 in stage 133.0 
(TID 3066, cdh52-node2): java.io.IOException: 
com.esotericsoftware.kryo.KryoException: Unable to find class: 
__wrapper$1$81257352e1c844aebf09cb84fe9e7459.__wrapper$1$81257352e1c844aebf09cb84fe9e7459$SpecificRow$1
Serialization trace:
hashTable (org.apache.spark.sql.execution.joins.UniqueKeyHashedRelation)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011)
at 
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
at 
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at 
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at 
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$3.apply(BroadcastHashJoin.scala:62)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$3.apply(BroadcastHashJoin.scala:61)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.CartesianRDD.compute(CartesianRDD.scala:75)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.CartesianRDD.compute(CartesianRDD.scala:75)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{noformat}
SQL:
{code:sql}
INSERT INTO TABLE ${hiveconf:TEMP_TABLE}
SELECT
  s_store_name,
  pr_review_date,
  pr_review_content
FROM (
  --select store_name for stores with flat or declining sales in 3 consecutive 
months.
  SELECT s_store_name
  FROM store s
  JOIN (
-- linear regression part
SELECT
  temp.cat AS cat,
  --SUM(temp.x)as sumX,
  --SUM(temp.y)as sumY,
  --SUM(temp.xy)as sumXY,
  --SUM(temp.xx)as sumXSquared,
  --count(temp.x) as N,
  --N * sumXY - sumX * sumY AS numerator,
  --N * sumXSquared - sumX*sumX AS denom

[jira] [Updated] (SPARK-5707) Enabling spark.sql.codegen throws ClassNotFound exception

2015-03-03 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-5707:
--
Description: 
Exception thrown:
{noformat}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 13 in 
stage 133.0 failed 4 times, most recent failure: Lost task 13.3 in stage 133.0 
(TID 3066, cdh52-node2): java.io.IOException: 
com.esotericsoftware.kryo.KryoException: Unable to find class: 
__wrapper$1$81257352e1c844aebf09cb84fe9e7459.__wrapper$1$81257352e1c844aebf09cb84fe9e7459$SpecificRow$1
Serialization trace:
hashTable (org.apache.spark.sql.execution.joins.UniqueKeyHashedRelation)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011)
at 
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
at 
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at 
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at 
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$3.apply(BroadcastHashJoin.scala:62)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$3.apply(BroadcastHashJoin.scala:61)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.CartesianRDD.compute(CartesianRDD.scala:75)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.CartesianRDD.compute(CartesianRDD.scala:75)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{noformat}
SQL:
{code:sql}
INSERT INTO TABLE ${hiveconf:TEMP_TABLE}
SELECT
  s_store_name,
  pr_review_date,
  pr_review_content
FROM (
  --select store_name for stores with flat or declining sales in 3 consecutive 
months.
  SELECT s_store_name
  FROM store s
  JOIN (
-- linear regression part
SELECT
  temp.cat AS cat,
  --SUM(temp.x)as sumX,
  --SUM(temp.y)as sumY,
  --SUM(temp.xy)as sumXY,
  --SUM(temp.xx)as sumXSquared,
  --count(temp.x) as N,
  --N * sumXY - sumX * sumY AS numerator,
  --N * sumXSquared - sumX*sumX AS 

[jira] [Updated] (SPARK-6136) Docker client library introduces Guava 17.0, which causes runtime binary incompatibilities

2015-03-03 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6136:
--
Description: 
Integration test suites in the JDBC data source ({{MySQLIntegration}} and 
{{PostgresIntegration}}) depend on docker-client 2.7.5, which transitively 
depends on Guava 17.0. Unfortunately, Guava 17.0 is causing runtime binary 
incompatibility issues when Spark is compiled against Hadoop 2.4.
{code}
$ ./build/sbt -Pyarn,hadoop-2.4,hive,hive-0.12.0,scala-2.10 
-Dhadoop.version=2.4.1
...
 sql/test-only *.ParquetDataSourceOffIOSuite
...
[info] ParquetDataSourceOffIOSuite:
[info] Exception encountered when attempting to run a suite with class name: 
org.apache.spark.sql.parquet.ParquetDataSourceOffIOSuite *** ABORTED *** (134 
milliseconds)
[info]   java.lang.IllegalAccessError: tried to access method 
com.google.common.base.Stopwatch.init()V from class 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat
[info]   at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:261)
[info]   at 
parquet.hadoop.ParquetInputFormat.listStatus(ParquetInputFormat.java:277)
[info]   at 
org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:437)
[info]   at 
org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95)
[info]   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
[info]   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
[info]   at scala.Option.getOrElse(Option.scala:120)
[info]   at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
[info]   at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
[info]   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
[info]   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
[info]   at scala.Option.getOrElse(Option.scala:120)
[info]   at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
[info]   at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
[info]   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
[info]   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
[info]   at scala.Option.getOrElse(Option.scala:120)
[info]   at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
[info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1525)
[info]   at org.apache.spark.rdd.RDD.collect(RDD.scala:813)
[info]   at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:83)
[info]   at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:797)
[info]   at org.apache.spark.sql.QueryTest$.checkAnswer(QueryTest.scala:115)
[info]   at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:60)
[info]   at 
org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$checkParquetFile$1.apply(ParquetIOSuite.scala:76)
[info]   at 
org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$checkParquetFile$1.apply(ParquetIOSuite.scala:76)
[info]   at 
org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetDataFrame$1.apply(ParquetTest.scala:105)
[info]   at 
org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetDataFrame$1.apply(ParquetTest.scala:105)
[info]   at 
org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetFile$1.apply(ParquetTest.scala:94)
[info]   at 
org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetFile$1.apply(ParquetTest.scala:92)
[info]   at 
org.apache.spark.sql.parquet.ParquetTest$class.withTempPath(ParquetTest.scala:71)
[info]   at 
org.apache.spark.sql.parquet.ParquetIOSuiteBase.withTempPath(ParquetIOSuite.scala:67)
[info]   at 
org.apache.spark.sql.parquet.ParquetTest$class.withParquetFile(ParquetTest.scala:92)
[info]   at 
org.apache.spark.sql.parquet.ParquetIOSuiteBase.withParquetFile(ParquetIOSuite.scala:67)
[info]   at 
org.apache.spark.sql.parquet.ParquetTest$class.withParquetDataFrame(ParquetTest.scala:105)
[info]   at 
org.apache.spark.sql.parquet.ParquetIOSuiteBase.withParquetDataFrame(ParquetIOSuite.scala:67)
[info]   at 
org.apache.spark.sql.parquet.ParquetIOSuiteBase.checkParquetFile(ParquetIOSuite.scala:76)
[info]   at 
org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$1.apply$mcV$sp(ParquetIOSuite.scala:83)
[info]   at 
org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$1.apply(ParquetIOSuite.scala:79)
[info]   at 
org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$1.apply(ParquetIOSuite.scala:79)
[info]   at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
[info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at 

[jira] [Resolved] (SPARK-5775) GenericRow cannot be cast to SpecificMutableRow when nested data and partitioned table

2015-02-28 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-5775.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

 GenericRow cannot be cast to SpecificMutableRow when nested data and 
 partitioned table
 --

 Key: SPARK-5775
 URL: https://issues.apache.org/jira/browse/SPARK-5775
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
Reporter: Ayoub Benali
Assignee: Cheng Lian
Priority: Blocker
  Labels: hivecontext, nested, parquet, partition
 Fix For: 1.3.0


 Using the LOAD sql command in Hive context to load parquet files into a 
 partitioned table causes exceptions during query time. 
 The bug requires the table to have a column of *type Array of struct* and to 
 be *partitioned*. 
 The example bellow shows how to reproduce the bug and you can see that if the 
 table is not partitioned the query works fine. 
 {noformat}
 scala val data1 = {data_array:[{field1:1,field2:2}]}
 scala val data2 = {data_array:[{field1:3,field2:4}]}
 scala val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil)
 scala val schemaRDD = hiveContext.jsonRDD(jsonRDD)
 scala schemaRDD.printSchema
 root
  |-- data_array: array (nullable = true)
  ||-- element: struct (containsNull = false)
  |||-- field1: integer (nullable = true)
  |||-- field2: integer (nullable = true)
 scala hiveContext.sql(create external table if not exists 
 partitioned_table(data_array ARRAY STRUCTfield1: INT, field2: INT) 
 Partitioned by (date STRING) STORED AS PARQUET Location 
 'hdfs:///partitioned_table')
 scala hiveContext.sql(create external table if not exists 
 none_partitioned_table(data_array ARRAY STRUCTfield1: INT, field2: INT) 
 STORED AS PARQUET Location 'hdfs:///none_partitioned_table')
 scala schemaRDD.saveAsParquetFile(hdfs:///tmp_data_1)
 scala schemaRDD.saveAsParquetFile(hdfs:///tmp_data_2)
 scala hiveContext.sql(LOAD DATA INPATH 
 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_1' INTO TABLE 
 partitioned_table PARTITION(date='2015-02-12'))
 scala hiveContext.sql(LOAD DATA INPATH 
 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_2' INTO TABLE 
 none_partitioned_table)
 scala hiveContext.sql(select data.field1 from none_partitioned_table 
 LATERAL VIEW explode(data_array) nestedStuff AS data).collect
 res23: Array[org.apache.spark.sql.Row] = Array([1], [3])
 scala hiveContext.sql(select data.field1 from partitioned_table LATERAL 
 VIEW explode(data_array) nestedStuff AS data).collect
 15/02/12 16:21:03 INFO ParseDriver: Parsing command: select data.field1 from 
 partitioned_table LATERAL VIEW explode(data_array) nestedStuff AS data
 15/02/12 16:21:03 INFO ParseDriver: Parse Completed
 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(260661) called with 
 curMem=0, maxMem=280248975
 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18 stored as values in 
 memory (estimated size 254.6 KB, free 267.0 MB)
 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(28615) called with 
 curMem=260661, maxMem=280248975
 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18_piece0 stored as bytes 
 in memory (estimated size 27.9 KB, free 267.0 MB)
 15/02/12 16:21:03 INFO BlockManagerInfo: Added broadcast_18_piece0 in memory 
 on *:51990 (size: 27.9 KB, free: 267.2 MB)
 15/02/12 16:21:03 INFO BlockManagerMaster: Updated info of block 
 broadcast_18_piece0
 15/02/12 16:21:03 INFO SparkContext: Created broadcast 18 from NewHadoopRDD 
 at ParquetTableOperations.scala:119
 15/02/12 16:21:03 INFO FileInputFormat: Total input paths to process : 3
 15/02/12 16:21:03 INFO ParquetInputFormat: Total input paths to process : 3
 15/02/12 16:21:03 INFO FilteringParquetRowInputFormat: Using Task Side 
 Metadata Split Strategy
 15/02/12 16:21:03 INFO SparkContext: Starting job: collect at 
 SparkPlan.scala:84
 15/02/12 16:21:03 INFO DAGScheduler: Got job 12 (collect at 
 SparkPlan.scala:84) with 3 output partitions (allowLocal=false)
 15/02/12 16:21:03 INFO DAGScheduler: Final stage: Stage 13(collect at 
 SparkPlan.scala:84)
 15/02/12 16:21:03 INFO DAGScheduler: Parents of final stage: List()
 15/02/12 16:21:03 INFO DAGScheduler: Missing parents: List()
 15/02/12 16:21:03 INFO DAGScheduler: Submitting Stage 13 (MappedRDD[111] at 
 map at SparkPlan.scala:84), which has no missing parents
 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(7632) called with 
 curMem=289276, maxMem=280248975
 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_19 stored as values in 
 memory (estimated size 7.5 KB, free 267.0 MB)
 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(4230) called with 
 curMem=296908, maxMem=280248975
 15/02/12 16:21:03 INFO MemoryStore: Block 

[jira] [Resolved] (SPARK-6073) Need to refresh metastore cache after append data in CreateMetastoreDataSourceAsSelect

2015-03-02 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-6073.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

 Need to refresh metastore cache after append data in 
 CreateMetastoreDataSourceAsSelect
 --

 Key: SPARK-6073
 URL: https://issues.apache.org/jira/browse/SPARK-6073
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Priority: Blocker
 Fix For: 1.3.0


 We should drop the metadata cache in CreateMetastoreDataSourceAsSelect after 
 we append data. Otherwise, users have to manually call 
 HiveContext.refreshTable to drop the cached metadata entry from the catalog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6109) Unit tests fail when compiled against Hive 0.12.0

2015-03-02 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-6109:
-

 Summary: Unit tests fail when compiled against Hive 0.12.0
 Key: SPARK-6109
 URL: https://issues.apache.org/jira/browse/SPARK-6109
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Assignee: Cheng Lian


Currently, Jenkins doesn't run unit tests against Hive 0.12.0, and several Hive 
0.13.1 specific test cases always fail against Hive 0.12.0. Need to blacklist 
them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6052) In JSON schema inference, we should always set containsNull of an ArrayType to true

2015-03-02 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-6052.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4806
[https://github.com/apache/spark/pull/4806]

 In JSON schema inference, we should always set containsNull of an ArrayType 
 to true
 ---

 Key: SPARK-6052
 URL: https://issues.apache.org/jira/browse/SPARK-6052
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Priority: Blocker
 Fix For: 1.3.0


 We should not try to figure out if an array contains null or not because we 
 may miss arrays with null if we do sampling or future data may have nulls in 
 the array.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6153) intellij import from maven cannot debug sparksqlclidriver

2015-03-05 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6153:
--
Assignee: Adrian Wang

 intellij import from maven cannot debug sparksqlclidriver
 -

 Key: SPARK-6153
 URL: https://issues.apache.org/jira/browse/SPARK-6153
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, SQL
Reporter: Adrian Wang
Assignee: Adrian Wang
Priority: Minor
 Fix For: 1.3.0


 The {{hive-thriftserver}} module depends on Guava indirectly via {{hive}} 
 module. However, the scope of Guava was explicitly set to {{provided}} in the 
 root POM. This makes developers not able to run the Spark SQL CLI tool within 
 IntelliJ IDEA for debugging purposes. Should promote Guava dependency scope 
 to {{runtime}} for {{hive-thriftserver}} module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6153) intellij import from maven cannot debug sparksqlclidriver

2015-03-05 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-6153.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4884
[https://github.com/apache/spark/pull/4884]

 intellij import from maven cannot debug sparksqlclidriver
 -

 Key: SPARK-6153
 URL: https://issues.apache.org/jira/browse/SPARK-6153
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, SQL
Reporter: Adrian Wang
Priority: Minor
 Fix For: 1.3.0


 The {{hive-thriftserver}} module depends on Guava indirectly via {{hive}} 
 module. However, the scope of Guava was explicitly set to {{provided}} in the 
 root POM. This makes developers not able to run the Spark SQL CLI tool within 
 IntelliJ IDEA for debugging purposes. Should promote Guava dependency scope 
 to {{runtime}} for {{hive-thriftserver}} module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6153) intellij import from maven cannot debug sparksqlclidriver

2015-03-05 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6153:
--
Description: The {{hive-thriftserver}} module depends on Guava indirectly 
via {{hive}} module. However, the scope of Guava was explicitly set to 
{{provided}} in the root POM. This makes developers not able to run the Spark 
SQL CLI tool within IntelliJ IDEA for debugging purposes. Should promote Guava 
dependency scope to {{runtime}} for {{hive-thriftserver}} module.  (was: The 
{{hive-thriftserver}} module depends on Guava indirectly via {{hive]} module. 
However, the scope of Guava was explicitly set to {{provided}} in the root POM. 
This makes developers not able to run the Spark SQL CLI tool within IntelliJ 
IDEA for debugging purposes. Should promote Guava dependency scope to 
{{runtime}} for {{hive-thriftserver}} module.)

 intellij import from maven cannot debug sparksqlclidriver
 -

 Key: SPARK-6153
 URL: https://issues.apache.org/jira/browse/SPARK-6153
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, SQL
Reporter: Adrian Wang
Priority: Minor
 Fix For: 1.3.0


 The {{hive-thriftserver}} module depends on Guava indirectly via {{hive}} 
 module. However, the scope of Guava was explicitly set to {{provided}} in the 
 root POM. This makes developers not able to run the Spark SQL CLI tool within 
 IntelliJ IDEA for debugging purposes. Should promote Guava dependency scope 
 to {{runtime}} for {{hive-thriftserver}} module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6153) intellij import from maven cannot debug sparksqlclidriver

2015-03-05 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6153:
--
Description: The {{hive-thriftserver}} module depends on Guava indirectly 
via {{hive]} module. However, the scope of Guava was explicitly set to 
{{provided}} in the root POM. This makes developers not able to run the Spark 
SQL CLI tool within IntelliJ IDEA for debugging purposes. Should promote Guava 
dependency scope to {{runtime}} for {{hive-thriftserver}} module.  (was: we 
need to promote guava dependency to a proper level manually.)

 intellij import from maven cannot debug sparksqlclidriver
 -

 Key: SPARK-6153
 URL: https://issues.apache.org/jira/browse/SPARK-6153
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, SQL
Reporter: Adrian Wang
Priority: Minor

 The {{hive-thriftserver}} module depends on Guava indirectly via {{hive]} 
 module. However, the scope of Guava was explicitly set to {{provided}} in the 
 root POM. This makes developers not able to run the Spark SQL CLI tool within 
 IntelliJ IDEA for debugging purposes. Should promote Guava dependency scope 
 to {{runtime}} for {{hive-thriftserver}} module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6153) intellij import from maven cannot debug sparksqlclidriver

2015-03-05 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348407#comment-14348407
 ] 

Cheng Lian commented on SPARK-6153:
---

Updated JIRA description to reflect the actual change made in PR 4884.

 intellij import from maven cannot debug sparksqlclidriver
 -

 Key: SPARK-6153
 URL: https://issues.apache.org/jira/browse/SPARK-6153
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, SQL
Reporter: Adrian Wang
Assignee: Adrian Wang
Priority: Minor
 Fix For: 1.3.0


 The {{hive-thriftserver}} module depends on Guava indirectly via {{hive}} 
 module. However, the scope of Guava was explicitly set to {{provided}} in the 
 root POM. This makes developers not able to run the Spark SQL CLI tool within 
 IntelliJ IDEA for debugging purposes. Should promote Guava dependency scope 
 to {{runtime}} for {{hive-thriftserver}} module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6147) Move JDBC data source integration tests to the Spark integration tests project

2015-03-03 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-6147:
-

 Summary: Move JDBC data source integration tests to the Spark 
integration tests project
 Key: SPARK-6147
 URL: https://issues.apache.org/jira/browse/SPARK-6147
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 1.3.0
Reporter: Cheng Lian
Assignee: Cheng Lian


In [PR #4872|https://github.com/apache/spark/pull/4872], we removed JDBC 
integration tests from Spark because of Guava dependency hell. These two test 
suites should be moved to the [Spark integration tests 
project|github.com/databricks/spark-integration-tests] 'cause that's where we 
do complex / time-consuming integration tests, and that project has been well 
dockerized.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6149) Spark SQL CLI doesn't work when compiled against Hive 12 because of runtime incompatibility issues caused by Guava (15?)

2015-03-03 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-6149:
-

 Summary: Spark SQL CLI doesn't work when compiled against Hive 12 
because of runtime incompatibility issues caused by Guava (15?) 
 Key: SPARK-6149
 URL: https://issues.apache.org/jira/browse/SPARK-6149
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker


The following description is based on [a recent master 
revision|https://github.com/apache/spark/tree/159b24a1e47e4fa8118e4b81049fbc7bc3406433].

{noformat}
$ ./build/sbt -Pyarn,hadoop-2.4,hive,hive-thriftserver,hive-0.12.0,scala-2.10 
-Dhadoop.version=2.4.1 clean assembly/assembly
...
$ ./bin/spark-sql
...
spark-sql CREATE TABLE hive_test(key INT, value STRING);
15/03/03 21:28:08 ERROR exec.DDLTask: 
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: 
Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:602)
at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:3661)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:252)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:308)
at 
org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:280)
at 
org.apache.spark.sql.hive.execution.HiveNativeCommand.run(HiveNativeCommand.scala:37)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:55)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:55)
at 
org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:65)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1092)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1092)
at org.apache.spark.sql.DataFrame.init(DataFrame.scala:134)
at org.apache.spark.sql.DataFrame.init(DataFrame.scala:117)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:92)
at 
org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:57)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:275)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:211)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1212)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.init(RetryingMetaStoreClient.java:62)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:72)
at 
org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2372)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2383)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:596)
... 34 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 

[jira] [Commented] (SPARK-6149) Spark SQL CLI doesn't work when compiled against Hive 12 with SBT because of runtime incompatibility issues caused by Guava 15

2015-03-03 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14346414#comment-14346414
 ] 

Cheng Lian commented on SPARK-6149:
---

Pointed out by [~pwendell], this is a Maven-vs-SBT issue. In Maven, Guava 15 is 
properly shaded and Guava 14.0.1 is used. However, SBT still chooses the 
highest version. Verified that the Maven build is not affected.

 Spark SQL CLI doesn't work when compiled against Hive 12 with SBT because of 
 runtime incompatibility issues caused by Guava 15
 --

 Key: SPARK-6149
 URL: https://issues.apache.org/jira/browse/SPARK-6149
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker

 The following description is based on [a recent master 
 revision|https://github.com/apache/spark/tree/159b24a1e47e4fa8118e4b81049fbc7bc3406433].
 {noformat}
 $ ./build/sbt -Pyarn,hadoop-2.4,hive,hive-thriftserver,hive-0.12.0,scala-2.10 
 -Dhadoop.version=2.4.1 clean assembly/assembly
 ...
 $ ./bin/spark-sql
 ...
 spark-sql CREATE TABLE hive_test(key INT, value STRING);
 15/03/03 21:28:08 ERROR exec.DDLTask: 
 org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: 
 Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
 at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:602)
 at 
 org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:3661)
 at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:252)
 at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
 at 
 org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)
 at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414)
 at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192)
 at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
 at 
 org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:308)
 at 
 org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:280)
 at 
 org.apache.spark.sql.hive.execution.HiveNativeCommand.run(HiveNativeCommand.scala:37)
 at 
 org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:55)
 at 
 org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:55)
 at 
 org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:65)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1092)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1092)
 at org.apache.spark.sql.DataFrame.init(DataFrame.scala:134)
 at org.apache.spark.sql.DataFrame.init(DataFrame.scala:117)
 at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
 at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:92)
 at 
 org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:57)
 at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:275)
 at 
 org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
 at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:211)
 at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:483)
 at 
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
 at 
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
 at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 Caused by: java.lang.RuntimeException: Unable to instantiate 
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient
 at 
 org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1212)
 at 
 org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.init(RetryingMetaStoreClient.java:62)
 at 
 

[jira] [Updated] (SPARK-6149) Spark SQL CLI doesn't work when compiled against Hive 12 with SBT because of runtime incompatibility issues caused by Guava (15?)

2015-03-03 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6149:
--
Summary: Spark SQL CLI doesn't work when compiled against Hive 12 with SBT 
because of runtime incompatibility issues caused by Guava (15?)   (was: Spark 
SQL CLI doesn't work when compiled against Hive 12 because of runtime 
incompatibility issues caused by Guava (15?) )

 Spark SQL CLI doesn't work when compiled against Hive 12 with SBT because of 
 runtime incompatibility issues caused by Guava (15?) 
 --

 Key: SPARK-6149
 URL: https://issues.apache.org/jira/browse/SPARK-6149
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker

 The following description is based on [a recent master 
 revision|https://github.com/apache/spark/tree/159b24a1e47e4fa8118e4b81049fbc7bc3406433].
 {noformat}
 $ ./build/sbt -Pyarn,hadoop-2.4,hive,hive-thriftserver,hive-0.12.0,scala-2.10 
 -Dhadoop.version=2.4.1 clean assembly/assembly
 ...
 $ ./bin/spark-sql
 ...
 spark-sql CREATE TABLE hive_test(key INT, value STRING);
 15/03/03 21:28:08 ERROR exec.DDLTask: 
 org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: 
 Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
 at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:602)
 at 
 org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:3661)
 at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:252)
 at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
 at 
 org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)
 at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414)
 at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192)
 at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
 at 
 org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:308)
 at 
 org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:280)
 at 
 org.apache.spark.sql.hive.execution.HiveNativeCommand.run(HiveNativeCommand.scala:37)
 at 
 org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:55)
 at 
 org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:55)
 at 
 org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:65)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1092)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1092)
 at org.apache.spark.sql.DataFrame.init(DataFrame.scala:134)
 at org.apache.spark.sql.DataFrame.init(DataFrame.scala:117)
 at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
 at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:92)
 at 
 org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:57)
 at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:275)
 at 
 org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
 at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:211)
 at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:483)
 at 
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
 at 
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
 at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 Caused by: java.lang.RuntimeException: Unable to instantiate 
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient
 at 
 org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1212)
 at 
 org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.init(RetryingMetaStoreClient.java:62)
 at 
 

[jira] [Updated] (SPARK-6149) Spark SQL CLI doesn't work when compiled against Hive 12 with SBT because of runtime incompatibility issues caused by Guava 15

2015-03-03 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6149:
--
Summary: Spark SQL CLI doesn't work when compiled against Hive 12 with SBT 
because of runtime incompatibility issues caused by Guava 15  (was: Spark SQL 
CLI doesn't work when compiled against Hive 12 with SBT because of runtime 
incompatibility issues caused by Guava (15?) )

 Spark SQL CLI doesn't work when compiled against Hive 12 with SBT because of 
 runtime incompatibility issues caused by Guava 15
 --

 Key: SPARK-6149
 URL: https://issues.apache.org/jira/browse/SPARK-6149
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker

 The following description is based on [a recent master 
 revision|https://github.com/apache/spark/tree/159b24a1e47e4fa8118e4b81049fbc7bc3406433].
 {noformat}
 $ ./build/sbt -Pyarn,hadoop-2.4,hive,hive-thriftserver,hive-0.12.0,scala-2.10 
 -Dhadoop.version=2.4.1 clean assembly/assembly
 ...
 $ ./bin/spark-sql
 ...
 spark-sql CREATE TABLE hive_test(key INT, value STRING);
 15/03/03 21:28:08 ERROR exec.DDLTask: 
 org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: 
 Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
 at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:602)
 at 
 org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:3661)
 at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:252)
 at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
 at 
 org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)
 at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414)
 at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192)
 at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
 at 
 org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:308)
 at 
 org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:280)
 at 
 org.apache.spark.sql.hive.execution.HiveNativeCommand.run(HiveNativeCommand.scala:37)
 at 
 org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:55)
 at 
 org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:55)
 at 
 org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:65)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1092)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1092)
 at org.apache.spark.sql.DataFrame.init(DataFrame.scala:134)
 at org.apache.spark.sql.DataFrame.init(DataFrame.scala:117)
 at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
 at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:92)
 at 
 org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:57)
 at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:275)
 at 
 org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
 at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:211)
 at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:483)
 at 
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
 at 
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
 at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 Caused by: java.lang.RuntimeException: Unable to instantiate 
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient
 at 
 org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1212)
 at 
 org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.init(RetryingMetaStoreClient.java:62)
 at 
 

[jira] [Resolved] (SPARK-6136) Docker client library introduces Guava 17.0, which causes runtime binary incompatibilities

2015-03-04 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-6136.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

 Docker client library introduces Guava 17.0, which causes runtime binary 
 incompatibilities
 --

 Key: SPARK-6136
 URL: https://issues.apache.org/jira/browse/SPARK-6136
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Assignee: Cheng Lian
 Fix For: 1.3.0


 Integration test suites in the JDBC data source ({{MySQLIntegration}} and 
 {{PostgresIntegration}}) depend on docker-client 2.7.5, which transitively 
 depends on Guava 17.0. Unfortunately, Guava 17.0 is causing runtime binary 
 incompatibility issues when Spark is compiled against Hadoop 2.4.
 {code}
 $ ./build/sbt -Pyarn,hadoop-2.4,hive,hive-0.12.0,scala-2.10 
 -Dhadoop.version=2.4.1
 ...
  sql/test-only *.ParquetDataSourceOffIOSuite
 ...
 [info] ParquetDataSourceOffIOSuite:
 [info] Exception encountered when attempting to run a suite with class name: 
 org.apache.spark.sql.parquet.ParquetDataSourceOffIOSuite *** ABORTED *** (134 
 milliseconds)
 [info]   java.lang.IllegalAccessError: tried to access method 
 com.google.common.base.Stopwatch.init()V from class 
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat
 [info]   at 
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:261)
 [info]   at 
 parquet.hadoop.ParquetInputFormat.listStatus(ParquetInputFormat.java:277)
 [info]   at 
 org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:437)
 [info]   at 
 org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95)
 [info]   at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 [info]   at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 [info]   at scala.Option.getOrElse(Option.scala:120)
 [info]   at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 [info]   at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 [info]   at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 [info]   at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 [info]   at scala.Option.getOrElse(Option.scala:120)
 [info]   at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 [info]   at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 [info]   at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 [info]   at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 [info]   at scala.Option.getOrElse(Option.scala:120)
 [info]   at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1525)
 [info]   at org.apache.spark.rdd.RDD.collect(RDD.scala:813)
 [info]   at 
 org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:83)
 [info]   at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:797)
 [info]   at org.apache.spark.sql.QueryTest$.checkAnswer(QueryTest.scala:115)
 [info]   at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:60)
 [info]   at 
 org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$checkParquetFile$1.apply(ParquetIOSuite.scala:76)
 [info]   at 
 org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$checkParquetFile$1.apply(ParquetIOSuite.scala:76)
 [info]   at 
 org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetDataFrame$1.apply(ParquetTest.scala:105)
 [info]   at 
 org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetDataFrame$1.apply(ParquetTest.scala:105)
 [info]   at 
 org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetFile$1.apply(ParquetTest.scala:94)
 [info]   at 
 org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetFile$1.apply(ParquetTest.scala:92)
 [info]   at 
 org.apache.spark.sql.parquet.ParquetTest$class.withTempPath(ParquetTest.scala:71)
 [info]   at 
 org.apache.spark.sql.parquet.ParquetIOSuiteBase.withTempPath(ParquetIOSuite.scala:67)
 [info]   at 
 org.apache.spark.sql.parquet.ParquetTest$class.withParquetFile(ParquetTest.scala:92)
 [info]   at 
 org.apache.spark.sql.parquet.ParquetIOSuiteBase.withParquetFile(ParquetIOSuite.scala:67)
 [info]   at 
 org.apache.spark.sql.parquet.ParquetTest$class.withParquetDataFrame(ParquetTest.scala:105)
 [info]   at 
 org.apache.spark.sql.parquet.ParquetIOSuiteBase.withParquetDataFrame(ParquetIOSuite.scala:67)
 [info]   at 
 org.apache.spark.sql.parquet.ParquetIOSuiteBase.checkParquetFile(ParquetIOSuite.scala:76)
 [info]   at 
 org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$1.apply$mcV$sp(ParquetIOSuite.scala:83)
 

[jira] [Resolved] (SPARK-6134) Fix wrong datatype for casting FloatType and default LongType value in defaultPrimitive

2015-03-04 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-6134.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4870
[https://github.com/apache/spark/pull/4870]

 Fix wrong datatype for casting FloatType and default LongType value in 
 defaultPrimitive
 ---

 Key: SPARK-6134
 URL: https://issues.apache.org/jira/browse/SPARK-6134
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh
 Fix For: 1.3.0


 In CodeGenerator, the casting on FloatType should use FloatType instead of 
 IntegerType.
 Besides, defaultPrimitive for LongType should be -1L instead of 1L.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5938) Generate row from json efficiently

2015-02-27 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-5938:
--
Assignee: Liang-Chi Hsieh

 Generate row from json efficiently
 --

 Key: SPARK-5938
 URL: https://issues.apache.org/jira/browse/SPARK-5938
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh
Priority: Minor

 Generate row from json efficiently in JsonRDD object.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6136) Docker client library introduces Guava 17.0, which causes runtime binary incompatibilities

2015-03-03 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6136:
--
Description: 
Integration test suites in the JDBC data source ({{MySQLIntegration}} and 
{{PostgresIntegration}}) depend on docker-client 2.7.5, which transitively 
depends on Guava 17.0. Unfortunately, Guava 17.0 is causing runtime binary 
compatibility issues
{code}
$ ./build/sbt -Pyarn,hadoop-2.4,hive,hive-0.12.0,scala-2.10 
-Dhadoop.version=2.4.1
...
 sql/test-only *.ParquetDataSourceOffIOSuite
...
[info] ParquetDataSourceOffIOSuite:
[info] Exception encountered when attempting to run a suite with class name: 
org.apache.spark.sql.parquet.ParquetDataSourceOffIOSuite *** ABORTED *** (134 
milliseconds)
[info]   java.lang.IllegalAccessError: tried to access method 
com.google.common.base.Stopwatch.init()V from class 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat
[info]   at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:261)
[info]   at 
parquet.hadoop.ParquetInputFormat.listStatus(ParquetInputFormat.java:277)
[info]   at 
org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:437)
[info]   at 
org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95)
[info]   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
[info]   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
[info]   at scala.Option.getOrElse(Option.scala:120)
[info]   at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
[info]   at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
[info]   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
[info]   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
[info]   at scala.Option.getOrElse(Option.scala:120)
[info]   at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
[info]   at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
[info]   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
[info]   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
[info]   at scala.Option.getOrElse(Option.scala:120)
[info]   at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
[info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1525)
[info]   at org.apache.spark.rdd.RDD.collect(RDD.scala:813)
[info]   at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:83)
[info]   at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:797)
[info]   at org.apache.spark.sql.QueryTest$.checkAnswer(QueryTest.scala:115)
[info]   at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:60)
[info]   at 
org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$checkParquetFile$1.apply(ParquetIOSuite.scala:76)
[info]   at 
org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$checkParquetFile$1.apply(ParquetIOSuite.scala:76)
[info]   at 
org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetDataFrame$1.apply(ParquetTest.scala:105)
[info]   at 
org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetDataFrame$1.apply(ParquetTest.scala:105)
[info]   at 
org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetFile$1.apply(ParquetTest.scala:94)
[info]   at 
org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetFile$1.apply(ParquetTest.scala:92)
[info]   at 
org.apache.spark.sql.parquet.ParquetTest$class.withTempPath(ParquetTest.scala:71)
[info]   at 
org.apache.spark.sql.parquet.ParquetIOSuiteBase.withTempPath(ParquetIOSuite.scala:67)
[info]   at 
org.apache.spark.sql.parquet.ParquetTest$class.withParquetFile(ParquetTest.scala:92)
[info]   at 
org.apache.spark.sql.parquet.ParquetIOSuiteBase.withParquetFile(ParquetIOSuite.scala:67)
[info]   at 
org.apache.spark.sql.parquet.ParquetTest$class.withParquetDataFrame(ParquetTest.scala:105)
[info]   at 
org.apache.spark.sql.parquet.ParquetIOSuiteBase.withParquetDataFrame(ParquetIOSuite.scala:67)
[info]   at 
org.apache.spark.sql.parquet.ParquetIOSuiteBase.checkParquetFile(ParquetIOSuite.scala:76)
[info]   at 
org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$1.apply$mcV$sp(ParquetIOSuite.scala:83)
[info]   at 
org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$1.apply(ParquetIOSuite.scala:79)
[info]   at 
org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$1.apply(ParquetIOSuite.scala:79)
[info]   at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
[info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
[info]   at 

[jira] [Created] (SPARK-5947) First class partitioning support in data sources API

2015-02-23 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-5947:
-

 Summary: First class partitioning support in data sources API
 Key: SPARK-5947
 URL: https://issues.apache.org/jira/browse/SPARK-5947
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Lian


For file system based data sources, implementing Hive style partitioning 
support can be complex and error prone. To be specific, partitioning support 
include:

# Partition discovery:  Given a directory organized similar to Hive partitions, 
discover the directory structure and partitioning information automatically, 
including partition column names, data types, and values.
# Reading from partitioned tables
# Writing to partitioned tables

It would be good to have first class partitioning support in the data sources 
API. For example, add a {{FileBasedScan}} trait with callbacks and default 
implementations for these features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5948) Support writing to partitioned table for the Parquet data source

2015-02-23 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-5948:
-

 Summary: Support writing to partitioned table for the Parquet data 
source
 Key: SPARK-5948
 URL: https://issues.apache.org/jira/browse/SPARK-5948
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Lian


In 1.3.0, we added support for reading partitioned tables declared in Hive 
metastore for the Parquet data source. However, writing to partitioned tables 
is not supported yet. This feature should probably built upon SPARK-5947.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6010) Exception thrown when reading Spark SQL generated Parquet files with different but compatible schemas

2015-02-25 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-6010:
-

 Summary: Exception thrown when reading Spark SQL generated Parquet 
files with different but compatible schemas
 Key: SPARK-6010
 URL: https://issues.apache.org/jira/browse/SPARK-6010
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker


The following test case added in {{ParquetPartitionDiscoverySuite}} can be used 
to reproduce this issue:
{code}
  test(read partitioned table - merging compatible schemas) {
withTempDir { base =
  makeParquetFile(
(1 to 10).map(i = Tuple1(i)).toDF(intField),
makePartitionDir(base, defaultPartitionName, pi - 1))

  makeParquetFile(
(1 to 10).map(i = (i, i.toString)).toDF(intField, stringField),
makePartitionDir(base, defaultPartitionName, pi - 2))

  load(base.getCanonicalPath, 
org.apache.spark.sql.parquet).registerTempTable(t)

  withTempTable(t) {
checkAnswer(
  sql(SELECT * FROM t),
  (1 to 10).map(i = Row(i, null, 1)) ++ (1 to 10).map(i = Row(i, 
i.toString, 2)))
  }
}
  }
{code}
Exception thrown:
{code}
[info]   java.lang.RuntimeException: could not merge metadata: key 
org.apache.spark.sql.parquet.row.metadata has conflicting values: 
[{type:struct,fields:[{name:intField,type:integer,nullable:false,metadata:{}},{name:stringField,type:string,nullable:true,metadata:{}}]},
 
{type:struct,fields:[{name:intField,type:integer,nullable:false,metadata:{}}]}]
[info]  at 
parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67)
[info]  at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84)
[info]  at 
org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:484)
[info]  at 
parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245)
[info]  at 
org.apache.spark.sql.parquet.ParquetRelation2$$anon$1.getPartitions(newParquet.scala:461)
[info]  at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
[info]  at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
[info]  at scala.Option.getOrElse(Option.scala:120)
[info]  at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
[info]  at 
org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.getPartitions(NewHadoopRDD.scala:239)
[info]  at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
[info]  at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
[info]  at scala.Option.getOrElse(Option.scala:120)
[info]  at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
[info]  at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
[info]  at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
[info]  at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
[info]  at scala.Option.getOrElse(Option.scala:120)
[info]  at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
[info]  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1518)
[info]  at org.apache.spark.rdd.RDD.collect(RDD.scala:813)
[info]  at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:83)
[info]  at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:790)
[info]  at 
org.apache.spark.sql.QueryTest$.checkAnswer(QueryTest.scala:115)
[info]  at 
org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:60)
[info]  at 
org.apache.spark.sql.parquet.ParquetPartitionDiscoverySuite$$anonfun$8$$anonfun$apply$mcV$sp$18$$anonfun$apply$8.apply$mcV$sp(ParquetPartitionDiscoverySuite.scala:337)
[info]  at 
org.apache.spark.sql.parquet.ParquetTest$class.withTempTable(ParquetTest.scala:112)
[info]  at 
org.apache.spark.sql.parquet.ParquetPartitionDiscoverySuite.withTempTable(ParquetPartitionDiscoverySuite.scala:35)
[info]  at 
org.apache.spark.sql.parquet.ParquetPartitionDiscoverySuite$$anonfun$8$$anonfun$apply$mcV$sp$18.apply(ParquetPartitionDiscoverySuite.scala:336)
[info]  at 
org.apache.spark.sql.parquet.ParquetPartitionDiscoverySuite$$anonfun$8$$anonfun$apply$mcV$sp$18.apply(ParquetPartitionDiscoverySuite.scala:325)
[info]  at 
org.apache.spark.sql.parquet.ParquetTest$class.withTempDir(ParquetTest.scala:82)
[info]  at 
org.apache.spark.sql.parquet.ParquetPartitionDiscoverySuite.withTempDir(ParquetPartitionDiscoverySuite.scala:35)
[info]  at 
org.apache.spark.sql.parquet.ParquetPartitionDiscoverySuite$$anonfun$8.apply$mcV$sp(ParquetPartitionDiscoverySuite.scala:325)
[info]  at 

[jira] [Updated] (SPARK-5968) Parquet warning in spark-shell

2015-02-24 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-5968:
--
Description: 
This may happen in the case of schema evolving, namely appending new Parquet 
data with different but compatible schema to existing Parquet files:
{code}
15/02/23 23:29:24 WARN ParquetOutputCommitter: could not write summary file for 
rankings
parquet.io.ParquetEncodingException: 
file:/Users/matei/workspace/apache-spark/rankings/part-r-1.parquet invalid: 
all the files must be contained in the root rankings
at parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422)
at 
parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398)
at 
parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51)
{code}
The reason is that the Spark SQL schemas stored in Parquet key-value metadata 
differ. Parquet doesn't know how to merge these opaque user-defined metadata, 
and just throw an exception and give up writing summary files. Since the 
Parquet data source in Spark 1.3.0 supports schema merging, it's harmless.  But 
this is kind of scary for the user.  We should try to suppress this through the 
logger. 

  was:
{code}
15/02/23 23:29:24 WARN ParquetOutputCommitter: could not write summary file for 
rankings
parquet.io.ParquetEncodingException: 
file:/Users/matei/workspace/apache-spark/rankings/part-r-1.parquet invalid: 
all the files must be contained in the root rankings
at parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422)
at 
parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398)
at 
parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51)
{code}

it is only a warning, but kind of scary for the user.  We should try to 
suppress this through the logger. 


 Parquet warning in spark-shell
 --

 Key: SPARK-5968
 URL: https://issues.apache.org/jira/browse/SPARK-5968
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Critical

 This may happen in the case of schema evolving, namely appending new Parquet 
 data with different but compatible schema to existing Parquet files:
 {code}
 15/02/23 23:29:24 WARN ParquetOutputCommitter: could not write summary file 
 for rankings
 parquet.io.ParquetEncodingException: 
 file:/Users/matei/workspace/apache-spark/rankings/part-r-1.parquet 
 invalid: all the files must be contained in the root rankings
 at parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422)
 at 
 parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398)
 at 
 parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51)
 {code}
 The reason is that the Spark SQL schemas stored in Parquet key-value metadata 
 differ. Parquet doesn't know how to merge these opaque user-defined 
 metadata, and just throw an exception and give up writing summary files. 
 Since the Parquet data source in Spark 1.3.0 supports schema merging, it's 
 harmless.  But this is kind of scary for the user.  We should try to suppress 
 this through the logger. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3850) Scala style: disallow trailing spaces

2015-02-24 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14335082#comment-14335082
 ] 

Cheng Lian commented on SPARK-3850:
---

Actually that was a Scala source file, where we put query strings and generate 
MD5 according to the content of query strings. But I agree that was a really 
rare case.

 Scala style: disallow trailing spaces
 -

 Key: SPARK-3850
 URL: https://issues.apache.org/jira/browse/SPARK-3850
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Reporter: Nicholas Chammas
Priority: Minor

 Background discussions:
 * https://github.com/apache/spark/pull/2619
 * 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Extending-Scala-style-checks-td8624.html
 If you look at [the PR Cheng 
 opened|https://github.com/apache/spark/pull/2619], you'll see a trailing 
 white space seemed to mess up some SQL test. That's what spurred the creation 
 of this issue.
 [Ted Yu on the dev 
 list|http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%3ccalte62y7a6wybdufdcguwbf8wcpttvie+pao4pzor+_-nb2...@mail.gmail.com%3E]
  suggested using this 
 [{{WhitespaceEndOfLineChecker}}|http://www.scalastyle.org/rules-0.1.0.html].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6016) Cannot read the parquet table after overwriting the existing table when spark.sql.parquet.cacheMetadata=true

2015-02-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-6016.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4775
[https://github.com/apache/spark/pull/4775]

 Cannot read the parquet table after overwriting the existing table when 
 spark.sql.parquet.cacheMetadata=true
 

 Key: SPARK-6016
 URL: https://issues.apache.org/jira/browse/SPARK-6016
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Priority: Blocker
 Fix For: 1.3.0


 saveAsTable is fine and seems we have successfully deleted the old data and 
 written the new data. However, when reading the newly created table, an error 
 will be thrown.
 {code}
 Error in SQL statement: java.lang.RuntimeException: 
 java.lang.RuntimeException: could not merge metadata: key 
 org.apache.spark.sql.parquet.row.metadata has conflicting values: 
 at 
 parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67)
   at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84)
   at 
 org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:469)
   at 
 parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$$anon$1.getPartitions(newParquet.scala:461)
   ...
 {code}
 If I set spark.sql.parquet.cacheMetadata to false, it's fine to query the 
 data. 
 Note: the newly created table needs to have more than one file to trigger the 
 bug (if there is only a single file, we will not need to merge metadata). 
 To reproduce it, try...
 {code}
 import org.apache.spark.sql.SaveMode
 import sqlContext._
 sql(drop table if exists test)
 val df1 = sqlContext.jsonRDD(sc.parallelize((1 to 10).map(i = 
 s{a:$i}), 2)) // we will save to 2 parquet files.
 df1.saveAsTable(test, parquet, SaveMode.Overwrite)
 sql(select * from test).collect.foreach(println) // Warm the 
 FilteringParquetRowInputFormat.footerCache
 val df2 = sqlContext.jsonRDD(sc.parallelize((1 to 10).map(i = 
 s{b:$i}), 4)) // we will save to 4 parquet files.
 df2.saveAsTable(test, parquet, SaveMode.Overwrite)
 sql(select * from test).collect.foreach(println)
 {code}
 For this example, we have two outdated footers for df1 in footerCache and 
 since we have four parquet files for the new test table, we picked up 2 new 
 footers for df2. Then, we hit the bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6023) ParquetConversions fails to replace the destination MetastoreRelation of an InsertIntoTable node to ParquetRelation2

2015-02-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-6023.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4782
[https://github.com/apache/spark/pull/4782]

 ParquetConversions fails to replace the destination MetastoreRelation of an 
 InsertIntoTable node to ParquetRelation2
 

 Key: SPARK-6023
 URL: https://issues.apache.org/jira/browse/SPARK-6023
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Priority: Blocker
 Fix For: 1.3.0


 {code}
 import sqlContext._
 sql(drop table if exists test)
 val df1 = sqlContext.jsonRDD(sc.parallelize((1 to 10).map(i = 
 s{a:$i})))
 df1.registerTempTable(jt)
 sql(create table test (a bigint) stored as parquet )
 sql(explain insert into table test select a from 
 jt).collect.foreach(println)
 {code}
 The plan will be
 {code}
 [== Physical Plan ==]
 [InsertIntoHiveTable (MetastoreRelation default, test, None), Map(), false]
 [ PhysicalRDD [a#34L], MapPartitionsRDD[17] at map at JsonRDD.scala:41]
 {code}
 However, the write path should be converted to our own data source path.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5775) GenericRow cannot be cast to SpecificMutableRow when nested data and partitioned table

2015-02-26 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14339010#comment-14339010
 ] 

Cheng Lian commented on SPARK-5775:
---

Hey [~avignon], sorry for the delay. I've left comments on the PR page. Thanks 
a lot for working on this!

 GenericRow cannot be cast to SpecificMutableRow when nested data and 
 partitioned table
 --

 Key: SPARK-5775
 URL: https://issues.apache.org/jira/browse/SPARK-5775
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
Reporter: Ayoub Benali
Assignee: Cheng Lian
Priority: Blocker
  Labels: hivecontext, nested, parquet, partition

 Using the LOAD sql command in Hive context to load parquet files into a 
 partitioned table causes exceptions during query time. 
 The bug requires the table to have a column of *type Array of struct* and to 
 be *partitioned*. 
 The example bellow shows how to reproduce the bug and you can see that if the 
 table is not partitioned the query works fine. 
 {noformat}
 scala val data1 = {data_array:[{field1:1,field2:2}]}
 scala val data2 = {data_array:[{field1:3,field2:4}]}
 scala val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil)
 scala val schemaRDD = hiveContext.jsonRDD(jsonRDD)
 scala schemaRDD.printSchema
 root
  |-- data_array: array (nullable = true)
  ||-- element: struct (containsNull = false)
  |||-- field1: integer (nullable = true)
  |||-- field2: integer (nullable = true)
 scala hiveContext.sql(create external table if not exists 
 partitioned_table(data_array ARRAY STRUCTfield1: INT, field2: INT) 
 Partitioned by (date STRING) STORED AS PARQUET Location 
 'hdfs:///partitioned_table')
 scala hiveContext.sql(create external table if not exists 
 none_partitioned_table(data_array ARRAY STRUCTfield1: INT, field2: INT) 
 STORED AS PARQUET Location 'hdfs:///none_partitioned_table')
 scala schemaRDD.saveAsParquetFile(hdfs:///tmp_data_1)
 scala schemaRDD.saveAsParquetFile(hdfs:///tmp_data_2)
 scala hiveContext.sql(LOAD DATA INPATH 
 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_1' INTO TABLE 
 partitioned_table PARTITION(date='2015-02-12'))
 scala hiveContext.sql(LOAD DATA INPATH 
 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_2' INTO TABLE 
 none_partitioned_table)
 scala hiveContext.sql(select data.field1 from none_partitioned_table 
 LATERAL VIEW explode(data_array) nestedStuff AS data).collect
 res23: Array[org.apache.spark.sql.Row] = Array([1], [3])
 scala hiveContext.sql(select data.field1 from partitioned_table LATERAL 
 VIEW explode(data_array) nestedStuff AS data).collect
 15/02/12 16:21:03 INFO ParseDriver: Parsing command: select data.field1 from 
 partitioned_table LATERAL VIEW explode(data_array) nestedStuff AS data
 15/02/12 16:21:03 INFO ParseDriver: Parse Completed
 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(260661) called with 
 curMem=0, maxMem=280248975
 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18 stored as values in 
 memory (estimated size 254.6 KB, free 267.0 MB)
 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(28615) called with 
 curMem=260661, maxMem=280248975
 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18_piece0 stored as bytes 
 in memory (estimated size 27.9 KB, free 267.0 MB)
 15/02/12 16:21:03 INFO BlockManagerInfo: Added broadcast_18_piece0 in memory 
 on *:51990 (size: 27.9 KB, free: 267.2 MB)
 15/02/12 16:21:03 INFO BlockManagerMaster: Updated info of block 
 broadcast_18_piece0
 15/02/12 16:21:03 INFO SparkContext: Created broadcast 18 from NewHadoopRDD 
 at ParquetTableOperations.scala:119
 15/02/12 16:21:03 INFO FileInputFormat: Total input paths to process : 3
 15/02/12 16:21:03 INFO ParquetInputFormat: Total input paths to process : 3
 15/02/12 16:21:03 INFO FilteringParquetRowInputFormat: Using Task Side 
 Metadata Split Strategy
 15/02/12 16:21:03 INFO SparkContext: Starting job: collect at 
 SparkPlan.scala:84
 15/02/12 16:21:03 INFO DAGScheduler: Got job 12 (collect at 
 SparkPlan.scala:84) with 3 output partitions (allowLocal=false)
 15/02/12 16:21:03 INFO DAGScheduler: Final stage: Stage 13(collect at 
 SparkPlan.scala:84)
 15/02/12 16:21:03 INFO DAGScheduler: Parents of final stage: List()
 15/02/12 16:21:03 INFO DAGScheduler: Missing parents: List()
 15/02/12 16:21:03 INFO DAGScheduler: Submitting Stage 13 (MappedRDD[111] at 
 map at SparkPlan.scala:84), which has no missing parents
 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(7632) called with 
 curMem=289276, maxMem=280248975
 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_19 stored as values in 
 memory (estimated size 7.5 KB, free 267.0 MB)
 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(4230) called with 
 

[jira] [Resolved] (SPARK-5751) Flaky test: o.a.s.sql.hive.thriftserver.HiveThriftServer2Suite sometimes times out

2015-02-24 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-5751.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4720
[https://github.com/apache/spark/pull/4720]

 Flaky test: o.a.s.sql.hive.thriftserver.HiveThriftServer2Suite sometimes 
 times out
 --

 Key: SPARK-5751
 URL: https://issues.apache.org/jira/browse/SPARK-5751
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Critical
  Labels: flaky-test
 Fix For: 1.3.0


 The Test JDBC query execution test case times out occasionally, all other 
 test cases are just fine. The failure output only contains service startup 
 command line without any log output. Guess somehow the test case misses the 
 log file path.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6037) Avoiding duplicate Parquet schema merging

2015-02-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6037:
--
Assignee: Liang-Chi Hsieh

 Avoiding duplicate Parquet schema merging
 -

 Key: SPARK-6037
 URL: https://issues.apache.org/jira/browse/SPARK-6037
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh
Priority: Minor

 FilteringParquetRowInputFormat manually merges Parquet schemas before 
 computing splits. However, it is duplicate because the schemas are already 
 merged in ParquetRelation2. We don't need to re-merge them at InputFormat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6037) Avoiding duplicate Parquet schema merging

2015-02-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-6037.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4786
[https://github.com/apache/spark/pull/4786]

 Avoiding duplicate Parquet schema merging
 -

 Key: SPARK-6037
 URL: https://issues.apache.org/jira/browse/SPARK-6037
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh
Priority: Minor
 Fix For: 1.3.0


 FilteringParquetRowInputFormat manually merges Parquet schemas before 
 computing splits. However, it is duplicate because the schemas are already 
 merged in ParquetRelation2. We don't need to re-merge them at InputFormat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6572) When I build Spark 1.3 sbt gives me to following error : unresolved dependency: org.apache.kafka#kafka_2.11;0.8.1.1: not found org.scalamacros#quasiquotes_2.11;2.0.1

2015-03-27 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383999#comment-14383999
 ] 

Cheng Lian commented on SPARK-6572:
---

Would you please provide exact command line you used to invoke SBT?

 When I build Spark 1.3 sbt gives me to following error: unresolved 
 dependency: org.apache.kafka#kafka_2.11;0.8.1.1: not found  
 org.scalamacros#quasiquotes_2.11;2.0.1: not found [error] Total time: 27 s, 
 completed 27-Mar-2015 14:24:39
 

 Key: SPARK-6572
 URL: https://issues.apache.org/jira/browse/SPARK-6572
 Project: Spark
  Issue Type: Bug
Reporter: Frank Domoney





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6575) Add configuration to disable schema merging while converting metastore Parquet tables

2015-03-27 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6575:
--
Description: 
Consider a metastore Parquet table that
# doesn't have schema evolution issue
# has lots of data files and/or partitions
In this case, driver schema merging can be both slow and unnecessary. Would be 
good to have a configuration to let the use disable schema merging when 
coverting such a metastore Parquet table.

 Add configuration to disable schema merging while converting metastore 
 Parquet tables
 -

 Key: SPARK-6575
 URL: https://issues.apache.org/jira/browse/SPARK-6575
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Assignee: Cheng Lian

 Consider a metastore Parquet table that
 # doesn't have schema evolution issue
 # has lots of data files and/or partitions
 In this case, driver schema merging can be both slow and unnecessary. Would 
 be good to have a configuration to let the use disable schema merging when 
 coverting such a metastore Parquet table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6575) Add configuration to disable schema merging while converting metastore Parquet tables

2015-03-27 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6575:
--
Description: 
Consider a metastore Parquet table that
# doesn't have schema evolution issue
# has lots of data files and/or partitions

In this case, driver schema merging can be both slow and unnecessary. Would be 
good to have a configuration to let the use disable schema merging when 
coverting such a metastore Parquet table.

  was:
Consider a metastore Parquet table that
# doesn't have schema evolution issue
# has lots of data files and/or partitions
In this case, driver schema merging can be both slow and unnecessary. Would be 
good to have a configuration to let the use disable schema merging when 
coverting such a metastore Parquet table.


 Add configuration to disable schema merging while converting metastore 
 Parquet tables
 -

 Key: SPARK-6575
 URL: https://issues.apache.org/jira/browse/SPARK-6575
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Assignee: Cheng Lian

 Consider a metastore Parquet table that
 # doesn't have schema evolution issue
 # has lots of data files and/or partitions
 In this case, driver schema merging can be both slow and unnecessary. Would 
 be good to have a configuration to let the use disable schema merging when 
 coverting such a metastore Parquet table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6483) Spark SQL udf(ScalaUdf) is very slow

2015-03-25 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-6483.
---
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5154
[https://github.com/apache/spark/pull/5154]

 Spark SQL udf(ScalaUdf) is very slow
 

 Key: SPARK-6483
 URL: https://issues.apache.org/jira/browse/SPARK-6483
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0, 1.4.0
 Environment: 1. Spark version is 1.3.0 
 2. 3 node per 80G/20C 
 3. read 250G parquet files from hdfs 
Reporter: zzc
Assignee: zzc
 Fix For: 1.4.0


 Test case: 
 1. 
 register floor func with command: 
 sqlContext.udf.register(floor, (ts: Int) = ts - ts % 300), 
 then run with sql select chan, floor(ts) as tt, sum(size) from qlogbase3 
 group by chan, floor(ts), 
 *it takes 17 minutes.*
 {quote}
 == Physical Plan ==   
   
 Aggregate false, [chan#23015,PartialGroup#23500], 
 [chan#23015,PartialGroup#23500 AS tt#23494,CombineSum(PartialSum#23499L) AS 
 c2#23495L] 
  Exchange (HashPartitioning [chan#23015,PartialGroup#23500], 54) 
   Aggregate true, [chan#23015,scalaUDF(ts#23016)], 
 [chan#23015,*scalaUDF*(ts#23016) AS PartialGroup#23500,SUM(size#23023L) AS 
 PartialSum#23499L] 
PhysicalRDD [chan#23015,ts#23016,size#23023L], MapPartitionsRDD[115] at 
 map at newParquet.scala:562 
 {quote}
 2. 
 run with sql select chan, (ts - ts % 300) as tt, sum(size) from qlogbase3 
 group by chan, (ts - ts % 300), 
 *it takes only 5 minutes.*
 {quote}
 == Physical Plan == 
 Aggregate false, [chan#23015,PartialGroup#23349], 
 [chan#23015,PartialGroup#23349 AS tt#23343,CombineSum(PartialSum#23348L) AS 
 c2#23344L]   
  Exchange (HashPartitioning [chan#23015,PartialGroup#23349], 54)   
   Aggregate true, [chan#23015,(ts#23016 - (ts#23016 % 300))], 
 [chan#23015,*(ts#23016 - (ts#23016 % 300))* AS 
 PartialGroup#23349,SUM(size#23023L) AS PartialSum#23348L] 
PhysicalRDD [chan#23015,ts#23016,size#23023L], MapPartitionsRDD[83] at map 
 at newParquet.scala:562 
 {quote}
 3. 
 use *HiveContext* with sql select chan, floor((ts - ts % 300)) as tt, 
 sum(size) from qlogbase3 group by chan, floor((ts - ts % 300)) 
 *it takes only 5 minutes too. *
 {quote}
 == Physical Plan == 
 Aggregate false, [chan#23015,PartialGroup#23108L], 
 [chan#23015,PartialGroup#23108L AS tt#23102L,CombineSum(PartialSum#23107L) AS 
 _c2#23103L] 
  Exchange (HashPartitioning [chan#23015,PartialGroup#23108L], 54) 
   Aggregate true, 
 [chan#23015,HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFFloor((ts#23016
  - (ts#23016 % 300)))], 
 [chan#23015,*HiveGenericUdf*#org.apache.hadoop.hive.ql.udf.generic.GenericUDFFloor((ts#23016
  - (ts#23016 % 300))) AS PartialGroup#23108L,SUM(size#23023L) AS 
 PartialSum#23107L] 
PhysicalRDD [chan#23015,ts#23016,size#23023L], MapPartitionsRDD[28] at map 
 at newParquet.scala:562 
 {quote}
 *Why? ScalaUdf is so slow?? How to improve it?*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6483) Spark SQL udf(ScalaUdf) is very slow

2015-03-25 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6483:
--
Assignee: zzc

 Spark SQL udf(ScalaUdf) is very slow
 

 Key: SPARK-6483
 URL: https://issues.apache.org/jira/browse/SPARK-6483
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0, 1.4.0
 Environment: 1. Spark version is 1.3.0 
 2. 3 node per 80G/20C 
 3. read 250G parquet files from hdfs 
Reporter: zzc
Assignee: zzc
 Fix For: 1.4.0


 Test case: 
 1. 
 register floor func with command: 
 sqlContext.udf.register(floor, (ts: Int) = ts - ts % 300), 
 then run with sql select chan, floor(ts) as tt, sum(size) from qlogbase3 
 group by chan, floor(ts), 
 *it takes 17 minutes.*
 {quote}
 == Physical Plan ==   
   
 Aggregate false, [chan#23015,PartialGroup#23500], 
 [chan#23015,PartialGroup#23500 AS tt#23494,CombineSum(PartialSum#23499L) AS 
 c2#23495L] 
  Exchange (HashPartitioning [chan#23015,PartialGroup#23500], 54) 
   Aggregate true, [chan#23015,scalaUDF(ts#23016)], 
 [chan#23015,*scalaUDF*(ts#23016) AS PartialGroup#23500,SUM(size#23023L) AS 
 PartialSum#23499L] 
PhysicalRDD [chan#23015,ts#23016,size#23023L], MapPartitionsRDD[115] at 
 map at newParquet.scala:562 
 {quote}
 2. 
 run with sql select chan, (ts - ts % 300) as tt, sum(size) from qlogbase3 
 group by chan, (ts - ts % 300), 
 *it takes only 5 minutes.*
 {quote}
 == Physical Plan == 
 Aggregate false, [chan#23015,PartialGroup#23349], 
 [chan#23015,PartialGroup#23349 AS tt#23343,CombineSum(PartialSum#23348L) AS 
 c2#23344L]   
  Exchange (HashPartitioning [chan#23015,PartialGroup#23349], 54)   
   Aggregate true, [chan#23015,(ts#23016 - (ts#23016 % 300))], 
 [chan#23015,*(ts#23016 - (ts#23016 % 300))* AS 
 PartialGroup#23349,SUM(size#23023L) AS PartialSum#23348L] 
PhysicalRDD [chan#23015,ts#23016,size#23023L], MapPartitionsRDD[83] at map 
 at newParquet.scala:562 
 {quote}
 3. 
 use *HiveContext* with sql select chan, floor((ts - ts % 300)) as tt, 
 sum(size) from qlogbase3 group by chan, floor((ts - ts % 300)) 
 *it takes only 5 minutes too. *
 {quote}
 == Physical Plan == 
 Aggregate false, [chan#23015,PartialGroup#23108L], 
 [chan#23015,PartialGroup#23108L AS tt#23102L,CombineSum(PartialSum#23107L) AS 
 _c2#23103L] 
  Exchange (HashPartitioning [chan#23015,PartialGroup#23108L], 54) 
   Aggregate true, 
 [chan#23015,HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFFloor((ts#23016
  - (ts#23016 % 300)))], 
 [chan#23015,*HiveGenericUdf*#org.apache.hadoop.hive.ql.udf.generic.GenericUDFFloor((ts#23016
  - (ts#23016 % 300))) AS PartialGroup#23108L,SUM(size#23023L) AS 
 PartialSum#23107L] 
PhysicalRDD [chan#23015,ts#23016,size#23023L], MapPartitionsRDD[28] at map 
 at newParquet.scala:562 
 {quote}
 *Why? ScalaUdf is so slow?? How to improve it?*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6538) Add missing nullable Metastore fields when merging a Parquet schema

2015-03-27 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-6538.
---
   Resolution: Fixed
Fix Version/s: 1.4.0
   1.3.1

Issue resolved by pull request 5214
[https://github.com/apache/spark/pull/5214]

 Add missing nullable Metastore fields when merging a Parquet schema
 ---

 Key: SPARK-6538
 URL: https://issues.apache.org/jira/browse/SPARK-6538
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Adam Budde
 Fix For: 1.3.1, 1.4.0


 When Spark SQL infers a schema for a DataFrame, it will take the union of all 
 field types present in the structured source data (e.g. an RDD of JSON data). 
 When the source data for a row doesn't define a particular field on the 
 DataFrame's schema, a null value will simply be assumed for this field. This 
 workflow makes it very easy to construct tables and query over a set of 
 structured data with a nonuniform schema. However, this behavior is not 
 consistent in some cases when dealing with Parquet files and an external 
 table managed by an external Hive metastore.
 In our particular usecase, we use Spark Streaming to parse and transform our 
 input data and then apply a window function to save an arbitrary-sized batch 
 of data as a Parquet file, which itself will be added as a partition to an 
 external Hive table via an ALTER TABLE... ADD PARTITION... statement. Since 
 our input data is nonuniform, it is expected that not every partition batch 
 will contain every field present in the table's schema obtained from the Hive 
 metastore. As such, we expect that the schema of some of our Parquet files 
 may not contain the same set fields present in the full metastore schema.
 In such cases, it seems natural that Spark SQL would simply assume null 
 values for any missing fields in the partition's Parquet file, assuming these 
 fields are specified as nullable by the metastore schema. This is not the 
 case in the current implementation of ParquetRelation2. The 
 mergeMetastoreParquetSchema() method used to reconcile differences between a 
 Parquet file's schema and a schema retrieved from the Hive metastore will 
 raise an exception if the Parquet file doesn't match the same set of fields 
 specified by the metastore.
 I propose altering this implementation in order to allow for any missing 
 metastore fields marked as nullable to be merged in to the Parquet file's 
 schema before continuing with the checks present in 
 mergeMetastoreParquetSchema().
 Classifying this as a bug as it exposes inconsistent behavior, IMHO. If you 
 feel this should be an improvement or new feature instead, please feel free 
 to reclassify this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6467) Override QueryPlan.missingInput when necessary and rely on it CheckAnalysis

2015-03-23 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-6467:
-

 Summary: Override QueryPlan.missingInput when necessary and rely 
on it CheckAnalysis
 Key: SPARK-6467
 URL: https://issues.apache.org/jira/browse/SPARK-6467
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Priority: Minor


Currently, some LogicalPlans do not override missingInput, but they should. 
Then, the lack of proper missingInput implementations leaks to CheckAnalysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6397) Exclude virtual columns from QueryPlan.missingInput

2015-03-23 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6397:
--
Description: Virtual columns like GROUPING__ID should never be considered 
as missing input, and thus should be execluded from {{QueryPlan.missingInput}}. 
 (was: Currently, some LogicalPlans do not override missingInput, but they 
should. Then, the lack of proper missingInput implementations leaks to 
CheckAnalysis.)

 Exclude virtual columns from QueryPlan.missingInput
 ---

 Key: SPARK-6397
 URL: https://issues.apache.org/jira/browse/SPARK-6397
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yadong Qi
Assignee: Yadong Qi
Priority: Minor

 Virtual columns like GROUPING__ID should never be considered as missing 
 input, and thus should be execluded from {{QueryPlan.missingInput}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6456) Spark Sql throwing exception on large partitioned data

2015-03-23 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375481#comment-14375481
 ] 

Cheng Lian edited comment on SPARK-6456 at 3/23/15 7:21 AM:


How many partitions are there? Also, what's the version of the Hive metastore? 
For now, Spark SQL only support Hive 0.12.0 and 0.13.1. Spark 1.1 and prior 
versions only support Hive 0.12.0.


was (Author: lian cheng):
How many partitions are there?

 Spark Sql throwing exception on large partitioned data
 --

 Key: SPARK-6456
 URL: https://issues.apache.org/jira/browse/SPARK-6456
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: pankaj
 Fix For: 1.2.1


 Spark connects with Hive Metastore. I am able to run simple queries like show 
 table and select. but throws below exception while running query on the hive 
 Table having large number of partitions.
 {noformat}
 Exception in thread main java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:40)
 at`enter code here` 
 org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
 org.apache.thrift.transport.TTransportException: 
 java.net.SocketTimeoutException: Read timed out
 at 
 org.apache.hadoop.hive.ql.metadata.Hive.getAllPartitionsOf(Hive.java:1785)
 at 
 org.apache.spark.sql.hive.HiveShim$.getAllPartitionsOf(Shim13.scala:316)
 at 
 org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:86)
 at 
 org.apache.spark.sql.hive.HiveContext$$anon$1.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:253)
 at 
 org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
 at 
 org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
 at scala.Option.getOrElse(Option.scala:120)
 at 
 org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:137)
 at 
 org.apache.spark.sql.hive.HiveContext$$anon$1.lookupRelation(HiveContext.scala:253)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:143)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:138)
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6397) Exclude virtual columns from QueryPlan.missingInput

2015-03-23 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6397:
--
Summary: Exclude virtual columns from QueryPlan.missingInput  (was: 
Override QueryPlan.missingInput when necessary and rely on CheckAnalysis)

 Exclude virtual columns from QueryPlan.missingInput
 ---

 Key: SPARK-6397
 URL: https://issues.apache.org/jira/browse/SPARK-6397
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yadong Qi
Assignee: Yadong Qi
Priority: Minor

 Currently, some LogicalPlans do not override missingInput, but they should. 
 Then, the lack of proper missingInput implementations leaks to CheckAnalysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6397) Override QueryPlan.missingInput when necessary and rely on CheckAnalysis

2015-03-23 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6397:
--
Affects Version/s: 1.3.0

 Override QueryPlan.missingInput when necessary and rely on CheckAnalysis
 

 Key: SPARK-6397
 URL: https://issues.apache.org/jira/browse/SPARK-6397
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yadong Qi
Assignee: Yadong Qi
Priority: Minor

 Currently, some LogicalPlans do not override missingInput, but they should. 
 Then, the lack of proper missingInput implementations leaks to CheckAnalysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6397) Override QueryPlan.missingInput when necessary and rely on CheckAnalysis

2015-03-23 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6397:
--
Assignee: Yadong Qi

 Override QueryPlan.missingInput when necessary and rely on CheckAnalysis
 

 Key: SPARK-6397
 URL: https://issues.apache.org/jira/browse/SPARK-6397
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yadong Qi
Assignee: Yadong Qi
Priority: Minor

 Currently, some LogicalPlans do not override missingInput, but they should. 
 Then, the lack of proper missingInput implementations leaks to CheckAnalysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6397) Exclude virtual columns from QueryPlan.missingInput

2015-03-23 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375680#comment-14375680
 ] 

Cheng Lian commented on SPARK-6397:
---

Hey [~smolav], after some discussion with [~waterman] in his PRs, we decided to 
fix the GROUPING__ID virtual column issue first. So I updated the title and 
description of this JIRA ticket, and created SPARK-6467 for the original one. 
You may link your PR to that one. Thanks! I should have created another JIRA 
ticket for the fix introduced in [~waterman]'s PR, but I realized the problem 
too late after merging it.

 Exclude virtual columns from QueryPlan.missingInput
 ---

 Key: SPARK-6397
 URL: https://issues.apache.org/jira/browse/SPARK-6397
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yadong Qi
Assignee: Yadong Qi
Priority: Minor

 Virtual columns like GROUPING__ID should never be considered as missing 
 input, and thus should be execluded from {{QueryPlan.missingInput}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6397) Exclude virtual columns from QueryPlan.missingInput

2015-03-23 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-6397.
---
   Resolution: Fixed
Fix Version/s: 1.4.0
   1.3.1

Issue resolved by pull request 5132
[https://github.com/apache/spark/pull/5132]

 Exclude virtual columns from QueryPlan.missingInput
 ---

 Key: SPARK-6397
 URL: https://issues.apache.org/jira/browse/SPARK-6397
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yadong Qi
Assignee: Yadong Qi
Priority: Minor
 Fix For: 1.3.1, 1.4.0


 Virtual columns like GROUPING__ID should never be considered as missing 
 input, and thus should be execluded from {{QueryPlan.missingInput}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6456) Spark Sql throwing exception on large partitioned data

2015-03-23 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6456:
--
Description: 
Spark connects with Hive Metastore. I am able to run simple queries like show 
table and select. but throws below exception while running query on the hive 
Table having large number of partitions.
{noformat}
Exception in thread main java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:40)
at`enter code here` 
org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
org.apache.thrift.transport.TTransportException: 
java.net.SocketTimeoutException: Read timed out
at 
org.apache.hadoop.hive.ql.metadata.Hive.getAllPartitionsOf(Hive.java:1785)
at 
org.apache.spark.sql.hive.HiveShim$.getAllPartitionsOf(Shim13.scala:316)
at 
org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:86)
at 
org.apache.spark.sql.hive.HiveContext$$anon$1.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:253)
at 
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
at 
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
at scala.Option.getOrElse(Option.scala:120)
at 
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:137)
at 
org.apache.spark.sql.hive.HiveContext$$anon$1.lookupRelation(HiveContext.scala:253)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:143)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:138)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
{noformat}

  was:
Observation:
Spark connects with hive Metastore. i am able to run simple queries like 
show table and select.
but throws below exception while running query on the hive Table having large 
number of partitions.

{code}
Exception in thread main java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:40)
at`enter code here` 
org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
org.apache.thrift.transport.TTransportException: 
java.net.SocketTimeoutException: Read timed out
at 
org.apache.hadoop.hive.ql.metadata.Hive.getAllPartitionsOf(Hive.java:1785)
at 
org.apache.spark.sql.hive.HiveShim$.getAllPartitionsOf(Shim13.scala:316)
at 
org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:86)
at 
org.apache.spark.sql.hive.HiveContext$$anon$1.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:253)
at 
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
at 
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
at scala.Option.getOrElse(Option.scala:120)
at 
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:137)
at 
org.apache.spark.sql.hive.HiveContext$$anon$1.lookupRelation(HiveContext.scala:253)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:143)
at 

[jira] [Commented] (SPARK-6456) Spark Sql throwing exception on large partitioned data

2015-03-23 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375481#comment-14375481
 ] 

Cheng Lian commented on SPARK-6456:
---

How many partitions are there?

 Spark Sql throwing exception on large partitioned data
 --

 Key: SPARK-6456
 URL: https://issues.apache.org/jira/browse/SPARK-6456
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: pankaj
 Fix For: 1.2.1


 Spark connects with Hive Metastore. I am able to run simple queries like show 
 table and select. but throws below exception while running query on the hive 
 Table having large number of partitions.
 {noformat}
 Exception in thread main java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:40)
 at`enter code here` 
 org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
 org.apache.thrift.transport.TTransportException: 
 java.net.SocketTimeoutException: Read timed out
 at 
 org.apache.hadoop.hive.ql.metadata.Hive.getAllPartitionsOf(Hive.java:1785)
 at 
 org.apache.spark.sql.hive.HiveShim$.getAllPartitionsOf(Shim13.scala:316)
 at 
 org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:86)
 at 
 org.apache.spark.sql.hive.HiveContext$$anon$1.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:253)
 at 
 org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
 at 
 org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
 at scala.Option.getOrElse(Option.scala:120)
 at 
 org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:137)
 at 
 org.apache.spark.sql.hive.HiveContext$$anon$1.lookupRelation(HiveContext.scala:253)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:143)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:138)
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5203) union with different decimal type report error

2015-04-03 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-5203.
---
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 4004
[https://github.com/apache/spark/pull/4004]

 union with different decimal type report error
 --

 Key: SPARK-5203
 URL: https://issues.apache.org/jira/browse/SPARK-5203
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: guowei
 Fix For: 1.4.0


 Test case like this:
 {code:sql}
 create table test (a decimal(10,1));
 select a from test union all select a*2 from test;
 {code}
 Exception thown:
 {noformat}
 15/01/12 16:28:54 ERROR SparkSQLDriver: Failed in [select a from test union 
 all select a*2 from test]
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
 attributes: *, tree:
 'Project [*]
  'Subquery _u1
   'Union 
Project [a#1]
 MetastoreRelation default, test, None
Project [CAST((CAST(a#2, DecimalType()) * CAST(CAST(2, DecimalType(10,0)), 
 DecimalType())), DecimalType(21,1)) AS _c0#0]
 MetastoreRelation default, test, None
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:85)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:83)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:83)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:81)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
   at 
 scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
   at 
 scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
   at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:410)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:410)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:411)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:411)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:412)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:412)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:417)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:415)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:421)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:421)
   at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:369)
   at 
 org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:58)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:275)
   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:211)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4521) Parquet fails to read columns with spaces in the name

2015-04-20 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-4521.
---
Resolution: Done

This ticket is covered by SPARK-6607.

 Parquet fails to read columns with spaces in the name
 -

 Key: SPARK-4521
 URL: https://issues.apache.org/jira/browse/SPARK-4521
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Michael Armbrust

 I think this is actually a bug in parquet, but it would be good to track it 
 here as well.  To reproduce:
 {code}
 jsonRDD(sparkContext.parallelize({number of clusters: 
 1}::Nil)).saveAsParquetFile(test)
 parquetFile(test).collect()
 {code}
 {code}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 
 (TID 13, localhost): java.lang.IllegalArgumentException: field ended by ';': 
 expected ';' but got 'of' at line 1:   optional int32 number of
   at parquet.schema.MessageTypeParser.check(MessageTypeParser.java:209)
   at 
 parquet.schema.MessageTypeParser.addPrimitiveType(MessageTypeParser.java:182)
   at parquet.schema.MessageTypeParser.addType(MessageTypeParser.java:108)
   at 
 parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:96)
   at parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:89)
   at 
 parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:79)
   at 
 parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:189)
   at 
 parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
   at 
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:135)
   at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   4   5   6   7   8   9   10   >