Re: Unable to use HiveContext in spark-shell

2014-11-06 Thread Terry Siu
What version of Spark are you using? Did you compile your Spark version
and if so, what compile options did you use?

On 11/6/14, 9:22 AM, tridib tridib.sama...@live.com wrote:

Help please!



--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-use-HiveCont
ext-in-spark-shell-tp18261p18280.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Unable to use HiveContext in spark-shell

2014-11-06 Thread Terry Siu
Those are the same options I used, except I had —tgz to package it and I built 
off of the master branch. Unfortunately, my only guess is that these errors 
stem from your build environment.  In your spark assembly, do you have any 
classes which belong to the org.apache.hadoop.hive package?


From: Tridib Samanta tridib.sama...@live.commailto:tridib.sama...@live.com
Date: Thursday, November 6, 2014 at 9:49 AM
To: Terry Siu terry@smartfocus.commailto:terry@smartfocus.com, 
u...@spark.incubator.apache.orgmailto:u...@spark.incubator.apache.org 
u...@spark.incubator.apache.orgmailto:u...@spark.incubator.apache.org
Subject: RE: Unable to use HiveContext in spark-shell

I am using spark 1.1.0.
I built it using:
./make-distribution.sh -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive 
-DskipTests

My ultimate goal is to execute a query on parquet file with nested structure 
and cast a date string to Date. This is required to calculate the age of Person 
entity.
but I am even unable to pass this line:
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
I made sure that org.apache.hadoop package is in the spark assembly jar.

Re-attaching the stack trace for quick reference.

scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

error: bad symbolic reference. A signature in HiveContext.class refers to term 
hive
in package org.apache.hadoop which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling 
HiveContext.class.
error:
 while compiling: console
during phase: erasure
 library version: version 2.10.4
compiler version: version 2.10.4
  reconstructed args:

  last tree to typer: Apply(value $outer)
  symbol: value $outer (flags: method synthetic stable 
expandedname triedcooking)
   symbol definition: val $outer(): $iwC.$iwC.type
 tpe: $iwC.$iwC.type
   symbol owners: value $outer - class $iwC - class $iwC - class $iwC - 
class $read - package $line5
  context owners: class $iwC - class $iwC - class $iwC - class $iwC - 
class $read - package $line5

== Enclosing template or block ==

ClassDef( // class $iwC extends Serializable
  0
  $iwC
  []
  Template( // val local $iwC: notype, tree.tpe=$iwC
java.lang.Object, scala.Serializable // parents
ValDef(
  private
  _
  tpt
  empty
)
// 5 statements
DefDef( // def init(arg$outer: $iwC.$iwC.$iwC.type): $iwC
  method triedcooking
  init
  []
  // 1 parameter list
  ValDef( // $outer: $iwC.$iwC.$iwC.type

$outer
tpt // tree.tpe=$iwC.$iwC.$iwC.type
empty
  )
  tpt // tree.tpe=$iwC
  Block( // tree.tpe=Unit
Apply( // def init(): Object in class Object, tree.tpe=Object
  $iwC.super.init // def init(): Object in class Object, 
tree.tpe=()Object
  Nil
)
()
  )
)
ValDef( // private[this] val sqlContext: 
org.apache.spark.sql.hive.HiveContext
  private local triedcooking
  sqlContext 
  tpt // tree.tpe=org.apache.spark.sql.hive.HiveContext
  Apply( // def init(sc: org.apache.spark.SparkContext): 
org.apache.spark.sql.hive.HiveContext in class HiveContext, 
tree.tpe=org.apache.spark.sql.hive.HiveContext
new org.apache.spark.sql.hive.HiveContext.init // def init(sc: 
org.apache.spark.SparkContext): org.apache.spark.sql.hive.HiveContext in class 
HiveContext, tree.tpe=(sc: 
org.apache.spark.SparkContext)org.apache.spark.sql.hive.HiveContext
Apply( // val sc(): org.apache.spark.SparkContext, 
tree.tpe=org.apache.spark.SparkContext
  
$iwC.this.$line5$$read$$iwC$$iwC$$iwC$$iwC$$$outer().$line5$$read$$iwC$$iwC$$iwC$$$outer().$line5$$read$$iwC$$iwC$$$outer().$VAL1().$iw().$iw().sc
 // val sc(): org.apache.spark.SparkContext, 
tree.tpe=()org.apache.spark.SparkContext
  Nil
)
  )
)
DefDef( // val sqlContext(): org.apache.spark.sql.hive.HiveContext
  method stable accessor
  sqlContext
  []
  List(Nil)
  tpt // tree.tpe=org.apache.spark.sql.hive.HiveContext
  $iwC.this.sqlContext  // private[this] val sqlContext: 
org.apache.spark.sql.hive.HiveContext, 
tree.tpe=org.apache.spark.sql.hive.HiveContext
)
ValDef( // protected val $outer: $iwC.$iwC.$iwC.type
  protected synthetic paramaccessor triedcooking
  $outer 
  tpt // tree.tpe=$iwC.$iwC.$iwC.type
  empty
)
DefDef( // val $outer(): $iwC.$iwC.$iwC.type
  method synthetic stable expandedname triedcooking
  $line5$$read$$iwC$$iwC$$iwC$$iwC$$$outer
  []
  List(Nil)
  tpt // tree.tpe=Any
  $iwC.this.$outer  // protected val $outer: $iwC.$iwC.$iwC.type, 
tree.tpe=$iwC.$iwC.$iwC.type
)
  )
)

== Expanded type of tree ==

ThisType(class $iwC)

uncaught exception during compilation: scala.reflect.internal.Types$TypeError
scala.reflect.internal.Types

NoClassDefFoundError encountered in Spark 1.2-snapshot build with hive-0.13.1 profile

2014-11-03 Thread Terry Siu
I just built the 1.2 snapshot current as of commit 76386e1a23c using:

$ ./make-distribution.sh —tgz —name my-spark —skip-java-test -DskipTests 
-Phadoop-2.4 -Phive -Phive-0.13.1 -Pyarn

I drop in my Hive configuration files into the conf directory, launch 
spark-shell, and then create my HiveContext, hc. I then issue a “use db” 
command:

scala hc.hql(“use db”)

and receive the following class-not-found error:


java.lang.NoClassDefFoundError: 
com/esotericsoftware/shaded/org/objenesis/strategy/InstantiatorStrategy

at org.apache.hadoop.hive.ql.exec.Utilities.clinit(Utilities.java:925)

at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1224)

at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1088)

at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)

at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)

at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:315)

at 
org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:286)

at 
org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35)

at 
org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35)

at 
org.apache.spark.sql.execution.Command$class.execute(commands.scala:46)

at 
org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:30)

at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:424)

at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:424)

at 
org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)

at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:103)

at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:111)

at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:115)

at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31)

at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:36)

at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:38)

at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:40)

at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:42)

at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:44)

at $iwC$$iwC$$iwC$$iwC.init(console:46)

at $iwC$$iwC$$iwC.init(console:48)

at $iwC$$iwC.init(console:50)

at $iwC.init(console:52)

at init(console:54)

at .init(console:58)

at .clinit(console)

at .init(console:7)

at .clinit(console)

at $print(console)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorIva:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)

at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125

at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674)

at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705)

at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669)

at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828)

at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:8

at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:785)

at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:628)

at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:636)

at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:641)

at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILola:968)

at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scal

at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scal

at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoadla:135)

at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:916)

at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1011)

at org.apache.spark.repl.Main$.main(Main.scala:31)

at org.apache.spark.repl.Main.main(Main.scala)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorIva:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:353)

at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)

at 

ParquetFilters and StringType support for GT, GTE, LT, LTE

2014-11-03 Thread Terry Siu
Is there any reason why StringType is not a supported type the GT, GTE, LT, LTE 
operations? I was able to previously have a predicate where my column type was 
a string and execute a filter with one of the above operators in SparkSQL w/o 
any problems. However, I synced up to the latest code this morning and now the 
same query will give me a MatchError for this column of string type.

Thanks,
-Terry




Re: NoClassDefFoundError encountered in Spark 1.2-snapshot build with hive-0.13.1 profile

2014-11-03 Thread Terry Siu
Thanks, Kousuke. I’ll wait till this pull request makes it into the master 
branch.

-Terry

From: Kousuke Saruta 
saru...@oss.nttdata.co.jpmailto:saru...@oss.nttdata.co.jp
Date: Monday, November 3, 2014 at 11:11 AM
To: Terry Siu terry@smartfocus.commailto:terry@smartfocus.com, 
user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: NoClassDefFoundError encountered in Spark 1.2-snapshot build with 
hive-0.13.1 profile

Hi Terry

I think the issue you mentioned will be resolved by following PR.
https://github.com/apache/spark/pull/3072

- Kousuke

(2014/11/03 10:42), Terry Siu wrote:
I just built the 1.2 snapshot current as of commit 76386e1a23c using:

$ ./make-distribution.sh —tgz —name my-spark —skip-java-test -DskipTests 
-Phadoop-2.4 -Phive -Phive-0.13.1 -Pyarn

I drop in my Hive configuration files into the conf directory, launch 
spark-shell, and then create my HiveContext, hc. I then issue a “use db” 
command:

scala hc.hql(“use db”)

and receive the following class-not-found error:


java.lang.NoClassDefFoundError: 
com/esotericsoftware/shaded/org/objenesis/strategy/InstantiatorStrategy

at org.apache.hadoop.hive.ql.exec.Utilities.clinit(Utilities.java:925)

at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1224)

at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1088)

at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)

at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)

at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:315)

at 
org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:286)

at 
org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35)

at 
org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35)

at 
org.apache.spark.sql.execution.Command$class.execute(commands.scala:46)

at 
org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:30)

at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:424)

at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:424)

at 
org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)

at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:103)

at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:111)

at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:115)

at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31)

at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:36)

at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:38)

at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:40)

at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:42)

at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:44)

at $iwC$$iwC$$iwC$$iwC.init(console:46)

at $iwC$$iwC$$iwC.init(console:48)

at $iwC$$iwC.init(console:50)

at $iwC.init(console:52)

at init(console:54)

at .init(console:58)

at .clinit(console)

at .init(console:7)

at .clinit(console)

at $print(console)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorIva:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)

at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125

at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674)

at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705)

at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669)

at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828)

at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:8

at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:785)

at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:628)

at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:636)

at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:641)

at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILola:968)

at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scal

at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scal

at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoadla:135)

at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:916

Re: ParquetFilters and StringType support for GT, GTE, LT, LTE

2014-11-03 Thread Terry Siu
Done.

https://issues.apache.org/jira/browse/SPARK-4213

Thanks,
-Terry

From: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com
Date: Monday, November 3, 2014 at 1:37 PM
To: Terry Siu terry@smartfocus.commailto:terry@smartfocus.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: ParquetFilters and StringType support for GT, GTE, LT, LTE

That sounds like a regression.  Could you open a JIRA with steps to reproduce 
(https://issues.apache.org/jira/browse/SPARK)?  We'll want to fix this before 
the 1.2 release.

On Mon, Nov 3, 2014 at 11:04 AM, Terry Siu 
terry@smartfocus.commailto:terry@smartfocus.com wrote:
Is there any reason why StringType is not a supported type the GT, GTE, LT, LTE 
operations? I was able to previously have a predicate where my column type was 
a string and execute a filter with one of the above operators in SparkSQL w/o 
any problems. However, I synced up to the latest code this morning and now the 
same query will give me a MatchError for this column of string type.

Thanks,
-Terry





Spark Build

2014-10-31 Thread Terry Siu
I am synced up to the Spark master branch as of commit 23468e7e96. I have Maven 
3.0.5, Scala 2.10.3, and SBT 0.13.1. I’ve built the master branch successfully 
previously and am trying to rebuild again to take advantage of the new Hive 
0.13.1 profile. I execute the following command:

$ mvn -DskipTests -Phive-0.13-1 -Phadoop-2.4 -Pyarn clean package

The build fails at the following stage:


INFO] Using incremental compilation

[INFO] compiler plugin: 
BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null)

[INFO] Compiling 5 Scala sources to 
/home/terrys/Applications/spark/yarn/stable/target/scala-2.10/test-classes...

[ERROR] 
/home/terrys/Applications/spark/yarn/common/src/test/scala/org/apache/spark/deploy/yarn/YarnAllocatorSuite.scala:20:
 object MemLimitLogger is not a member of package org.apache.spark.deploy.yarn

[ERROR] import org.apache.spark.deploy.yarn.MemLimitLogger._

[ERROR] ^

[ERROR] 
/home/terrys/Applications/spark/yarn/common/src/test/scala/org/apache/spark/deploy/yarn/YarnAllocatorSuite.scala:29:
 not found: value memLimitExceededLogMessage

[ERROR] val vmemMsg = memLimitExceededLogMessage(diagnostics, 
VMEM_EXCEEDED_PATTERN)

[ERROR]   ^

[ERROR] 
/home/terrys/Applications/spark/yarn/common/src/test/scala/org/apache/spark/deploy/yarn/YarnAllocatorSuite.scala:30:
 not found: value memLimitExceededLogMessage

[ERROR] val pmemMsg = memLimitExceededLogMessage(diagnostics, 
PMEM_EXCEEDED_PATTERN)

[ERROR]   ^

[ERROR] three errors found

[INFO] 

[INFO] Reactor Summary:

[INFO]

[INFO] Spark Project Parent POM .. SUCCESS [2.758s]

[INFO] Spark Project Common Network Code . SUCCESS [6.716s]

[INFO] Spark Project Core  SUCCESS [2:46.610s]

[INFO] Spark Project Bagel ... SUCCESS [16.776s]

[INFO] Spark Project GraphX .. SUCCESS [52.159s]

[INFO] Spark Project Streaming ... SUCCESS [1:09.883s]

[INFO] Spark Project ML Library .. SUCCESS [1:18.932s]

[INFO] Spark Project Tools ... SUCCESS [10.210s]

[INFO] Spark Project Catalyst  SUCCESS [1:12.499s]

[INFO] Spark Project SQL . SUCCESS [1:10.561s]

[INFO] Spark Project Hive  SUCCESS [1:08.571s]

[INFO] Spark Project REPL  SUCCESS [32.377s]

[INFO] Spark Project YARN Parent POM . SUCCESS [1.317s]

[INFO] Spark Project YARN Stable API . FAILURE [25.918s]

[INFO] Spark Project Assembly  SKIPPED

[INFO] Spark Project External Twitter  SKIPPED

[INFO] Spark Project External Kafka .. SKIPPED

[INFO] Spark Project External Flume Sink . SKIPPED

[INFO] Spark Project External Flume .. SKIPPED

[INFO] Spark Project External ZeroMQ . SKIPPED

[INFO] Spark Project External MQTT ... SKIPPED

[INFO] Spark Project Examples  SKIPPED

[INFO] 

[INFO] BUILD FAILURE

[INFO] 

[INFO] Total time: 11:15.889s

[INFO] Finished at: Fri Oct 31 12:08:55 PDT 2014

[INFO] Final Memory: 73M/829M

[INFO] 

[ERROR] Failed to execute goal 
net.alchim31.maven:scala-maven-plugin:3.2.0:testCompile 
(scala-test-compile-first) on project spark-yarn_2.10: Execution 
scala-test-compile-first of goal 
net.alchim31.maven:scala-maven-plugin:3.2.0:testCompile failed. CompileFailed 
- [Help 1]

[ERROR]

[ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.

[ERROR] Re-run Maven using the -X switch to enable full debug logging.

[ERROR]

[ERROR] For more information about the errors and possible solutions, please 
read the following articles:

[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException

[ERROR]

[ERROR] After correcting the problems, you can resume the build with the command

[ERROR]   mvn goals -rf :spark-yarn_2.10


I could not find MemLimitLogger anywhere in the Spark code. Anybody else 
seen/encounter this?


Thanks,

-Terry






Re: Spark Build

2014-10-31 Thread Terry Siu
Thanks for the update, Shivaram.

-Terry

On 10/31/14, 12:37 PM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote:

Yeah looks like https://github.com/apache/spark/pull/2744 broke the
build. We will fix it soon

On Fri, Oct 31, 2014 at 12:21 PM, Terry Siu terry@smartfocus.com
wrote:
 I am synced up to the Spark master branch as of commit 23468e7e96. I
have
 Maven 3.0.5, Scala 2.10.3, and SBT 0.13.1. I¹ve built the master branch
 successfully previously and am trying to rebuild again to take
advantage of
 the new Hive 0.13.1 profile. I execute the following command:

 $ mvn -DskipTests -Phive-0.13-1 -Phadoop-2.4 -Pyarn clean package

 The build fails at the following stage:

 INFO] Using incremental compilation

 [INFO] compiler plugin:
 BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null)

 [INFO] Compiling 5 Scala sources to
 
/home/terrys/Applications/spark/yarn/stable/target/scala-2.10/test-classe
s...

 [ERROR]
 
/home/terrys/Applications/spark/yarn/common/src/test/scala/org/apache/spa
rk/deploy/yarn/YarnAllocatorSuite.scala:20:
 object MemLimitLogger is not a member of package
 org.apache.spark.deploy.yarn

 [ERROR] import org.apache.spark.deploy.yarn.MemLimitLogger._

 [ERROR] ^

 [ERROR]
 
/home/terrys/Applications/spark/yarn/common/src/test/scala/org/apache/spa
rk/deploy/yarn/YarnAllocatorSuite.scala:29:
 not found: value memLimitExceededLogMessage

 [ERROR] val vmemMsg = memLimitExceededLogMessage(diagnostics,
 VMEM_EXCEEDED_PATTERN)

 [ERROR]   ^

 [ERROR]
 
/home/terrys/Applications/spark/yarn/common/src/test/scala/org/apache/spa
rk/deploy/yarn/YarnAllocatorSuite.scala:30:
 not found: value memLimitExceededLogMessage

 [ERROR] val pmemMsg = memLimitExceededLogMessage(diagnostics,
 PMEM_EXCEEDED_PATTERN)

 [ERROR]   ^

 [ERROR] three errors found

 [INFO]
 

 [INFO] Reactor Summary:

 [INFO]

 [INFO] Spark Project Parent POM .. SUCCESS
[2.758s]

 [INFO] Spark Project Common Network Code . SUCCESS
[6.716s]

 [INFO] Spark Project Core  SUCCESS
 [2:46.610s]

 [INFO] Spark Project Bagel ... SUCCESS
[16.776s]

 [INFO] Spark Project GraphX .. SUCCESS
[52.159s]

 [INFO] Spark Project Streaming ... SUCCESS
 [1:09.883s]

 [INFO] Spark Project ML Library .. SUCCESS
 [1:18.932s]

 [INFO] Spark Project Tools ... SUCCESS
[10.210s]

 [INFO] Spark Project Catalyst  SUCCESS
 [1:12.499s]

 [INFO] Spark Project SQL . SUCCESS
 [1:10.561s]

 [INFO] Spark Project Hive  SUCCESS
 [1:08.571s]

 [INFO] Spark Project REPL  SUCCESS
[32.377s]

 [INFO] Spark Project YARN Parent POM . SUCCESS
[1.317s]

 [INFO] Spark Project YARN Stable API . FAILURE
[25.918s]

 [INFO] Spark Project Assembly  SKIPPED

 [INFO] Spark Project External Twitter  SKIPPED

 [INFO] Spark Project External Kafka .. SKIPPED

 [INFO] Spark Project External Flume Sink . SKIPPED

 [INFO] Spark Project External Flume .. SKIPPED

 [INFO] Spark Project External ZeroMQ . SKIPPED

 [INFO] Spark Project External MQTT ... SKIPPED

 [INFO] Spark Project Examples  SKIPPED

 [INFO]
 

 [INFO] BUILD FAILURE

 [INFO]
 

 [INFO] Total time: 11:15.889s

 [INFO] Finished at: Fri Oct 31 12:08:55 PDT 2014

 [INFO] Final Memory: 73M/829M

 [INFO]
 

 [ERROR] Failed to execute goal
 net.alchim31.maven:scala-maven-plugin:3.2.0:testCompile
 (scala-test-compile-first) on project spark-yarn_2.10: Execution
 scala-test-compile-first of goal
 net.alchim31.maven:scala-maven-plugin:3.2.0:testCompile failed.
 CompileFailed - [Help 1]

 [ERROR]

 [ERROR] To see the full stack trace of the errors, re-run Maven with
the -e
 switch.

 [ERROR] Re-run Maven using the -X switch to enable full debug logging.

 [ERROR]

 [ERROR] For more information about the errors and possible solutions,
please
 read the following articles:

 [ERROR] [Help 1]
 
http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException

 [ERROR]

 [ERROR] After correcting the problems, you can resume the build with the
 command

 [ERROR]   mvn goals -rf :spark-yarn_2.10


 I could not find MemLimitLogger anywhere in the Spark code. Anybody else
 seen/encounter this?


 Thanks,

 -Terry

Re: Ambiguous references to id : what does it mean ?

2014-10-30 Thread Terry Siu
Found this as I am having the same issue. I have exactly the same usage as
shown in Michael's join example. I tried executing a SQL statement against
the join data set with two columns that have the same name and tried to
unambiguate the column name with the table alias, but I would still get an
Unresolved attributes error back. Is there any way around this short of
renaming the columns in the join sources?

Thanks
-Terry


Michael Armbrust wrote
Yes, but if both tagCollection and selectedVideos have a column named id
then Spark SQL does not know which one you are referring to in the where
clause.  Here's an example with aliases:
 val x = testData2.as('x)
 val y = testData2.as('y)
 val join = x.join(y, Inner, Some(x.a.attr === y.a.attr))
On Wed, Jul 16, 2014 at 2:47 AM, Jaonary Rabarisoa lt;

jaonary@

gt;
wrote:
My query is just a simple query that use the spark sql dsl :

tagCollection.join(selectedVideos).where('videoId === 'id)




On Tue, Jul 15, 2014 at 6:03 PM, Yin Huai lt;

huaiyin.thu@

gt; wrote:

Hi Jao,

Seems the SQL analyzer cannot resolve the references in the Join
condition. What is your query? Did you use the Hive Parser (your query
was
submitted through hql(...)) or the basic SQL Parser (your query was
submitted through sql(...)).

Thanks,

Yin


On Tue, Jul 15, 2014 at 8:52 AM, Jaonary Rabarisoa lt;

jaonary@

gt;
wrote:

Hi all,

When running a join operation with Spark SQL I got the following error
:


Exception in thread main
org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
Ambiguous
references to id: (id#303,List()),(id#0,List()), tree:
Filter ('videoId = 'id)
  Join Inner, None
   ParquetRelation data/tags.parquet
   Filter (name#1 = P1/cam1)
ParquetRelation data/videos.parquet


What does it mean ?


Cheers,


jao








Re: SparkSQL - TreeNodeException for unresolved attributes

2014-10-21 Thread Terry Siu
Just to follow up, the queries worked against master and I got my whole flow 
rolling. Thanks for the suggestion! Now if only Spark 1.2 will come out with 
the next release of CDH5  :P

-Terry

From: Terry Siu terry@smartfocus.commailto:terry@smartfocus.com
Date: Monday, October 20, 2014 at 12:22 PM
To: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: SparkSQL - TreeNodeException for unresolved attributes

Hi Michael,

Thanks again for the reply. Was hoping it was something I was doing wrong in 
1.1.0, but I’ll try master.

Thanks,
-Terry

From: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com
Date: Monday, October 20, 2014 at 12:11 PM
To: Terry Siu terry@smartfocus.commailto:terry@smartfocus.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: SparkSQL - TreeNodeException for unresolved attributes

Have you tried this on master?  There were several problems with resolution of 
complex queries that were registered as tables in the 1.1.0 release.

On Mon, Oct 20, 2014 at 10:33 AM, Terry Siu 
terry@smartfocus.commailto:terry@smartfocus.com wrote:
Hi all,

I’m getting a TreeNodeException for unresolved attributes when I do a simple 
select from a schemaRDD generated by a join in Spark 1.1.0. A little background 
first. I am using a HiveContext (against Hive 0.12) to grab two tables, join 
them, and then perform multiple INSERT-SELECT with GROUP BY to write back out 
to a Hive rollup table that has two partitions. This task is an effort to 
simulate the unsupported GROUPING SETS functionality in SparkSQL.

In my first attempt, I got really close using  SchemaRDD.groupBy until I 
realized that SchemaRDD.insertTo API does not support partitioned tables yet. 
This prompted my second attempt to pass in SQL to the HiveContext.sql API 
instead.

Here’s a rundown of the commands I executed on the spark-shell:


val hc = new HiveContext(sc)

hc.setConf(spark.sql.hive.convertMetastoreParquet, true”)

hc.setConf(spark.sql.parquet.compression.codec, snappy”)


// For implicit conversions to Expression

val sqlContext = new SQLContext(sc)

import sqlContext._


val segCusts = hc.hql(“select …”)

val segTxns = hc.hql(“select …”)


val sc = segCusts.as('sc)

val st = segTxns.as(‘st)


// Join the segCusts and segTxns tables

val rup = sc.join(st, Inner, 
Some(sc.segcustomerid.attr===st.customerid.attr))

rup.registerAsTable(“rupbrand”)



If I do a printSchema on the rup, I get:

root

 |-- segcustomerid: string (nullable = true)

 |-- sales: double (nullable = false)

 |-- tx_count: long (nullable = false)

 |-- storeid: string (nullable = true)

 |-- transdate: long (nullable = true)

 |-- transdate_ts: string (nullable = true)

 |-- transdate_dt: string (nullable = true)

 |-- unitprice: double (nullable = true)

 |-- translineitem: string (nullable = true)

 |-- offerid: string (nullable = true)

 |-- customerid: string (nullable = true)

 |-- customerkey: string (nullable = true)

 |-- sku: string (nullable = true)

 |-- quantity: double (nullable = true)

 |-- returnquantity: double (nullable = true)

 |-- channel: string (nullable = true)

 |-- unitcost: double (nullable = true)

 |-- transid: string (nullable = true)

 |-- productid: string (nullable = true)

 |-- id: string (nullable = true)

 |-- campaign_campaigncost: double (nullable = true)

 |-- campaign_begindate: long (nullable = true)

 |-- campaign_begindate_ts: string (nullable = true)

 |-- campaign_begindate_dt: string (nullable = true)

 |-- campaign_enddate: long (nullable = true)

 |-- campaign_enddate_ts: string (nullable = true)

 |-- campaign_enddate_dt: string (nullable = true)

 |-- campaign_campaigntitle: string (nullable = true)

 |-- campaign_campaignname: string (nullable = true)

 |-- campaign_id: string (nullable = true)

 |-- product_categoryid: string (nullable = true)

 |-- product_company: string (nullable = true)

 |-- product_brandname: string (nullable = true)

 |-- product_vendorid: string (nullable = true)

 |-- product_color: string (nullable = true)

 |-- product_brandid: string (nullable = true)

 |-- product_description: string (nullable = true)

 |-- product_size: string (nullable = true)

 |-- product_subcategoryid: string (nullable = true)

 |-- product_departmentid: string (nullable = true)

 |-- product_productname: string (nullable = true)

 |-- product_categoryname: string (nullable = true)

 |-- product_vendorname: string (nullable = true)

 |-- product_sku: string (nullable = true)

 |-- product_subcategoryname: string (nullable = true)

 |-- product_status: string (nullable = true)

 |-- product_departmentname: string (nullable = true)

 |-- product_style: string (nullable = true)

 |-- product_id: string (nullable = true)

 |-- customer_lastname: string (nullable

Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-20 Thread Terry Siu
Hi Yin,

Sorry for the delay, but I’ll try the code change when I get a chance, but 
Michael’s initial response did solve my problem. In the meantime, I’m hitting 
another issue with SparkSQL which I will probably post another message if I 
can’t figure a workaround.

Thanks,
-Terry

From: Yin Huai huaiyin@gmail.commailto:huaiyin@gmail.com
Date: Thursday, October 16, 2014 at 7:08 AM
To: Terry Siu terry@smartfocus.commailto:terry@smartfocus.com
Cc: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com, 
user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

Hello Terry,

I guess you hit this bughttps://issues.apache.org/jira/browse/SPARK-3559. The 
list of needed column ids was messed up. Can you try the master branch or apply 
the code 
changehttps://github.com/apache/spark/commit/e10d71e7e58bf2ec0f1942cb2f0602396ab866b4
 to your 1.1 and see if the problem is resolved?/

Thanks,

Yin

On Wed, Oct 15, 2014 at 12:08 PM, Terry Siu 
terry@smartfocus.commailto:terry@smartfocus.com wrote:
Hi Yin,

pqt_rdt_snappy has 76 columns. These two parquet tables were created via Hive 
0.12 from existing Avro data using CREATE TABLE following by an INSERT 
OVERWRITE. These are partitioned tables - pqt_rdt_snappy has one partition 
while pqt_segcust_snappy has two partitions. For pqt_segcust_snappy, I noticed 
that when I populated it with a single INSERT OVERWRITE over all the partitions 
and then executed the Spark code, it would report an illegal index value of 29. 
 However, if I manually did INSERT OVERWRITE for every single partition, I 
would get an illegal index value of 21. I don’t know if this will help in 
debugging, but here’s the DESCRIBE output for pqt_segcust_snappy:


OK

col_namedata_type   comment

customer_id string  from deserializer

age_range   string  from deserializer

gender  string  from deserializer

last_tx_datebigint  from deserializer

last_tx_date_ts string  from deserializer

last_tx_date_dt string  from deserializer

first_tx_date   bigint  from deserializer

first_tx_date_tsstring  from deserializer

first_tx_date_dtstring  from deserializer

second_tx_date  bigint  from deserializer

second_tx_date_ts   string  from deserializer

second_tx_date_dt   string  from deserializer

third_tx_date   bigint  from deserializer

third_tx_date_tsstring  from deserializer

third_tx_date_dtstring  from deserializer

frequency   double  from deserializer

tx_size double  from deserializer

recency double  from deserializer

rfm double  from deserializer

tx_countbigint  from deserializer

sales   double  from deserializer

coll_def_id string  None

seg_def_id  string  None



# Partition Information

# col_name  data_type   comment



coll_def_id string  None

seg_def_id  string  None

Time taken: 0.788 seconds, Fetched: 29 row(s)


As you can see, I have 21 data columns, followed by the 2 partition columns, 
coll_def_id and seg_def_id. Output shows 29 rows, but that looks like it’s just 
counting the rows in the console output. Let me know if you need more 
information.


Thanks

-Terry


From: Yin Huai huaiyin@gmail.commailto:huaiyin@gmail.com
Date: Tuesday, October 14, 2014 at 6:29 PM
To: Terry Siu terry@smartfocus.commailto:terry@smartfocus.com
Cc: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com, 
user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org

Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

Hello Terry,

How many columns does pqt_rdt_snappy have?

Thanks,

Yin

On Tue, Oct 14, 2014 at 11:52 AM, Terry Siu 
terry@smartfocus.commailto:terry@smartfocus.com wrote:
Hi Michael,

That worked for me. At least I’m now further than I was. Thanks for the tip!

-Terry

From: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com
Date: Monday, October 13, 2014 at 5:05 PM
To: Terry Siu terry@smartfocus.commailto:terry@smartfocus.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

SparkSQL - TreeNodeException for unresolved attributes

2014-10-20 Thread Terry Siu
Hi all,

I’m getting a TreeNodeException for unresolved attributes when I do a simple 
select from a schemaRDD generated by a join in Spark 1.1.0. A little background 
first. I am using a HiveContext (against Hive 0.12) to grab two tables, join 
them, and then perform multiple INSERT-SELECT with GROUP BY to write back out 
to a Hive rollup table that has two partitions. This task is an effort to 
simulate the unsupported GROUPING SETS functionality in SparkSQL.

In my first attempt, I got really close using  SchemaRDD.groupBy until I 
realized that SchemaRDD.insertTo API does not support partitioned tables yet. 
This prompted my second attempt to pass in SQL to the HiveContext.sql API 
instead.

Here’s a rundown of the commands I executed on the spark-shell:


val hc = new HiveContext(sc)

hc.setConf(spark.sql.hive.convertMetastoreParquet, true”)

hc.setConf(spark.sql.parquet.compression.codec, snappy”)


// For implicit conversions to Expression

val sqlContext = new SQLContext(sc)

import sqlContext._


val segCusts = hc.hql(“select …”)

val segTxns = hc.hql(“select …”)


val sc = segCusts.as('sc)

val st = segTxns.as(‘st)


// Join the segCusts and segTxns tables

val rup = sc.join(st, Inner, 
Some(sc.segcustomerid.attr===st.customerid.attr))

rup.registerAsTable(“rupbrand”)



If I do a printSchema on the rup, I get:

root

 |-- segcustomerid: string (nullable = true)

 |-- sales: double (nullable = false)

 |-- tx_count: long (nullable = false)

 |-- storeid: string (nullable = true)

 |-- transdate: long (nullable = true)

 |-- transdate_ts: string (nullable = true)

 |-- transdate_dt: string (nullable = true)

 |-- unitprice: double (nullable = true)

 |-- translineitem: string (nullable = true)

 |-- offerid: string (nullable = true)

 |-- customerid: string (nullable = true)

 |-- customerkey: string (nullable = true)

 |-- sku: string (nullable = true)

 |-- quantity: double (nullable = true)

 |-- returnquantity: double (nullable = true)

 |-- channel: string (nullable = true)

 |-- unitcost: double (nullable = true)

 |-- transid: string (nullable = true)

 |-- productid: string (nullable = true)

 |-- id: string (nullable = true)

 |-- campaign_campaigncost: double (nullable = true)

 |-- campaign_begindate: long (nullable = true)

 |-- campaign_begindate_ts: string (nullable = true)

 |-- campaign_begindate_dt: string (nullable = true)

 |-- campaign_enddate: long (nullable = true)

 |-- campaign_enddate_ts: string (nullable = true)

 |-- campaign_enddate_dt: string (nullable = true)

 |-- campaign_campaigntitle: string (nullable = true)

 |-- campaign_campaignname: string (nullable = true)

 |-- campaign_id: string (nullable = true)

 |-- product_categoryid: string (nullable = true)

 |-- product_company: string (nullable = true)

 |-- product_brandname: string (nullable = true)

 |-- product_vendorid: string (nullable = true)

 |-- product_color: string (nullable = true)

 |-- product_brandid: string (nullable = true)

 |-- product_description: string (nullable = true)

 |-- product_size: string (nullable = true)

 |-- product_subcategoryid: string (nullable = true)

 |-- product_departmentid: string (nullable = true)

 |-- product_productname: string (nullable = true)

 |-- product_categoryname: string (nullable = true)

 |-- product_vendorname: string (nullable = true)

 |-- product_sku: string (nullable = true)

 |-- product_subcategoryname: string (nullable = true)

 |-- product_status: string (nullable = true)

 |-- product_departmentname: string (nullable = true)

 |-- product_style: string (nullable = true)

 |-- product_id: string (nullable = true)

 |-- customer_lastname: string (nullable = true)

 |-- customer_familystatus: string (nullable = true)

 |-- customer_customertype: string (nullable = true)

 |-- customer_city: string (nullable = true)

 |-- customer_country: string (nullable = true)

 |-- customer_state: string (nullable = true)

 |-- customer_region: string (nullable = true)

 |-- customer_customergroup: string (nullable = true)

 |-- customer_maritalstatus: string (nullable = true)

 |-- customer_agerange: string (nullable = true)

 |-- customer_zip: string (nullable = true)

 |-- customer_age: double (nullable = true)

 |-- customer_address2: string (nullable = true)

 |-- customer_incomerange: string (nullable = true)

 |-- customer_gender: string (nullable = true)

 |-- customer_customerkey: string (nullable = true)

 |-- customer_address1: string (nullable = true)

 |-- customer_email: string (nullable = true)

 |-- customer_education: string (nullable = true)

 |-- customer_birthdate: long (nullable = true)

 |-- customer_birthdate_ts: string (nullable = true)

 |-- customer_birthdate_dt: string (nullable = true)

 |-- customer_id: string (nullable = true)

 |-- customer_firstname: string (nullable = true)

 |-- transnum: long (nullable = true)

 |-- transmonth: string (nullable = true)


Nothing but a flat schema with no duplicated column names. I then 

Re: SparkSQL - TreeNodeException for unresolved attributes

2014-10-20 Thread Terry Siu
Hi Michael,

Thanks again for the reply. Was hoping it was something I was doing wrong in 
1.1.0, but I’ll try master.

Thanks,
-Terry

From: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com
Date: Monday, October 20, 2014 at 12:11 PM
To: Terry Siu terry@smartfocus.commailto:terry@smartfocus.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: SparkSQL - TreeNodeException for unresolved attributes

Have you tried this on master?  There were several problems with resolution of 
complex queries that were registered as tables in the 1.1.0 release.

On Mon, Oct 20, 2014 at 10:33 AM, Terry Siu 
terry@smartfocus.commailto:terry@smartfocus.com wrote:
Hi all,

I’m getting a TreeNodeException for unresolved attributes when I do a simple 
select from a schemaRDD generated by a join in Spark 1.1.0. A little background 
first. I am using a HiveContext (against Hive 0.12) to grab two tables, join 
them, and then perform multiple INSERT-SELECT with GROUP BY to write back out 
to a Hive rollup table that has two partitions. This task is an effort to 
simulate the unsupported GROUPING SETS functionality in SparkSQL.

In my first attempt, I got really close using  SchemaRDD.groupBy until I 
realized that SchemaRDD.insertTo API does not support partitioned tables yet. 
This prompted my second attempt to pass in SQL to the HiveContext.sql API 
instead.

Here’s a rundown of the commands I executed on the spark-shell:


val hc = new HiveContext(sc)

hc.setConf(spark.sql.hive.convertMetastoreParquet, true”)

hc.setConf(spark.sql.parquet.compression.codec, snappy”)


// For implicit conversions to Expression

val sqlContext = new SQLContext(sc)

import sqlContext._


val segCusts = hc.hql(“select …”)

val segTxns = hc.hql(“select …”)


val sc = segCusts.as('sc)

val st = segTxns.as(‘st)


// Join the segCusts and segTxns tables

val rup = sc.join(st, Inner, 
Some(sc.segcustomerid.attr===st.customerid.attr))

rup.registerAsTable(“rupbrand”)



If I do a printSchema on the rup, I get:

root

 |-- segcustomerid: string (nullable = true)

 |-- sales: double (nullable = false)

 |-- tx_count: long (nullable = false)

 |-- storeid: string (nullable = true)

 |-- transdate: long (nullable = true)

 |-- transdate_ts: string (nullable = true)

 |-- transdate_dt: string (nullable = true)

 |-- unitprice: double (nullable = true)

 |-- translineitem: string (nullable = true)

 |-- offerid: string (nullable = true)

 |-- customerid: string (nullable = true)

 |-- customerkey: string (nullable = true)

 |-- sku: string (nullable = true)

 |-- quantity: double (nullable = true)

 |-- returnquantity: double (nullable = true)

 |-- channel: string (nullable = true)

 |-- unitcost: double (nullable = true)

 |-- transid: string (nullable = true)

 |-- productid: string (nullable = true)

 |-- id: string (nullable = true)

 |-- campaign_campaigncost: double (nullable = true)

 |-- campaign_begindate: long (nullable = true)

 |-- campaign_begindate_ts: string (nullable = true)

 |-- campaign_begindate_dt: string (nullable = true)

 |-- campaign_enddate: long (nullable = true)

 |-- campaign_enddate_ts: string (nullable = true)

 |-- campaign_enddate_dt: string (nullable = true)

 |-- campaign_campaigntitle: string (nullable = true)

 |-- campaign_campaignname: string (nullable = true)

 |-- campaign_id: string (nullable = true)

 |-- product_categoryid: string (nullable = true)

 |-- product_company: string (nullable = true)

 |-- product_brandname: string (nullable = true)

 |-- product_vendorid: string (nullable = true)

 |-- product_color: string (nullable = true)

 |-- product_brandid: string (nullable = true)

 |-- product_description: string (nullable = true)

 |-- product_size: string (nullable = true)

 |-- product_subcategoryid: string (nullable = true)

 |-- product_departmentid: string (nullable = true)

 |-- product_productname: string (nullable = true)

 |-- product_categoryname: string (nullable = true)

 |-- product_vendorname: string (nullable = true)

 |-- product_sku: string (nullable = true)

 |-- product_subcategoryname: string (nullable = true)

 |-- product_status: string (nullable = true)

 |-- product_departmentname: string (nullable = true)

 |-- product_style: string (nullable = true)

 |-- product_id: string (nullable = true)

 |-- customer_lastname: string (nullable = true)

 |-- customer_familystatus: string (nullable = true)

 |-- customer_customertype: string (nullable = true)

 |-- customer_city: string (nullable = true)

 |-- customer_country: string (nullable = true)

 |-- customer_state: string (nullable = true)

 |-- customer_region: string (nullable = true)

 |-- customer_customergroup: string (nullable = true)

 |-- customer_maritalstatus: string (nullable = true)

 |-- customer_agerange: string (nullable = true)

 |-- customer_zip: string (nullable = true)

 |-- customer_age: double (nullable = true

Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-15 Thread Terry Siu
Hi Yin,

pqt_rdt_snappy has 76 columns. These two parquet tables were created via Hive 
0.12 from existing Avro data using CREATE TABLE following by an INSERT 
OVERWRITE. These are partitioned tables - pqt_rdt_snappy has one partition 
while pqt_segcust_snappy has two partitions. For pqt_segcust_snappy, I noticed 
that when I populated it with a single INSERT OVERWRITE over all the partitions 
and then executed the Spark code, it would report an illegal index value of 29. 
 However, if I manually did INSERT OVERWRITE for every single partition, I 
would get an illegal index value of 21. I don’t know if this will help in 
debugging, but here’s the DESCRIBE output for pqt_segcust_snappy:


OK

col_namedata_type   comment

customer_id string  from deserializer

age_range   string  from deserializer

gender  string  from deserializer

last_tx_datebigint  from deserializer

last_tx_date_ts string  from deserializer

last_tx_date_dt string  from deserializer

first_tx_date   bigint  from deserializer

first_tx_date_tsstring  from deserializer

first_tx_date_dtstring  from deserializer

second_tx_date  bigint  from deserializer

second_tx_date_ts   string  from deserializer

second_tx_date_dt   string  from deserializer

third_tx_date   bigint  from deserializer

third_tx_date_tsstring  from deserializer

third_tx_date_dtstring  from deserializer

frequency   double  from deserializer

tx_size double  from deserializer

recency double  from deserializer

rfm double  from deserializer

tx_countbigint  from deserializer

sales   double  from deserializer

coll_def_id string  None

seg_def_id  string  None



# Partition Information

# col_name  data_type   comment



coll_def_id string  None

seg_def_id  string  None

Time taken: 0.788 seconds, Fetched: 29 row(s)


As you can see, I have 21 data columns, followed by the 2 partition columns, 
coll_def_id and seg_def_id. Output shows 29 rows, but that looks like it’s just 
counting the rows in the console output. Let me know if you need more 
information.


Thanks

-Terry


From: Yin Huai huaiyin@gmail.commailto:huaiyin@gmail.com
Date: Tuesday, October 14, 2014 at 6:29 PM
To: Terry Siu terry@smartfocus.commailto:terry@smartfocus.com
Cc: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com, 
user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

Hello Terry,

How many columns does pqt_rdt_snappy have?

Thanks,

Yin

On Tue, Oct 14, 2014 at 11:52 AM, Terry Siu 
terry@smartfocus.commailto:terry@smartfocus.com wrote:
Hi Michael,

That worked for me. At least I’m now further than I was. Thanks for the tip!

-Terry

From: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com
Date: Monday, October 13, 2014 at 5:05 PM
To: Terry Siu terry@smartfocus.commailto:terry@smartfocus.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

There are some known bug with the parquet serde and spark 1.1.

You can try setting spark.sql.hive.convertMetastoreParquet=true to cause spark 
sql to use built in parquet support when the serde looks like parquet.

On Mon, Oct 13, 2014 at 2:57 PM, Terry Siu 
terry@smartfocus.commailto:terry@smartfocus.com wrote:
I am currently using Spark 1.1.0 that has been compiled against Hadoop 2.3. Our 
cluster is CDH5.1.2 which is runs Hive 0.12. I have two external Hive tables 
that point to Parquet (compressed with Snappy), which were converted over from 
Avro if that matters.

I am trying to perform a join with these two Hive tables, but am encountering 
an exception. In a nutshell, I launch a spark shell, create my HiveContext 
(pointing to the correct metastore on our cluster), and then proceed to do the 
following:

scala val hc = new HiveContext(sc)

scala val txn = hc.sql(“select * from pqt_rdt_snappy where transdate = 
132537600 and translate = 134006399”)

scala val segcust = hc.sql(“select * from pqt_segcust_snappy where 
coll_def_id=‘abcd’”)

scala txn.registerAsTable(“segTxns”)

scala segcust.registerAsTable(“segCusts

Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-14 Thread Terry Siu
Hi Michael,

That worked for me. At least I’m now further than I was. Thanks for the tip!

-Terry

From: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com
Date: Monday, October 13, 2014 at 5:05 PM
To: Terry Siu terry@smartfocus.commailto:terry@smartfocus.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

There are some known bug with the parquet serde and spark 1.1.

You can try setting spark.sql.hive.convertMetastoreParquet=true to cause spark 
sql to use built in parquet support when the serde looks like parquet.

On Mon, Oct 13, 2014 at 2:57 PM, Terry Siu 
terry@smartfocus.commailto:terry@smartfocus.com wrote:
I am currently using Spark 1.1.0 that has been compiled against Hadoop 2.3. Our 
cluster is CDH5.1.2 which is runs Hive 0.12. I have two external Hive tables 
that point to Parquet (compressed with Snappy), which were converted over from 
Avro if that matters.

I am trying to perform a join with these two Hive tables, but am encountering 
an exception. In a nutshell, I launch a spark shell, create my HiveContext 
(pointing to the correct metastore on our cluster), and then proceed to do the 
following:

scala val hc = new HiveContext(sc)

scala val txn = hc.sql(“select * from pqt_rdt_snappy where transdate = 
132537600 and translate = 134006399”)

scala val segcust = hc.sql(“select * from pqt_segcust_snappy where 
coll_def_id=‘abcd’”)

scala txn.registerAsTable(“segTxns”)

scala segcust.registerAsTable(“segCusts”)

scala val joined = hc.sql(“select t.transid, c.customer_id from segTxns t join 
segCusts c on t.customerid=c.customer_id”)

Straight forward enough, but I get the following exception:


14/10/13 14:37:12 ERROR Executor: Exception in task 1.0 in stage 18.0 (TID 51)

java.lang.IndexOutOfBoundsException: Index: 21, Size: 21

at java.util.ArrayList.rangeCheck(ArrayList.java:635)

at java.util.ArrayList.get(ArrayList.java:411)

at 
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:94)

at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:206)

at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.init(ParquetRecordReaderWrapper.java:81)

at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.init(ParquetRecordReaderWrapper.java:67)

at 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:51)

at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:197)

at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:188)

at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:97)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)

at org.apache.spark.scheduler.Task.run(Task.scala:54)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)


The number of columns in my table, pqt_segcust_snappy, has 21 columns and two 
partitions defined. Does this error look familiar to anyone? Could my usage of 
SparkSQL with Hive be incorrect or is support with Hive/Parquet/partitioning 
still buggy at this point in Spark 1.1.0?


Thanks,

-Terry





SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-13 Thread Terry Siu
I am currently using Spark 1.1.0 that has been compiled against Hadoop 2.3. Our 
cluster is CDH5.1.2 which is runs Hive 0.12. I have two external Hive tables 
that point to Parquet (compressed with Snappy), which were converted over from 
Avro if that matters.

I am trying to perform a join with these two Hive tables, but am encountering 
an exception. In a nutshell, I launch a spark shell, create my HiveContext 
(pointing to the correct metastore on our cluster), and then proceed to do the 
following:

scala val hc = new HiveContext(sc)

scala val txn = hc.sql(“select * from pqt_rdt_snappy where transdate = 
132537600 and translate = 134006399”)

scala val segcust = hc.sql(“select * from pqt_segcust_snappy where 
coll_def_id=‘abcd’”)

scala txn.registerAsTable(“segTxns”)

scala segcust.registerAsTable(“segCusts”)

scala val joined = hc.sql(“select t.transid, c.customer_id from segTxns t join 
segCusts c on t.customerid=c.customer_id”)

Straight forward enough, but I get the following exception:


14/10/13 14:37:12 ERROR Executor: Exception in task 1.0 in stage 18.0 (TID 51)

java.lang.IndexOutOfBoundsException: Index: 21, Size: 21

at java.util.ArrayList.rangeCheck(ArrayList.java:635)

at java.util.ArrayList.get(ArrayList.java:411)

at 
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:94)

at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:206)

at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.init(ParquetRecordReaderWrapper.java:81)

at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.init(ParquetRecordReaderWrapper.java:67)

at 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:51)

at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:197)

at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:188)

at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:97)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)

at org.apache.spark.scheduler.Task.run(Task.scala:54)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)


The number of columns in my table, pqt_segcust_snappy, has 21 columns and two 
partitions defined. Does this error look familiar to anyone? Could my usage of 
SparkSQL with Hive be incorrect or is support with Hive/Parquet/partitioning 
still buggy at this point in Spark 1.1.0?


Thanks,

-Terry