[jira] [Created] (SPARK-12369) ataFrameReader fails on globbing parquet paths
Yana Kadiyska created SPARK-12369: - Summary: ataFrameReader fails on globbing parquet paths Key: SPARK-12369 URL: https://issues.apache.org/jira/browse/SPARK-12369 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.2 Reporter: Yana Kadiyska Start with a list of parquet paths where some or all do not exist: {noformat} val paths=List("/foo/month=05/*.parquet","/foo/month=06/*.parquet") sqlContext.read.parquet(paths:_*) java.lang.NullPointerException at org.apache.hadoop.fs.Globber.glob(Globber.java:218) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1625) at org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:251) at org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:258) at org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:264) at org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:260) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:260) {noformat} It would be better to produce a dataframe from the paths that do exist and log a warning that a path was missing. Not sure for "all paths are missing case" -- could return an emptyDF with no schema or a nicer exception...But I would prefer not to have to pre-validate paths -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12369) DataFrameReader fails on globbing parquet paths
[ https://issues.apache.org/jira/browse/SPARK-12369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yana Kadiyska updated SPARK-12369: -- Summary: DataFrameReader fails on globbing parquet paths (was: ataFrameReader fails on globbing parquet paths) > DataFrameReader fails on globbing parquet paths > --- > > Key: SPARK-12369 > URL: https://issues.apache.org/jira/browse/SPARK-12369 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Yana Kadiyska > > Start with a list of parquet paths where some or all do not exist: > {noformat} > val paths=List("/foo/month=05/*.parquet","/foo/month=06/*.parquet") > sqlContext.read.parquet(paths:_*) > java.lang.NullPointerException > at org.apache.hadoop.fs.Globber.glob(Globber.java:218) > at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1625) > at > org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:251) > at > org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:258) > at > org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:264) > at > org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:260) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) > at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:260) > {noformat} > It would be better to produce a dataframe from the paths that do exist and > log a warning that a path was missing. Not sure for "all paths are missing > case" -- could return an emptyDF with no schema or a nicer exception...But I > would prefer not to have to pre-validate paths -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12369) DataFrameReader fails on globbing parquet paths
[ https://issues.apache.org/jira/browse/SPARK-12369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yana Kadiyska updated SPARK-12369: -- Description: Start with a list of parquet paths where some or all do not exist: {noformat} val paths=List("/foo/month=05/*.parquet","/foo/month=06/*.parquet") sqlContext.read.parquet(paths:_*) java.lang.NullPointerException at org.apache.hadoop.fs.Globber.glob(Globber.java:218) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1625) at org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:251) at org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:258) at org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:264) at org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:260) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:260) {noformat} It would be better to produce a dataframe from the paths that do exist and log a warning that a path was missing. Not sure for "all paths are missing case" -- probably return an emptyDF with no schema since that method already does so on empty path list.But I would prefer not to have to pre-validate paths was: Start with a list of parquet paths where some or all do not exist: {noformat} val paths=List("/foo/month=05/*.parquet","/foo/month=06/*.parquet") sqlContext.read.parquet(paths:_*) java.lang.NullPointerException at org.apache.hadoop.fs.Globber.glob(Globber.java:218) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1625) at org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:251) at org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:258) at org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:264) at org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:260) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:260) {noformat} It would be better to produce a dataframe from the paths that do exist and log a warning that a path was missing. Not sure for "all paths are missing case" -- could return an emptyDF with no schema or a nicer exception...But I would prefer not to have to pre-validate paths > DataFrameReader fails on globbing parquet paths > --- > > Key: SPARK-12369 > URL: https://issues.apache.org/jira/browse/SPARK-12369 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Yana Kadiyska > > Start with a list of parquet paths where some or all do not exist: > {noformat} > val paths=List("/foo/month=05/*.parquet","/foo/month=06/*.parquet") > sqlContext.read.parquet(paths:_*) > java.lang.NullPointerException > at org.apache.hadoop.fs.Globber.glob(Globber.java:218) > at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1625) > at > org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:251) > at > org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:258) > at > org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:264) > at > org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:260) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) > at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:260) > {noformat} > It would be better to
[jira] [Commented] (SPARK-4497) HiveThriftServer2 does not exit properly on failure
[ https://issues.apache.org/jira/browse/SPARK-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058324#comment-15058324 ] Yana Kadiyska commented on SPARK-4497: -- [~jeffzhang] I have moved on to 1.2 so I cannot comment on this any longer. I'm ok to close > HiveThriftServer2 does not exit properly on failure > --- > > Key: SPARK-4497 > URL: https://issues.apache.org/jira/browse/SPARK-4497 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Yana Kadiyska >Priority: Critical > > start thriftserver with > {{sbin/start-thriftserver.sh --master ...}} > If there is an error (in my case namenode is in standby mode) the driver > shuts down properly: > {code} > 14/11/19 16:32:58 ERROR HiveThriftServer2: Error starting HiveThriftServer2 > > 14/11/19 16:32:59 INFO SparkUI: Stopped Spark web UI at http://myip:4040 > 14/11/19 16:32:59 INFO DAGScheduler: Stopping DAGScheduler > 14/11/19 16:32:59 INFO SparkDeploySchedulerBackend: Shutting down all > executors > 14/11/19 16:32:59 INFO SparkDeploySchedulerBackend: Asking each executor to > shut down > 14/11/19 16:33:00 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor > stopped! > 14/11/19 16:33:00 INFO MemoryStore: MemoryStore cleared > 14/11/19 16:33:00 INFO BlockManager: BlockManager stopped > 14/11/19 16:33:00 INFO BlockManagerMaster: BlockManagerMaster stopped > 14/11/19 16:33:00 INFO SparkContext: Successfully stopped SparkContext > {code} > but trying to run {{sbin/start-thriftserver.sh --master ... }} again results > in an error that Thrifserver is already running. > {{ps -aef|grep }} shows > {code} > root 32334 1 0 16:32 ?00:00:00 /usr/local/bin/java > org.apache.spark.deploy.SparkSubmitDriverBootstrapper --class > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --master > spark://myip:7077 --conf -spark.executor.extraJavaOptions=-verbose:gc > -XX:-PrintGCDetails -XX:+PrintGCTimeStamps spark-internal --hiveconf > hive.root.logger=INFO,console > {code} > This is problematic since we have a process that tries to restart the driver > if it dies -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9405) approximateCountDiscting does not work with GroupBy
Yana Kadiyska created SPARK-9405: Summary: approximateCountDiscting does not work with GroupBy Key: SPARK-9405 URL: https://issues.apache.org/jira/browse/SPARK-9405 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1 Reporter: Yana Kadiyska {code} case class MockCustomer(val customer_id:Int,val host:String) val df =sc.parallelize(1 to 10).map(i=MockCustomer(1234,if (i%2 ==0) http://foo.com; else http://bar.com;)).toDF //this works OK df.groupBy($host).agg(count($*),sum($customer_id)).show but this doesnt: df.groupBy($host).agg(approxCountDistinct($*),sum($customer_id)).show 15/07/28 10:46:14 INFO BlockManagerInfo: Removed broadcast_55_piece0 on localhost:33727 in memory (size: 4.4 KB, free: 265.3 MB) 15/07/28 10:46:14 INFO BlockManagerInfo: Removed broadcast_54_piece0 on localhost:33727 in memory (size: 4.4 KB, free: 265.3 MB) org.apache.spark.sql.AnalysisException: cannot resolve 'host' given input columns customer_id, host; at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:63) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:122) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8956) Rollup produces incorrect result when group by contains expressions
[ https://issues.apache.org/jira/browse/SPARK-8956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yana Kadiyska resolved SPARK-8956. -- Resolution: Duplicate Duplicate of Spark-8972 Rollup produces incorrect result when group by contains expressions --- Key: SPARK-8956 URL: https://issues.apache.org/jira/browse/SPARK-8956 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Yana Kadiyska Rollup produces incorrect results when group clause contains an expression {code}case class KeyValue(key: Int, value: String) val df = sc.parallelize(1 to 50).map(i=KeyValue(i, i.toString)).toDF df.registerTempTable(foo) sqlContext.sql(“select count(*) as cnt, key % 100 as key,GROUPING__ID from foo group by key%100 with rollup”).show(100) {code} As a workaround, this works correctly: {code} val df1=df.withColumn(newkey,df(key)%100) df1.registerTempTable(foo1) sqlContext.sql(select count(*) as cnt, newkey as key,GROUPING__ID as grp from foo1 group by newkey with rollup).show(100) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8956) Rollup produces incorrect result when group by contains expressions
Yana Kadiyska created SPARK-8956: Summary: Rollup produces incorrect result when group by contains expressions Key: SPARK-8956 URL: https://issues.apache.org/jira/browse/SPARK-8956 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Yana Kadiyska Rollup produces incorrect results when group clause contains an expression {code}case class KeyValue(key: Int, value: String) val df = sc.parallelize(1 to 50).map(i=KeyValue(i, i.toString)).toDF df.registerTempTable(foo) sqlContext.sql(“select count(*) as cnt, key % 100 as key,GROUPING__ID from foo group by key%100 with rollup”).show(100) {code} As a workaround, this works correctly: {code} val df1=df.withColumn(newkey,df(key)%100) df1.registerTempTable(foo1) sqlContext.sql(select count(*) as cnt, newkey as key,GROUPING__ID as grp from foo1 group by newkey with rollup).show(100) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1403) Spark on Mesos does not set Thread's context class loader
[ https://issues.apache.org/jira/browse/SPARK-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yana Kadiyska updated SPARK-1403: - Target Version/s: 1.5.0 Spark on Mesos does not set Thread's context class loader - Key: SPARK-1403 URL: https://issues.apache.org/jira/browse/SPARK-1403 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 1.3.0, 1.4.0 Environment: ubuntu 12.04 on vagrant Reporter: Bharath Bhushan Priority: Blocker Fix For: 1.0.0 I can run spark 0.9.0 on mesos but not spark 1.0.0. This is because the spark executor on mesos slave throws a java.lang.ClassNotFoundException for org.apache.spark.serializer.JavaSerializer. The lengthy discussion is here: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html#a3513 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1403) Spark on Mesos does not set Thread's context class loader
[ https://issues.apache.org/jira/browse/SPARK-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593725#comment-14593725 ] Yana Kadiyska commented on SPARK-1403: -- [~pwendell] at your convenience can you please triage this bug. It was originally opened as a blocker. I reopened but am not sure if it's a release blocker. I set the target release to 1.5 just so it shows up in triage queries since it's a reopened bug... Spark on Mesos does not set Thread's context class loader - Key: SPARK-1403 URL: https://issues.apache.org/jira/browse/SPARK-1403 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 1.3.0, 1.4.0 Environment: ubuntu 12.04 on vagrant Reporter: Bharath Bhushan Priority: Blocker Fix For: 1.0.0 I can run spark 0.9.0 on mesos but not spark 1.0.0. This is because the spark executor on mesos slave throws a java.lang.ClassNotFoundException for org.apache.spark.serializer.JavaSerializer. The lengthy discussion is here: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html#a3513 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1403) Spark on Mesos does not set Thread's context class loader
[ https://issues.apache.org/jira/browse/SPARK-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yana Kadiyska updated SPARK-1403: - Affects Version/s: 1.4.0 Spark on Mesos does not set Thread's context class loader - Key: SPARK-1403 URL: https://issues.apache.org/jira/browse/SPARK-1403 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 1.3.0, 1.4.0 Environment: ubuntu 12.04 on vagrant Reporter: Bharath Bhushan Priority: Blocker Fix For: 1.0.0 I can run spark 0.9.0 on mesos but not spark 1.0.0. This is because the spark executor on mesos slave throws a java.lang.ClassNotFoundException for org.apache.spark.serializer.JavaSerializer. The lengthy discussion is here: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html#a3513 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-1403) Spark on Mesos does not set Thread's context class loader
[ https://issues.apache.org/jira/browse/SPARK-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yana Kadiyska reopened SPARK-1403: -- Multiple users reporting this is occuring again in 1.3 Spark on Mesos does not set Thread's context class loader - Key: SPARK-1403 URL: https://issues.apache.org/jira/browse/SPARK-1403 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: ubuntu 12.04 on vagrant Reporter: Bharath Bhushan Priority: Blocker Fix For: 1.0.0 I can run spark 0.9.0 on mesos but not spark 1.0.0. This is because the spark executor on mesos slave throws a java.lang.ClassNotFoundException for org.apache.spark.serializer.JavaSerializer. The lengthy discussion is here: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html#a3513 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1403) Spark on Mesos does not set Thread's context class loader
[ https://issues.apache.org/jira/browse/SPARK-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yana Kadiyska updated SPARK-1403: - Affects Version/s: 1.3.0 Spark on Mesos does not set Thread's context class loader - Key: SPARK-1403 URL: https://issues.apache.org/jira/browse/SPARK-1403 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 1.3.0 Environment: ubuntu 12.04 on vagrant Reporter: Bharath Bhushan Priority: Blocker Fix For: 1.0.0 I can run spark 0.9.0 on mesos but not spark 1.0.0. This is because the spark executor on mesos slave throws a java.lang.ClassNotFoundException for org.apache.spark.serializer.JavaSerializer. The lengthy discussion is here: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html#a3513 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7
[ https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567601#comment-14567601 ] Yana Kadiyska commented on SPARK-5389: -- FWIW I just tried the 1.4-rc3 build (http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc3-bin/) cdh4 binary and it runs without issues. From the exact same command prompt I can run the 1.4 script but not the 1.2 script. So if we can't figure out a consistent repro, maybe other folks can confirm if the new cmd files work... spark-shell.cmd does not run from DOS Windows 7 --- Key: SPARK-5389 URL: https://issues.apache.org/jira/browse/SPARK-5389 Project: Spark Issue Type: Bug Components: PySpark, Spark Shell, Windows Affects Versions: 1.2.0 Environment: Windows 7 Reporter: Yana Kadiyska Attachments: SparkShell_Win7.JPG, spark_bug.png spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2 Marking as trivial since calling spark-shell2.cmd also works fine Attaching a screenshot since the error isn't very useful: {code} spark-1.2.0-bin-cdh4bin\spark-shell.cmd else was unexpected at this time. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3815) LPAD function does not work in where predicate
[ https://issues.apache.org/jira/browse/SPARK-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559396#comment-14559396 ] Yana Kadiyska commented on SPARK-3815: -- I will close -- I have not observed this in 1.2.x versions LPAD function does not work in where predicate -- Key: SPARK-3815 URL: https://issues.apache.org/jira/browse/SPARK-3815 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Yana Kadiyska Priority: Minor select customer_id from mytable where pkey=concat_ws('-',LPAD('077',4,'0'),'2014-07') LIMIT 2 produces: 14/10/03 14:51:35 ERROR server.SparkSQLOperationManager: Error executing query: org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1242) at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:597) at org.apache.spark.sql.execution.Limit.execute(basicOperators.scala:146) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360) at org.apache.spark.sql.hive.thriftserver.server.SparkSQLOperationManager$$anon$1.run(SparkSQLOperationManager.scala:185) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:193) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatement(HiveSessionImpl.java:175) at org.apache.hive.service.cli.CLIService.executeStatement(CLIService.java:150) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:207) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1133) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1118) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:58) at org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:55) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:526) at org.apache.hive.service.auth.TUGIContainingProcessor.process(TUGIContainingProcessor.java:55) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.io.NotSerializableException: java.lang.reflect.Constructor at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at scala.collection.immutable.$colon$colon.writeObject(List.scala:379) at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) The following work fine: select concat_ws('-', LPAD(cast(112717 % 1024 AS STRING),4,'0'),'2014-07') from mytable where pkey='0077-2014-07' LIMIT 2
[jira] [Resolved] (SPARK-3815) LPAD function does not work in where predicate
[ https://issues.apache.org/jira/browse/SPARK-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yana Kadiyska resolved SPARK-3815. -- Resolution: Cannot Reproduce LPAD function does not work in where predicate -- Key: SPARK-3815 URL: https://issues.apache.org/jira/browse/SPARK-3815 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Yana Kadiyska Priority: Minor select customer_id from mytable where pkey=concat_ws('-',LPAD('077',4,'0'),'2014-07') LIMIT 2 produces: 14/10/03 14:51:35 ERROR server.SparkSQLOperationManager: Error executing query: org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1242) at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:597) at org.apache.spark.sql.execution.Limit.execute(basicOperators.scala:146) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360) at org.apache.spark.sql.hive.thriftserver.server.SparkSQLOperationManager$$anon$1.run(SparkSQLOperationManager.scala:185) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:193) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatement(HiveSessionImpl.java:175) at org.apache.hive.service.cli.CLIService.executeStatement(CLIService.java:150) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:207) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1133) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1118) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:58) at org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:55) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:526) at org.apache.hive.service.auth.TUGIContainingProcessor.process(TUGIContainingProcessor.java:55) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.io.NotSerializableException: java.lang.reflect.Constructor at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at scala.collection.immutable.$colon$colon.writeObject(List.scala:379) at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) The following work fine: select concat_ws('-', LPAD(cast(112717 % 1024 AS STRING),4,'0'),'2014-07') from mytable where pkey='0077-2014-07' LIMIT 2 select customer_id from mytable where pkey=concat_ws('-','0077','2014-07')
[jira] [Closed] (SPARK-3815) LPAD function does not work in where predicate
[ https://issues.apache.org/jira/browse/SPARK-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yana Kadiyska closed SPARK-3815. I have not been able to reproduce this behavior with 1.2.x versions so I'm assuming some commit in the interim fixed this. LPAD function does not work in where predicate -- Key: SPARK-3815 URL: https://issues.apache.org/jira/browse/SPARK-3815 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Yana Kadiyska Priority: Minor select customer_id from mytable where pkey=concat_ws('-',LPAD('077',4,'0'),'2014-07') LIMIT 2 produces: 14/10/03 14:51:35 ERROR server.SparkSQLOperationManager: Error executing query: org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1242) at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:597) at org.apache.spark.sql.execution.Limit.execute(basicOperators.scala:146) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360) at org.apache.spark.sql.hive.thriftserver.server.SparkSQLOperationManager$$anon$1.run(SparkSQLOperationManager.scala:185) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:193) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatement(HiveSessionImpl.java:175) at org.apache.hive.service.cli.CLIService.executeStatement(CLIService.java:150) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:207) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1133) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1118) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:58) at org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:55) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:526) at org.apache.hive.service.auth.TUGIContainingProcessor.process(TUGIContainingProcessor.java:55) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.io.NotSerializableException: java.lang.reflect.Constructor at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at scala.collection.immutable.$colon$colon.writeObject(List.scala:379) at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) The following work fine: select concat_ws('-', LPAD(cast(112717 % 1024 AS STRING),4,'0'),'2014-07') from mytable where pkey='0077-2014-07'
[jira] [Created] (SPARK-7792) HiveContext registerTempTable not thread safe
Yana Kadiyska created SPARK-7792: Summary: HiveContext registerTempTable not thread safe Key: SPARK-7792 URL: https://issues.apache.org/jira/browse/SPARK-7792 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Yana Kadiyska {quote} public class ThreadRepro { public static void main(String[] args) throws Exception{ new ThreadRepro().sparkPerfTest(); } public void sparkPerfTest(){ final AtomicLong counter = new AtomicLong(); SparkConf conf = new SparkConf(); conf.setAppName(My Application); conf.setMaster(local[7]); SparkContext sc = new SparkContext(conf); org.apache.spark.sql.hive.HiveContext hc = new org.apache.spark.sql.hive.HiveContext(sc); int poolSize = 10; ExecutorService pool = Executors.newFixedThreadPool(poolSize); for (int i=0; ipoolSize;i++ ) pool.execute(new QueryJob(hc, i, counter)); pool.shutdown(); try { pool.awaitTermination(60, TimeUnit.MINUTES); }catch(Exception e){ System.out.println(Thread interrupted); } System.out.println(All jobs complete); System.out.println( Counter is +counter.get()); } } class QueryJob implements Runnable{ String threadId; org.apache.spark.sql.hive.HiveContext sqlContext; String key; AtomicLong counter; final AtomicLong local_counter = new AtomicLong(); public QueryJob(org.apache.spark.sql.hive.HiveContext _sqlContext,int id,AtomicLong ctr){ threadId = thread_+id; this.sqlContext= _sqlContext; this.counter = ctr; } public void run() { for (int i = 0; i 100; i++) { String tblName = threadId +_+i; DataFrame df = sqlContext.emptyDataFrame(); df.registerTempTable(tblName); String _query = String.format(select count(*) from %s,tblName); System.out.println(String.format( registered table %s; catalog (%s) ,tblName,debugTables())); ListRow res; try { res = sqlContext.sql(_query).collectAsList(); }catch (Exception e){ System.out.println(*Exception + debugTables() +**); throw e; } sqlContext.dropTempTable(tblName); System.out.println( dropped table +tblName); try { Thread.sleep(3000);//lets make this a not-so-tight loop }catch(Exception e){ System.out.println(Thread interrupted); } } } private String debugTables(){ String v = Joiner.on(',').join(sqlContext.tableNames()); if (v==null)return ; else return v; } } {quote} this will periodically produce the following: {quote} registered table thread_0_50; catalog (thread_1_50) registered table thread_4_50; catalog (thread_4_50,thread_1_50) registered table thread_1_50; catalog (thread_1_50) dropped table thread_1_50 dropped table thread_4_50 *Exception ** Exception in thread pool-6-thread-1 java.lang.Error: org.apache.spark.sql.AnalysisException: no such table thread_0_50; line 1 pos 21 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.spark.sql.AnalysisException: no such table thread_0_50; line 1 pos 21 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:177) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$6.applyOrElse(Analyzer.scala:186) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$6.applyOrElse(Analyzer.scala:181) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:188) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:188) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:208) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at
[jira] [Updated] (SPARK-7792) HiveContext registerTempTable not thread safe
[ https://issues.apache.org/jira/browse/SPARK-7792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yana Kadiyska updated SPARK-7792: - Description: {code:java} public class ThreadRepro { public static void main(String[] args) throws Exception{ new ThreadRepro().sparkPerfTest(); } public void sparkPerfTest(){ final AtomicLong counter = new AtomicLong(); SparkConf conf = new SparkConf(); conf.setAppName(My Application); conf.setMaster(local[7]); SparkContext sc = new SparkContext(conf); org.apache.spark.sql.hive.HiveContext hc = new org.apache.spark.sql.hive.HiveContext(sc); int poolSize = 10; ExecutorService pool = Executors.newFixedThreadPool(poolSize); for (int i=0; ipoolSize;i++ ) pool.execute(new QueryJob(hc, i, counter)); pool.shutdown(); try { pool.awaitTermination(60, TimeUnit.MINUTES); }catch(Exception e){ System.out.println(Thread interrupted); } System.out.println(All jobs complete); System.out.println( Counter is +counter.get()); } } class QueryJob implements Runnable{ String threadId; org.apache.spark.sql.hive.HiveContext sqlContext; String key; AtomicLong counter; final AtomicLong local_counter = new AtomicLong(); public QueryJob(org.apache.spark.sql.hive.HiveContext _sqlContext,int id,AtomicLong ctr){ threadId = thread_+id; this.sqlContext= _sqlContext; this.counter = ctr; } public void run() { for (int i = 0; i 100; i++) { String tblName = threadId +_+i; DataFrame df = sqlContext.emptyDataFrame(); df.registerTempTable(tblName); String _query = String.format(select count(*) from %s,tblName); System.out.println(String.format( registered table %s; catalog (%s) ,tblName,debugTables())); ListRow res; try { res = sqlContext.sql(_query).collectAsList(); }catch (Exception e){ System.out.println(*Exception + debugTables() +**); throw e; } sqlContext.dropTempTable(tblName); System.out.println( dropped table +tblName); try { Thread.sleep(3000);//lets make this a not-so-tight loop }catch(Exception e){ System.out.println(Thread interrupted); } } } private String debugTables(){ String v = Joiner.on(',').join(sqlContext.tableNames()); if (v==null)return ; else return v; } } {code} this will periodically produce the following: {quote} registered table thread_0_50; catalog (thread_1_50) registered table thread_4_50; catalog (thread_4_50,thread_1_50) registered table thread_1_50; catalog (thread_1_50) dropped table thread_1_50 dropped table thread_4_50 *Exception ** Exception in thread pool-6-thread-1 java.lang.Error: org.apache.spark.sql.AnalysisException: no such table thread_0_50; line 1 pos 21 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.spark.sql.AnalysisException: no such table thread_0_50; line 1 pos 21 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:177) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$6.applyOrElse(Analyzer.scala:186) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$6.applyOrElse(Analyzer.scala:181) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:188) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:188) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:208) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at
[jira] [Updated] (SPARK-4412) Parquet logger cannot be configured
[ https://issues.apache.org/jira/browse/SPARK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yana Kadiyska updated SPARK-4412: - Affects Version/s: 1.3.1 Parquet logger cannot be configured --- Key: SPARK-4412 URL: https://issues.apache.org/jira/browse/SPARK-4412 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.1 Reporter: Jim Carroll The Spark ParquetRelation.scala code makes the assumption that the parquet.Log class has already been loaded. If ParquetRelation.enableLogForwarding executes prior to the parquet.Log class being loaded then the code in enableLogForwarding has no affect. ParquetRelation.scala attempts to override the parquet logger but, at least currently (and if your application simply reads a parquet file before it does anything else with Parquet), the parquet.Log class hasn't been loaded yet. Therefore the code in ParquetRelation.enableLogForwarding has no affect. If you look at the code in parquet.Log there's a static initializer that needs to be called prior to enableLogForwarding or whatever enableLogForwarding does gets undone by this static initializer. The fix would be to force the static initializer to get called in parquet.Log as part of enableForwardLogging. PR will be forthcomming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4412) Parquet logger cannot be configured
[ https://issues.apache.org/jira/browse/SPARK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545677#comment-14545677 ] Yana Kadiyska edited comment on SPARK-4412 at 5/16/15 5:11 PM: --- I would like to reopen as I believe the issue has again regressed in Spark 1.3.0. This SO thread has a lengthy discussion http://stackoverflow.com/questions/30052889/how-to-suppress-parquet-log-messages-in-spark but the short summary is that log4j.rootCategory=ERROR, console setting still leaks {quote} INFO: parquet.hadoop.InternalParquetRecordReader {quote} messages was (Author: yanakad): I would like to reopen as I believe the issue has again regressed in Spark 1.3.0. This SO thread has a lengthy discussion http://stackoverflow.com/questions/30052889/how-to-suppress-parquet-log-messages-in-spark but the short summary is that log4j.rootCategory=ERROR, console setting still leaks INFO: parquet.hadoop.InternalParquetRecordReader messages Parquet logger cannot be configured --- Key: SPARK-4412 URL: https://issues.apache.org/jira/browse/SPARK-4412 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Jim Carroll The Spark ParquetRelation.scala code makes the assumption that the parquet.Log class has already been loaded. If ParquetRelation.enableLogForwarding executes prior to the parquet.Log class being loaded then the code in enableLogForwarding has no affect. ParquetRelation.scala attempts to override the parquet logger but, at least currently (and if your application simply reads a parquet file before it does anything else with Parquet), the parquet.Log class hasn't been loaded yet. Therefore the code in ParquetRelation.enableLogForwarding has no affect. If you look at the code in parquet.Log there's a static initializer that needs to be called prior to enableLogForwarding or whatever enableLogForwarding does gets undone by this static initializer. The fix would be to force the static initializer to get called in parquet.Log as part of enableForwardLogging. PR will be forthcomming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-4412) Parquet logger cannot be configured
[ https://issues.apache.org/jira/browse/SPARK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yana Kadiyska reopened SPARK-4412: -- Reopening as the issue reappeared in 1.3.0 Parquet logger cannot be configured --- Key: SPARK-4412 URL: https://issues.apache.org/jira/browse/SPARK-4412 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Jim Carroll Fix For: 1.2.0 The Spark ParquetRelation.scala code makes the assumption that the parquet.Log class has already been loaded. If ParquetRelation.enableLogForwarding executes prior to the parquet.Log class being loaded then the code in enableLogForwarding has no affect. ParquetRelation.scala attempts to override the parquet logger but, at least currently (and if your application simply reads a parquet file before it does anything else with Parquet), the parquet.Log class hasn't been loaded yet. Therefore the code in ParquetRelation.enableLogForwarding has no affect. If you look at the code in parquet.Log there's a static initializer that needs to be called prior to enableLogForwarding or whatever enableLogForwarding does gets undone by this static initializer. The fix would be to force the static initializer to get called in parquet.Log as part of enableForwardLogging. PR will be forthcomming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4412) Parquet logger cannot be configured
[ https://issues.apache.org/jira/browse/SPARK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545677#comment-14545677 ] Yana Kadiyska commented on SPARK-4412: -- I would like to reopen as I believe the issue has again regressed in Spark 1.3.0. This SO thread has a lengthy discussion http://stackoverflow.com/questions/30052889/how-to-suppress-parquet-log-messages-in-spark but the short summary is that log4j.rootCategory=ERROR, console setting still leaks INFO: parquet.hadoop.InternalParquetRecordReader messages Parquet logger cannot be configured --- Key: SPARK-4412 URL: https://issues.apache.org/jira/browse/SPARK-4412 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Jim Carroll Fix For: 1.2.0 The Spark ParquetRelation.scala code makes the assumption that the parquet.Log class has already been loaded. If ParquetRelation.enableLogForwarding executes prior to the parquet.Log class being loaded then the code in enableLogForwarding has no affect. ParquetRelation.scala attempts to override the parquet logger but, at least currently (and if your application simply reads a parquet file before it does anything else with Parquet), the parquet.Log class hasn't been loaded yet. Therefore the code in ParquetRelation.enableLogForwarding has no affect. If you look at the code in parquet.Log there's a static initializer that needs to be called prior to enableLogForwarding or whatever enableLogForwarding does gets undone by this static initializer. The fix would be to force the static initializer to get called in parquet.Log as part of enableForwardLogging. PR will be forthcomming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3928) Support wildcard matches on Parquet files
[ https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14536114#comment-14536114 ] Yana Kadiyska commented on SPARK-3928: -- [~tkyaw] Your suggested workaround does work. One question though -- what are the implications of turning off spark.sql.parquet.useDataSourceApi? My particular concern is with predicate pushdowns into parquet -- am I going to lose these (it's hard to tell from the UI if pushdown is happening correctly). Also, can you clarify if you still plan to fix this for 1.4 or New parquet implementation does not contain wild card support yet means that we'd have to live with spark.sql.parquet.useDataSourceApi until further time? Support wildcard matches on Parquet files - Key: SPARK-3928 URL: https://issues.apache.org/jira/browse/SPARK-3928 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Reporter: Nicholas Chammas Assignee: Cheng Lian Priority: Minor Fix For: 1.3.0 {{SparkContext.textFile()}} supports patterns like {{part-*}} and {{2014-\?\?-\?\?}}. It would be nice if {{SparkContext.parquetFile()}} did the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3928) Support wildcard matches on Parquet files
[ https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532680#comment-14532680 ] Yana Kadiyska edited comment on SPARK-3928 at 5/7/15 2:16 PM: -- Marius, are you saying that wildcards are not supported then? in my case, I would really like to do /r/warehouse/hive/pkey=-2015-04/* (which works w/ textFile method btw) -- i.e. pass a single path for all April 2015 partitions. Enumerating all paths underneath is pretty crazy, that's a huge list. Are you saying that is the only way? I thought the whole point of this bug is that we _don't_ have to enumerate the paths explicitly. Also in my case hc is a HiveContext instance, not a dataframe. As a side note I am trying to use this feature as a workaround to https://issues.apache.org/jira/browse/SPARK-6910 -- Michael A. suggested a work around which takes way too long in our case -- I was hoping to be able to create a DF from a subset of partitions...But it would be a pain to build an explicit listWould like to know for sure if that's the deliberate design though... was (Author: yanakad): Marius, are you saying that wildcards are not supported then? in my case, I would really like to do /r/warehouse/hive/pkey=-2015-04/* (which works w/ textFile method btw) -- i.e. pass a single path for all April 2015 partitions. Enumerating all paths underneath is pretty crazy, that's a huge list. Are you saying that is the only way? I thought the whole point of this bug is that we _don't_ have to enumerate the paths explicitly. Also in my case hc is a HiveContext instance, not a dataframe. As a side note I am trying to use this feature as a workaround to https://issues.apache.org/jira/browse/SPARK-6910 -- Mike A. suggested a work around which takes way too long in our case -- I was hoping to be able to create a DF from a subset of partitions...But it would be a pain to build an explicit listWould like to know for sure if that's the deliberate design though... Support wildcard matches on Parquet files - Key: SPARK-3928 URL: https://issues.apache.org/jira/browse/SPARK-3928 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Reporter: Nicholas Chammas Priority: Minor Fix For: 1.3.0 {{SparkContext.textFile()}} supports patterns like {{part-*}} and {{2014-\?\?-\?\?}}. It would be nice if {{SparkContext.parquetFile()}} did the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3928) Support wildcard matches on Parquet files
[ https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532716#comment-14532716 ] Yana Kadiyska commented on SPARK-3928: -- Then I think they should change the resolved status to Resolved-Won't fix if that's a conscious decision. I did make an edit to my previous comment as to why I'd love this to be different. Support wildcard matches on Parquet files - Key: SPARK-3928 URL: https://issues.apache.org/jira/browse/SPARK-3928 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Reporter: Nicholas Chammas Priority: Minor Fix For: 1.3.0 {{SparkContext.textFile()}} supports patterns like {{part-*}} and {{2014-\?\?-\?\?}}. It would be nice if {{SparkContext.parquetFile()}} did the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3928) Support wildcard matches on Parquet files
[ https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532624#comment-14532624 ] Yana Kadiyska edited comment on SPARK-3928 at 5/7/15 1:48 PM: -- I am observing the same issue. Downloaded a pre-built CDH4 1.3.1 distro. {quote} scala sc.textFile(/r/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet).first res0: String = PAR1? L??? ?p??? ,?�� ?? ???p?p??? ,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� ? ,?��??? ???p?p??? ,��???���q?�qL?� �8��{???%??? ???/???(???�???�???9???�???�???2???#???M???0??? ???6???�???4???�???P???*??? ???�??? s scala hc.parquetFile(/r/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet) java.io.FileNotFoundException: File does not exist: /r/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet {quote} was (Author: yanakad): I am observing the same issue. Downloaded a pre-built CDH4 1.3.1 distro. {quote} scala sc.textFile(/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet).first res0: String = PAR1? L??? ?p??? ,?�� ?? ???p?p??? ,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� ? ,?��??? ???p?p??? ,��???���q?�qL?� �8��{???%??? ???/???(???�???�???9???�???�???2???#???M???0??? ???6???�???4???�???P???*??? ???�??? s scala hc.parquetFile(/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet) java.io.FileNotFoundException: File does not exist: /rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet {quote} Support wildcard matches on Parquet files - Key: SPARK-3928 URL: https://issues.apache.org/jira/browse/SPARK-3928 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Reporter: Nicholas Chammas Priority: Minor Fix For: 1.3.0 {{SparkContext.textFile()}} supports patterns like {{part-*}} and {{2014-\?\?-\?\?}}. It would be nice if {{SparkContext.parquetFile()}} did the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3928) Support wildcard matches on Parquet files
[ https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532680#comment-14532680 ] Yana Kadiyska edited comment on SPARK-3928 at 5/7/15 2:15 PM: -- Marius, are you saying that wildcards are not supported then? in my case, I would really like to do /r/warehouse/hive/pkey=-2015-04/* (which works w/ textFile method btw) -- i.e. pass a single path for all April 2015 partitions. Enumerating all paths underneath is pretty crazy, that's a huge list. Are you saying that is the only way? I thought the whole point of this bug is that we _don't_ have to enumerate the paths explicitly. Also in my case hc is a HiveContext instance, not a dataframe. As a side note I am trying to use this feature as a workaround to https://issues.apache.org/jira/browse/SPARK-6910 -- Mike A. suggested a work around which takes way too long in our case -- I was hoping to be able to create a DF from a subset of partitions...But it would be a pain to build an explicit listWould like to know for sure if that's the deliberate design though... was (Author: yanakad): Marius, are you saying that wildcards are not supported then? in my case, I would really like to do /r/warehouse/hive/pkey=-2015-04/* (which works w/ textFile method btw) -- i.e. pass a single path for all April 2015 partitions. Enumerating all paths underneath is pretty crazy, that's a huge list. Are you saying that is the only way? I thought the whole point of this bug is that we _don't_ have to enumerate the paths explicitly. Also in my case hc is a HiveContext instance, not a dataframe. Support wildcard matches on Parquet files - Key: SPARK-3928 URL: https://issues.apache.org/jira/browse/SPARK-3928 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Reporter: Nicholas Chammas Priority: Minor Fix For: 1.3.0 {{SparkContext.textFile()}} supports patterns like {{part-*}} and {{2014-\?\?-\?\?}}. It would be nice if {{SparkContext.parquetFile()}} did the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3928) Support wildcard matches on Parquet files
[ https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532624#comment-14532624 ] Yana Kadiyska edited comment on SPARK-3928 at 5/7/15 1:38 PM: -- I am observing the same issue. Downloaded a pre-built CDH4 1.3.1 distro. {quote} scala sc.textFile(/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet).first res0: String = PAR1? L??? ?p??? ,?�� ?? ???p?p??? ,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� ? ,?��??? ???p?p??? ,��???���q?�qL?� �8��{???%??? ???/???(???�???�???9???�???�???2???#???M???0??? ???6???�???4???�???P???*??? ???�??? s scala hc.parquetFile(/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet) java.io.FileNotFoundException: File does not exist: /rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet {quote} was (Author: yanakad): I am observing the same issue. Downloaded a pre-built CDH4 1.3.1 distro. {quote} scala sc.textFile(/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet).first res0: String = PAR1? L??? ?p??? ,?�� ?? ???p?p??? ,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� ? ,?��??? ???p?p??? ,��???���q?�qL?� �8��{???%??? ???/???(???�???�???9???�???�???2???#???M???0??? ???6???�???4???�???P???*??? ???�??? s scala hc.parquetFile(/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet) java.io.FileNotFoundException: File does not exist: hdfs://cdh4-21968-nn/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet {quote} Support wildcard matches on Parquet files - Key: SPARK-3928 URL: https://issues.apache.org/jira/browse/SPARK-3928 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Reporter: Nicholas Chammas Priority: Minor Fix For: 1.3.0 {{SparkContext.textFile()}} supports patterns like {{part-*}} and {{2014-\?\?-\?\?}}. It would be nice if {{SparkContext.parquetFile()}} did the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3928) Support wildcard matches on Parquet files
[ https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532624#comment-14532624 ] Yana Kadiyska commented on SPARK-3928: -- I am observing the same issue. Downloaded a pre-built CDH4 1.3.1 distro. {quote} scala sc.textFile(/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet).first res0: String = PAR1? L??? ?p??? ,?�� ?? ???p?p??? ,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� ? ,?��??? ???p?p??? ,��???���q?�qL?� �8��{???%??? ???/???(???�???�???9???�???�???2???#???M???0??? ???6???�???4???�???P???*??? ???�??? s scala hc.parquetFile(/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet) java.io.FileNotFoundException: File does not exist: hdfs://cdh4-21968-nn/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet {quote} Support wildcard matches on Parquet files - Key: SPARK-3928 URL: https://issues.apache.org/jira/browse/SPARK-3928 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Reporter: Nicholas Chammas Priority: Minor Fix For: 1.3.0 {{SparkContext.textFile()}} supports patterns like {{part-*}} and {{2014-\?\?-\?\?}}. It would be nice if {{SparkContext.parquetFile()}} did the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3928) Support wildcard matches on Parquet files
[ https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532680#comment-14532680 ] Yana Kadiyska commented on SPARK-3928: -- Marius, are you saying that wildcards are not supported then? in my case, I would really like to do /r/warehouse/hive/pkey=-2015-04/* (which works w/ textFile method btw) -- i.e. pass a single path for all April 2015 partitions. Enumerating all paths underneath is pretty crazy, that's a huge list. Are you saying that is the only way? I thought the whole point of this bug is that we _don't_ have to enumerate the paths explicitly. Also in my case hc is a HiveContext instance, not a dataframe. Support wildcard matches on Parquet files - Key: SPARK-3928 URL: https://issues.apache.org/jira/browse/SPARK-3928 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Reporter: Nicholas Chammas Priority: Minor Fix For: 1.3.0 {{SparkContext.textFile()}} supports patterns like {{part-*}} and {{2014-\?\?-\?\?}}. It would be nice if {{SparkContext.parquetFile()}} did the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6910) Support for pushing predicates down to metastore for partition pruning
[ https://issues.apache.org/jira/browse/SPARK-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529639#comment-14529639 ] Yana Kadiyska commented on SPARK-6910: -- Can you please provide an update on this -- status/maybe a target release? Spark-6904 was closed as a duplicate of this but this seems like a critical bug? We have a pretty large metastore (partitions are per month per customer with a few years of data) and Shark works OK but I cannot take advantage of the new cool versions of Spark until the Metastore interaction improves. Any advice on a workaround would also be great... I opened https://issues.apache.org/jira/browse/SPARK-6984 which is probably a dup of this. Support for pushing predicates down to metastore for partition pruning -- Key: SPARK-6910 URL: https://issues.apache.org/jira/browse/SPARK-6910 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6984) Operations on tables with many partitions _very_slow
[ https://issues.apache.org/jira/browse/SPARK-6984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519409#comment-14519409 ] Yana Kadiyska edited comment on SPARK-6984 at 5/6/15 12:26 AM: --- Possibly related to this https://issues.apache.org/jira/browse/SPARK-6910 was (Author: yanakad): Possibly related to this Operations on tables with many partitions _very_slow Key: SPARK-6984 URL: https://issues.apache.org/jira/browse/SPARK-6984 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Environment: External Hive metastore, table with 30K partitions Reporter: Yana Kadiyska Attachments: 7282_partitions_stack.png I have a table with _many_partitions (30K). Users cannot query all of them but they are in the metastore. Querying this table is extremely slow even if we're asking for a single partition. describe sometable also performs _very_ poorly {quote} Spark produces the following times: Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02, SQL query: 72.831, Reading results: 0.189 Whereas Hive over the same metastore shows: Query 1 of 1, Rows read: 47, Elapsed time (seconds) - Total: 0.44, SQL query: 0.204, Reading results: 0.236 {quote} I attempted to debug this and noticed that HiveMetastoreCatalog constructs an object for each partition, which is puzzling to me (attaching screenshot). Should this value be lazy -- describe table should be purely a metastore op IMO (i.e. query postgres, return types). The issue is a blocker to me but leaving with default priority until someone can confirm it is a bug. describe table is not so interesting but I think this affects all query paths -- I sent an inquiry earlier here: https://www.mail-archive.com/user@spark.apache.org/msg26242.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6984) Operations on tables with many partitions _very_slow
[ https://issues.apache.org/jira/browse/SPARK-6984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519409#comment-14519409 ] Yana Kadiyska commented on SPARK-6984: -- Possibly related to this Operations on tables with many partitions _very_slow Key: SPARK-6984 URL: https://issues.apache.org/jira/browse/SPARK-6984 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Environment: External Hive metastore, table with 30K partitions Reporter: Yana Kadiyska Attachments: 7282_partitions_stack.png I have a table with _many_partitions (30K). Users cannot query all of them but they are in the metastore. Querying this table is extremely slow even if we're asking for a single partition. describe sometable also performs _very_ poorly {quote} Spark produces the following times: Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02, SQL query: 72.831, Reading results: 0.189 Whereas Hive over the same metastore shows: Query 1 of 1, Rows read: 47, Elapsed time (seconds) - Total: 0.44, SQL query: 0.204, Reading results: 0.236 {quote} I attempted to debug this and noticed that HiveMetastoreCatalog constructs an object for each partition, which is puzzling to me (attaching screenshot). Should this value be lazy -- describe table should be purely a metastore op IMO (i.e. query postgres, return types). The issue is a blocker to me but leaving with default priority until someone can confirm it is a bug. describe table is not so interesting but I think this affects all query paths -- I sent an inquiry earlier here: https://www.mail-archive.com/user@spark.apache.org/msg26242.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5923) Very slow query when using Oracle hive metastore and table has lots of partitions
[ https://issues.apache.org/jira/browse/SPARK-5923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499855#comment-14499855 ] Yana Kadiyska commented on SPARK-5923: -- [~mtaylor] Can you please provide some information on how you debugged this? I just experienced a similar issue -- really poor performance on large metastore even though I'm only touching a few partitions. I'm using a Postgresql metastore . I do not however see IN queries logged to Postgres, and according to the posgres log no individual query took longer than 50ms. So I'm hoping to get some debugging tips Very slow query when using Oracle hive metastore and table has lots of partitions -- Key: SPARK-5923 URL: https://issues.apache.org/jira/browse/SPARK-5923 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Matthew Taylor This has two aspects * The direct sql support for oracle is broken in hive 0.13.1. Fails when partitions get bigger than 1000 due oracle limitation on IN clause. This cause fall back to ORM which is very slow(20 minutes to even start the query) * Hive it self does not suffer this problem as it passes down to the metadata query, filter terms that restrict the partitions returned. SparkSQL is always asking for all partitions event if they are not all needed. Even when we patched hive it was still taking 2 minutes -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6984) Operations on tables with many partitions _very_slow
[ https://issues.apache.org/jira/browse/SPARK-6984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yana Kadiyska updated SPARK-6984: - Attachment: 7282_partitions_stack.png Operations on tables with many partitions _very_slow Key: SPARK-6984 URL: https://issues.apache.org/jira/browse/SPARK-6984 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Environment: External Hive metastore, table with 30K partitions Reporter: Yana Kadiyska Attachments: 7282_partitions_stack.png I have a table with _many_partitions (30K). Users cannot query all of them but they are in the metastore. Querying this table is extremely slow even if we're asking for a single partition. describe table also performs _very_ poorly Spark produces the following times: Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02, SQL query: 72.831, Reading results: 0.189 Whereas Hive over the same metastore shows: Query 1 of 1, Rows read: 47, Elapsed time (seconds) - Total: 0.44, SQL query: 0.204, Reading results: 0.236 I attempted to debug this and noticed that HiveMetastoreCatalog constructs an object for each partition, which is puzzling to me (attaching screenshot). Should this value be lazy -- describe table should be purely a metastore op IMO (i.e. query postgres, return types). The issue is a blocker to me but leaving with default priority until someone can confirm it is a bug. describe table is not so interesting but I think this affects all query paths -- I sent an inquiry earlier here: https://www.mail-archive.com/user@spark.apache.org/msg26242.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6984) Operations on tables with many partitions _very_slow
[ https://issues.apache.org/jira/browse/SPARK-6984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yana Kadiyska updated SPARK-6984: - Description: I have a table with _many_partitions (30K). Users cannot query all of them but they are in the metastore. Querying this table is extremely slow even if we're asking for a single partition. describe table also performs _very_ poorly {quote} Spark produces the following times: Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02, SQL query: 72.831, Reading results: 0.189 Whereas Hive over the same metastore shows: Query 1 of 1, Rows read: 47, Elapsed time (seconds) - Total: 0.44, SQL query: 0.204, Reading results: 0.236 {quote} I attempted to debug this and noticed that HiveMetastoreCatalog constructs an object for each partition, which is puzzling to me (attaching screenshot). Should this value be lazy -- describe table should be purely a metastore op IMO (i.e. query postgres, return types). The issue is a blocker to me but leaving with default priority until someone can confirm it is a bug. describe table is not so interesting but I think this affects all query paths -- I sent an inquiry earlier here: https://www.mail-archive.com/user@spark.apache.org/msg26242.html was: I have a table with _many_partitions (30K). Users cannot query all of them but they are in the metastore. Querying this table is extremely slow even if we're asking for a single partition. describe table also performs _very_ poorly Spark produces the following times: Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02, SQL query: 72.831, Reading results: 0.189 Whereas Hive over the same metastore shows: Query 1 of 1, Rows read: 47, Elapsed time (seconds) - Total: 0.44, SQL query: 0.204, Reading results: 0.236 I attempted to debug this and noticed that HiveMetastoreCatalog constructs an object for each partition, which is puzzling to me (attaching screenshot). Should this value be lazy -- describe table should be purely a metastore op IMO (i.e. query postgres, return types). The issue is a blocker to me but leaving with default priority until someone can confirm it is a bug. describe table is not so interesting but I think this affects all query paths -- I sent an inquiry earlier here: https://www.mail-archive.com/user@spark.apache.org/msg26242.html Operations on tables with many partitions _very_slow Key: SPARK-6984 URL: https://issues.apache.org/jira/browse/SPARK-6984 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Environment: External Hive metastore, table with 30K partitions Reporter: Yana Kadiyska Attachments: 7282_partitions_stack.png I have a table with _many_partitions (30K). Users cannot query all of them but they are in the metastore. Querying this table is extremely slow even if we're asking for a single partition. describe table also performs _very_ poorly {quote} Spark produces the following times: Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02, SQL query: 72.831, Reading results: 0.189 Whereas Hive over the same metastore shows: Query 1 of 1, Rows read: 47, Elapsed time (seconds) - Total: 0.44, SQL query: 0.204, Reading results: 0.236 {quote} I attempted to debug this and noticed that HiveMetastoreCatalog constructs an object for each partition, which is puzzling to me (attaching screenshot). Should this value be lazy -- describe table should be purely a metastore op IMO (i.e. query postgres, return types). The issue is a blocker to me but leaving with default priority until someone can confirm it is a bug. describe table is not so interesting but I think this affects all query paths -- I sent an inquiry earlier here: https://www.mail-archive.com/user@spark.apache.org/msg26242.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6984) Operations on tables with many partitions _very_slow
[ https://issues.apache.org/jira/browse/SPARK-6984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yana Kadiyska updated SPARK-6984: - Description: I have a table with _many_partitions (30K). Users cannot query all of them but they are in the metastore. Querying this table is extremely slow even if we're asking for a single partition. describe sometable also performs _very_ poorly {quote} Spark produces the following times: Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02, SQL query: 72.831, Reading results: 0.189 Whereas Hive over the same metastore shows: Query 1 of 1, Rows read: 47, Elapsed time (seconds) - Total: 0.44, SQL query: 0.204, Reading results: 0.236 {quote} I attempted to debug this and noticed that HiveMetastoreCatalog constructs an object for each partition, which is puzzling to me (attaching screenshot). Should this value be lazy -- describe table should be purely a metastore op IMO (i.e. query postgres, return types). The issue is a blocker to me but leaving with default priority until someone can confirm it is a bug. describe table is not so interesting but I think this affects all query paths -- I sent an inquiry earlier here: https://www.mail-archive.com/user@spark.apache.org/msg26242.html was: I have a table with _many_partitions (30K). Users cannot query all of them but they are in the metastore. Querying this table is extremely slow even if we're asking for a single partition. describe table also performs _very_ poorly {quote} Spark produces the following times: Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02, SQL query: 72.831, Reading results: 0.189 Whereas Hive over the same metastore shows: Query 1 of 1, Rows read: 47, Elapsed time (seconds) - Total: 0.44, SQL query: 0.204, Reading results: 0.236 {quote} I attempted to debug this and noticed that HiveMetastoreCatalog constructs an object for each partition, which is puzzling to me (attaching screenshot). Should this value be lazy -- describe table should be purely a metastore op IMO (i.e. query postgres, return types). The issue is a blocker to me but leaving with default priority until someone can confirm it is a bug. describe table is not so interesting but I think this affects all query paths -- I sent an inquiry earlier here: https://www.mail-archive.com/user@spark.apache.org/msg26242.html Operations on tables with many partitions _very_slow Key: SPARK-6984 URL: https://issues.apache.org/jira/browse/SPARK-6984 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Environment: External Hive metastore, table with 30K partitions Reporter: Yana Kadiyska Attachments: 7282_partitions_stack.png I have a table with _many_partitions (30K). Users cannot query all of them but they are in the metastore. Querying this table is extremely slow even if we're asking for a single partition. describe sometable also performs _very_ poorly {quote} Spark produces the following times: Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02, SQL query: 72.831, Reading results: 0.189 Whereas Hive over the same metastore shows: Query 1 of 1, Rows read: 47, Elapsed time (seconds) - Total: 0.44, SQL query: 0.204, Reading results: 0.236 {quote} I attempted to debug this and noticed that HiveMetastoreCatalog constructs an object for each partition, which is puzzling to me (attaching screenshot). Should this value be lazy -- describe table should be purely a metastore op IMO (i.e. query postgres, return types). The issue is a blocker to me but leaving with default priority until someone can confirm it is a bug. describe table is not so interesting but I think this affects all query paths -- I sent an inquiry earlier here: https://www.mail-archive.com/user@spark.apache.org/msg26242.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6984) Operations on tables with many partitions _very_slow
Yana Kadiyska created SPARK-6984: Summary: Operations on tables with many partitions _very_slow Key: SPARK-6984 URL: https://issues.apache.org/jira/browse/SPARK-6984 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Environment: External Hive metastore, table with 30K partitions Reporter: Yana Kadiyska I have a table with _many_partitions (30K). Users cannot query all of them but they are in the metastore. Querying this table is extremely slow even if we're asking for a single partition. describe table also performs _very_ poorly Spark produces the following times: Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02, SQL query: 72.831, Reading results: 0.189 Whereas Hive over the same metastore shows: Query 1 of 1, Rows read: 47, Elapsed time (seconds) - Total: 0.44, SQL query: 0.204, Reading results: 0.236 I attempted to debug this and noticed that HiveMetastoreCatalog constructs an object for each partition, which is puzzling to me (attaching screenshot). Should this value be lazy -- describe table should be purely a metastore op IMO (i.e. query postgres, return types). The issue is a blocker to me but leaving with default priority until someone can confirm it is a bug. describe table is not so interesting but I think this affects all query paths -- I sent an inquiry earlier here: https://www.mail-archive.com/user@spark.apache.org/msg26242.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7
[ https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14359614#comment-14359614 ] Yana Kadiyska commented on SPARK-5389: -- C:\Users\ykadiysk\Downloads\spark-1.2.0-bin-cdh4where find C:\Windows\System32\find.exe C:\Users\ykadiysk\Downloads\spark-1.2.0-bin-cdh4where findstr C:\Windows\System32\findstr.exe C:\Users\ykadiysk\Downloads\spark-1.2.0-bin-cdh4echo %PATH% C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program Files (x86)\Enterprise Vault\EVClient\;C:\Program Files (x86)\Git\cmd ;C:\Program Files (x86)\Perforce;C:\Program Files\MiKTeX 2.9\miktex\bin\x64\;C:\Program Files\Java\jdk1.7.0_40\bin;C:\Program Files (x86)\sbt\\bin;C:\Program Files (x86)\scala\bin; C:\apache-maven-3.1.0\bin;C:\Program Files\Java\jre7\bin\server;c:\Program Files\R\R-3.0.2\bin;C:\Python27 spark-shell.cmd does not run from DOS Windows 7 --- Key: SPARK-5389 URL: https://issues.apache.org/jira/browse/SPARK-5389 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.2.0 Environment: Windows 7 Reporter: Yana Kadiyska Attachments: SparkShell_Win7.JPG spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2 Marking as trivial since calling spark-shell2.cmd also works fine Attaching a screenshot since the error isn't very useful: {code} spark-1.2.0-bin-cdh4bin\spark-shell.cmd else was unexpected at this time. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7
[ https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yana Kadiyska updated SPARK-5389: - Description: spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2 Marking as trivial sine calling spark-shell2.cmd also works fine Attaching a screenshot since the error isn't very useful: spark-1.2.0-bin-cdh4bin\spark-shell.cmd else was unexpected at this time. was: spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2 Marking as trivial sine calling spark-shell2.cmd also works fine Attaching a screenshot since the error isn't very useful spark-shell.cmd does not run from DOS Windows 7 --- Key: SPARK-5389 URL: https://issues.apache.org/jira/browse/SPARK-5389 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.2.0 Reporter: Yana Kadiyska Priority: Trivial Attachments: SparkShell_Win7.JPG spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2 Marking as trivial sine calling spark-shell2.cmd also works fine Attaching a screenshot since the error isn't very useful: spark-1.2.0-bin-cdh4bin\spark-shell.cmd else was unexpected at this time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7
Yana Kadiyska created SPARK-5389: Summary: spark-shell.cmd does not run from DOS Windows 7 Key: SPARK-5389 URL: https://issues.apache.org/jira/browse/SPARK-5389 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.2.0 Reporter: Yana Kadiyska Priority: Trivial spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2 Marking as trivial sine calling spark-shell2.cmd also works fine Attaching a screenshot since the error isn't very useful -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7
[ https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yana Kadiyska updated SPARK-5389: - Attachment: SparkShell_Win7.JPG spark-shell.cmd does not run from DOS Windows 7 --- Key: SPARK-5389 URL: https://issues.apache.org/jira/browse/SPARK-5389 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.2.0 Reporter: Yana Kadiyska Priority: Trivial Attachments: SparkShell_Win7.JPG spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2 Marking as trivial sine calling spark-shell2.cmd also works fine Attaching a screenshot since the error isn't very useful -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4049) Storage web UI fraction cached shows as 100%
[ https://issues.apache.org/jira/browse/SPARK-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14244290#comment-14244290 ] Yana Kadiyska commented on SPARK-4049: -- I'd suggest this be converted to a documentation bug then -- I was just stumped by the same phenomenon. Is the 1x replicated statement still correct in this case -- i would imagine that if the storage level says 1x replicated each partition would have a unique location? Storage web UI fraction cached shows as 100% Key: SPARK-4049 URL: https://issues.apache.org/jira/browse/SPARK-4049 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.2.0 Reporter: Josh Rosen Priority: Minor In the Storage tab of the Spark Web UI, I saw a case where the Fraction Cached was greater than 100%: !http://i.imgur.com/Gm2hEeL.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1
[ https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234307#comment-14234307 ] Yana Kadiyska commented on SPARK-4702: -- Just confirming that https://github.com/apache/spark/pull/3586 does fix the issue. Thanks! Querying non-existent partition produces exception in v1.2.0-rc1 - Key: SPARK-4702 URL: https://issues.apache.org/jira/browse/SPARK-4702 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yana Kadiyska Using HiveThriftServer2, when querying a non-existent partition I get an exception rather than an empty result set. This seems to be a regression -- I had an older build of master branch where this works. Build off of RC1.2 tag produces the following: 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement: org.apache.hive.service.cli.HiveSQLException: java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60) at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source) at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1
[ https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233031#comment-14233031 ] Yana Kadiyska commented on SPARK-4702: -- I'm investigating the possibility that this is caused by Hive being switched to Hive0.13 by default. Building with 0.12 profile now, will close if this goes away Querying non-existent partition produces exception in v1.2.0-rc1 - Key: SPARK-4702 URL: https://issues.apache.org/jira/browse/SPARK-4702 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yana Kadiyska Using HiveThriftServer2, when querying a non-existent partition I get an exception rather than an empty result set. This seems to be a regression -- I had an older build of master branch where this works. Build off of RC1.2 tag produces the following: 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement: org.apache.hive.service.cli.HiveSQLException: java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60) at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source) at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1
[ https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233081#comment-14233081 ] Yana Kadiyska commented on SPARK-4702: -- Unfortunately I still see this error after building with ./make-distribution.sh --tgz -Phive -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -Phive-thriftserver -Phive-0.12.0 I do have a working build from master branch from October 24th where this scenario works. We are running CDH4.6 Hive0.10 In particular, the query I tried is select count(*) from mytable where pkey ='some-non-existant-key'; Querying non-existent partition produces exception in v1.2.0-rc1 - Key: SPARK-4702 URL: https://issues.apache.org/jira/browse/SPARK-4702 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yana Kadiyska Using HiveThriftServer2, when querying a non-existent partition I get an exception rather than an empty result set. This seems to be a regression -- I had an older build of master branch where this works. Build off of RC1.2 tag produces the following: 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement: org.apache.hive.service.cli.HiveSQLException: java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60) at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source) at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1
[ https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233081#comment-14233081 ] Yana Kadiyska edited comment on SPARK-4702 at 12/3/14 3:53 PM: --- Unfortunately I still see this error after building with ./make-distribution.sh --tgz -Phive -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -Phive-thriftserver -Phive-0.12.0 I do have a working build from master branch from October 24th where this scenario works. We are running CDH4.6 Hive0.10 Beeline output from October: Connected to: Hive (version 0.12.0-protobuf-2.5) Driver: null (version null) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline output from RC: Connected to: Hive (version 1.2.0) Driver: null (version null) Transaction isolation: TRANSACTION_REPEATABLE_READ I am wondering if -Phive-0.12.0 is fully sufficient --not sure why the Connected to: version prints differently? In particular, the query I tried is select count(*) from mytable where pkey ='some-non-existant-key'; was (Author: yanakad): Unfortunately I still see this error after building with ./make-distribution.sh --tgz -Phive -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -Phive-thriftserver -Phive-0.12.0 I do have a working build from master branch from October 24th where this scenario works. We are running CDH4.6 Hive0.10 In particular, the query I tried is select count(*) from mytable where pkey ='some-non-existant-key'; Querying non-existent partition produces exception in v1.2.0-rc1 - Key: SPARK-4702 URL: https://issues.apache.org/jira/browse/SPARK-4702 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yana Kadiyska Using HiveThriftServer2, when querying a non-existent partition I get an exception rather than an empty result set. This seems to be a regression -- I had an older build of master branch where this works. Build off of RC1.2 tag produces the following: 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement: org.apache.hive.service.cli.HiveSQLException: java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60) at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source) at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1
[ https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233081#comment-14233081 ] Yana Kadiyska edited comment on SPARK-4702 at 12/3/14 6:09 PM: --- Unfortunately I still see this error after building with ./make-distribution.sh --tgz -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -Phive-thriftserver -Phive-0.12.0 I do have a working build from master branch from October 24th where this scenario works. We are running CDH4.6 Hive0.10 Beeline output from October: Connected to: Hive (version 0.12.0-protobuf-2.5) Driver: null (version null) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline output from RC: Connected to: Hive (version 1.2.0) Driver: null (version null) Transaction isolation: TRANSACTION_REPEATABLE_READ I am wondering if -Phive-0.12.0 is fully sufficient --not sure why the Connected to: version prints differently? In particular, the query I tried is select count(*) from mytable where pkey ='some-non-existant-key'; was (Author: yanakad): Unfortunately I still see this error after building with ./make-distribution.sh --tgz -Phive -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -Phive-thriftserver -Phive-0.12.0 I do have a working build from master branch from October 24th where this scenario works. We are running CDH4.6 Hive0.10 Beeline output from October: Connected to: Hive (version 0.12.0-protobuf-2.5) Driver: null (version null) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline output from RC: Connected to: Hive (version 1.2.0) Driver: null (version null) Transaction isolation: TRANSACTION_REPEATABLE_READ I am wondering if -Phive-0.12.0 is fully sufficient --not sure why the Connected to: version prints differently? In particular, the query I tried is select count(*) from mytable where pkey ='some-non-existant-key'; Querying non-existent partition produces exception in v1.2.0-rc1 - Key: SPARK-4702 URL: https://issues.apache.org/jira/browse/SPARK-4702 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yana Kadiyska Using HiveThriftServer2, when querying a non-existent partition I get an exception rather than an empty result set. This seems to be a regression -- I had an older build of master branch where this works. Build off of RC1.2 tag produces the following: 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement: org.apache.hive.service.cli.HiveSQLException: java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60) at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source) at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55) at
[jira] [Commented] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1
[ https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233443#comment-14233443 ] Yana Kadiyska commented on SPARK-4702: -- Michael, just wanted to point out that the workaround you suggested did indeed help when the partition is missing -- I get a count of 0. But it did break the otherwise working case when a partition is present: java.lang.IllegalStateException: All the offsets listed in the split should be found in the file. expected: [4, 4] found: {my schema dumped out here} out of: [4, 121017555, 242333553, 363518600] in range 0, 134217728 It's possible that this is a very corner case -- we've added columns to our schema so it's possible that the parquet files are likely not symmetric (not quite sure what convertMetastoreParquet does under the hood). But wanted to point out that in our case the bug is truly a blocker (I'm hoping it makes it in 1.2, don't care if it makes it in the next RC or later) Querying non-existent partition produces exception in v1.2.0-rc1 - Key: SPARK-4702 URL: https://issues.apache.org/jira/browse/SPARK-4702 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yana Kadiyska Using HiveThriftServer2, when querying a non-existent partition I get an exception rather than an empty result set. This seems to be a regression -- I had an older build of master branch where this works. Build off of RC1.2 tag produces the following: 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement: org.apache.hive.service.cli.HiveSQLException: java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60) at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source) at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1
[ https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233514#comment-14233514 ] Yana Kadiyska commented on SPARK-4702: -- Michael, I do not have a 1.1. In October I built master manually, I believe from commit d2987e8f7a2cb3bf971f381399d8efdccb51d3d2. At that time both of types of queries worked, without setting spark.sql.hive.convertMetastoreParquet=false (I tested on a smaller cluster, will drop on the same one now to make sure there's not some data weirdness) If you meant to say When convertMetastoreParquet is FALSE, there is not currently support for heterogeneous schema. then we are saying the same thing -- I didn't have to set this flag as the missing partitions where handled fine. Now missing partitions are broken, but setting SET spark.sql.hive.convertMetastoreParquet=false breaks 99% case because my files have different # columns. I have not tried the PR you mentioned, will try it now --in my case the issue is not an empty file, it's a missing directory -- our query does a partition=-mm query, where parquet files are laid out under -mm directories representing partitions. But will see if the issue is helped by that PR. In any case, I am just hoping this works before the final release, not a particular rush Querying non-existent partition produces exception in v1.2.0-rc1 - Key: SPARK-4702 URL: https://issues.apache.org/jira/browse/SPARK-4702 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yana Kadiyska Using HiveThriftServer2, when querying a non-existent partition I get an exception rather than an empty result set. This seems to be a regression -- I had an older build of master branch where this works. Build off of RC1.2 tag produces the following: 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement: org.apache.hive.service.cli.HiveSQLException: java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60) at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source) at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Created] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1
Yana Kadiyska created SPARK-4702: Summary: Querying non-existent partition produces exception in v1.2.0-rc1 Key: SPARK-4702 URL: https://issues.apache.org/jira/browse/SPARK-4702 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yana Kadiyska Using HiveThriftServer2, when querying a non-existent partition I get an exception rather than an empty result set. This seems to be a regression -- I had an older build of master branch where this works. Build off of RC1.2 tag produces the following: 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement: org.apache.hive.service.cli.HiveSQLException: java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60) at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source) at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4497) HiveThriftServer2 does not exit properly on failure
Yana Kadiyska created SPARK-4497: Summary: HiveThriftServer2 does not exit properly on failure Key: SPARK-4497 URL: https://issues.apache.org/jira/browse/SPARK-4497 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yana Kadiyska start thriftserver with sbin/start-thriftserver.sh --master ... If there is an error (in my case namenode is in standby mode) the driver shuts down properly: 14/11/19 16:32:58 ERROR HiveThriftServer2: Error starting HiveThriftServer2 14/11/19 16:32:59 INFO SparkUI: Stopped Spark web UI at http://myip:4040 14/11/19 16:32:59 INFO DAGScheduler: Stopping DAGScheduler 14/11/19 16:32:59 INFO SparkDeploySchedulerBackend: Shutting down all executors 14/11/19 16:32:59 INFO SparkDeploySchedulerBackend: Asking each executor to shut down 14/11/19 16:33:00 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor stopped! 14/11/19 16:33:00 INFO MemoryStore: MemoryStore cleared 14/11/19 16:33:00 INFO BlockManager: BlockManager stopped 14/11/19 16:33:00 INFO BlockManagerMaster: BlockManagerMaster stopped 14/11/19 16:33:00 INFO SparkContext: Successfully stopped SparkContext but trying to run sbin/start-thriftserver.sh --master ... again results in an error that Thrifserver is already running. ps -aef|grep offendingPID shows root 32334 1 0 16:32 ?00:00:00 /usr/local/bin/java org.apache.spark.deploy.SparkSubmitDriverBootstrapper --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --master spark://myip:7077 --conf -spark.executor.extraJavaOptions=-verbose:gc -XX:-PrintGCDetails -XX:+PrintGCTimeStamps spark-internal --hiveconf hive.root.logger=INFO,console This is problematic since we have a process that tries to restart the driver if it dies -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3815) LPAD function does not work in where predicate
[ https://issues.apache.org/jira/browse/SPARK-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177396#comment-14177396 ] Yana Kadiyska commented on SPARK-3815: -- Venkata, I am building master and I am still seeing this. Another odd fact: select customer_id from mytable where pkey=concat_ws('-',LPAD('077',4,'0'),'2014-07') LIMIT 2 fails while select customer_id from mytable where pkey=concat_ws('-',LPAD('077',4,'0'),'2014-07') works OK. There are more than 2 results and looking at the executor logs it does seem that they succeed in the computation -- looks like something goes wrong during cleanup when there is a LIMIT. Feel free to augment the title if you can better figure out what the issue is -- let me know if you can't reproduce -- I'll make a synthetic dataset. The table is a parquet table partitioned on pkey. LPAD function does not work in where predicate -- Key: SPARK-3815 URL: https://issues.apache.org/jira/browse/SPARK-3815 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Yana Kadiyska Priority: Minor select customer_id from mytable where pkey=concat_ws('-',LPAD('077',4,'0'),'2014-07') LIMIT 2 produces: 14/10/03 14:51:35 ERROR server.SparkSQLOperationManager: Error executing query: org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1242) at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:597) at org.apache.spark.sql.execution.Limit.execute(basicOperators.scala:146) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360) at org.apache.spark.sql.hive.thriftserver.server.SparkSQLOperationManager$$anon$1.run(SparkSQLOperationManager.scala:185) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:193) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatement(HiveSessionImpl.java:175) at org.apache.hive.service.cli.CLIService.executeStatement(CLIService.java:150) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:207) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1133) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1118) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:58) at org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:55) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:526) at org.apache.hive.service.auth.TUGIContainingProcessor.process(TUGIContainingProcessor.java:55) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.io.NotSerializableException: java.lang.reflect.Constructor at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at
[jira] [Comment Edited] (SPARK-3815) LPAD function does not work in where predicate
[ https://issues.apache.org/jira/browse/SPARK-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177396#comment-14177396 ] Yana Kadiyska edited comment on SPARK-3815 at 10/20/14 7:58 PM: [~gvramana], I am building master and I am still seeing this. Another odd fact: select customer_id from mytable where pkey=concat_ws('-',LPAD('077',4,'0'),'2014-07') LIMIT 2 fails while select customer_id from mytable where pkey=concat_ws('-',LPAD('077',4,'0'),'2014-07') works OK. There are more than 2 results and looking at the executor logs it does seem that they succeed in the computation -- looks like something goes wrong during cleanup when there is a LIMIT. Feel free to augment the title if you can better figure out what the issue is -- let me know if you can't reproduce -- I'll make a synthetic dataset. The table is a parquet table partitioned on pkey. was (Author: yanakad): Venkata, I am building master and I am still seeing this. Another odd fact: select customer_id from mytable where pkey=concat_ws('-',LPAD('077',4,'0'),'2014-07') LIMIT 2 fails while select customer_id from mytable where pkey=concat_ws('-',LPAD('077',4,'0'),'2014-07') works OK. There are more than 2 results and looking at the executor logs it does seem that they succeed in the computation -- looks like something goes wrong during cleanup when there is a LIMIT. Feel free to augment the title if you can better figure out what the issue is -- let me know if you can't reproduce -- I'll make a synthetic dataset. The table is a parquet table partitioned on pkey. LPAD function does not work in where predicate -- Key: SPARK-3815 URL: https://issues.apache.org/jira/browse/SPARK-3815 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Yana Kadiyska Priority: Minor select customer_id from mytable where pkey=concat_ws('-',LPAD('077',4,'0'),'2014-07') LIMIT 2 produces: 14/10/03 14:51:35 ERROR server.SparkSQLOperationManager: Error executing query: org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1242) at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:597) at org.apache.spark.sql.execution.Limit.execute(basicOperators.scala:146) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360) at org.apache.spark.sql.hive.thriftserver.server.SparkSQLOperationManager$$anon$1.run(SparkSQLOperationManager.scala:185) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:193) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatement(HiveSessionImpl.java:175) at org.apache.hive.service.cli.CLIService.executeStatement(CLIService.java:150) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:207) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1133) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1118) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:58) at org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:55) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:526) at org.apache.hive.service.auth.TUGIContainingProcessor.process(TUGIContainingProcessor.java:55) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.io.NotSerializableException: java.lang.reflect.Constructor at
[jira] [Created] (SPARK-3814) Bitwise does not work in Hive
Yana Kadiyska created SPARK-3814: Summary: Bitwise does not work in Hive Key: SPARK-3814 URL: https://issues.apache.org/jira/browse/SPARK-3814 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Yana Kadiyska Priority: Minor Error: java.lang.RuntimeException: Unsupported language features in query: select (case when bit_field 1=1 then r_end - r_start else NULL end) from mytable where pkey='0178-2014-07' LIMIT 2 TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME mytable TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR TOK_FUNCTION when = TOK_TABLE_OR_COL bit_field 1 1 - TOK_TABLE_OR_COL r_end TOK_TABLE_OR_COL r_start TOK_NULL TOK_WHERE = TOK_TABLE_OR_COL pkey '0178-2014-07' TOK_LIMIT 2 SQLState: null ErrorCode: 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3815) LPAD function does not work in where predicate
Yana Kadiyska created SPARK-3815: Summary: LPAD function does not work in where predicate Key: SPARK-3815 URL: https://issues.apache.org/jira/browse/SPARK-3815 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Yana Kadiyska Priority: Minor select customer_id from mytable where pkey=concat_ws('-',LPAD('077',4,'0'),'2014-07') LIMIT 2 produces: 14/10/03 14:51:35 ERROR server.SparkSQLOperationManager: Error executing query: org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1242) at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:597) at org.apache.spark.sql.execution.Limit.execute(basicOperators.scala:146) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360) at org.apache.spark.sql.hive.thriftserver.server.SparkSQLOperationManager$$anon$1.run(SparkSQLOperationManager.scala:185) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:193) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatement(HiveSessionImpl.java:175) at org.apache.hive.service.cli.CLIService.executeStatement(CLIService.java:150) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:207) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1133) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1118) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:58) at org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:55) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:526) at org.apache.hive.service.auth.TUGIContainingProcessor.process(TUGIContainingProcessor.java:55) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.io.NotSerializableException: java.lang.reflect.Constructor at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at scala.collection.immutable.$colon$colon.writeObject(List.scala:379) at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) The following work fine: select concat_ws('-', LPAD(cast(112717 % 1024 AS STRING),4,'0'),'2014-07') from mytable where pkey='0077-2014-07' LIMIT 2 select customer_id from mytable where pkey=concat_ws('-','0077','2014-07') LIMIT 2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org