[jira] [Created] (SPARK-12369) ataFrameReader fails on globbing parquet paths

2015-12-16 Thread Yana Kadiyska (JIRA)
Yana Kadiyska created SPARK-12369:
-

 Summary: ataFrameReader fails on globbing parquet paths
 Key: SPARK-12369
 URL: https://issues.apache.org/jira/browse/SPARK-12369
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.2
Reporter: Yana Kadiyska


Start with a list of parquet paths where some or all do not exist:

{noformat}
val paths=List("/foo/month=05/*.parquet","/foo/month=06/*.parquet")

 sqlContext.read.parquet(paths:_*)
java.lang.NullPointerException
at org.apache.hadoop.fs.Globber.glob(Globber.java:218)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1625)
at 
org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:251)
at 
org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:258)
at 
org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:264)
at 
org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:260)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at 
org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:260)
{noformat}

It would be better to produce a dataframe from the paths that do exist and log 
a warning that a path was missing. Not sure for "all paths are missing case" -- 
could return an emptyDF with no schema or a nicer exception...But I would 
prefer not to have to pre-validate paths





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12369) DataFrameReader fails on globbing parquet paths

2015-12-16 Thread Yana Kadiyska (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yana Kadiyska updated SPARK-12369:
--
Summary: DataFrameReader fails on globbing parquet paths  (was: 
ataFrameReader fails on globbing parquet paths)

> DataFrameReader fails on globbing parquet paths
> ---
>
> Key: SPARK-12369
> URL: https://issues.apache.org/jira/browse/SPARK-12369
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Yana Kadiyska
>
> Start with a list of parquet paths where some or all do not exist:
> {noformat}
> val paths=List("/foo/month=05/*.parquet","/foo/month=06/*.parquet")
>  sqlContext.read.parquet(paths:_*)
> java.lang.NullPointerException
> at org.apache.hadoop.fs.Globber.glob(Globber.java:218)
> at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1625)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:251)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:258)
> at 
> org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:264)
> at 
> org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:260)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:260)
> {noformat}
> It would be better to produce a dataframe from the paths that do exist and 
> log a warning that a path was missing. Not sure for "all paths are missing 
> case" -- could return an emptyDF with no schema or a nicer exception...But I 
> would prefer not to have to pre-validate paths



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12369) DataFrameReader fails on globbing parquet paths

2015-12-16 Thread Yana Kadiyska (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yana Kadiyska updated SPARK-12369:
--
Description: 
Start with a list of parquet paths where some or all do not exist:

{noformat}
val paths=List("/foo/month=05/*.parquet","/foo/month=06/*.parquet")

 sqlContext.read.parquet(paths:_*)
java.lang.NullPointerException
at org.apache.hadoop.fs.Globber.glob(Globber.java:218)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1625)
at 
org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:251)
at 
org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:258)
at 
org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:264)
at 
org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:260)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at 
org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:260)
{noformat}

It would be better to produce a dataframe from the paths that do exist and log 
a warning that a path was missing. Not sure for "all paths are missing case" -- 
probably return an emptyDF with no schema since that method already does so on 
empty path list.But I would prefer not to have to pre-validate paths



  was:
Start with a list of parquet paths where some or all do not exist:

{noformat}
val paths=List("/foo/month=05/*.parquet","/foo/month=06/*.parquet")

 sqlContext.read.parquet(paths:_*)
java.lang.NullPointerException
at org.apache.hadoop.fs.Globber.glob(Globber.java:218)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1625)
at 
org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:251)
at 
org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:258)
at 
org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:264)
at 
org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:260)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at 
org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:260)
{noformat}

It would be better to produce a dataframe from the paths that do exist and log 
a warning that a path was missing. Not sure for "all paths are missing case" -- 
could return an emptyDF with no schema or a nicer exception...But I would 
prefer not to have to pre-validate paths




> DataFrameReader fails on globbing parquet paths
> ---
>
> Key: SPARK-12369
> URL: https://issues.apache.org/jira/browse/SPARK-12369
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Yana Kadiyska
>
> Start with a list of parquet paths where some or all do not exist:
> {noformat}
> val paths=List("/foo/month=05/*.parquet","/foo/month=06/*.parquet")
>  sqlContext.read.parquet(paths:_*)
> java.lang.NullPointerException
> at org.apache.hadoop.fs.Globber.glob(Globber.java:218)
> at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1625)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:251)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:258)
> at 
> org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:264)
> at 
> org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:260)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:260)
> {noformat}
> It would be better to 

[jira] [Commented] (SPARK-4497) HiveThriftServer2 does not exit properly on failure

2015-12-15 Thread Yana Kadiyska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058324#comment-15058324
 ] 

Yana Kadiyska commented on SPARK-4497:
--

[~jeffzhang] I have moved on to 1.2 so I cannot comment on this any longer. I'm 
ok to close

> HiveThriftServer2 does not exit properly on failure
> ---
>
> Key: SPARK-4497
> URL: https://issues.apache.org/jira/browse/SPARK-4497
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Yana Kadiyska
>Priority: Critical
>
> start thriftserver with 
> {{sbin/start-thriftserver.sh --master ...}}
> If there is an error (in my case namenode is in standby mode) the driver 
> shuts down properly:
> {code}
> 14/11/19 16:32:58 ERROR HiveThriftServer2: Error starting HiveThriftServer2
> 
> 14/11/19 16:32:59 INFO SparkUI: Stopped Spark web UI at http://myip:4040
> 14/11/19 16:32:59 INFO DAGScheduler: Stopping DAGScheduler
> 14/11/19 16:32:59 INFO SparkDeploySchedulerBackend: Shutting down all 
> executors
> 14/11/19 16:32:59 INFO SparkDeploySchedulerBackend: Asking each executor to 
> shut down
> 14/11/19 16:33:00 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor 
> stopped!
> 14/11/19 16:33:00 INFO MemoryStore: MemoryStore cleared
> 14/11/19 16:33:00 INFO BlockManager: BlockManager stopped
> 14/11/19 16:33:00 INFO BlockManagerMaster: BlockManagerMaster stopped
> 14/11/19 16:33:00 INFO SparkContext: Successfully stopped SparkContext
> {code}
> but trying to run {{sbin/start-thriftserver.sh --master ... }} again results 
> in an error that Thrifserver is already running.
> {{ps -aef|grep }} shows
> {code}
> root 32334 1  0 16:32 ?00:00:00 /usr/local/bin/java 
> org.apache.spark.deploy.SparkSubmitDriverBootstrapper --class 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --master 
> spark://myip:7077 --conf -spark.executor.extraJavaOptions=-verbose:gc 
> -XX:-PrintGCDetails -XX:+PrintGCTimeStamps spark-internal --hiveconf 
> hive.root.logger=INFO,console
> {code}
> This is problematic since we have a process that tries to restart the driver 
> if it dies



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9405) approximateCountDiscting does not work with GroupBy

2015-07-28 Thread Yana Kadiyska (JIRA)
Yana Kadiyska created SPARK-9405:


 Summary: approximateCountDiscting does not work with GroupBy
 Key: SPARK-9405
 URL: https://issues.apache.org/jira/browse/SPARK-9405
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
Reporter: Yana Kadiyska


{code}
case class MockCustomer(val customer_id:Int,val host:String)

val df =sc.parallelize(1 to 10).map(i=MockCustomer(1234,if (i%2 ==0) 
http://foo.com; else http://bar.com;)).toDF
//this works OK
df.groupBy($host).agg(count($*),sum($customer_id)).show

but this doesnt:

df.groupBy($host).agg(approxCountDistinct($*),sum($customer_id)).show
15/07/28 10:46:14 INFO BlockManagerInfo: Removed broadcast_55_piece0 on 
localhost:33727 in memory (size: 4.4 KB, free: 265.3 MB)
15/07/28 10:46:14 INFO BlockManagerInfo: Removed broadcast_54_piece0 on 
localhost:33727 in memory (size: 4.4 KB, free: 265.3 MB)
org.apache.spark.sql.AnalysisException: cannot resolve 'host' given input 
columns customer_id, host;
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:63)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:122)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8956) Rollup produces incorrect result when group by contains expressions

2015-07-10 Thread Yana Kadiyska (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yana Kadiyska resolved SPARK-8956.
--
Resolution: Duplicate

Duplicate of Spark-8972

 Rollup produces incorrect result when group by contains expressions
 ---

 Key: SPARK-8956
 URL: https://issues.apache.org/jira/browse/SPARK-8956
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yana Kadiyska

 Rollup produces incorrect results when group clause contains an expression
 {code}case class KeyValue(key: Int, value: String)
 val df = sc.parallelize(1 to 50).map(i=KeyValue(i, i.toString)).toDF
 df.registerTempTable(foo)
 sqlContext.sql(“select count(*) as cnt, key % 100 as key,GROUPING__ID from 
 foo group by key%100 with rollup”).show(100)
 {code}
 As a workaround, this works correctly:
 {code}
 val df1=df.withColumn(newkey,df(key)%100)
 df1.registerTempTable(foo1)
 sqlContext.sql(select count(*) as cnt, newkey as key,GROUPING__ID as grp 
 from foo1 group by newkey with rollup).show(100)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8956) Rollup produces incorrect result when group by contains expressions

2015-07-09 Thread Yana Kadiyska (JIRA)
Yana Kadiyska created SPARK-8956:


 Summary: Rollup produces incorrect result when group by contains 
expressions
 Key: SPARK-8956
 URL: https://issues.apache.org/jira/browse/SPARK-8956
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yana Kadiyska


Rollup produces incorrect results when group clause contains an expression
{code}case class KeyValue(key: Int, value: String)
val df = sc.parallelize(1 to 50).map(i=KeyValue(i, i.toString)).toDF

df.registerTempTable(foo)

sqlContext.sql(“select count(*) as cnt, key % 100 as key,GROUPING__ID from foo 
group by key%100 with rollup”).show(100)
{code}

As a workaround, this works correctly:
{code}
val df1=df.withColumn(newkey,df(key)%100)
df1.registerTempTable(foo1)
sqlContext.sql(select count(*) as cnt, newkey as key,GROUPING__ID as grp from 
foo1 group by newkey with rollup).show(100)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1403) Spark on Mesos does not set Thread's context class loader

2015-06-19 Thread Yana Kadiyska (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yana Kadiyska updated SPARK-1403:
-
Target Version/s: 1.5.0

 Spark on Mesos does not set Thread's context class loader
 -

 Key: SPARK-1403
 URL: https://issues.apache.org/jira/browse/SPARK-1403
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.3.0, 1.4.0
 Environment: ubuntu 12.04 on vagrant
Reporter: Bharath Bhushan
Priority: Blocker
 Fix For: 1.0.0


 I can run spark 0.9.0 on mesos but not spark 1.0.0. This is because the spark 
 executor on mesos slave throws a  java.lang.ClassNotFoundException for 
 org.apache.spark.serializer.JavaSerializer.
 The lengthy discussion is here: 
 http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html#a3513



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1403) Spark on Mesos does not set Thread's context class loader

2015-06-19 Thread Yana Kadiyska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593725#comment-14593725
 ] 

Yana Kadiyska commented on SPARK-1403:
--

[~pwendell] at your convenience can you please triage this bug. It was 
originally opened as a blocker. I reopened but am not sure if it's a release 
blocker. I set the target release to 1.5 just so it shows up in triage queries 
since it's a reopened bug...

 Spark on Mesos does not set Thread's context class loader
 -

 Key: SPARK-1403
 URL: https://issues.apache.org/jira/browse/SPARK-1403
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.3.0, 1.4.0
 Environment: ubuntu 12.04 on vagrant
Reporter: Bharath Bhushan
Priority: Blocker
 Fix For: 1.0.0


 I can run spark 0.9.0 on mesos but not spark 1.0.0. This is because the spark 
 executor on mesos slave throws a  java.lang.ClassNotFoundException for 
 org.apache.spark.serializer.JavaSerializer.
 The lengthy discussion is here: 
 http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html#a3513



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1403) Spark on Mesos does not set Thread's context class loader

2015-06-16 Thread Yana Kadiyska (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yana Kadiyska updated SPARK-1403:
-
Affects Version/s: 1.4.0

 Spark on Mesos does not set Thread's context class loader
 -

 Key: SPARK-1403
 URL: https://issues.apache.org/jira/browse/SPARK-1403
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.3.0, 1.4.0
 Environment: ubuntu 12.04 on vagrant
Reporter: Bharath Bhushan
Priority: Blocker
 Fix For: 1.0.0


 I can run spark 0.9.0 on mesos but not spark 1.0.0. This is because the spark 
 executor on mesos slave throws a  java.lang.ClassNotFoundException for 
 org.apache.spark.serializer.JavaSerializer.
 The lengthy discussion is here: 
 http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html#a3513



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-1403) Spark on Mesos does not set Thread's context class loader

2015-06-11 Thread Yana Kadiyska (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yana Kadiyska reopened SPARK-1403:
--

Multiple users reporting this is occuring again in 1.3

 Spark on Mesos does not set Thread's context class loader
 -

 Key: SPARK-1403
 URL: https://issues.apache.org/jira/browse/SPARK-1403
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: ubuntu 12.04 on vagrant
Reporter: Bharath Bhushan
Priority: Blocker
 Fix For: 1.0.0


 I can run spark 0.9.0 on mesos but not spark 1.0.0. This is because the spark 
 executor on mesos slave throws a  java.lang.ClassNotFoundException for 
 org.apache.spark.serializer.JavaSerializer.
 The lengthy discussion is here: 
 http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html#a3513



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1403) Spark on Mesos does not set Thread's context class loader

2015-06-11 Thread Yana Kadiyska (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yana Kadiyska updated SPARK-1403:
-
Affects Version/s: 1.3.0

 Spark on Mesos does not set Thread's context class loader
 -

 Key: SPARK-1403
 URL: https://issues.apache.org/jira/browse/SPARK-1403
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.3.0
 Environment: ubuntu 12.04 on vagrant
Reporter: Bharath Bhushan
Priority: Blocker
 Fix For: 1.0.0


 I can run spark 0.9.0 on mesos but not spark 1.0.0. This is because the spark 
 executor on mesos slave throws a  java.lang.ClassNotFoundException for 
 org.apache.spark.serializer.JavaSerializer.
 The lengthy discussion is here: 
 http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html#a3513



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7

2015-06-01 Thread Yana Kadiyska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567601#comment-14567601
 ] 

Yana Kadiyska commented on SPARK-5389:
--

FWIW I just tried the 1.4-rc3 build 
(http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc3-bin/) cdh4 
binary and it runs without issues. From the exact same command prompt I can run 
the 1.4 script but not the 1.2 script. So if we can't figure out a consistent 
repro, maybe other folks can confirm if the new cmd files work... 

 spark-shell.cmd does not run from DOS Windows 7
 ---

 Key: SPARK-5389
 URL: https://issues.apache.org/jira/browse/SPARK-5389
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Shell, Windows
Affects Versions: 1.2.0
 Environment: Windows 7
Reporter: Yana Kadiyska
 Attachments: SparkShell_Win7.JPG, spark_bug.png


 spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. 
 spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2
 Marking as trivial since calling spark-shell2.cmd also works fine
 Attaching a screenshot since the error isn't very useful:
 {code}
 spark-1.2.0-bin-cdh4bin\spark-shell.cmd
 else was unexpected at this time.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3815) LPAD function does not work in where predicate

2015-05-26 Thread Yana Kadiyska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559396#comment-14559396
 ] 

Yana Kadiyska commented on SPARK-3815:
--

I will close -- I have not observed this in 1.2.x versions

 LPAD function does not work in where predicate
 --

 Key: SPARK-3815
 URL: https://issues.apache.org/jira/browse/SPARK-3815
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Yana Kadiyska
Priority: Minor

 select customer_id from mytable where 
 pkey=concat_ws('-',LPAD('077',4,'0'),'2014-07') LIMIT 2
 produces:
 14/10/03 14:51:35 ERROR server.SparkSQLOperationManager: Error executing 
 query:
 org.apache.spark.SparkException: Task not serializable
 at 
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
 at 
 org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
 at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
 at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:597)
 at 
 org.apache.spark.sql.execution.Limit.execute(basicOperators.scala:146)
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360)
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360)
 at 
 org.apache.spark.sql.hive.thriftserver.server.SparkSQLOperationManager$$anon$1.run(SparkSQLOperationManager.scala:185)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:193)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatement(HiveSessionImpl.java:175)
 at 
 org.apache.hive.service.cli.CLIService.executeStatement(CLIService.java:150)
 at 
 org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:207)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1133)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1118)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at 
 org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:58)
 at 
 org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:55)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
 at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:526)
 at 
 org.apache.hive.service.auth.TUGIContainingProcessor.process(TUGIContainingProcessor.java:55)
 at 
 org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 Caused by: java.io.NotSerializableException: java.lang.reflect.Constructor
 at 
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
 at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
 at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
 at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
 at 
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)
 at 
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173)
 at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
 at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
 at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
 at 
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
 at scala.collection.immutable.$colon$colon.writeObject(List.scala:379)
 at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 The following work fine:
 select concat_ws('-', LPAD(cast(112717 % 1024 AS STRING),4,'0'),'2014-07') 
 from mytable where pkey='0077-2014-07' LIMIT 2

[jira] [Resolved] (SPARK-3815) LPAD function does not work in where predicate

2015-05-26 Thread Yana Kadiyska (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yana Kadiyska resolved SPARK-3815.
--
Resolution: Cannot Reproduce

 LPAD function does not work in where predicate
 --

 Key: SPARK-3815
 URL: https://issues.apache.org/jira/browse/SPARK-3815
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Yana Kadiyska
Priority: Minor

 select customer_id from mytable where 
 pkey=concat_ws('-',LPAD('077',4,'0'),'2014-07') LIMIT 2
 produces:
 14/10/03 14:51:35 ERROR server.SparkSQLOperationManager: Error executing 
 query:
 org.apache.spark.SparkException: Task not serializable
 at 
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
 at 
 org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
 at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
 at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:597)
 at 
 org.apache.spark.sql.execution.Limit.execute(basicOperators.scala:146)
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360)
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360)
 at 
 org.apache.spark.sql.hive.thriftserver.server.SparkSQLOperationManager$$anon$1.run(SparkSQLOperationManager.scala:185)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:193)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatement(HiveSessionImpl.java:175)
 at 
 org.apache.hive.service.cli.CLIService.executeStatement(CLIService.java:150)
 at 
 org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:207)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1133)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1118)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at 
 org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:58)
 at 
 org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:55)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
 at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:526)
 at 
 org.apache.hive.service.auth.TUGIContainingProcessor.process(TUGIContainingProcessor.java:55)
 at 
 org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 Caused by: java.io.NotSerializableException: java.lang.reflect.Constructor
 at 
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
 at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
 at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
 at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
 at 
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)
 at 
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173)
 at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
 at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
 at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
 at 
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
 at scala.collection.immutable.$colon$colon.writeObject(List.scala:379)
 at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 The following work fine:
 select concat_ws('-', LPAD(cast(112717 % 1024 AS STRING),4,'0'),'2014-07') 
 from mytable where pkey='0077-2014-07' LIMIT 2
 select customer_id from mytable  where pkey=concat_ws('-','0077','2014-07') 
 

[jira] [Closed] (SPARK-3815) LPAD function does not work in where predicate

2015-05-26 Thread Yana Kadiyska (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yana Kadiyska closed SPARK-3815.


I have not been able to reproduce this behavior with 1.2.x versions so I'm 
assuming some commit in the interim fixed this.

 LPAD function does not work in where predicate
 --

 Key: SPARK-3815
 URL: https://issues.apache.org/jira/browse/SPARK-3815
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Yana Kadiyska
Priority: Minor

 select customer_id from mytable where 
 pkey=concat_ws('-',LPAD('077',4,'0'),'2014-07') LIMIT 2
 produces:
 14/10/03 14:51:35 ERROR server.SparkSQLOperationManager: Error executing 
 query:
 org.apache.spark.SparkException: Task not serializable
 at 
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
 at 
 org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
 at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
 at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:597)
 at 
 org.apache.spark.sql.execution.Limit.execute(basicOperators.scala:146)
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360)
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360)
 at 
 org.apache.spark.sql.hive.thriftserver.server.SparkSQLOperationManager$$anon$1.run(SparkSQLOperationManager.scala:185)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:193)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatement(HiveSessionImpl.java:175)
 at 
 org.apache.hive.service.cli.CLIService.executeStatement(CLIService.java:150)
 at 
 org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:207)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1133)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1118)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at 
 org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:58)
 at 
 org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:55)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
 at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:526)
 at 
 org.apache.hive.service.auth.TUGIContainingProcessor.process(TUGIContainingProcessor.java:55)
 at 
 org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 Caused by: java.io.NotSerializableException: java.lang.reflect.Constructor
 at 
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
 at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
 at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
 at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
 at 
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)
 at 
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173)
 at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
 at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
 at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
 at 
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
 at scala.collection.immutable.$colon$colon.writeObject(List.scala:379)
 at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 The following work fine:
 select concat_ws('-', LPAD(cast(112717 % 1024 AS STRING),4,'0'),'2014-07') 
 from mytable where pkey='0077-2014-07' 

[jira] [Created] (SPARK-7792) HiveContext registerTempTable not thread safe

2015-05-21 Thread Yana Kadiyska (JIRA)
Yana Kadiyska created SPARK-7792:


 Summary: HiveContext registerTempTable not thread safe
 Key: SPARK-7792
 URL: https://issues.apache.org/jira/browse/SPARK-7792
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Yana Kadiyska


{quote}
public class ThreadRepro {
public static void main(String[] args) throws Exception{
   new ThreadRepro().sparkPerfTest();
}

public void sparkPerfTest(){

final AtomicLong counter = new AtomicLong();
SparkConf conf = new SparkConf();
conf.setAppName(My Application);
conf.setMaster(local[7]);
SparkContext sc = new SparkContext(conf);

org.apache.spark.sql.hive.HiveContext hc = new 
org.apache.spark.sql.hive.HiveContext(sc);
int poolSize = 10;
ExecutorService pool = Executors.newFixedThreadPool(poolSize);
for (int i=0; ipoolSize;i++ )
pool.execute(new QueryJob(hc, i, counter));

pool.shutdown();
try {
pool.awaitTermination(60, TimeUnit.MINUTES);
}catch(Exception e){
System.out.println(Thread interrupted);
}
System.out.println(All jobs complete);
System.out.println( Counter is +counter.get());

}
}

class QueryJob implements Runnable{
String threadId;
org.apache.spark.sql.hive.HiveContext sqlContext;
String key;
AtomicLong counter;
final AtomicLong local_counter = new AtomicLong();

public QueryJob(org.apache.spark.sql.hive.HiveContext _sqlContext,int 
id,AtomicLong ctr){

threadId = thread_+id;
this.sqlContext= _sqlContext;
this.counter = ctr;
}
public void run() {
for (int i = 0; i  100; i++) {
String tblName = threadId +_+i;
DataFrame df = sqlContext.emptyDataFrame();
df.registerTempTable(tblName);
String _query = String.format(select count(*) from %s,tblName);
System.out.println(String.format( registered table %s; catalog 
(%s) ,tblName,debugTables()));
ListRow res;
try {
res = sqlContext.sql(_query).collectAsList();
}catch (Exception e){
System.out.println(*Exception + debugTables() +**);
throw e;
}
sqlContext.dropTempTable(tblName);
System.out.println( dropped table +tblName);
try {
Thread.sleep(3000);//lets make this a not-so-tight loop
}catch(Exception e){
System.out.println(Thread interrupted);
}
}
}

private String debugTables(){
String v = Joiner.on(',').join(sqlContext.tableNames());
if (v==null)return ; else return v;
}
}
{quote}

this will periodically produce the following:

{quote}
 registered table thread_0_50; catalog (thread_1_50)
 registered table thread_4_50; catalog (thread_4_50,thread_1_50)
 registered table thread_1_50; catalog (thread_1_50)
 dropped table thread_1_50
 dropped table thread_4_50
*Exception **
Exception in thread pool-6-thread-1 java.lang.Error: 
org.apache.spark.sql.AnalysisException: no such table thread_0_50; line 1 pos 21
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.sql.AnalysisException: no such table thread_0_50; 
line 1 pos 21
  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:177)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$6.applyOrElse(Analyzer.scala:186)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$6.applyOrElse(Analyzer.scala:181)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:188)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:188)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:187)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:208)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
  at scala.collection.Iterator$class.foreach(Iterator.scala:727)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
  at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
  at 

[jira] [Updated] (SPARK-7792) HiveContext registerTempTable not thread safe

2015-05-21 Thread Yana Kadiyska (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yana Kadiyska updated SPARK-7792:
-
Description: 
{code:java}
public class ThreadRepro {
public static void main(String[] args) throws Exception{
   new ThreadRepro().sparkPerfTest();
}

public void sparkPerfTest(){

final AtomicLong counter = new AtomicLong();
SparkConf conf = new SparkConf();
conf.setAppName(My Application);
conf.setMaster(local[7]);
SparkContext sc = new SparkContext(conf);

org.apache.spark.sql.hive.HiveContext hc = new 
org.apache.spark.sql.hive.HiveContext(sc);
int poolSize = 10;
ExecutorService pool = Executors.newFixedThreadPool(poolSize);
for (int i=0; ipoolSize;i++ )
pool.execute(new QueryJob(hc, i, counter));

pool.shutdown();
try {
pool.awaitTermination(60, TimeUnit.MINUTES);
}catch(Exception e){
System.out.println(Thread interrupted);
}
System.out.println(All jobs complete);
System.out.println( Counter is +counter.get());

}
}

class QueryJob implements Runnable{
String threadId;
org.apache.spark.sql.hive.HiveContext sqlContext;
String key;
AtomicLong counter;
final AtomicLong local_counter = new AtomicLong();

public QueryJob(org.apache.spark.sql.hive.HiveContext _sqlContext,int 
id,AtomicLong ctr){

threadId = thread_+id;
this.sqlContext= _sqlContext;
this.counter = ctr;
}
public void run() {
for (int i = 0; i  100; i++) {
String tblName = threadId +_+i;
DataFrame df = sqlContext.emptyDataFrame();
df.registerTempTable(tblName);
String _query = String.format(select count(*) from %s,tblName);
System.out.println(String.format( registered table %s; catalog 
(%s) ,tblName,debugTables()));
ListRow res;
try {
res = sqlContext.sql(_query).collectAsList();
}catch (Exception e){
System.out.println(*Exception + debugTables() +**);
throw e;
}
sqlContext.dropTempTable(tblName);
System.out.println( dropped table +tblName);
try {
Thread.sleep(3000);//lets make this a not-so-tight loop
}catch(Exception e){
System.out.println(Thread interrupted);
}
}
}

private String debugTables(){
String v = Joiner.on(',').join(sqlContext.tableNames());
if (v==null)return ; else return v;
}
}
{code}

this will periodically produce the following:

{quote}
 registered table thread_0_50; catalog (thread_1_50)
 registered table thread_4_50; catalog (thread_4_50,thread_1_50)
 registered table thread_1_50; catalog (thread_1_50)
 dropped table thread_1_50
 dropped table thread_4_50
*Exception **
Exception in thread pool-6-thread-1 java.lang.Error: 
org.apache.spark.sql.AnalysisException: no such table thread_0_50; line 1 pos 21
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.sql.AnalysisException: no such table thread_0_50; 
line 1 pos 21
  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:177)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$6.applyOrElse(Analyzer.scala:186)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$6.applyOrElse(Analyzer.scala:181)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:188)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:188)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:187)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:208)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
  at scala.collection.Iterator$class.foreach(Iterator.scala:727)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
  at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
  at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
  at scala.collection.AbstractIterator.to(Iterator.scala:1157)
  at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
  at 

[jira] [Updated] (SPARK-4412) Parquet logger cannot be configured

2015-05-19 Thread Yana Kadiyska (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yana Kadiyska updated SPARK-4412:
-
Affects Version/s: 1.3.1

 Parquet logger cannot be configured
 ---

 Key: SPARK-4412
 URL: https://issues.apache.org/jira/browse/SPARK-4412
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.1
Reporter: Jim Carroll

 The Spark ParquetRelation.scala code makes the assumption that the 
 parquet.Log class has already been loaded. If 
 ParquetRelation.enableLogForwarding executes prior to the parquet.Log class 
 being loaded then the code in enableLogForwarding has no affect.
 ParquetRelation.scala attempts to override the parquet logger but, at least 
 currently (and if your application simply reads a parquet file before it does 
 anything else with Parquet), the parquet.Log class hasn't been loaded yet. 
 Therefore the code in ParquetRelation.enableLogForwarding has no affect. If 
 you look at the code in parquet.Log there's a static initializer that needs 
 to be called prior to enableLogForwarding or whatever enableLogForwarding 
 does gets undone by this static initializer.
 The fix would be to force the static initializer to get called in 
 parquet.Log as part of enableForwardLogging. 
 PR will be forthcomming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4412) Parquet logger cannot be configured

2015-05-16 Thread Yana Kadiyska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545677#comment-14545677
 ] 

Yana Kadiyska edited comment on SPARK-4412 at 5/16/15 5:11 PM:
---

I would like to reopen as I believe the issue has again regressed in Spark 
1.3.0. This SO thread has a lengthy discussion 
http://stackoverflow.com/questions/30052889/how-to-suppress-parquet-log-messages-in-spark
 but the short summary is that log4j.rootCategory=ERROR, console setting still 
leaks
{quote}
 INFO: parquet.hadoop.InternalParquetRecordReader 
{quote} messages


was (Author: yanakad):
I would like to reopen as I believe the issue has again regressed in Spark 
1.3.0. This SO thread has a lengthy discussion 
http://stackoverflow.com/questions/30052889/how-to-suppress-parquet-log-messages-in-spark
 but the short summary is that log4j.rootCategory=ERROR, console setting still 
leaks INFO: parquet.hadoop.InternalParquetRecordReader messages

 Parquet logger cannot be configured
 ---

 Key: SPARK-4412
 URL: https://issues.apache.org/jira/browse/SPARK-4412
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Jim Carroll

 The Spark ParquetRelation.scala code makes the assumption that the 
 parquet.Log class has already been loaded. If 
 ParquetRelation.enableLogForwarding executes prior to the parquet.Log class 
 being loaded then the code in enableLogForwarding has no affect.
 ParquetRelation.scala attempts to override the parquet logger but, at least 
 currently (and if your application simply reads a parquet file before it does 
 anything else with Parquet), the parquet.Log class hasn't been loaded yet. 
 Therefore the code in ParquetRelation.enableLogForwarding has no affect. If 
 you look at the code in parquet.Log there's a static initializer that needs 
 to be called prior to enableLogForwarding or whatever enableLogForwarding 
 does gets undone by this static initializer.
 The fix would be to force the static initializer to get called in 
 parquet.Log as part of enableForwardLogging. 
 PR will be forthcomming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-4412) Parquet logger cannot be configured

2015-05-15 Thread Yana Kadiyska (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yana Kadiyska reopened SPARK-4412:
--

Reopening as the issue reappeared in 1.3.0

 Parquet logger cannot be configured
 ---

 Key: SPARK-4412
 URL: https://issues.apache.org/jira/browse/SPARK-4412
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Jim Carroll
 Fix For: 1.2.0


 The Spark ParquetRelation.scala code makes the assumption that the 
 parquet.Log class has already been loaded. If 
 ParquetRelation.enableLogForwarding executes prior to the parquet.Log class 
 being loaded then the code in enableLogForwarding has no affect.
 ParquetRelation.scala attempts to override the parquet logger but, at least 
 currently (and if your application simply reads a parquet file before it does 
 anything else with Parquet), the parquet.Log class hasn't been loaded yet. 
 Therefore the code in ParquetRelation.enableLogForwarding has no affect. If 
 you look at the code in parquet.Log there's a static initializer that needs 
 to be called prior to enableLogForwarding or whatever enableLogForwarding 
 does gets undone by this static initializer.
 The fix would be to force the static initializer to get called in 
 parquet.Log as part of enableForwardLogging. 
 PR will be forthcomming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4412) Parquet logger cannot be configured

2015-05-15 Thread Yana Kadiyska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545677#comment-14545677
 ] 

Yana Kadiyska commented on SPARK-4412:
--

I would like to reopen as I believe the issue has again regressed in Spark 
1.3.0. This SO thread has a lengthy discussion 
http://stackoverflow.com/questions/30052889/how-to-suppress-parquet-log-messages-in-spark
 but the short summary is that log4j.rootCategory=ERROR, console setting still 
leaks INFO: parquet.hadoop.InternalParquetRecordReader messages

 Parquet logger cannot be configured
 ---

 Key: SPARK-4412
 URL: https://issues.apache.org/jira/browse/SPARK-4412
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Jim Carroll
 Fix For: 1.2.0


 The Spark ParquetRelation.scala code makes the assumption that the 
 parquet.Log class has already been loaded. If 
 ParquetRelation.enableLogForwarding executes prior to the parquet.Log class 
 being loaded then the code in enableLogForwarding has no affect.
 ParquetRelation.scala attempts to override the parquet logger but, at least 
 currently (and if your application simply reads a parquet file before it does 
 anything else with Parquet), the parquet.Log class hasn't been loaded yet. 
 Therefore the code in ParquetRelation.enableLogForwarding has no affect. If 
 you look at the code in parquet.Log there's a static initializer that needs 
 to be called prior to enableLogForwarding or whatever enableLogForwarding 
 does gets undone by this static initializer.
 The fix would be to force the static initializer to get called in 
 parquet.Log as part of enableForwardLogging. 
 PR will be forthcomming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3928) Support wildcard matches on Parquet files

2015-05-08 Thread Yana Kadiyska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14536114#comment-14536114
 ] 

Yana Kadiyska commented on SPARK-3928:
--

[~tkyaw] Your suggested workaround does work. One question though -- what are 
the implications of turning off spark.sql.parquet.useDataSourceApi? My 
particular concern is with predicate pushdowns into parquet -- am I going to 
lose these (it's hard to tell from the UI if pushdown is happening correctly). 

Also, can you clarify if you still plan to fix this for 1.4 or New parquet 
implementation does not contain wild card support yet means that we'd have to 
live with spark.sql.parquet.useDataSourceApi until further time?

 Support wildcard matches on Parquet files
 -

 Key: SPARK-3928
 URL: https://issues.apache.org/jira/browse/SPARK-3928
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Reporter: Nicholas Chammas
Assignee: Cheng Lian
Priority: Minor
 Fix For: 1.3.0


 {{SparkContext.textFile()}} supports patterns like {{part-*}} and 
 {{2014-\?\?-\?\?}}. 
 It would be nice if {{SparkContext.parquetFile()}} did the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3928) Support wildcard matches on Parquet files

2015-05-07 Thread Yana Kadiyska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532680#comment-14532680
 ] 

Yana Kadiyska edited comment on SPARK-3928 at 5/7/15 2:16 PM:
--

Marius, are you saying that wildcards are not supported then? in my case, I 
would really like to do /r/warehouse/hive/pkey=-2015-04/* (which works w/ 
textFile method btw) -- i.e. pass a single path for all April 2015 partitions. 
Enumerating all paths underneath is pretty crazy, that's a huge list.
 Are you saying that is the only way? I thought the whole point of this bug is 
that we _don't_ have to enumerate the paths explicitly. Also in my case hc is a 
HiveContext instance, not a dataframe.

As a side note I am trying to use this feature as a workaround to 
https://issues.apache.org/jira/browse/SPARK-6910 -- Michael A. suggested a work 
around which takes way too long in our case -- I was hoping to be able to 
create a DF from a subset of partitions...But it would be a pain to build an 
explicit listWould like to know for sure if that's the deliberate design 
though...


was (Author: yanakad):
Marius, are you saying that wildcards are not supported then? in my case, I 
would really like to do /r/warehouse/hive/pkey=-2015-04/* (which works w/ 
textFile method btw) -- i.e. pass a single path for all April 2015 partitions. 
Enumerating all paths underneath is pretty crazy, that's a huge list.
 Are you saying that is the only way? I thought the whole point of this bug is 
that we _don't_ have to enumerate the paths explicitly. Also in my case hc is a 
HiveContext instance, not a dataframe.

As a side note I am trying to use this feature as a workaround to 
https://issues.apache.org/jira/browse/SPARK-6910 -- Mike A. suggested a work 
around which takes way too long in our case -- I was hoping to be able to 
create a DF from a subset of partitions...But it would be a pain to build an 
explicit listWould like to know for sure if that's the deliberate design 
though...

 Support wildcard matches on Parquet files
 -

 Key: SPARK-3928
 URL: https://issues.apache.org/jira/browse/SPARK-3928
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Reporter: Nicholas Chammas
Priority: Minor
 Fix For: 1.3.0


 {{SparkContext.textFile()}} supports patterns like {{part-*}} and 
 {{2014-\?\?-\?\?}}. 
 It would be nice if {{SparkContext.parquetFile()}} did the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3928) Support wildcard matches on Parquet files

2015-05-07 Thread Yana Kadiyska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532716#comment-14532716
 ] 

Yana Kadiyska commented on SPARK-3928:
--

Then I think they should change the resolved status to Resolved-Won't fix if 
that's a conscious decision. I did make an edit to my previous comment as to 
why I'd love this to be different.

 Support wildcard matches on Parquet files
 -

 Key: SPARK-3928
 URL: https://issues.apache.org/jira/browse/SPARK-3928
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Reporter: Nicholas Chammas
Priority: Minor
 Fix For: 1.3.0


 {{SparkContext.textFile()}} supports patterns like {{part-*}} and 
 {{2014-\?\?-\?\?}}. 
 It would be nice if {{SparkContext.parquetFile()}} did the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3928) Support wildcard matches on Parquet files

2015-05-07 Thread Yana Kadiyska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532624#comment-14532624
 ] 

Yana Kadiyska edited comment on SPARK-3928 at 5/7/15 1:48 PM:
--

I am observing the same issue.

Downloaded a pre-built CDH4 1.3.1 distro.

{quote}
scala 
sc.textFile(/r/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet).first
res0: String = PAR1? L??? ?p??? ,?�� ?? ???p?p??? 
,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� ? ,?�� ?? 
???p?p??? ,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� 
? ,?��??? ???p?p??? ,��???���q?�qL?� 
�8��{???%??? 
???/???(???�???�???9???�???�???2???#???M???0??? 
???6???�???4???�???P???*??? ???�???
s
scala 
hc.parquetFile(/r/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet)
java.io.FileNotFoundException: File does not exist: 
/r/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet

{quote}


was (Author: yanakad):
I am observing the same issue.

Downloaded a pre-built CDH4 1.3.1 distro.

{quote}
scala 
sc.textFile(/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet).first
res0: String = PAR1? L??? ?p??? ,?�� ?? ???p?p??? 
,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� ? ,?�� ?? 
???p?p??? ,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� 
? ,?��??? ???p?p??? ,��???���q?�qL?� 
�8��{???%??? 
???/???(???�???�???9???�???�???2???#???M???0??? 
???6???�???4???�???P???*??? ???�???
s
scala 
hc.parquetFile(/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet)
java.io.FileNotFoundException: File does not exist: 
/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet

{quote}

 Support wildcard matches on Parquet files
 -

 Key: SPARK-3928
 URL: https://issues.apache.org/jira/browse/SPARK-3928
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Reporter: Nicholas Chammas
Priority: Minor
 Fix For: 1.3.0


 {{SparkContext.textFile()}} supports patterns like {{part-*}} and 
 {{2014-\?\?-\?\?}}. 
 It would be nice if {{SparkContext.parquetFile()}} did the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3928) Support wildcard matches on Parquet files

2015-05-07 Thread Yana Kadiyska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532680#comment-14532680
 ] 

Yana Kadiyska edited comment on SPARK-3928 at 5/7/15 2:15 PM:
--

Marius, are you saying that wildcards are not supported then? in my case, I 
would really like to do /r/warehouse/hive/pkey=-2015-04/* (which works w/ 
textFile method btw) -- i.e. pass a single path for all April 2015 partitions. 
Enumerating all paths underneath is pretty crazy, that's a huge list.
 Are you saying that is the only way? I thought the whole point of this bug is 
that we _don't_ have to enumerate the paths explicitly. Also in my case hc is a 
HiveContext instance, not a dataframe.

As a side note I am trying to use this feature as a workaround to 
https://issues.apache.org/jira/browse/SPARK-6910 -- Mike A. suggested a work 
around which takes way too long in our case -- I was hoping to be able to 
create a DF from a subset of partitions...But it would be a pain to build an 
explicit listWould like to know for sure if that's the deliberate design 
though...


was (Author: yanakad):
Marius, are you saying that wildcards are not supported then? in my case, I 
would really like to do /r/warehouse/hive/pkey=-2015-04/* (which works w/ 
textFile method btw) -- i.e. pass a single path for all April 2015 partitions. 
Enumerating all paths underneath is pretty crazy, that's a huge list.
 Are you saying that is the only way? I thought the whole point of this bug is 
that we _don't_ have to enumerate the paths explicitly. Also in my case hc is a 
HiveContext instance, not a dataframe.

 Support wildcard matches on Parquet files
 -

 Key: SPARK-3928
 URL: https://issues.apache.org/jira/browse/SPARK-3928
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Reporter: Nicholas Chammas
Priority: Minor
 Fix For: 1.3.0


 {{SparkContext.textFile()}} supports patterns like {{part-*}} and 
 {{2014-\?\?-\?\?}}. 
 It would be nice if {{SparkContext.parquetFile()}} did the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3928) Support wildcard matches on Parquet files

2015-05-07 Thread Yana Kadiyska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532624#comment-14532624
 ] 

Yana Kadiyska edited comment on SPARK-3928 at 5/7/15 1:38 PM:
--

I am observing the same issue.

Downloaded a pre-built CDH4 1.3.1 distro.

{quote}
scala 
sc.textFile(/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet).first
res0: String = PAR1? L??? ?p??? ,?�� ?? ???p?p??? 
,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� ? ,?�� ?? 
???p?p??? ,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� 
? ,?��??? ???p?p??? ,��???���q?�qL?� 
�8��{???%??? 
???/???(???�???�???9???�???�???2???#???M???0??? 
???6???�???4???�???P???*??? ???�???
s
scala 
hc.parquetFile(/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet)
java.io.FileNotFoundException: File does not exist: 
/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet

{quote}


was (Author: yanakad):
I am observing the same issue.

Downloaded a pre-built CDH4 1.3.1 distro.

{quote}
scala 
sc.textFile(/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet).first
res0: String = PAR1? L??? ?p??? ,?�� ?? ???p?p??? 
,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� ? ,?�� ?? 
???p?p??? ,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� 
? ,?��??? ???p?p??? ,��???���q?�qL?� 
�8��{???%??? 
???/???(???�???�???9???�???�???2???#???M???0??? 
???6???�???4???�???P???*??? ???�???
s
scala 
hc.parquetFile(/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet)
java.io.FileNotFoundException: File does not exist: 
hdfs://cdh4-21968-nn/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet

{quote}

 Support wildcard matches on Parquet files
 -

 Key: SPARK-3928
 URL: https://issues.apache.org/jira/browse/SPARK-3928
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Reporter: Nicholas Chammas
Priority: Minor
 Fix For: 1.3.0


 {{SparkContext.textFile()}} supports patterns like {{part-*}} and 
 {{2014-\?\?-\?\?}}. 
 It would be nice if {{SparkContext.parquetFile()}} did the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3928) Support wildcard matches on Parquet files

2015-05-07 Thread Yana Kadiyska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532624#comment-14532624
 ] 

Yana Kadiyska commented on SPARK-3928:
--

I am observing the same issue.

Downloaded a pre-built CDH4 1.3.1 distro.

{quote}
scala 
sc.textFile(/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet).first
res0: String = PAR1? L??? ?p??? ,?�� ?? ???p?p??? 
,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� ? ,?�� ?? 
???p?p??? ,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� 
? ,?��??? ???p?p??? ,��???���q?�qL?� 
�8��{???%??? 
???/???(???�???�???9???�???�???2???#???M???0??? 
???6???�???4???�???P???*??? ???�???
s
scala 
hc.parquetFile(/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet)
java.io.FileNotFoundException: File does not exist: 
hdfs://cdh4-21968-nn/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet

{quote}

 Support wildcard matches on Parquet files
 -

 Key: SPARK-3928
 URL: https://issues.apache.org/jira/browse/SPARK-3928
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Reporter: Nicholas Chammas
Priority: Minor
 Fix For: 1.3.0


 {{SparkContext.textFile()}} supports patterns like {{part-*}} and 
 {{2014-\?\?-\?\?}}. 
 It would be nice if {{SparkContext.parquetFile()}} did the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3928) Support wildcard matches on Parquet files

2015-05-07 Thread Yana Kadiyska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532680#comment-14532680
 ] 

Yana Kadiyska commented on SPARK-3928:
--

Marius, are you saying that wildcards are not supported then? in my case, I 
would really like to do /r/warehouse/hive/pkey=-2015-04/* (which works w/ 
textFile method btw) -- i.e. pass a single path for all April 2015 partitions. 
Enumerating all paths underneath is pretty crazy, that's a huge list.
 Are you saying that is the only way? I thought the whole point of this bug is 
that we _don't_ have to enumerate the paths explicitly. Also in my case hc is a 
HiveContext instance, not a dataframe.

 Support wildcard matches on Parquet files
 -

 Key: SPARK-3928
 URL: https://issues.apache.org/jira/browse/SPARK-3928
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Reporter: Nicholas Chammas
Priority: Minor
 Fix For: 1.3.0


 {{SparkContext.textFile()}} supports patterns like {{part-*}} and 
 {{2014-\?\?-\?\?}}. 
 It would be nice if {{SparkContext.parquetFile()}} did the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6910) Support for pushing predicates down to metastore for partition pruning

2015-05-05 Thread Yana Kadiyska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529639#comment-14529639
 ] 

Yana Kadiyska commented on SPARK-6910:
--

Can you please provide an update on this -- status/maybe a target release? 
Spark-6904 was closed as a duplicate of this but this seems like a critical 
bug? We have a pretty large metastore (partitions are per month per customer 
with a few years of data) and Shark works OK but I cannot take advantage of the 
new cool versions of Spark until the Metastore interaction improves.

Any advice on a workaround would also be great...
I opened https://issues.apache.org/jira/browse/SPARK-6984 which is probably a 
dup of this.

 Support for pushing predicates down to metastore for partition pruning
 --

 Key: SPARK-6910
 URL: https://issues.apache.org/jira/browse/SPARK-6910
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6984) Operations on tables with many partitions _very_slow

2015-05-05 Thread Yana Kadiyska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519409#comment-14519409
 ] 

Yana Kadiyska edited comment on SPARK-6984 at 5/6/15 12:26 AM:
---

Possibly related to this https://issues.apache.org/jira/browse/SPARK-6910


was (Author: yanakad):
Possibly related to this

 Operations on tables with many partitions _very_slow
 

 Key: SPARK-6984
 URL: https://issues.apache.org/jira/browse/SPARK-6984
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
 Environment: External Hive metastore, table with 30K partitions
Reporter: Yana Kadiyska
 Attachments: 7282_partitions_stack.png


 I have a table with _many_partitions (30K). Users cannot query all of them 
 but they are in the metastore. Querying this table is extremely slow even if 
 we're asking for a single partition. 
 describe sometable also performs _very_ poorly
 {quote}
 Spark produces the following times:
 Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02, SQL 
 query: 72.831, Reading results: 0.189
 Whereas Hive over the same metastore shows:
 Query 1 of 1, Rows read: 47, Elapsed time (seconds) - Total: 0.44, SQL query: 
 0.204, Reading results: 0.236
 {quote}
 I attempted to debug this and noticed that HiveMetastoreCatalog constructs an 
 object for each partition, which is puzzling to me (attaching screenshot). 
 Should this value be lazy -- describe table should be purely a metastore op 
 IMO (i.e. query postgres, return types).
 The issue is a blocker to me but leaving with default priority until someone 
 can confirm it is a bug. describe table is not so interesting but I think 
 this affects all query paths -- I sent an inquiry earlier here: 
 https://www.mail-archive.com/user@spark.apache.org/msg26242.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6984) Operations on tables with many partitions _very_slow

2015-04-29 Thread Yana Kadiyska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519409#comment-14519409
 ] 

Yana Kadiyska commented on SPARK-6984:
--

Possibly related to this

 Operations on tables with many partitions _very_slow
 

 Key: SPARK-6984
 URL: https://issues.apache.org/jira/browse/SPARK-6984
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
 Environment: External Hive metastore, table with 30K partitions
Reporter: Yana Kadiyska
 Attachments: 7282_partitions_stack.png


 I have a table with _many_partitions (30K). Users cannot query all of them 
 but they are in the metastore. Querying this table is extremely slow even if 
 we're asking for a single partition. 
 describe sometable also performs _very_ poorly
 {quote}
 Spark produces the following times:
 Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02, SQL 
 query: 72.831, Reading results: 0.189
 Whereas Hive over the same metastore shows:
 Query 1 of 1, Rows read: 47, Elapsed time (seconds) - Total: 0.44, SQL query: 
 0.204, Reading results: 0.236
 {quote}
 I attempted to debug this and noticed that HiveMetastoreCatalog constructs an 
 object for each partition, which is puzzling to me (attaching screenshot). 
 Should this value be lazy -- describe table should be purely a metastore op 
 IMO (i.e. query postgres, return types).
 The issue is a blocker to me but leaving with default priority until someone 
 can confirm it is a bug. describe table is not so interesting but I think 
 this affects all query paths -- I sent an inquiry earlier here: 
 https://www.mail-archive.com/user@spark.apache.org/msg26242.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5923) Very slow query when using Oracle hive metastore and table has lots of partitions

2015-04-17 Thread Yana Kadiyska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499855#comment-14499855
 ] 

Yana Kadiyska commented on SPARK-5923:
--

[~mtaylor] Can you please provide some information on how you debugged this? I 
just experienced a similar issue -- really poor performance on large metastore 
even though I'm only touching a few partitions. I'm using a Postgresql 
metastore . I do not however see IN queries logged to Postgres, and according 
to the posgres log no individual query took longer than 50ms. So I'm hoping to 
get some debugging tips

 Very slow query when using Oracle  hive metastore and table has lots of 
 partitions
 --

 Key: SPARK-5923
 URL: https://issues.apache.org/jira/browse/SPARK-5923
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Matthew Taylor

 This has two aspects
 * The direct sql support for oracle is broken in hive 0.13.1. Fails when 
 partitions get bigger than 1000 due oracle limitation on IN clause. This 
 cause fall back to ORM which is very slow(20 minutes to even start the query)
 * Hive it self does not suffer this problem as it passes down to the metadata 
 query, filter terms that restrict the partitions returned. SparkSQL is always 
 asking for all partitions event if they are not all needed. Even when we 
 patched hive it was still taking 2 minutes 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6984) Operations on tables with many partitions _very_slow

2015-04-17 Thread Yana Kadiyska (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yana Kadiyska updated SPARK-6984:
-
Attachment: 7282_partitions_stack.png

 Operations on tables with many partitions _very_slow
 

 Key: SPARK-6984
 URL: https://issues.apache.org/jira/browse/SPARK-6984
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
 Environment: External Hive metastore, table with 30K partitions
Reporter: Yana Kadiyska
 Attachments: 7282_partitions_stack.png


 I have a table with _many_partitions (30K). Users cannot query all of them 
 but they are in the metastore. Querying this table is extremely slow even if 
 we're asking for a single partition. 
 describe table also performs _very_ poorly
 Spark produces the following times:
 Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02, SQL 
 query: 72.831, Reading results: 0.189
 Whereas Hive over the same metastore shows:
 Query 1 of 1, Rows read: 47, Elapsed time (seconds) - Total: 0.44, SQL query: 
 0.204, Reading results: 0.236
 I attempted to debug this and noticed that HiveMetastoreCatalog constructs an 
 object for each partition, which is puzzling to me (attaching screenshot). 
 Should this value be lazy -- describe table should be purely a metastore op 
 IMO (i.e. query postgres, return types).
 The issue is a blocker to me but leaving with default priority until someone 
 can confirm it is a bug. describe table is not so interesting but I think 
 this affects all query paths -- I sent an inquiry earlier here: 
 https://www.mail-archive.com/user@spark.apache.org/msg26242.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6984) Operations on tables with many partitions _very_slow

2015-04-17 Thread Yana Kadiyska (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yana Kadiyska updated SPARK-6984:
-
Description: 
I have a table with _many_partitions (30K). Users cannot query all of them but 
they are in the metastore. Querying this table is extremely slow even if we're 
asking for a single partition. 
describe table also performs _very_ poorly
{quote}
Spark produces the following times:
Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02, SQL query: 
72.831, Reading results: 0.189

Whereas Hive over the same metastore shows:
Query 1 of 1, Rows read: 47, Elapsed time (seconds) - Total: 0.44, SQL query: 
0.204, Reading results: 0.236
{quote}
I attempted to debug this and noticed that HiveMetastoreCatalog constructs an 
object for each partition, which is puzzling to me (attaching screenshot). 
Should this value be lazy -- describe table should be purely a metastore op IMO 
(i.e. query postgres, return types).

The issue is a blocker to me but leaving with default priority until someone 
can confirm it is a bug. describe table is not so interesting but I think 
this affects all query paths -- I sent an inquiry earlier here: 
https://www.mail-archive.com/user@spark.apache.org/msg26242.html


  was:
I have a table with _many_partitions (30K). Users cannot query all of them but 
they are in the metastore. Querying this table is extremely slow even if we're 
asking for a single partition. 
describe table also performs _very_ poorly

Spark produces the following times:
Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02, SQL query: 
72.831, Reading results: 0.189

Whereas Hive over the same metastore shows:
Query 1 of 1, Rows read: 47, Elapsed time (seconds) - Total: 0.44, SQL query: 
0.204, Reading results: 0.236

I attempted to debug this and noticed that HiveMetastoreCatalog constructs an 
object for each partition, which is puzzling to me (attaching screenshot). 
Should this value be lazy -- describe table should be purely a metastore op IMO 
(i.e. query postgres, return types).

The issue is a blocker to me but leaving with default priority until someone 
can confirm it is a bug. describe table is not so interesting but I think 
this affects all query paths -- I sent an inquiry earlier here: 
https://www.mail-archive.com/user@spark.apache.org/msg26242.html



 Operations on tables with many partitions _very_slow
 

 Key: SPARK-6984
 URL: https://issues.apache.org/jira/browse/SPARK-6984
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
 Environment: External Hive metastore, table with 30K partitions
Reporter: Yana Kadiyska
 Attachments: 7282_partitions_stack.png


 I have a table with _many_partitions (30K). Users cannot query all of them 
 but they are in the metastore. Querying this table is extremely slow even if 
 we're asking for a single partition. 
 describe table also performs _very_ poorly
 {quote}
 Spark produces the following times:
 Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02, SQL 
 query: 72.831, Reading results: 0.189
 Whereas Hive over the same metastore shows:
 Query 1 of 1, Rows read: 47, Elapsed time (seconds) - Total: 0.44, SQL query: 
 0.204, Reading results: 0.236
 {quote}
 I attempted to debug this and noticed that HiveMetastoreCatalog constructs an 
 object for each partition, which is puzzling to me (attaching screenshot). 
 Should this value be lazy -- describe table should be purely a metastore op 
 IMO (i.e. query postgres, return types).
 The issue is a blocker to me but leaving with default priority until someone 
 can confirm it is a bug. describe table is not so interesting but I think 
 this affects all query paths -- I sent an inquiry earlier here: 
 https://www.mail-archive.com/user@spark.apache.org/msg26242.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6984) Operations on tables with many partitions _very_slow

2015-04-17 Thread Yana Kadiyska (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yana Kadiyska updated SPARK-6984:
-
Description: 
I have a table with _many_partitions (30K). Users cannot query all of them but 
they are in the metastore. Querying this table is extremely slow even if we're 
asking for a single partition. 
describe sometable also performs _very_ poorly
{quote}
Spark produces the following times:
Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02, SQL query: 
72.831, Reading results: 0.189

Whereas Hive over the same metastore shows:
Query 1 of 1, Rows read: 47, Elapsed time (seconds) - Total: 0.44, SQL query: 
0.204, Reading results: 0.236
{quote}
I attempted to debug this and noticed that HiveMetastoreCatalog constructs an 
object for each partition, which is puzzling to me (attaching screenshot). 
Should this value be lazy -- describe table should be purely a metastore op IMO 
(i.e. query postgres, return types).

The issue is a blocker to me but leaving with default priority until someone 
can confirm it is a bug. describe table is not so interesting but I think 
this affects all query paths -- I sent an inquiry earlier here: 
https://www.mail-archive.com/user@spark.apache.org/msg26242.html


  was:
I have a table with _many_partitions (30K). Users cannot query all of them but 
they are in the metastore. Querying this table is extremely slow even if we're 
asking for a single partition. 
describe table also performs _very_ poorly
{quote}
Spark produces the following times:
Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02, SQL query: 
72.831, Reading results: 0.189

Whereas Hive over the same metastore shows:
Query 1 of 1, Rows read: 47, Elapsed time (seconds) - Total: 0.44, SQL query: 
0.204, Reading results: 0.236
{quote}
I attempted to debug this and noticed that HiveMetastoreCatalog constructs an 
object for each partition, which is puzzling to me (attaching screenshot). 
Should this value be lazy -- describe table should be purely a metastore op IMO 
(i.e. query postgres, return types).

The issue is a blocker to me but leaving with default priority until someone 
can confirm it is a bug. describe table is not so interesting but I think 
this affects all query paths -- I sent an inquiry earlier here: 
https://www.mail-archive.com/user@spark.apache.org/msg26242.html



 Operations on tables with many partitions _very_slow
 

 Key: SPARK-6984
 URL: https://issues.apache.org/jira/browse/SPARK-6984
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
 Environment: External Hive metastore, table with 30K partitions
Reporter: Yana Kadiyska
 Attachments: 7282_partitions_stack.png


 I have a table with _many_partitions (30K). Users cannot query all of them 
 but they are in the metastore. Querying this table is extremely slow even if 
 we're asking for a single partition. 
 describe sometable also performs _very_ poorly
 {quote}
 Spark produces the following times:
 Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02, SQL 
 query: 72.831, Reading results: 0.189
 Whereas Hive over the same metastore shows:
 Query 1 of 1, Rows read: 47, Elapsed time (seconds) - Total: 0.44, SQL query: 
 0.204, Reading results: 0.236
 {quote}
 I attempted to debug this and noticed that HiveMetastoreCatalog constructs an 
 object for each partition, which is puzzling to me (attaching screenshot). 
 Should this value be lazy -- describe table should be purely a metastore op 
 IMO (i.e. query postgres, return types).
 The issue is a blocker to me but leaving with default priority until someone 
 can confirm it is a bug. describe table is not so interesting but I think 
 this affects all query paths -- I sent an inquiry earlier here: 
 https://www.mail-archive.com/user@spark.apache.org/msg26242.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6984) Operations on tables with many partitions _very_slow

2015-04-17 Thread Yana Kadiyska (JIRA)
Yana Kadiyska created SPARK-6984:


 Summary: Operations on tables with many partitions _very_slow
 Key: SPARK-6984
 URL: https://issues.apache.org/jira/browse/SPARK-6984
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
 Environment: External Hive metastore, table with 30K partitions
Reporter: Yana Kadiyska


I have a table with _many_partitions (30K). Users cannot query all of them but 
they are in the metastore. Querying this table is extremely slow even if we're 
asking for a single partition. 
describe table also performs _very_ poorly

Spark produces the following times:
Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02, SQL query: 
72.831, Reading results: 0.189

Whereas Hive over the same metastore shows:
Query 1 of 1, Rows read: 47, Elapsed time (seconds) - Total: 0.44, SQL query: 
0.204, Reading results: 0.236

I attempted to debug this and noticed that HiveMetastoreCatalog constructs an 
object for each partition, which is puzzling to me (attaching screenshot). 
Should this value be lazy -- describe table should be purely a metastore op IMO 
(i.e. query postgres, return types).

The issue is a blocker to me but leaving with default priority until someone 
can confirm it is a bug. describe table is not so interesting but I think 
this affects all query paths -- I sent an inquiry earlier here: 
https://www.mail-archive.com/user@spark.apache.org/msg26242.html




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7

2015-03-12 Thread Yana Kadiyska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14359614#comment-14359614
 ] 

Yana Kadiyska commented on SPARK-5389:
--

C:\Users\ykadiysk\Downloads\spark-1.2.0-bin-cdh4where find
C:\Windows\System32\find.exe

C:\Users\ykadiysk\Downloads\spark-1.2.0-bin-cdh4where findstr
C:\Windows\System32\findstr.exe

C:\Users\ykadiysk\Downloads\spark-1.2.0-bin-cdh4echo %PATH%
C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program
 Files (x86)\Enterprise Vault\EVClient\;C:\Program Files (x86)\Git\cmd
;C:\Program Files (x86)\Perforce;C:\Program Files\MiKTeX 
2.9\miktex\bin\x64\;C:\Program Files\Java\jdk1.7.0_40\bin;C:\Program Files 
(x86)\sbt\\bin;C:\Program Files (x86)\scala\bin;
C:\apache-maven-3.1.0\bin;C:\Program Files\Java\jre7\bin\server;c:\Program 
Files\R\R-3.0.2\bin;C:\Python27

 spark-shell.cmd does not run from DOS Windows 7
 ---

 Key: SPARK-5389
 URL: https://issues.apache.org/jira/browse/SPARK-5389
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.2.0
 Environment: Windows 7
Reporter: Yana Kadiyska
 Attachments: SparkShell_Win7.JPG


 spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. 
 spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2
 Marking as trivial since calling spark-shell2.cmd also works fine
 Attaching a screenshot since the error isn't very useful:
 {code}
 spark-1.2.0-bin-cdh4bin\spark-shell.cmd
 else was unexpected at this time.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7

2015-01-23 Thread Yana Kadiyska (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yana Kadiyska updated SPARK-5389:
-
Description: 
spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. 

spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2

Marking as trivial sine calling spark-shell2.cmd also works fine

Attaching a screenshot since the error isn't very useful:

spark-1.2.0-bin-cdh4bin\spark-shell.cmd
else was unexpected at this time.

  was:
spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. 

spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2

Marking as trivial sine calling spark-shell2.cmd also works fine

Attaching a screenshot since the error isn't very useful


 spark-shell.cmd does not run from DOS Windows 7
 ---

 Key: SPARK-5389
 URL: https://issues.apache.org/jira/browse/SPARK-5389
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.2.0
Reporter: Yana Kadiyska
Priority: Trivial
 Attachments: SparkShell_Win7.JPG


 spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. 
 spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2
 Marking as trivial sine calling spark-shell2.cmd also works fine
 Attaching a screenshot since the error isn't very useful:
 spark-1.2.0-bin-cdh4bin\spark-shell.cmd
 else was unexpected at this time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7

2015-01-23 Thread Yana Kadiyska (JIRA)
Yana Kadiyska created SPARK-5389:


 Summary: spark-shell.cmd does not run from DOS Windows 7
 Key: SPARK-5389
 URL: https://issues.apache.org/jira/browse/SPARK-5389
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.2.0
Reporter: Yana Kadiyska
Priority: Trivial


spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. 

spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2

Marking as trivial sine calling spark-shell2.cmd also works fine

Attaching a screenshot since the error isn't very useful



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7

2015-01-23 Thread Yana Kadiyska (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yana Kadiyska updated SPARK-5389:
-
Attachment: SparkShell_Win7.JPG

 spark-shell.cmd does not run from DOS Windows 7
 ---

 Key: SPARK-5389
 URL: https://issues.apache.org/jira/browse/SPARK-5389
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.2.0
Reporter: Yana Kadiyska
Priority: Trivial
 Attachments: SparkShell_Win7.JPG


 spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. 
 spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2
 Marking as trivial sine calling spark-shell2.cmd also works fine
 Attaching a screenshot since the error isn't very useful



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4049) Storage web UI fraction cached shows as 100%

2014-12-12 Thread Yana Kadiyska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14244290#comment-14244290
 ] 

Yana Kadiyska commented on SPARK-4049:
--

I'd suggest this be converted to a documentation bug then -- I was just stumped 
by the same phenomenon. Is the 1x replicated statement still correct in this 
case -- i would imagine that if the storage level says 1x replicated each 
partition would have a unique location?

 Storage web UI fraction cached shows as  100%
 

 Key: SPARK-4049
 URL: https://issues.apache.org/jira/browse/SPARK-4049
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.2.0
Reporter: Josh Rosen
Priority: Minor

 In the Storage tab of the Spark Web UI, I saw a case where the Fraction 
 Cached was greater than 100%:
 !http://i.imgur.com/Gm2hEeL.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1

2014-12-04 Thread Yana Kadiyska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234307#comment-14234307
 ] 

Yana Kadiyska commented on SPARK-4702:
--

Just confirming that https://github.com/apache/spark/pull/3586 does fix the 
issue. Thanks!

 Querying  non-existent partition produces exception in v1.2.0-rc1
 -

 Key: SPARK-4702
 URL: https://issues.apache.org/jira/browse/SPARK-4702
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yana Kadiyska

 Using HiveThriftServer2, when querying a non-existent partition I get an 
 exception rather than an empty result set. This seems to be a regression -- I 
 had an older build of master branch where this works. Build off of RC1.2 tag 
 produces the following:
 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement:
 org.apache.hive.service.cli.HiveSQLException: 
 java.lang.IllegalArgumentException: Can not create a Path from an empty string
 at 
 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
 at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
 at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source)
 at 
 org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)
 at 
 org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at 
 org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
 at 
 org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1

2014-12-03 Thread Yana Kadiyska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233031#comment-14233031
 ] 

Yana Kadiyska commented on SPARK-4702:
--

I'm investigating the possibility that this is caused by Hive being switched to 
Hive0.13 by default. Building with 0.12 profile now, will close if this goes 
away

 Querying  non-existent partition produces exception in v1.2.0-rc1
 -

 Key: SPARK-4702
 URL: https://issues.apache.org/jira/browse/SPARK-4702
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yana Kadiyska

 Using HiveThriftServer2, when querying a non-existent partition I get an 
 exception rather than an empty result set. This seems to be a regression -- I 
 had an older build of master branch where this works. Build off of RC1.2 tag 
 produces the following:
 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement:
 org.apache.hive.service.cli.HiveSQLException: 
 java.lang.IllegalArgumentException: Can not create a Path from an empty string
 at 
 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
 at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
 at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source)
 at 
 org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)
 at 
 org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at 
 org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
 at 
 org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1

2014-12-03 Thread Yana Kadiyska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233081#comment-14233081
 ] 

Yana Kadiyska commented on SPARK-4702:
--

Unfortunately I still see this error after building with  
./make-distribution.sh --tgz -Phive -Dhadoop.version=2.0.0-mr1-cdh4.2.0 
-Phive-thriftserver -Phive-0.12.0

I do have a working build from master branch from October 24th where this 
scenario works. We are running CDH4.6 Hive0.10 

In particular, the query I tried is select count(*) from mytable where pkey 
='some-non-existant-key';

 Querying  non-existent partition produces exception in v1.2.0-rc1
 -

 Key: SPARK-4702
 URL: https://issues.apache.org/jira/browse/SPARK-4702
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yana Kadiyska

 Using HiveThriftServer2, when querying a non-existent partition I get an 
 exception rather than an empty result set. This seems to be a regression -- I 
 had an older build of master branch where this works. Build off of RC1.2 tag 
 produces the following:
 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement:
 org.apache.hive.service.cli.HiveSQLException: 
 java.lang.IllegalArgumentException: Can not create a Path from an empty string
 at 
 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
 at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
 at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source)
 at 
 org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)
 at 
 org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at 
 org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
 at 
 org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1

2014-12-03 Thread Yana Kadiyska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233081#comment-14233081
 ] 

Yana Kadiyska edited comment on SPARK-4702 at 12/3/14 3:53 PM:
---

Unfortunately I still see this error after building with  
./make-distribution.sh --tgz -Phive -Dhadoop.version=2.0.0-mr1-cdh4.2.0 
-Phive-thriftserver -Phive-0.12.0

I do have a working build from master branch from October 24th where this 
scenario works. We are running CDH4.6 Hive0.10 
Beeline output from October:
Connected to: Hive (version 0.12.0-protobuf-2.5)
Driver: null (version null)
Transaction isolation: TRANSACTION_REPEATABLE_READ

Beeline output from RC:
Connected to: Hive (version 1.2.0)
Driver: null (version null)
Transaction isolation: TRANSACTION_REPEATABLE_READ

I am wondering if  -Phive-0.12.0 is fully sufficient --not sure why the 
Connected to: version prints differently?

In particular, the query I tried is select count(*) from mytable where pkey 
='some-non-existant-key';


was (Author: yanakad):
Unfortunately I still see this error after building with  
./make-distribution.sh --tgz -Phive -Dhadoop.version=2.0.0-mr1-cdh4.2.0 
-Phive-thriftserver -Phive-0.12.0

I do have a working build from master branch from October 24th where this 
scenario works. We are running CDH4.6 Hive0.10 

In particular, the query I tried is select count(*) from mytable where pkey 
='some-non-existant-key';

 Querying  non-existent partition produces exception in v1.2.0-rc1
 -

 Key: SPARK-4702
 URL: https://issues.apache.org/jira/browse/SPARK-4702
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yana Kadiyska

 Using HiveThriftServer2, when querying a non-existent partition I get an 
 exception rather than an empty result set. This seems to be a regression -- I 
 had an older build of master branch where this works. Build off of RC1.2 tag 
 produces the following:
 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement:
 org.apache.hive.service.cli.HiveSQLException: 
 java.lang.IllegalArgumentException: Can not create a Path from an empty string
 at 
 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
 at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
 at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source)
 at 
 org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)
 at 
 org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at 
 org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
 at 
 org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1

2014-12-03 Thread Yana Kadiyska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233081#comment-14233081
 ] 

Yana Kadiyska edited comment on SPARK-4702 at 12/3/14 6:09 PM:
---

Unfortunately I still see this error after building with  
./make-distribution.sh --tgz -Dhadoop.version=2.0.0-mr1-cdh4.2.0 
-Phive-thriftserver -Phive-0.12.0

I do have a working build from master branch from October 24th where this 
scenario works. We are running CDH4.6 Hive0.10 
Beeline output from October:
Connected to: Hive (version 0.12.0-protobuf-2.5)
Driver: null (version null)
Transaction isolation: TRANSACTION_REPEATABLE_READ

Beeline output from RC:
Connected to: Hive (version 1.2.0)
Driver: null (version null)
Transaction isolation: TRANSACTION_REPEATABLE_READ

I am wondering if  -Phive-0.12.0 is fully sufficient --not sure why the 
Connected to: version prints differently?

In particular, the query I tried is select count(*) from mytable where pkey 
='some-non-existant-key';


was (Author: yanakad):
Unfortunately I still see this error after building with  
./make-distribution.sh --tgz -Phive -Dhadoop.version=2.0.0-mr1-cdh4.2.0 
-Phive-thriftserver -Phive-0.12.0

I do have a working build from master branch from October 24th where this 
scenario works. We are running CDH4.6 Hive0.10 
Beeline output from October:
Connected to: Hive (version 0.12.0-protobuf-2.5)
Driver: null (version null)
Transaction isolation: TRANSACTION_REPEATABLE_READ

Beeline output from RC:
Connected to: Hive (version 1.2.0)
Driver: null (version null)
Transaction isolation: TRANSACTION_REPEATABLE_READ

I am wondering if  -Phive-0.12.0 is fully sufficient --not sure why the 
Connected to: version prints differently?

In particular, the query I tried is select count(*) from mytable where pkey 
='some-non-existant-key';

 Querying  non-existent partition produces exception in v1.2.0-rc1
 -

 Key: SPARK-4702
 URL: https://issues.apache.org/jira/browse/SPARK-4702
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yana Kadiyska

 Using HiveThriftServer2, when querying a non-existent partition I get an 
 exception rather than an empty result set. This seems to be a regression -- I 
 had an older build of master branch where this works. Build off of RC1.2 tag 
 produces the following:
 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement:
 org.apache.hive.service.cli.HiveSQLException: 
 java.lang.IllegalArgumentException: Can not create a Path from an empty string
 at 
 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
 at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
 at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source)
 at 
 org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)
 at 
 org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at 
 org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
 at 
 

[jira] [Commented] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1

2014-12-03 Thread Yana Kadiyska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233443#comment-14233443
 ] 

Yana Kadiyska commented on SPARK-4702:
--

Michael, just wanted to point out that the workaround you suggested did indeed 
help when the partition is missing -- I get a count of 0. But it did break the 
otherwise working case when a partition is present:

java.lang.IllegalStateException: 
All the offsets listed in the split should be found in the file. expected: [4, 
4] 
found: {my schema dumped out here}
out of: [4, 121017555, 242333553, 363518600] in range 0, 134217728

It's possible that this is a very corner case -- we've added columns to our 
schema so it's possible that the parquet files are likely not symmetric (not 
quite sure what convertMetastoreParquet does under the hood). But wanted to 
point out that in our case the bug is truly a blocker (I'm hoping it makes it 
in 1.2, don't care if it makes it in the next RC or later)

 Querying  non-existent partition produces exception in v1.2.0-rc1
 -

 Key: SPARK-4702
 URL: https://issues.apache.org/jira/browse/SPARK-4702
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yana Kadiyska

 Using HiveThriftServer2, when querying a non-existent partition I get an 
 exception rather than an empty result set. This seems to be a regression -- I 
 had an older build of master branch where this works. Build off of RC1.2 tag 
 produces the following:
 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement:
 org.apache.hive.service.cli.HiveSQLException: 
 java.lang.IllegalArgumentException: Can not create a Path from an empty string
 at 
 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
 at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
 at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source)
 at 
 org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)
 at 
 org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at 
 org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
 at 
 org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1

2014-12-03 Thread Yana Kadiyska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233514#comment-14233514
 ] 

Yana Kadiyska commented on SPARK-4702:
--

Michael, I do not have a 1.1. In October I built master manually, I believe 
from commit d2987e8f7a2cb3bf971f381399d8efdccb51d3d2.
At that time both of types of queries worked, without setting 
spark.sql.hive.convertMetastoreParquet=false (I tested on a smaller cluster, 
will drop on the same one now to make sure there's not some data weirdness)

If you meant to say When convertMetastoreParquet is FALSE, there is not 
currently support for heterogeneous schema. then we are saying the same thing 
-- I didn't have to set this flag as the missing partitions where handled fine. 
Now missing partitions are broken, but setting  SET 
spark.sql.hive.convertMetastoreParquet=false breaks 99% case because my files 
have different # columns.

I have not tried the PR you mentioned, will try it now --in my case the issue 
is not an empty file, it's a missing directory -- our query does a 
partition=-mm query, where parquet files are laid out under -mm 
directories representing partitions. But will see if the issue is helped by 
that PR.

In any case, I am just hoping this works before the final release, not a 
particular rush

 Querying  non-existent partition produces exception in v1.2.0-rc1
 -

 Key: SPARK-4702
 URL: https://issues.apache.org/jira/browse/SPARK-4702
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yana Kadiyska

 Using HiveThriftServer2, when querying a non-existent partition I get an 
 exception rather than an empty result set. This seems to be a regression -- I 
 had an older build of master branch where this works. Build off of RC1.2 tag 
 produces the following:
 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement:
 org.apache.hive.service.cli.HiveSQLException: 
 java.lang.IllegalArgumentException: Can not create a Path from an empty string
 at 
 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
 at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
 at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source)
 at 
 org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)
 at 
 org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at 
 org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
 at 
 org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Created] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1

2014-12-02 Thread Yana Kadiyska (JIRA)
Yana Kadiyska created SPARK-4702:


 Summary: Querying  non-existent partition produces exception in 
v1.2.0-rc1
 Key: SPARK-4702
 URL: https://issues.apache.org/jira/browse/SPARK-4702
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yana Kadiyska


Using HiveThriftServer2, when querying a non-existent partition I get an 
exception rather than an empty result set. This seems to be a regression -- I 
had an older build of master branch where this works. Build off of RC1.2 tag 
produces the following:

14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement:
org.apache.hive.service.cli.HiveSQLException: 
java.lang.IllegalArgumentException: Can not create a Path from an empty string
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192)
at 
org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
at 
org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
at 
org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
at 
org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at 
org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
at 
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source)
at 
org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)
at 
org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344)
at 
org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313)
at 
org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at 
org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4497) HiveThriftServer2 does not exit properly on failure

2014-11-19 Thread Yana Kadiyska (JIRA)
Yana Kadiyska created SPARK-4497:


 Summary: HiveThriftServer2 does not exit properly on failure
 Key: SPARK-4497
 URL: https://issues.apache.org/jira/browse/SPARK-4497
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yana Kadiyska


start thriftserver with 
 sbin/start-thriftserver.sh --master ...

If there is an error (in my case namenode is in standby mode) the driver shuts 
down properly:

14/11/19 16:32:58 ERROR HiveThriftServer2: Error starting HiveThriftServer2

14/11/19 16:32:59 INFO SparkUI: Stopped Spark web UI at http://myip:4040
14/11/19 16:32:59 INFO DAGScheduler: Stopping DAGScheduler
14/11/19 16:32:59 INFO SparkDeploySchedulerBackend: Shutting down all executors
14/11/19 16:32:59 INFO SparkDeploySchedulerBackend: Asking each executor to 
shut down
14/11/19 16:33:00 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor 
stopped!
14/11/19 16:33:00 INFO MemoryStore: MemoryStore cleared
14/11/19 16:33:00 INFO BlockManager: BlockManager stopped
14/11/19 16:33:00 INFO BlockManagerMaster: BlockManagerMaster stopped
14/11/19 16:33:00 INFO SparkContext: Successfully stopped SparkContext


but trying to run  sbin/start-thriftserver.sh --master ... again results in an 
error that Thrifserver is already running.

ps -aef|grep offendingPID shows

root 32334 1  0 16:32 ?00:00:00 /usr/local/bin/java 
org.apache.spark.deploy.SparkSubmitDriverBootstrapper --class 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --master 
spark://myip:7077 --conf -spark.executor.extraJavaOptions=-verbose:gc 
-XX:-PrintGCDetails -XX:+PrintGCTimeStamps spark-internal --hiveconf 
hive.root.logger=INFO,console

This is problematic since we have a process that tries to restart the driver if 
it dies



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3815) LPAD function does not work in where predicate

2014-10-20 Thread Yana Kadiyska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177396#comment-14177396
 ] 

Yana Kadiyska commented on SPARK-3815:
--

Venkata, I am building master and I am still seeing this. Another odd fact:

select customer_id from mytable where 
pkey=concat_ws('-',LPAD('077',4,'0'),'2014-07') LIMIT 2 
fails while 
select customer_id from mytable where 
pkey=concat_ws('-',LPAD('077',4,'0'),'2014-07') works OK.

 There are more than 2 results and looking at the executor logs it does seem 
that they succeed in the computation -- looks like something goes wrong during 
cleanup when there is a LIMIT. Feel free to augment the title if you can better 
figure out what the issue is -- let me know if you can't reproduce -- I'll make 
a synthetic dataset.

The table is a parquet table partitioned on pkey.



 LPAD function does not work in where predicate
 --

 Key: SPARK-3815
 URL: https://issues.apache.org/jira/browse/SPARK-3815
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Yana Kadiyska
Priority: Minor

 select customer_id from mytable where 
 pkey=concat_ws('-',LPAD('077',4,'0'),'2014-07') LIMIT 2
 produces:
 14/10/03 14:51:35 ERROR server.SparkSQLOperationManager: Error executing 
 query:
 org.apache.spark.SparkException: Task not serializable
 at 
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
 at 
 org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
 at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
 at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:597)
 at 
 org.apache.spark.sql.execution.Limit.execute(basicOperators.scala:146)
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360)
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360)
 at 
 org.apache.spark.sql.hive.thriftserver.server.SparkSQLOperationManager$$anon$1.run(SparkSQLOperationManager.scala:185)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:193)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatement(HiveSessionImpl.java:175)
 at 
 org.apache.hive.service.cli.CLIService.executeStatement(CLIService.java:150)
 at 
 org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:207)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1133)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1118)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at 
 org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:58)
 at 
 org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:55)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
 at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:526)
 at 
 org.apache.hive.service.auth.TUGIContainingProcessor.process(TUGIContainingProcessor.java:55)
 at 
 org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 Caused by: java.io.NotSerializableException: java.lang.reflect.Constructor
 at 
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
 at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
 at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
 at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
 at 
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)
 at 
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173)
 at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
 at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
 at 
 

[jira] [Comment Edited] (SPARK-3815) LPAD function does not work in where predicate

2014-10-20 Thread Yana Kadiyska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177396#comment-14177396
 ] 

Yana Kadiyska edited comment on SPARK-3815 at 10/20/14 7:58 PM:


[~gvramana], I am building master and I am still seeing this. Another odd fact:

select customer_id from mytable where 
pkey=concat_ws('-',LPAD('077',4,'0'),'2014-07') LIMIT 2 
fails while 
select customer_id from mytable where 
pkey=concat_ws('-',LPAD('077',4,'0'),'2014-07') works OK.

 There are more than 2 results and looking at the executor logs it does seem 
that they succeed in the computation -- looks like something goes wrong during 
cleanup when there is a LIMIT. Feel free to augment the title if you can better 
figure out what the issue is -- let me know if you can't reproduce -- I'll make 
a synthetic dataset.

The table is a parquet table partitioned on pkey.




was (Author: yanakad):
Venkata, I am building master and I am still seeing this. Another odd fact:

select customer_id from mytable where 
pkey=concat_ws('-',LPAD('077',4,'0'),'2014-07') LIMIT 2 
fails while 
select customer_id from mytable where 
pkey=concat_ws('-',LPAD('077',4,'0'),'2014-07') works OK.

 There are more than 2 results and looking at the executor logs it does seem 
that they succeed in the computation -- looks like something goes wrong during 
cleanup when there is a LIMIT. Feel free to augment the title if you can better 
figure out what the issue is -- let me know if you can't reproduce -- I'll make 
a synthetic dataset.

The table is a parquet table partitioned on pkey.



 LPAD function does not work in where predicate
 --

 Key: SPARK-3815
 URL: https://issues.apache.org/jira/browse/SPARK-3815
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Yana Kadiyska
Priority: Minor

 select customer_id from mytable where 
 pkey=concat_ws('-',LPAD('077',4,'0'),'2014-07') LIMIT 2
 produces:
 14/10/03 14:51:35 ERROR server.SparkSQLOperationManager: Error executing 
 query:
 org.apache.spark.SparkException: Task not serializable
 at 
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
 at 
 org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
 at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
 at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:597)
 at 
 org.apache.spark.sql.execution.Limit.execute(basicOperators.scala:146)
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360)
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360)
 at 
 org.apache.spark.sql.hive.thriftserver.server.SparkSQLOperationManager$$anon$1.run(SparkSQLOperationManager.scala:185)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:193)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatement(HiveSessionImpl.java:175)
 at 
 org.apache.hive.service.cli.CLIService.executeStatement(CLIService.java:150)
 at 
 org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:207)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1133)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1118)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at 
 org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:58)
 at 
 org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:55)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
 at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:526)
 at 
 org.apache.hive.service.auth.TUGIContainingProcessor.process(TUGIContainingProcessor.java:55)
 at 
 org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 Caused by: java.io.NotSerializableException: java.lang.reflect.Constructor
 at 
 

[jira] [Created] (SPARK-3814) Bitwise does not work in Hive

2014-10-06 Thread Yana Kadiyska (JIRA)
Yana Kadiyska created SPARK-3814:


 Summary: Bitwise  does not work  in Hive
 Key: SPARK-3814
 URL: https://issues.apache.org/jira/browse/SPARK-3814
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Yana Kadiyska
Priority: Minor


Error: java.lang.RuntimeException: 
Unsupported language features in query: select (case when bit_field  1=1 then 
r_end - r_start else NULL end) from mytable where pkey='0178-2014-07' LIMIT 2
TOK_QUERY
  TOK_FROM
TOK_TABREF
  TOK_TABNAME
   mytable 
  TOK_INSERT
TOK_DESTINATION
  TOK_DIR
TOK_TMP_FILE
TOK_SELECT
  TOK_SELEXPR
TOK_FUNCTION
  when
  =

  TOK_TABLE_OR_COL
bit_field
  1
1
  -
TOK_TABLE_OR_COL
  r_end
TOK_TABLE_OR_COL
  r_start
  TOK_NULL
TOK_WHERE
  =
TOK_TABLE_OR_COL
  pkey
'0178-2014-07'
TOK_LIMIT
  2


SQLState:  null
ErrorCode: 0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3815) LPAD function does not work in where predicate

2014-10-06 Thread Yana Kadiyska (JIRA)
Yana Kadiyska created SPARK-3815:


 Summary: LPAD function does not work in where predicate
 Key: SPARK-3815
 URL: https://issues.apache.org/jira/browse/SPARK-3815
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Yana Kadiyska
Priority: Minor


select customer_id from mytable where 
pkey=concat_ws('-',LPAD('077',4,'0'),'2014-07') LIMIT 2

produces:

14/10/03 14:51:35 ERROR server.SparkSQLOperationManager: Error executing query:
org.apache.spark.SparkException: Task not serializable
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:597)
at 
org.apache.spark.sql.execution.Limit.execute(basicOperators.scala:146)
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360)
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360)
at 
org.apache.spark.sql.hive.thriftserver.server.SparkSQLOperationManager$$anon$1.run(SparkSQLOperationManager.scala:185)
at 
org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:193)
at 
org.apache.hive.service.cli.session.HiveSessionImpl.executeStatement(HiveSessionImpl.java:175)
at 
org.apache.hive.service.cli.CLIService.executeStatement(CLIService.java:150)
at 
org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:207)
at 
org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1133)
at 
org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1118)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at 
org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:58)
at 
org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:55)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at 
org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:526)
at 
org.apache.hive.service.auth.TUGIContainingProcessor.process(TUGIContainingProcessor.java:55)
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.io.NotSerializableException: java.lang.reflect.Constructor
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at scala.collection.immutable.$colon$colon.writeObject(List.scala:379)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

The following work fine:

select concat_ws('-', LPAD(cast(112717 % 1024 AS STRING),4,'0'),'2014-07') from 
mytable where pkey='0077-2014-07' LIMIT 2

select customer_id from mytable  where pkey=concat_ws('-','0077','2014-07') 
LIMIT 2




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org