[jira] [Commented] (SPARK-16311) Improve metadata refresh

2016-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15358456#comment-15358456
 ] 

Apache Spark commented on SPARK-16311:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/14009

> Improve metadata refresh
> 
>
> Key: SPARK-16311
> URL: https://issues.apache.org/jira/browse/SPARK-16311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>
> When the underlying file changes, it can be very confusing to users when they 
> see a FileNotFoundException. It would be great to do the following:
> (1) Append a message to the FileNotFoundException that a workaround is to do 
> explicitly metadata refresh.
> (2) Make metadata refresh work on temporary tables/views.
> (3) Make metadata refresh work on Datasets/DataFrames, by introducing a 
> Dataset.refresh() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16281) Implement parse_url SQL function

2016-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15358423#comment-15358423
 ] 

Apache Spark commented on SPARK-16281:
--

User 'janplus' has created a pull request for this issue:
https://github.com/apache/spark/pull/14008

> Implement parse_url SQL function
> 
>
> Key: SPARK-16281
> URL: https://issues.apache.org/jira/browse/SPARK-16281
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16334) [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException

2016-06-30 Thread Liwei Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15358424#comment-15358424
 ] 

Liwei Lin commented on SPARK-16334:
---

hi [~epahomov], by which tool were your parquet files written, SparkSQL or ? In 
addition, what's the {{WriterVersion}}, {{PARQUET_1_0 ("v1")}} or {{PARQUET_2_0 
("v2")}}?

> [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException
> -
>
> Key: SPARK-16334
> URL: https://issues.apache.org/jira/browse/SPARK-16334
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Priority: Critical
>  Labels: sql
>
> Query:
> {code}
> select * from blabla where user_id = 415706251
> {code}
> Error:
> {code}
> 16/06/30 14:07:27 WARN scheduler.TaskSetManager: Lost task 11.0 in stage 0.0 
> (TID 3, hadoop6): java.lang.ArrayIndexOutOfBoundsException: 6934
> at 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.decodeToBinary(PlainValuesDictionary.java:119)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:273)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:170)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
> at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Work on 1.6.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16331) [SQL] Reduce code generation time

2016-06-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16331.
-
   Resolution: Fixed
 Assignee: Hiroshi Inoue
Fix Version/s: 2.1.0

> [SQL] Reduce code generation time 
> --
>
> Key: SPARK-16331
> URL: https://issues.apache.org/jira/browse/SPARK-16331
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Hiroshi Inoue
>Assignee: Hiroshi Inoue
> Fix For: 2.1.0
>
>
> During the code generation, a {{LocalRelation}} often has a huge {{Vector}} 
> object as {{data}}. In the simple example below, a {{LocalRelation}} has a 
> Vector with 100 elements of {{UnsafeRow}}. 
> {quote}
> val numRows = 100
> val ds = (1 to numRows).toDS().persist()
> benchmark.addCase("filter+reduce") { iter =>
>   ds.filter(a => (a & 1) == 0).reduce(_ + _)
> }
> {quote}
> At {{TreeNode.transformChildren}}, all elements of the vector is 
> unnecessarily iterated to check whether any children exist in the vector 
> since {{Vector}} is Traversable. This part significantly increases code 
> generation time.
> This patch avoids this overhead by checking the number of children before 
> iterating all elements; {{LocalRelation}} does not have children since it 
> extends {{LeafNode}}.
> The performance of the above example 
> {quote}
> without this patch
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_91-b14 on Mac OS X 10.11.5
> Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz
> compilationTime: Best/Avg Time(ms)Rate(M/s)   Per 
> Row(ns)   Relative
> 
> filter+reduce 4426 / 4533  0.2
> 4426.0   1.0X
> with this patch
> compilationTime: Best/Avg Time(ms)Rate(M/s)   Per 
> Row(ns)   Relative
> 
> filter+reduce 3117 / 3391  0.3
> 3116.6   1.0X
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-16247) Using pyspark dataframe with pipeline and cross validator

2016-06-30 Thread Edward Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edward Ma closed SPARK-16247.
-

Misusage. Resolved.

> Using pyspark dataframe with pipeline and cross validator
> -
>
> Key: SPARK-16247
> URL: https://issues.apache.org/jira/browse/SPARK-16247
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.1
>Reporter: Edward Ma
>
> I am using pyspark with dataframe. Using pipeline operation to train and 
> predict the result. It is alright for single testing.
> However, I got issue when using pipeline and CrossValidator. The issue is 
> that I expect CrossValidator use "indexedLabel" and "indexedMsg" as label and 
> feature. Those fields are built by StringIndexer and VectorIndex. It suppose 
> to be existed after executing pipeline. 
> Then I dig into pyspark library [python/pyspark/ml/tuning.py] (line 222, _fit 
> function and line 239, est.fit), I found that it does not execute pipeline 
> stage. Therefore, I cannot get "indexedLabel" and "indexedMsg". 
> Would you mind advising whether my usage is correct or not.
> Thanks.
> Here is code snippet
> {noformat}
> // # Indexing
> labelIndexer = StringIndexer(inputCol="label", 
> outputCol="indexedLabel").fit(extracted_data)
> featureIndexer = VectorIndexer(inputCol="extracted_msg", 
> outputCol="indexedMsg", maxCategories=3000).fit(extracted_data)
> // # Training
> classification_model = RandomForestClassifier(labelCol="indexedLabel", 
> featuresCol="indexedMsg", numTrees=50, maxDepth=20)
> pipeline = Pipeline(stages=[labelIndexer, featureIndexer, 
> classification_model])
> // # Cross Validation
> paramGrid = ParamGridBuilder().addGrid(classification_model.maxDepth, (10, 
> 20, 30)).build()
> cvEvaluator = MulticlassClassificationEvaluator(metricName="precision")
> cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, 
> evaluator=cvEvaluator, numFolds=10)
> cvModel = cv.fit(trainingData)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16329) select * from temp_table_no_cols fails

2016-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16329:


Assignee: Apache Spark

> select * from temp_table_no_cols fails
> --
>
> Key: SPARK-16329
> URL: https://issues.apache.org/jira/browse/SPARK-16329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: Adrian Ionescu
>Assignee: Apache Spark
>
> The following works with spark 1.5.1, but not anymore with spark 1.6.0:
> {code}
> import org.apache.spark.sql.{ DataFrame, Row }
> import org.apache.spark.sql.types.StructType
> val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => 
> Row.empty)
> val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty))
> dfNoCols.registerTempTable("temp_table_no_cols")
> sqlContext.sql("select * from temp_table_no_cols").show
> {code}
> spark 1.5.1 result:
> {noformat}
> ++
> ||
> ++
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ++
> {noformat}
> spark 1.6.0 result:
> {noformat}
> java.lang.IllegalArgumentException: requirement failed
> at scala.Predef$.require(Predef.scala:221)
> at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:354)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:353)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:353)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:328)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:37)
> at $iwC$$iwC$$iwC$$iwC.(:39)
> at $iwC$$iwC$$iwC.(:41)
> at $iwC$$iwC.(:43)
> at $iwC.(:45)
> at (:47)
> at .(:51)
> at .()
> at .(:7)
> at .()
> 

[jira] [Assigned] (SPARK-16329) select * from temp_table_no_cols fails

2016-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16329:


Assignee: (was: Apache Spark)

> select * from temp_table_no_cols fails
> --
>
> Key: SPARK-16329
> URL: https://issues.apache.org/jira/browse/SPARK-16329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: Adrian Ionescu
>
> The following works with spark 1.5.1, but not anymore with spark 1.6.0:
> {code}
> import org.apache.spark.sql.{ DataFrame, Row }
> import org.apache.spark.sql.types.StructType
> val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => 
> Row.empty)
> val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty))
> dfNoCols.registerTempTable("temp_table_no_cols")
> sqlContext.sql("select * from temp_table_no_cols").show
> {code}
> spark 1.5.1 result:
> {noformat}
> ++
> ||
> ++
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ++
> {noformat}
> spark 1.6.0 result:
> {noformat}
> java.lang.IllegalArgumentException: requirement failed
> at scala.Predef$.require(Predef.scala:221)
> at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:354)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:353)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:353)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:328)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:37)
> at $iwC$$iwC$$iwC$$iwC.(:39)
> at $iwC$$iwC$$iwC.(:41)
> at $iwC$$iwC.(:43)
> at $iwC.(:45)
> at (:47)
> at .(:51)
> at .()
> at .(:7)
> at .()
> at $print()
>

[jira] [Commented] (SPARK-16329) select * from temp_table_no_cols fails

2016-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15358306#comment-15358306
 ] 

Apache Spark commented on SPARK-16329:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/14007

> select * from temp_table_no_cols fails
> --
>
> Key: SPARK-16329
> URL: https://issues.apache.org/jira/browse/SPARK-16329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: Adrian Ionescu
>
> The following works with spark 1.5.1, but not anymore with spark 1.6.0:
> {code}
> import org.apache.spark.sql.{ DataFrame, Row }
> import org.apache.spark.sql.types.StructType
> val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => 
> Row.empty)
> val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty))
> dfNoCols.registerTempTable("temp_table_no_cols")
> sqlContext.sql("select * from temp_table_no_cols").show
> {code}
> spark 1.5.1 result:
> {noformat}
> ++
> ||
> ++
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ++
> {noformat}
> spark 1.6.0 result:
> {noformat}
> java.lang.IllegalArgumentException: requirement failed
> at scala.Predef$.require(Predef.scala:221)
> at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:354)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:353)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:353)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:328)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:37)
> at $iwC$$iwC$$iwC$$iwC.(:39)
> at $iwC$$iwC$$iwC.(:41)
> at $iwC$$iwC.(:43)
> at $iwC.(:45)
> at (:47)

[jira] [Resolved] (SPARK-14608) transformSchema needs better documentation

2016-06-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-14608.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12384
[https://github.com/apache/spark/pull/12384]

> transformSchema needs better documentation
> --
>
> Key: SPARK-14608
> URL: https://issues.apache.org/jira/browse/SPARK-14608
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>Priority: Minor
> Fix For: 2.0.0
>
>
> {{PipelineStage.transformSchema}} currently has minimal documentation.  It 
> should have more to explain it can:
> * check schema
> * check parameter interactions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15820) Add Catalog.refreshTable into python API

2016-06-30 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-15820:
---
Fix Version/s: (was: 2.0.0)
   2.1.0
   2.0.1

> Add Catalog.refreshTable into python API
> 
>
> Key: SPARK-15820
> URL: https://issues.apache.org/jira/browse/SPARK-15820
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Weichen Xu
>Assignee: Weichen Xu
> Fix For: 2.0.1, 2.1.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The Catalog.refreshTable API is missing in python interface for Spark-SQL, 
> add it.
> see also: https://issues.apache.org/jira/browse/SPARK-15367



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15820) Add Catalog.refreshTable into python API

2016-06-30 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-15820:
---
Assignee: Weichen Xu

> Add Catalog.refreshTable into python API
> 
>
> Key: SPARK-15820
> URL: https://issues.apache.org/jira/browse/SPARK-15820
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Weichen Xu
>Assignee: Weichen Xu
> Fix For: 2.0.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The Catalog.refreshTable API is missing in python interface for Spark-SQL, 
> add it.
> see also: https://issues.apache.org/jira/browse/SPARK-15367



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15820) Add Catalog.refreshTable into python API

2016-06-30 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-15820.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13558
[https://github.com/apache/spark/pull/13558]

> Add Catalog.refreshTable into python API
> 
>
> Key: SPARK-15820
> URL: https://issues.apache.org/jira/browse/SPARK-15820
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Weichen Xu
>Assignee: Weichen Xu
> Fix For: 2.0.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The Catalog.refreshTable API is missing in python interface for Spark-SQL, 
> add it.
> see also: https://issues.apache.org/jira/browse/SPARK-15367



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16329) select * from temp_table_no_cols fails

2016-06-30 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15358236#comment-15358236
 ] 

Takeshi Yamamuro commented on SPARK-16329:
--

okay, thanks!

> select * from temp_table_no_cols fails
> --
>
> Key: SPARK-16329
> URL: https://issues.apache.org/jira/browse/SPARK-16329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: Adrian Ionescu
>
> The following works with spark 1.5.1, but not anymore with spark 1.6.0:
> {code}
> import org.apache.spark.sql.{ DataFrame, Row }
> import org.apache.spark.sql.types.StructType
> val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => 
> Row.empty)
> val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty))
> dfNoCols.registerTempTable("temp_table_no_cols")
> sqlContext.sql("select * from temp_table_no_cols").show
> {code}
> spark 1.5.1 result:
> {noformat}
> ++
> ||
> ++
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ++
> {noformat}
> spark 1.6.0 result:
> {noformat}
> java.lang.IllegalArgumentException: requirement failed
> at scala.Predef$.require(Predef.scala:221)
> at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:354)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:353)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:353)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:328)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:37)
> at $iwC$$iwC$$iwC$$iwC.(:39)
> at $iwC$$iwC$$iwC.(:41)
> at $iwC$$iwC.(:43)
> at $iwC.(:45)
> at (:47)
> at .(:51)
> at .()
> at .(:7)
> at .()
>

[jira] [Commented] (SPARK-16317) Add file filtering interface for FileFormat

2016-06-30 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15358235#comment-15358235
 ] 

Takeshi Yamamuro commented on SPARK-16317:
--

Does this intend a hadoop PathFilter-like interface?
How about adding codes below in DataSource#inferFileFormatSchemal?
{code}
val passFilter = format.getPassFilter
format.inferSchema(
  sparkSession,
  caseInsensitiveOptions,
  fileCatalog.allFiles(passFilter))
{code}

> Add file filtering interface for FileFormat
> ---
>
> Key: SPARK-16317
> URL: https://issues.apache.org/jira/browse/SPARK-16317
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Priority: Minor
>
> {{FileFormat}} data sources like Parquet and Avro (provided by spark-avro) 
> have customized file filtering logics. For example, Parquet needs to filter 
> out summary files, while Avro provides a Hadoop configuration option to 
> filter out all files whose names don't end with ".avro".
> It would be nice to have a general file filtering interface in {{FileFormat}} 
> to handle similar requirements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16329) select * from temp_table_no_cols fails

2016-06-30 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15358230#comment-15358230
 ] 

Xiao Li commented on SPARK-16329:
-

If we support Dataframe with zero column, I think we should also support it for 
SQL interface. So far, the exposed issues exist in star expansion. Let me fix 
this at first. You can continue to fix the remaining issues. Thanks!

> select * from temp_table_no_cols fails
> --
>
> Key: SPARK-16329
> URL: https://issues.apache.org/jira/browse/SPARK-16329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: Adrian Ionescu
>
> The following works with spark 1.5.1, but not anymore with spark 1.6.0:
> {code}
> import org.apache.spark.sql.{ DataFrame, Row }
> import org.apache.spark.sql.types.StructType
> val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => 
> Row.empty)
> val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty))
> dfNoCols.registerTempTable("temp_table_no_cols")
> sqlContext.sql("select * from temp_table_no_cols").show
> {code}
> spark 1.5.1 result:
> {noformat}
> ++
> ||
> ++
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ++
> {noformat}
> spark 1.6.0 result:
> {noformat}
> java.lang.IllegalArgumentException: requirement failed
> at scala.Predef$.require(Predef.scala:221)
> at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:354)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:353)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:353)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:328)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:37)
> at 

[jira] [Resolved] (SPARK-15954) TestHive has issues being used in PySpark

2016-06-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15954.
-
   Resolution: Fixed
 Assignee: Reynold Xin
Fix Version/s: 2.0.0

> TestHive has issues being used in PySpark
> -
>
> Key: SPARK-15954
> URL: https://issues.apache.org/jira/browse/SPARK-15954
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: holdenk
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> SPARK-15745 made TestHive unreliable from PySpark test cases, to support it 
> we should allow both resource or system property based lookup for loading the 
> hive file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16329) select * from temp_table_no_cols fails

2016-06-30 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15358226#comment-15358226
 ] 

Xiao Li commented on SPARK-16329:
-

We might hit multiple issues for supporting tables with zero column. 

> select * from temp_table_no_cols fails
> --
>
> Key: SPARK-16329
> URL: https://issues.apache.org/jira/browse/SPARK-16329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: Adrian Ionescu
>
> The following works with spark 1.5.1, but not anymore with spark 1.6.0:
> {code}
> import org.apache.spark.sql.{ DataFrame, Row }
> import org.apache.spark.sql.types.StructType
> val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => 
> Row.empty)
> val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty))
> dfNoCols.registerTempTable("temp_table_no_cols")
> sqlContext.sql("select * from temp_table_no_cols").show
> {code}
> spark 1.5.1 result:
> {noformat}
> ++
> ||
> ++
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ++
> {noformat}
> spark 1.6.0 result:
> {noformat}
> java.lang.IllegalArgumentException: requirement failed
> at scala.Predef$.require(Predef.scala:221)
> at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:354)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:353)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:353)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:328)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:37)
> at $iwC$$iwC$$iwC$$iwC.(:39)
> at $iwC$$iwC$$iwC.(:41)
> at $iwC$$iwC.(:43)
> at $iwC.(:45)
> at (:47)
> at .(:51)
> at .()
>  

[jira] [Commented] (SPARK-16329) select * from temp_table_no_cols fails

2016-06-30 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15358221#comment-15358221
 ] 

Takeshi Yamamuro commented on SPARK-16329:
--

I found there is the similar issue in `Dataset#drop`
{code}
case class DATA(a: Int)
val df1 = Seq(DATA(1)).toDF
val df2 = df1.drop($"a")
df2.select($"*").show
{code}
This also threw the exception.

> select * from temp_table_no_cols fails
> --
>
> Key: SPARK-16329
> URL: https://issues.apache.org/jira/browse/SPARK-16329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: Adrian Ionescu
>
> The following works with spark 1.5.1, but not anymore with spark 1.6.0:
> {code}
> import org.apache.spark.sql.{ DataFrame, Row }
> import org.apache.spark.sql.types.StructType
> val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => 
> Row.empty)
> val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty))
> dfNoCols.registerTempTable("temp_table_no_cols")
> sqlContext.sql("select * from temp_table_no_cols").show
> {code}
> spark 1.5.1 result:
> {noformat}
> ++
> ||
> ++
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ++
> {noformat}
> spark 1.6.0 result:
> {noformat}
> java.lang.IllegalArgumentException: requirement failed
> at scala.Predef$.require(Predef.scala:221)
> at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:354)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:353)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:353)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:328)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:37)
> at $iwC$$iwC$$iwC$$iwC.(:39)
>

[jira] [Updated] (SPARK-14138) Generated SpecificColumnarIterator code can exceed JVM size limit for cached DataFrames

2016-06-30 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-14138:
-
Fix Version/s: 1.6.2

> Generated SpecificColumnarIterator code can exceed JVM size limit for cached 
> DataFrames
> ---
>
> Key: SPARK-14138
> URL: https://issues.apache.org/jira/browse/SPARK-14138
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Sven Krasser
>Assignee: Kazuaki Ishizaki
> Fix For: 1.6.2, 2.0.0
>
>
> The generated {{SpecificColumnarIterator}} code for wide DataFrames can 
> exceed the JVM 64k limit under certain circumstances. This snippet reproduces 
> the error in spark-shell (with 5G driver memory) by creating a new DataFrame 
> with >2000 aggregation-based columns:
> {code}
> val df = sc.parallelize(1 to 10).toDF()
> val aggr = {1 to 2260}.map(colnum => avg(df.col("_1")).as(s"col_$colnum"))
> val res = df.groupBy("_1").agg(count("_1"), aggr: _*).cache()
> res.show() // this will break
> {code}
> The following error is produced (pruned for brevity):
> {noformat}
> /* 001 */
> /* 002 */ import java.nio.ByteBuffer;
> /* 003 */ import java.nio.ByteOrder;
> /* 004 */ import scala.collection.Iterator;
> /* 005 */ import org.apache.spark.sql.types.DataType;
> /* 006 */ import 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder;
> /* 007 */ import 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter;
> /* 008 */ import org.apache.spark.sql.execution.columnar.MutableUnsafeRow;
> /* 009 */
> /* 010 */ public SpecificColumnarIterator 
> generate(org.apache.spark.sql.catalyst.expressions.Expression[] expr) {
> /* 011 */   return new SpecificColumnarIterator();
> /* 012 */ }
> /* 013 */
> ...
> /* 9113 */ accessor2261.extractTo(mutableRow, 2261);
> /* 9114 */ unsafeRow.pointTo(bufferHolder.buffer, 2262, 
> bufferHolder.totalSize());
> /* 9115 */ return unsafeRow;
> /* 9116 */   }
> /* 9117 */ }
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:555)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:575)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:572)
>   at 
> org.spark-project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>   at 
> org.spark-project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>   ... 28 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method "()Z" 
> of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator"
>  grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:836)
>   at org.codehaus.janino.UnitCompiler.writeOpcode(UnitCompiler.java:10251)
>   at org.codehaus.janino.UnitCompiler.invoke(UnitCompiler.java:10050)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4008)
>   at org.codehaus.janino.UnitCompiler.access$6900(UnitCompiler.java:185)
>   at 
> org.codehaus.janino.UnitCompiler$10.visitMethodInvocation(UnitCompiler.java:3263)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:3974)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4368)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3927)
>   at org.codehaus.janino.UnitCompiler.access$6900(UnitCompiler.java:185)
>   at 
> org.codehaus.janino.UnitCompiler$10.visitMethodInvocation(UnitCompiler.java:3263)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:3974)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4368)
>   at 
> org.codehaus.janino.UnitCompiler.invokeConstructor(UnitCompiler.java:6681)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4126)
>   at org.codehaus.janino.UnitCompiler.access$7600(UnitCompiler.java:185)
>   at 
> org.codehaus.janino.UnitCompiler$10.visitNewClassInstance(UnitCompiler.java:3275)
>   at org.codehaus.janino.Java$NewClassInstance.accept(Java.java:4085)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4368)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2669)
>

[jira] [Comment Edited] (SPARK-16329) select * from temp_table_no_cols fails

2016-06-30 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15358212#comment-15358212
 ] 

Takeshi Yamamuro edited comment on SPARK-16329 at 7/1/16 1:49 AM:
--

FYI: I also checked in mysql;
{code}
mysql> create table test_rel();
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that 
corresponds to your MySQL server version for the right syntax to use near ')' 
at line 1
{code}


was (Author: maropu):
I also checked in mysql;
{code}
mysql> create table test_rel();
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that 
corresponds to your MySQL server version for the right syntax to use near ')' 
at line 1
{code}

> select * from temp_table_no_cols fails
> --
>
> Key: SPARK-16329
> URL: https://issues.apache.org/jira/browse/SPARK-16329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: Adrian Ionescu
>
> The following works with spark 1.5.1, but not anymore with spark 1.6.0:
> {code}
> import org.apache.spark.sql.{ DataFrame, Row }
> import org.apache.spark.sql.types.StructType
> val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => 
> Row.empty)
> val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty))
> dfNoCols.registerTempTable("temp_table_no_cols")
> sqlContext.sql("select * from temp_table_no_cols").show
> {code}
> spark 1.5.1 result:
> {noformat}
> ++
> ||
> ++
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ++
> {noformat}
> spark 1.6.0 result:
> {noformat}
> java.lang.IllegalArgumentException: requirement failed
> at scala.Predef$.require(Predef.scala:221)
> at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:354)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:353)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:353)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:328)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
> at 

[jira] [Commented] (SPARK-16329) select * from temp_table_no_cols fails

2016-06-30 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15358212#comment-15358212
 ] 

Takeshi Yamamuro commented on SPARK-16329:
--

I also checked in mysql;
{code}
mysql> create table test_rel();
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that 
corresponds to your MySQL server version for the right syntax to use near ')' 
at line 1
{code}

> select * from temp_table_no_cols fails
> --
>
> Key: SPARK-16329
> URL: https://issues.apache.org/jira/browse/SPARK-16329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: Adrian Ionescu
>
> The following works with spark 1.5.1, but not anymore with spark 1.6.0:
> {code}
> import org.apache.spark.sql.{ DataFrame, Row }
> import org.apache.spark.sql.types.StructType
> val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => 
> Row.empty)
> val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty))
> dfNoCols.registerTempTable("temp_table_no_cols")
> sqlContext.sql("select * from temp_table_no_cols").show
> {code}
> spark 1.5.1 result:
> {noformat}
> ++
> ||
> ++
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ++
> {noformat}
> spark 1.6.0 result:
> {noformat}
> java.lang.IllegalArgumentException: requirement failed
> at scala.Predef$.require(Predef.scala:221)
> at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:354)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:353)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:353)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:328)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:37)

[jira] [Commented] (SPARK-15643) ML 2.0 QA: migration guide update

2016-06-30 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15358118#comment-15358118
 ] 

Joseph K. Bradley commented on SPARK-15643:
---

I just resolved this, but let me know if there are other pending items I've 
missed/forgotten.  Thanks!

> ML 2.0 QA: migration guide update
> -
>
> Key: SPARK-15643
> URL: https://issues.apache.org/jira/browse/SPARK-15643
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Blocker
> Fix For: 2.0.0
>
>
> Update spark.ml and spark.mllib migration guide from 1.6 to 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15643) ML 2.0 QA: migration guide update

2016-06-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-15643.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13924
[https://github.com/apache/spark/pull/13924]

> ML 2.0 QA: migration guide update
> -
>
> Key: SPARK-15643
> URL: https://issues.apache.org/jira/browse/SPARK-15643
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Blocker
> Fix For: 2.0.0
>
>
> Update spark.ml and spark.mllib migration guide from 1.6 to 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16328) Implement conversion utility functions for single instances in Python

2016-06-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-16328.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13997
[https://github.com/apache/spark/pull/13997]

> Implement conversion utility functions for single instances in Python
> -
>
> Key: SPARK-16328
> URL: https://issues.apache.org/jira/browse/SPARK-16328
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib, PySpark
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
> Fix For: 2.0.0
>
>
> We have {{asML}}/{{fromML}} utility methods in Scala/Java to convert between 
> the old and new linalg types. These are missing in Python.
> For dense vectors it's actually easy to do without, e.g. {{mlDenseVector = 
> Vectors.dense(mllibDenseVector)}}, but for sparse it doesn't work easily. So 
> it would be good to have utility methods available for users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16276) Implement elt SQL function

2016-06-30 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-16276.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 13966
[https://github.com/apache/spark/pull/13966]

> Implement elt SQL function
> --
>
> Key: SPARK-16276
> URL: https://issues.apache.org/jira/browse/SPARK-16276
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16276) Implement elt SQL function

2016-06-30 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-16276:

Assignee: Peter Lee

> Implement elt SQL function
> --
>
> Key: SPARK-16276
> URL: https://issues.apache.org/jira/browse/SPARK-16276
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Peter Lee
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16313) Spark should not silently drop exceptions in file listing

2016-06-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16313.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Spark should not silently drop exceptions in file listing
> -
>
> Key: SPARK-16313
> URL: https://issues.apache.org/jira/browse/SPARK-16313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Critical
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16336) Suggest doing table refresh when encountering FileNotFoundException at runtime

2016-06-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16336.
-
   Resolution: Fixed
 Assignee: Peter Lee
Fix Version/s: 2.0.0

> Suggest doing table refresh when encountering FileNotFoundException at runtime
> --
>
> Key: SPARK-16336
> URL: https://issues.apache.org/jira/browse/SPARK-16336
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Peter Lee
>Assignee: Peter Lee
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13015) Replace example code in mllib-data-types.md using include_example

2016-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15358028#comment-15358028
 ] 

Apache Spark commented on SPARK-13015:
--

User 'yinxusen' has created a pull request for this issue:
https://github.com/apache/spark/pull/14006

> Replace example code in mllib-data-types.md using include_example
> -
>
> Key: SPARK-13015
> URL: https://issues.apache.org/jira/browse/SPARK-13015
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Xin Ren
>Priority: Minor
>  Labels: starter
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> Goal is to move actual example code to spark/examples and test compilation in 
> Jenkins builds. Then in the markdown, we can reference part of the code to 
> show in the user guide. This requires adding a Jekyll tag that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}{% include_example 
> scala/org/apache/spark/examples/mllib/LocalVectorExample.scala %}{code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/mllib/LocalVectorExample.scala`
>  and pick code blocks marked "example" and replace code block in 
> {code}{% highlight %}{code}
>  in the markdown. 
> See more sub-tasks in parent ticket: 
> https://issues.apache.org/jira/browse/SPARK-11337



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15954) TestHive has issues being used in PySpark

2016-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15358008#comment-15358008
 ] 

Apache Spark commented on SPARK-15954:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/14005

> TestHive has issues being used in PySpark
> -
>
> Key: SPARK-15954
> URL: https://issues.apache.org/jira/browse/SPARK-15954
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: holdenk
>
> SPARK-15745 made TestHive unreliable from PySpark test cases, to support it 
> we should allow both resource or system property based lookup for loading the 
> hive file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16286) Implement stack table generating function

2016-06-30 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357983#comment-15357983
 ] 

Dongjoon Hyun commented on SPARK-16286:
---

Thank you! :)

> Implement stack table generating function
> -
>
> Key: SPARK-16286
> URL: https://issues.apache.org/jira/browse/SPARK-16286
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16338) Streaming driver running on standalone cluster mode with supervise goes into bad state when application is killed from the UI

2016-06-30 Thread Rohit Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohit Agarwal updated SPARK-16338:
--
Attachment: error

Attached the file with the driver log for a couple of batch durations.

> Streaming driver running on standalone cluster mode with supervise goes into 
> bad state when application is killed from the UI
> -
>
> Key: SPARK-16338
> URL: https://issues.apache.org/jira/browse/SPARK-16338
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Streaming, Web UI
>Affects Versions: 1.6.1
>Reporter: Rohit Agarwal
> Attachments: error
>
>
> We are going to start using Spark Streaming in production and I was testing 
> various failure scenarios. I noticed one case where the spark streaming 
> driver got into a bad state.
> Steps to reproduce:
> 1. Create a spark streaming application with Direct Kafka Streams and 
> checkpointing enabled.
> 2. Deploy the application to a spark standalone cluster. With cluster mode 
> and --supervise.
> 3. Let it run for sometime.
> 4. Kill the application (but not the driver) from the Spark Master UI.
> 5. The driver keeps on running but doesn't restart the application. What's 
> worse is that it keeps updating the checkpoint every batch duration, so when 
> you do restart the driver, it starts at a later point and you have lost data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16338) Streaming driver running on standalone cluster mode with supervise goes into bad state when application is killed from the UI

2016-06-30 Thread Rohit Agarwal (JIRA)
Rohit Agarwal created SPARK-16338:
-

 Summary: Streaming driver running on standalone cluster mode with 
supervise goes into bad state when application is killed from the UI
 Key: SPARK-16338
 URL: https://issues.apache.org/jira/browse/SPARK-16338
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Streaming, Web UI
Affects Versions: 1.6.1
Reporter: Rohit Agarwal


We are going to start using Spark Streaming in production and I was testing 
various failure scenarios. I noticed one case where the spark streaming driver 
got into a bad state.

Steps to reproduce:
1. Create a spark streaming application with Direct Kafka Streams and 
checkpointing enabled.
2. Deploy the application to a spark standalone cluster. With cluster mode and 
--supervise.
3. Let it run for sometime.
4. Kill the application (but not the driver) from the Spark Master UI.
5. The driver keeps on running but doesn't restart the application. What's 
worse is that it keeps updating the checkpoint every batch duration, so when 
you do restart the driver, it starts at a later point and you have lost data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16286) Implement stack table generating function

2016-06-30 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357972#comment-15357972
 ] 

Reynold Xin commented on SPARK-16286:
-

Go for it!


> Implement stack table generating function
> -
>
> Key: SPARK-16286
> URL: https://issues.apache.org/jira/browse/SPARK-16286
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16286) Implement stack table generating function

2016-06-30 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357970#comment-15357970
 ] 

Dongjoon Hyun commented on SPARK-16286:
---

Hi, [~petermaxlee] and [~rxin].

If you don't mind, I'll start to work on this issue tonight.

> Implement stack table generating function
> -
>
> Key: SPARK-16286
> URL: https://issues.apache.org/jira/browse/SPARK-16286
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16208) Add `PropagateEmptyRelation` optimizer

2016-06-30 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-16208:
--
   Priority: Major  (was: Minor)
Description: 
This issue adds a new logical optimizer, `PropagateEmptyRelation`, to collapse 
a logical plans consisting of only empty LocalRelations.

**Optimizer Targets**

1. Binary(or Higher)-node Logical Plans
   - Union with all empty children.
   - Join with one or two empty children (including Intersect/Except).
2. Unary-node Logical Plans
   - Project/Filter/Sample/Join/Limit/Repartition with all empty children.
   - Aggregate with all empty children and without AggregateFunction 
expressions, COUNT.
   - Generate with Explode because other UserDefinedGenerators like Hive UDTF 
returns results.

**Sample Query**
{code}
WITH t1 AS (SELECT a FROM VALUES 1 t(a)),
 t2 AS (SELECT b FROM VALUES 1 t(b) WHERE 1=2)
SELECT a,b
FROM t1, t2
WHERE a=b
GROUP BY a,b
HAVING a>1
ORDER BY a,b
{code}

**Before**
{code}
scala> sql("with t1 as (select a from values 1 t(a)), t2 as (select b from 
values 1 t(b) where 1=2) select a,b from t1, t2 where a=b group by a,b having 
a>1 order by a,b").explain
== Physical Plan ==
*Sort [a#0 ASC, b#1 ASC], true, 0
+- Exchange rangepartitioning(a#0 ASC, b#1 ASC, 200)
   +- *HashAggregate(keys=[a#0, b#1], functions=[])
  +- Exchange hashpartitioning(a#0, b#1, 200)
 +- *HashAggregate(keys=[a#0, b#1], functions=[])
+- *BroadcastHashJoin [a#0], [b#1], Inner, BuildRight
   :- *Filter (isnotnull(a#0) && (a#0 > 1))
   :  +- LocalTableScan [a#0]
   +- BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
  +- *Filter (isnotnull(b#1) && (b#1 > 1))
 +- LocalTableScan , [b#1]
{code}

**After**
{code}
scala> sql("with t1 as (select a from values 1 t(a)), t2 as (select b from 
values 1 t(b) where 1=2) select a,b from t1, t2 where a=b group by a,b having 
a>1 order by a,b").explain
== Physical Plan ==
LocalTableScan , [a#0, b#1]
{code}

  was:
This PR adds a new logical optimizer, `CollapseEmptyPlan`, to collapse a 
logical plans consisting of only empty LocalRelations. The only exceptional 
logical plan is aggregation. For aggregation plan, only simple cases are 
consider for this optimization.

**Before**
{code}
scala> sql("select a from values (1,2) T(a,b) where 1=0 group by a,b having a>1 
order by a,b").explain
== Physical Plan ==
*Project [a#11]
+- *Sort [a#11 ASC, b#12 ASC], true, 0
   +- Exchange rangepartitioning(a#11 ASC, b#12 ASC, 200)
  +- *HashAggregate(keys=[a#11, b#12], functions=[])
 +- Exchange hashpartitioning(a#11, b#12, 200)
+- *HashAggregate(keys=[a#11, b#12], functions=[])
   +- LocalTableScan , [a#11, b#12]
{code}

**After**
{code}
scala> sql("select a from values (1,2) T(a,b) where 1=0 group by a,b having a>1 
order by a,b").explain
== Physical Plan ==
LocalTableScan , [a#0]
{code}

Summary: Add `PropagateEmptyRelation` optimizer  (was: Add 
`CollapseEmptyPlan` optimizer)

> Add `PropagateEmptyRelation` optimizer
> --
>
> Key: SPARK-16208
> URL: https://issues.apache.org/jira/browse/SPARK-16208
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>
> This issue adds a new logical optimizer, `PropagateEmptyRelation`, to 
> collapse a logical plans consisting of only empty LocalRelations.
> **Optimizer Targets**
> 1. Binary(or Higher)-node Logical Plans
>- Union with all empty children.
>- Join with one or two empty children (including Intersect/Except).
> 2. Unary-node Logical Plans
>- Project/Filter/Sample/Join/Limit/Repartition with all empty children.
>- Aggregate with all empty children and without AggregateFunction 
> expressions, COUNT.
>- Generate with Explode because other UserDefinedGenerators like Hive UDTF 
> returns results.
> **Sample Query**
> {code}
> WITH t1 AS (SELECT a FROM VALUES 1 t(a)),
>  t2 AS (SELECT b FROM VALUES 1 t(b) WHERE 1=2)
> SELECT a,b
> FROM t1, t2
> WHERE a=b
> GROUP BY a,b
> HAVING a>1
> ORDER BY a,b
> {code}
> **Before**
> {code}
> scala> sql("with t1 as (select a from values 1 t(a)), t2 as (select b from 
> values 1 t(b) where 1=2) select a,b from t1, t2 where a=b group by a,b having 
> a>1 order by a,b").explain
> == Physical Plan ==
> *Sort [a#0 ASC, b#1 ASC], true, 0
> +- Exchange rangepartitioning(a#0 ASC, b#1 ASC, 200)
>+- *HashAggregate(keys=[a#0, b#1], functions=[])
>   +- Exchange hashpartitioning(a#0, b#1, 200)
>  +- *HashAggregate(keys=[a#0, b#1], functions=[])
> +- *BroadcastHashJoin [a#0], [b#1], Inner, BuildRight
>:- *Filter (isnotnull(a#0) && (a#0 > 1))
>   

[jira] [Assigned] (SPARK-16285) Implement sentences SQL function

2016-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16285:


Assignee: (was: Apache Spark)

> Implement sentences SQL function
> 
>
> Key: SPARK-16285
> URL: https://issues.apache.org/jira/browse/SPARK-16285
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16285) Implement sentences SQL function

2016-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357958#comment-15357958
 ] 

Apache Spark commented on SPARK-16285:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/14004

> Implement sentences SQL function
> 
>
> Key: SPARK-16285
> URL: https://issues.apache.org/jira/browse/SPARK-16285
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16285) Implement sentences SQL function

2016-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16285:


Assignee: Apache Spark

> Implement sentences SQL function
> 
>
> Key: SPARK-16285
> URL: https://issues.apache.org/jira/browse/SPARK-16285
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16336) Suggest doing table refresh when encountering FileNotFoundException at runtime

2016-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16336:


Assignee: Apache Spark

> Suggest doing table refresh when encountering FileNotFoundException at runtime
> --
>
> Key: SPARK-16336
> URL: https://issues.apache.org/jira/browse/SPARK-16336
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Peter Lee
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16336) Suggest doing table refresh when encountering FileNotFoundException at runtime

2016-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357926#comment-15357926
 ] 

Apache Spark commented on SPARK-16336:
--

User 'petermaxlee' has created a pull request for this issue:
https://github.com/apache/spark/pull/14003

> Suggest doing table refresh when encountering FileNotFoundException at runtime
> --
>
> Key: SPARK-16336
> URL: https://issues.apache.org/jira/browse/SPARK-16336
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Peter Lee
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16336) Suggest doing table refresh when encountering FileNotFoundException at runtime

2016-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16336:


Assignee: (was: Apache Spark)

> Suggest doing table refresh when encountering FileNotFoundException at runtime
> --
>
> Key: SPARK-16336
> URL: https://issues.apache.org/jira/browse/SPARK-16336
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Peter Lee
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16337) Metadata refresh should work on temporary views

2016-06-30 Thread Peter Lee (JIRA)
Peter Lee created SPARK-16337:
-

 Summary: Metadata refresh should work on temporary views
 Key: SPARK-16337
 URL: https://issues.apache.org/jira/browse/SPARK-16337
 Project: Spark
  Issue Type: Sub-task
Reporter: Peter Lee






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16336) Suggest doing table refresh when encountering FileNotFoundException at runtime

2016-06-30 Thread Peter Lee (JIRA)
Peter Lee created SPARK-16336:
-

 Summary: Suggest doing table refresh when encountering 
FileNotFoundException at runtime
 Key: SPARK-16336
 URL: https://issues.apache.org/jira/browse/SPARK-16336
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Peter Lee






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16329) select * from temp_table_no_cols fails

2016-06-30 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357910#comment-15357910
 ] 

Xiao Li edited comment on SPARK-16329 at 6/30/16 9:47 PM:
--

I see. Will try to do it.

Just FYI, I tried it in DB2. 

db2 => create table t2()
DB21034E  The command was processed as an SQL statement because it was not a 
valid Command Line Processor command.  During SQL processing it returned:
SQL0104N  An unexpected token ")" was found following "create table t2(".  
Expected tokens may include:  "".  SQLSTATE=42601



was (Author: smilegator):
I see.

Just FYI, I tried it in DB2. 

db2 => create table t2()
DB21034E  The command was processed as an SQL statement because it was not a 
valid Command Line Processor command.  During SQL processing it returned:
SQL0104N  An unexpected token ")" was found following "create table t2(".  
Expected tokens may include:  "".  SQLSTATE=42601


> select * from temp_table_no_cols fails
> --
>
> Key: SPARK-16329
> URL: https://issues.apache.org/jira/browse/SPARK-16329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: Adrian Ionescu
>
> The following works with spark 1.5.1, but not anymore with spark 1.6.0:
> {code}
> import org.apache.spark.sql.{ DataFrame, Row }
> import org.apache.spark.sql.types.StructType
> val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => 
> Row.empty)
> val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty))
> dfNoCols.registerTempTable("temp_table_no_cols")
> sqlContext.sql("select * from temp_table_no_cols").show
> {code}
> spark 1.5.1 result:
> {noformat}
> ++
> ||
> ++
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ++
> {noformat}
> spark 1.6.0 result:
> {noformat}
> java.lang.IllegalArgumentException: requirement failed
> at scala.Predef$.require(Predef.scala:221)
> at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:354)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:353)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:353)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:328)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:36)
> at 
> 

[jira] [Commented] (SPARK-16329) select * from temp_table_no_cols fails

2016-06-30 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357910#comment-15357910
 ] 

Xiao Li commented on SPARK-16329:
-

I see.

Just FYI, I tried it in DB2. 

db2 => create table t2()
DB21034E  The command was processed as an SQL statement because it was not a 
valid Command Line Processor command.  During SQL processing it returned:
SQL0104N  An unexpected token ")" was found following "create table t2(".  
Expected tokens may include:  "".  SQLSTATE=42601


> select * from temp_table_no_cols fails
> --
>
> Key: SPARK-16329
> URL: https://issues.apache.org/jira/browse/SPARK-16329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: Adrian Ionescu
>
> The following works with spark 1.5.1, but not anymore with spark 1.6.0:
> {code}
> import org.apache.spark.sql.{ DataFrame, Row }
> import org.apache.spark.sql.types.StructType
> val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => 
> Row.empty)
> val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty))
> dfNoCols.registerTempTable("temp_table_no_cols")
> sqlContext.sql("select * from temp_table_no_cols").show
> {code}
> spark 1.5.1 result:
> {noformat}
> ++
> ||
> ++
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ++
> {noformat}
> spark 1.6.0 result:
> {noformat}
> java.lang.IllegalArgumentException: requirement failed
> at scala.Predef$.require(Predef.scala:221)
> at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:354)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:353)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:353)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:328)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
>

[jira] [Commented] (SPARK-16329) select * from temp_table_no_cols fails

2016-06-30 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357897#comment-15357897
 ] 

Reynold Xin commented on SPARK-16329:
-

Hmmm I tend to like Postgres more :)

It's a real database. 

> select * from temp_table_no_cols fails
> --
>
> Key: SPARK-16329
> URL: https://issues.apache.org/jira/browse/SPARK-16329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: Adrian Ionescu
>
> The following works with spark 1.5.1, but not anymore with spark 1.6.0:
> {code}
> import org.apache.spark.sql.{ DataFrame, Row }
> import org.apache.spark.sql.types.StructType
> val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => 
> Row.empty)
> val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty))
> dfNoCols.registerTempTable("temp_table_no_cols")
> sqlContext.sql("select * from temp_table_no_cols").show
> {code}
> spark 1.5.1 result:
> {noformat}
> ++
> ||
> ++
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ++
> {noformat}
> spark 1.6.0 result:
> {noformat}
> java.lang.IllegalArgumentException: requirement failed
> at scala.Predef$.require(Predef.scala:221)
> at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:354)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:353)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:353)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:328)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:37)
> at $iwC$$iwC$$iwC$$iwC.(:39)
> at $iwC$$iwC$$iwC.(:41)
> at $iwC$$iwC.(:43)
> at $iwC.(:45)
> at (:47)
> at .(:51)
> at .()
>   

[jira] [Updated] (SPARK-16335) Structured streaming should fail if source directory does not exist

2016-06-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16335:

Summary: Structured streaming should fail if source directory does not 
exist  (was: Streaming source should fail if file does not exist)

> Structured streaming should fail if source directory does not exist
> ---
>
> Key: SPARK-16335
> URL: https://issues.apache.org/jira/browse/SPARK-16335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> In structured streaming, Spark does not report errors when the specified 
> directory does not exist. This is a behavior different from the batch mode. 
> This patch changes the behavior to fail if the directory does not exist (when 
> the path is not a glob pattern).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16335) Streaming source should fail if file does not exist

2016-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16335:


Assignee: Apache Spark  (was: Reynold Xin)

> Streaming source should fail if file does not exist
> ---
>
> Key: SPARK-16335
> URL: https://issues.apache.org/jira/browse/SPARK-16335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16335) Streaming source should fail if file does not exist

2016-06-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16335:

Description: 
In structured streaming, Spark does not report errors when the specified 
directory does not exist. This is a behavior different from the batch mode. 
This patch changes the behavior to fail if the directory does not exist (when 
the path is not a glob pattern).




> Streaming source should fail if file does not exist
> ---
>
> Key: SPARK-16335
> URL: https://issues.apache.org/jira/browse/SPARK-16335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> In structured streaming, Spark does not report errors when the specified 
> directory does not exist. This is a behavior different from the batch mode. 
> This patch changes the behavior to fail if the directory does not exist (when 
> the path is not a glob pattern).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16335) Streaming source should fail if file does not exist

2016-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16335:


Assignee: Reynold Xin  (was: Apache Spark)

> Streaming source should fail if file does not exist
> ---
>
> Key: SPARK-16335
> URL: https://issues.apache.org/jira/browse/SPARK-16335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16335) Streaming source should fail if file does not exist

2016-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357876#comment-15357876
 ] 

Apache Spark commented on SPARK-16335:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/14002

> Streaming source should fail if file does not exist
> ---
>
> Key: SPARK-16335
> URL: https://issues.apache.org/jira/browse/SPARK-16335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16335) Streaming source should fail if file does not exist

2016-06-30 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-16335:
---

 Summary: Streaming source should fail if file does not exist
 Key: SPARK-16335
 URL: https://issues.apache.org/jira/browse/SPARK-16335
 Project: Spark
  Issue Type: Bug
  Components: SQL, Streaming
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.

2016-06-30 Thread Vladimir Feinberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357868#comment-15357868
 ] 

Vladimir Feinberg commented on SPARK-4240:
--

[~sethah] Hi Seth, it seems like your comment is outdated now that GBT is 
indeed in ML. Are you currently working on this?


> Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
> 
>
> Key: SPARK-4240
> URL: https://issues.apache.org/jira/browse/SPARK-4240
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Sung Chung
>
> The gradient boosting as currently implemented estimates the loss-gradient in 
> each iteration using regression trees. At every iteration, the regression 
> trees are trained/split to minimize predicted gradient variance. 
> Additionally, the terminal node predictions are computed to minimize the 
> prediction variance.
> However, such predictions won't be optimal for loss functions other than the 
> mean-squared error. The TreeBoosting refinement can help mitigate this issue 
> by modifying terminal node prediction values so that those predictions would 
> directly minimize the actual loss function. Although this still doesn't 
> change the fact that the tree splits were done through variance reduction, it 
> should still lead to improvement in gradient estimations, and thus better 
> performance.
> The details of this can be found in the R vignette. This paper also shows how 
> to refine the terminal node predictions.
> http://www.saedsayad.com/docs/gbm2.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16334) [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException

2016-06-30 Thread Egor Pahomov (JIRA)
Egor Pahomov created SPARK-16334:


 Summary: [SQL] SQL query on parquet table 
java.lang.ArrayIndexOutOfBoundsException
 Key: SPARK-16334
 URL: https://issues.apache.org/jira/browse/SPARK-16334
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Egor Pahomov
Priority: Critical


Query:

{code}
select * from blabla where user_id = 415706251
{code}

Error:

{code}
16/06/30 14:07:27 WARN scheduler.TaskSetManager: Lost task 11.0 in stage 0.0 
(TID 3, hadoop6): java.lang.ArrayIndexOutOfBoundsException: 6934
at 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.decodeToBinary(PlainValuesDictionary.java:119)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:273)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:170)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}


Work on 1.6.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16334) [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException

2016-06-30 Thread Egor Pahomov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Egor Pahomov updated SPARK-16334:
-
Labels: sql  (was: )

> [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException
> -
>
> Key: SPARK-16334
> URL: https://issues.apache.org/jira/browse/SPARK-16334
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Priority: Critical
>  Labels: sql
>
> Query:
> {code}
> select * from blabla where user_id = 415706251
> {code}
> Error:
> {code}
> 16/06/30 14:07:27 WARN scheduler.TaskSetManager: Lost task 11.0 in stage 0.0 
> (TID 3, hadoop6): java.lang.ArrayIndexOutOfBoundsException: 6934
> at 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.decodeToBinary(PlainValuesDictionary.java:119)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:273)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:170)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
> at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Work on 1.6.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16333) Excessive Spark history event/json data size (5GB each)

2016-06-30 Thread Peter Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357855#comment-15357855
 ] 

Peter Liu commented on SPARK-16333:
---

is there anyway to upload the file (gzip the 5GB)?

otherwise, I can give the first and last part of the json file as follows - 
which may not be very helpful:

(a) head:
[root@sparkhab1lab config-files]# more  
/hdd10/spark-event-log/app-20160630101556-
{"Event":"SparkListenerLogStart","Spark Version":"2.0.0-preview"}
{"Event":"SparkListenerBlockManagerAdded","Block Manager ID":{"Executor 
ID":"driver","Host":"10.10.14.1","Port":55879},"Maximum Memory":142929690624
,"Timestamp":1467299756102}
{"Event":"SparkListenerEnvironmentUpdate","JVM Information":{"Java 
Home":"/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.65-3.b17.el7.ppc64le/jre","Java 
Vers
ion":"1.8.0_65 (Oracle Corporation)","Scala Version":"version 2.11.8"},"Spark 
Properties":{"spark.serializer":"org.apache.spark.serializer.KryoSeria
lizer","spark.executor.extraJavaOptions":"-XX:+AlwaysTenure 
-XX:ParallelGCThreads=20","spark.driver.host":"10.10.14.1","spark.history.fs.logDirector
y":"/hdd10/spark-event-log","spark.eventLog.enabled":"true","spark.driver.maxResultSize":"0","spark.driver.port":"35591","spark.sql.tungsten.enabled
":"true","spark.jars":"file:/home/SparkBench_working_sql_gl/spark_app/sql/target/scala-2.10/sqlapp_2.11-1.0.jar","spark.app.name":"Spark
 RDDRelation
 Application 
Query5","spark.network.timeout":"600","spark.scheduler.mode":"FIFO","spark.driver.memory":"200g","spark.executor.instances":"1","spark.
history.fs.cleaner.enabled":"false","spark.executor.id":"driver","spark.submit.deployMode":"client","spark.master":"spark://sparkhab1lab.austin.ibm.
com:7077","spark.executor.memory":"480g","spark.local.dir":"/hdd1/spark-tmp,/hdd2/spark-tmp,/hdd3/spark-tmp,/hdd4/spark-tmp,/hdd5/spark-tmp,/hdd6/sp
ark-tmp,/hdd7/spark-tmp,/hdd8/spark-tmp,/hdd9/spark-tmp,/hdd10/spark-tmp","spark.eventLog.dir":"/hdd10/spark-event-log","spark.executor.cores":"20",
"spark.core.connection.ack.wait.timeout":"600","spark.app.id":"app-20160630101556-"},"System
 Properties":{"java.io.tmpdir":"/tmp","line.separato
r":"\n","path.separator":":","sun.management.compiler":"HotSpot 64-Bit Server 
Compiler","SPARK_SUBMIT":"true","sun.cpu.endian":"little","java.specif
ication.version":"1.8","java.vm.specification.name":"Java Virtual Machine 
Specification","java.vendor":"Oracle Corporation","java.vm.specification.v
ersion":"1.8","user.home":"/root","file.encoding.pkg":"sun.io","sun.nio.ch.bugLevel":"","sun.arch.data.model":"64","sun.boot.library.path":"/usr/lib
/jvm/java-1.8.0-openjdk-1.8.0.65-3.b17.el7.ppc64le/jre/lib/ppc64le","user.dir":"/home/pl/SQL-RUN-DIR/rundir-spk2-may24","java.library.path":"/usr/ja
va/packages/lib/ppc64le:/usr/lib64:/lib64:/lib:/usr/lib","sun.cpu.isalist":"","os.arch":"ppc64le","java.vm.version":"25.65-b01","java.endorsed.dirs"
:"/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.65-3.b17.el7.ppc64le/jre/lib/endorsed","java.runtime.version":"1.8.0_65-b17","java.vm.info":"mixed
 mode","ja
va.ext.dirs":"/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.65-3.b17.el7.ppc64le/jre/lib/ext:/usr/java/packages/lib/ext","java.runtime.name":"OpenJDK
 Runtim
e 
Environment","file.separator":"/","java.class.version":"52.0","java.specification.name":"Java
 Platform API Specification","sun.boot.class.path":"/
usr/lib/jvm/java-1.8.0-openjdk-1.8.0.65-3.b17.el7.ppc64le/jre/lib/resources.jar:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.65-3.b17.el7.ppc64le/jre/lib/r
t.jar:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.65-3.b17.el7.ppc64le/jre/lib/sunrsasign.jar:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.65-3.b17.el7.ppc64le/j
re/lib/jsse.jar:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.65-3.b17.el7.ppc64le/jre/lib/jce.jar:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.65-3.b17.el7.ppc64l
e/jre/lib/charsets.jar:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.65-3.b17.el7.ppc64le/jre/lib/jfr.jar:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.65-3.b17.el7
.ppc64le/jre/classes","file.encoding":"UTF-8","user.timezone":"US/Central","java.specification.vendor":"Oracle
 Corporation","sun.java.launcher":"SUN
_STANDARD","os.version":"3.10.0-327.el7.ppc64le","sun.os.patch.level":"unknown","java.vm.specification.vendor":"Oracle
 Corporation","user.country":"
US","sun.jnu.encoding":"UTF-8","user.language":"en","java.vendor.url":"http://java.oracle.com/","java.awt.printerjob":"sun.print.PSPrinterJob","java
.awt.graphicsenv":"sun.awt.X11GraphicsEnvironment","awt.toolkit":"sun.awt.X11.XToolkit","os.name":"Linux","java.vm.vendor":"Oracle
 Corporation","jav
a.vendor.url.bug":"http://bugreport.sun.com/bugreport/","user.name":"root","java.vm.name":"OpenJDK
 64-Bit Server VM","sun.java.command":"org.apache.
spark.deploy.SparkSubmit --master spark://sparkhab1lab.austin.ibm.com:7077 
--conf spark.executor.memory=480g --conf spark.executor.cores=20 --class
src.main.scala.RDDRelationQueryZero5 

[jira] [Commented] (SPARK-16329) select * from temp_table_no_cols fails

2016-06-30 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357845#comment-15357845
 ] 

Takeshi Yamamuro commented on SPARK-16329:
--

Tables with no columns make less sense, so the Hive way seems more reasonable 
to me.

> select * from temp_table_no_cols fails
> --
>
> Key: SPARK-16329
> URL: https://issues.apache.org/jira/browse/SPARK-16329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: Adrian Ionescu
>
> The following works with spark 1.5.1, but not anymore with spark 1.6.0:
> {code}
> import org.apache.spark.sql.{ DataFrame, Row }
> import org.apache.spark.sql.types.StructType
> val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => 
> Row.empty)
> val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty))
> dfNoCols.registerTempTable("temp_table_no_cols")
> sqlContext.sql("select * from temp_table_no_cols").show
> {code}
> spark 1.5.1 result:
> {noformat}
> ++
> ||
> ++
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ++
> {noformat}
> spark 1.6.0 result:
> {noformat}
> java.lang.IllegalArgumentException: requirement failed
> at scala.Predef$.require(Predef.scala:221)
> at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:354)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:353)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:353)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:328)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:37)
> at $iwC$$iwC$$iwC$$iwC.(:39)
> at $iwC$$iwC$$iwC.(:41)
> at $iwC$$iwC.(:43)
> at $iwC.(:45)
> at (:47)
> 

[jira] [Commented] (SPARK-16329) select * from temp_table_no_cols fails

2016-06-30 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357831#comment-15357831
 ] 

Xiao Li commented on SPARK-16329:
-

In Hive, we are unable to create a table with 0 column. That is the decision we 
have to make 

> select * from temp_table_no_cols fails
> --
>
> Key: SPARK-16329
> URL: https://issues.apache.org/jira/browse/SPARK-16329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: Adrian Ionescu
>
> The following works with spark 1.5.1, but not anymore with spark 1.6.0:
> {code}
> import org.apache.spark.sql.{ DataFrame, Row }
> import org.apache.spark.sql.types.StructType
> val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => 
> Row.empty)
> val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty))
> dfNoCols.registerTempTable("temp_table_no_cols")
> sqlContext.sql("select * from temp_table_no_cols").show
> {code}
> spark 1.5.1 result:
> {noformat}
> ++
> ||
> ++
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ++
> {noformat}
> spark 1.6.0 result:
> {noformat}
> java.lang.IllegalArgumentException: requirement failed
> at scala.Predef$.require(Predef.scala:221)
> at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:354)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:353)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:353)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:328)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:37)
> at $iwC$$iwC$$iwC$$iwC.(:39)
> at $iwC$$iwC$$iwC.(:41)
> at $iwC$$iwC.(:43)
> at $iwC.(:45)
> at (:47)
> at 

[jira] [Commented] (SPARK-16333) Excessive Spark history event/json data size (5GB each)

2016-06-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357829#comment-15357829
 ] 

Sean Owen commented on SPARK-16333:
---

Likely related. Are you certain it's the same program and same output?
What I meant was, what is in the new vs old logs? does that give a hint about 
why it's bigger?
Yes, it was already clear that the size is different, I'm asking what is in 
them.

> Excessive Spark history event/json data size (5GB each)
> ---
>
> Key: SPARK-16333
> URL: https://issues.apache.org/jira/browse/SPARK-16333
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: this is seen on both x86 (Intel(R) Xeon(R), E5-2699 ) 
> and ppc platform (Habanero, Model: 8348-21C), Red Hat Enterprise Linux Server 
> release 7.2 (Maipo)., Spark2.0.0-preview (May-24, 2016 build)
>Reporter: Peter Liu
>  Labels: performance, spark2.0.0
>
> With Spark2.0.0-preview (May-24 build), the history event data (the json 
> file), that is generated for each Spark application run (see below), can be 
> as big as 5GB (instead of 14 MB for exactly the same application run and the 
> same input data of 1TB under Spark1.6.1)
> -rwxrwx--- 1 root root 5.3G Jun 30 09:39 app-20160630091959-
> -rwxrwx--- 1 root root 5.3G Jun 30 09:56 app-20160630094213-
> -rwxrwx--- 1 root root 5.3G Jun 30 10:13 app-20160630095856-
> -rwxrwx--- 1 root root 5.3G Jun 30 10:30 app-20160630101556-
> The test is done with Sparkbench V2, SQL RDD (see github: 
> https://github.com/SparkTC/spark-bench)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16329) select * from temp_table_no_cols fails

2016-06-30 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357826#comment-15357826
 ] 

Takeshi Yamamuro commented on SPARK-16329:
--

One idea to fix this is to follow the behaviour of other databases,  e.g. the 
example of postgresql is as follows;
{code}
postgres=# create table test_rel();
CREATE TABLE
postgres=# select * from test_rel;
--
(0 rows)
{code}

> select * from temp_table_no_cols fails
> --
>
> Key: SPARK-16329
> URL: https://issues.apache.org/jira/browse/SPARK-16329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: Adrian Ionescu
>
> The following works with spark 1.5.1, but not anymore with spark 1.6.0:
> {code}
> import org.apache.spark.sql.{ DataFrame, Row }
> import org.apache.spark.sql.types.StructType
> val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => 
> Row.empty)
> val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty))
> dfNoCols.registerTempTable("temp_table_no_cols")
> sqlContext.sql("select * from temp_table_no_cols").show
> {code}
> spark 1.5.1 result:
> {noformat}
> ++
> ||
> ++
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ++
> {noformat}
> spark 1.6.0 result:
> {noformat}
> java.lang.IllegalArgumentException: requirement failed
> at scala.Predef$.require(Predef.scala:221)
> at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:354)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:353)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:353)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:328)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:37)
> at 

[jira] [Commented] (SPARK-16329) select * from temp_table_no_cols fails

2016-06-30 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357824#comment-15357824
 ] 

Xiao Li commented on SPARK-16329:
-

nvm, thank you for your confirmation!

> select * from temp_table_no_cols fails
> --
>
> Key: SPARK-16329
> URL: https://issues.apache.org/jira/browse/SPARK-16329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: Adrian Ionescu
>
> The following works with spark 1.5.1, but not anymore with spark 1.6.0:
> {code}
> import org.apache.spark.sql.{ DataFrame, Row }
> import org.apache.spark.sql.types.StructType
> val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => 
> Row.empty)
> val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty))
> dfNoCols.registerTempTable("temp_table_no_cols")
> sqlContext.sql("select * from temp_table_no_cols").show
> {code}
> spark 1.5.1 result:
> {noformat}
> ++
> ||
> ++
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ++
> {noformat}
> spark 1.6.0 result:
> {noformat}
> java.lang.IllegalArgumentException: requirement failed
> at scala.Predef$.require(Predef.scala:221)
> at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:354)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:353)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:353)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:328)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:37)
> at $iwC$$iwC$$iwC$$iwC.(:39)
> at $iwC$$iwC$$iwC.(:41)
> at $iwC$$iwC.(:43)
> at $iwC.(:45)
> at (:47)
> at .(:51)
> at .()
> at .(:7)
> at .()
>  

[jira] [Commented] (SPARK-16256) Add Structured Streaming Programming Guide

2016-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357825#comment-15357825
 ] 

Apache Spark commented on SPARK-16256:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/14001

> Add Structured Streaming Programming Guide
> --
>
> Key: SPARK-16256
> URL: https://issues.apache.org/jira/browse/SPARK-16256
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16329) select * from temp_table_no_cols fails

2016-06-30 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357817#comment-15357817
 ] 

Takeshi Yamamuro commented on SPARK-16329:
--

Oh, my bad.
{code}
val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => Row.empty)
val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty))
dfNoCols.show
{code}
The above query passed though, the original one threw the exception.


> select * from temp_table_no_cols fails
> --
>
> Key: SPARK-16329
> URL: https://issues.apache.org/jira/browse/SPARK-16329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: Adrian Ionescu
>
> The following works with spark 1.5.1, but not anymore with spark 1.6.0:
> {code}
> import org.apache.spark.sql.{ DataFrame, Row }
> import org.apache.spark.sql.types.StructType
> val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => 
> Row.empty)
> val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty))
> dfNoCols.registerTempTable("temp_table_no_cols")
> sqlContext.sql("select * from temp_table_no_cols").show
> {code}
> spark 1.5.1 result:
> {noformat}
> ++
> ||
> ++
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ++
> {noformat}
> spark 1.6.0 result:
> {noformat}
> java.lang.IllegalArgumentException: requirement failed
> at scala.Predef$.require(Predef.scala:221)
> at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:354)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:353)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:353)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:328)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
> at 

[jira] [Comment Edited] (SPARK-16332) the history server of spark2.0-preview (may-24 build) consumes more than 1000% cpu

2016-06-30 Thread Peter Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357811#comment-15357811
 ] 

Peter Liu edited comment on SPARK-16332 at 6/30/16 8:41 PM:


I think this is likely related to issue "SPARK-16333" due the huge amount of 
data per each run (5Gb per run). when the cluster is idle, it does the 
initialization and consumes 1000% cpu; when under load, the cpu consumption 
goes above 2000% (per linux top command);

the net is: for the same scenario (exactly the same amount of data and the 
source code), when running spark 1.6.1, the jvm of the spark history server 
only consumes ~30% under load (where is spark event data amount is also much 
smaller as said in the other ticket). 

the fix of this issue can be made dependent of the fix of "SPARK-16333", I 
think.
thanks ...
Peter


was (Author: petergangliu):
I think this is likely related to issue "SPARK-16333" due the huge amount of 
data per each run (5Gb per run). when the cluster is idle, it does the 
initialization and consumes 1000% cpu; when under load, the cpu consumption 
goes above 2000% (per linux top command);

the net is: for the same scenario (exactly the same amount of data and the 
source code), when running spark 1.6.1, the jvm of the spark history server 
only consumes ~30% under load (where is data amount is also much smaller). 

the fix of this issue can be made dependent of the fix of "SPARK-16333", I 
think.
thanks ...
Peter

> the history server of spark2.0-preview (may-24 build) consumes more than 
> 1000% cpu
> --
>
> Key: SPARK-16332
> URL: https://issues.apache.org/jira/browse/SPARK-16332
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: this is seen on both x86 (Intel(R) Xeon(R), E5-2699 ) 
> and ppc platform IBM Power8 Habanero (Model: 8348-21C), Red Hat Enterprise 
> Linux Server release 7.2 (Maipo), Spark2.0.0-preview (May-24, 2016)
>Reporter: Peter Liu
>
> the JVM instance of the Spark history server of spark2.0-preview (may-24 
> build) consumes more than 1000% cpu without the spark standalone cluster (of 
> 6 nodes) being under load. 
> When under load (1TB input data for a SQL query scenario), the JVM instance 
> of the Spark history server of spark2.0-preview consumes 2000% cpu (as seen 
> with "top" on linux 3.10)
> Note: can't see a proper Component selection here, surely not a Web GUI issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16332) the history server of spark2.0-preview (may-24 build) consumes more than 1000% cpu

2016-06-30 Thread Peter Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357811#comment-15357811
 ] 

Peter Liu commented on SPARK-16332:
---

I think this is likely related to issue "SPARK-16333" due the huge amount of 
data per each run (5Gb per run). when the cluster is idle, it does the 
initialization and consumes 1000% cpu; when under load, the cpu consumption 
goes above 2000% (per linux top command);

the net is: for the same scenario (exactly the same amount of data and the 
source code), when running spark 1.6.1, the jvm of the spark history server 
only consumes ~30% under load (where is data amount is also much smaller). 

the fix of this issue can be made dependent of the fix of "SPARK-16333", I 
think.
thanks ...
Peter

> the history server of spark2.0-preview (may-24 build) consumes more than 
> 1000% cpu
> --
>
> Key: SPARK-16332
> URL: https://issues.apache.org/jira/browse/SPARK-16332
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: this is seen on both x86 (Intel(R) Xeon(R), E5-2699 ) 
> and ppc platform IBM Power8 Habanero (Model: 8348-21C), Red Hat Enterprise 
> Linux Server release 7.2 (Maipo), Spark2.0.0-preview (May-24, 2016)
>Reporter: Peter Liu
>
> the JVM instance of the Spark history server of spark2.0-preview (may-24 
> build) consumes more than 1000% cpu without the spark standalone cluster (of 
> 6 nodes) being under load. 
> When under load (1TB input data for a SQL query scenario), the JVM instance 
> of the Spark history server of spark2.0-preview consumes 2000% cpu (as seen 
> with "top" on linux 3.10)
> Note: can't see a proper Component selection here, surely not a Web GUI issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16329) select * from temp_table_no_cols fails

2016-06-30 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357804#comment-15357804
 ] 

Xiao Li commented on SPARK-16329:
-

[~maropu] I can reproduce it in the master. It reports a misleading exception.

> select * from temp_table_no_cols fails
> --
>
> Key: SPARK-16329
> URL: https://issues.apache.org/jira/browse/SPARK-16329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: Adrian Ionescu
>
> The following works with spark 1.5.1, but not anymore with spark 1.6.0:
> {code}
> import org.apache.spark.sql.{ DataFrame, Row }
> import org.apache.spark.sql.types.StructType
> val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => 
> Row.empty)
> val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty))
> dfNoCols.registerTempTable("temp_table_no_cols")
> sqlContext.sql("select * from temp_table_no_cols").show
> {code}
> spark 1.5.1 result:
> {noformat}
> ++
> ||
> ++
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ++
> {noformat}
> spark 1.6.0 result:
> {noformat}
> java.lang.IllegalArgumentException: requirement failed
> at scala.Predef$.require(Predef.scala:221)
> at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:354)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:353)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:353)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:328)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:37)
> at $iwC$$iwC$$iwC$$iwC.(:39)
> at $iwC$$iwC$$iwC.(:41)
> at $iwC$$iwC.(:43)
> at $iwC.(:45)
> at (:47)
> at .(:51)
> at 

[jira] [Issue Comment Deleted] (SPARK-16329) select * from temp_table_no_cols fails

2016-06-30 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-16329:

Comment: was deleted

(was: [~maropu]I can reproduce it in the master. )

> select * from temp_table_no_cols fails
> --
>
> Key: SPARK-16329
> URL: https://issues.apache.org/jira/browse/SPARK-16329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: Adrian Ionescu
>
> The following works with spark 1.5.1, but not anymore with spark 1.6.0:
> {code}
> import org.apache.spark.sql.{ DataFrame, Row }
> import org.apache.spark.sql.types.StructType
> val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => 
> Row.empty)
> val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty))
> dfNoCols.registerTempTable("temp_table_no_cols")
> sqlContext.sql("select * from temp_table_no_cols").show
> {code}
> spark 1.5.1 result:
> {noformat}
> ++
> ||
> ++
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ++
> {noformat}
> spark 1.6.0 result:
> {noformat}
> java.lang.IllegalArgumentException: requirement failed
> at scala.Predef$.require(Predef.scala:221)
> at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:354)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:353)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:353)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:328)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:37)
> at $iwC$$iwC$$iwC$$iwC.(:39)
> at $iwC$$iwC$$iwC.(:41)
> at $iwC$$iwC.(:43)
> at $iwC.(:45)
> at (:47)
> at .(:51)
> at .()
> at .(:7)
> at .()
>

[jira] [Commented] (SPARK-16329) select * from temp_table_no_cols fails

2016-06-30 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357803#comment-15357803
 ] 

Xiao Li commented on SPARK-16329:
-

Which behavior is preferred? 

> select * from temp_table_no_cols fails
> --
>
> Key: SPARK-16329
> URL: https://issues.apache.org/jira/browse/SPARK-16329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: Adrian Ionescu
>
> The following works with spark 1.5.1, but not anymore with spark 1.6.0:
> {code}
> import org.apache.spark.sql.{ DataFrame, Row }
> import org.apache.spark.sql.types.StructType
> val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => 
> Row.empty)
> val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty))
> dfNoCols.registerTempTable("temp_table_no_cols")
> sqlContext.sql("select * from temp_table_no_cols").show
> {code}
> spark 1.5.1 result:
> {noformat}
> ++
> ||
> ++
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ++
> {noformat}
> spark 1.6.0 result:
> {noformat}
> java.lang.IllegalArgumentException: requirement failed
> at scala.Predef$.require(Predef.scala:221)
> at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:354)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:353)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:353)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:328)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:37)
> at $iwC$$iwC$$iwC$$iwC.(:39)
> at $iwC$$iwC$$iwC.(:41)
> at $iwC$$iwC.(:43)
> at $iwC.(:45)
> at (:47)
> at .(:51)
> at .()
> at .(:7)
> at .()
> 

[jira] [Commented] (SPARK-16329) select * from temp_table_no_cols fails

2016-06-30 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357800#comment-15357800
 ] 

Xiao Li commented on SPARK-16329:
-

[~maropu]I can reproduce it in the master. 

> select * from temp_table_no_cols fails
> --
>
> Key: SPARK-16329
> URL: https://issues.apache.org/jira/browse/SPARK-16329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: Adrian Ionescu
>
> The following works with spark 1.5.1, but not anymore with spark 1.6.0:
> {code}
> import org.apache.spark.sql.{ DataFrame, Row }
> import org.apache.spark.sql.types.StructType
> val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => 
> Row.empty)
> val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty))
> dfNoCols.registerTempTable("temp_table_no_cols")
> sqlContext.sql("select * from temp_table_no_cols").show
> {code}
> spark 1.5.1 result:
> {noformat}
> ++
> ||
> ++
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ++
> {noformat}
> spark 1.6.0 result:
> {noformat}
> java.lang.IllegalArgumentException: requirement failed
> at scala.Predef$.require(Predef.scala:221)
> at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:354)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:353)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:353)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:328)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:37)
> at $iwC$$iwC$$iwC$$iwC.(:39)
> at $iwC$$iwC$$iwC.(:41)
> at $iwC$$iwC.(:43)
> at $iwC.(:45)
> at (:47)
> at .(:51)
> at .()
> at .(:7)
> at 

[jira] [Commented] (SPARK-16329) select * from temp_table_no_cols fails

2016-06-30 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357798#comment-15357798
 ] 

Reynold Xin commented on SPARK-16329:
-

We can fix 1.6.


> select * from temp_table_no_cols fails
> --
>
> Key: SPARK-16329
> URL: https://issues.apache.org/jira/browse/SPARK-16329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: Adrian Ionescu
>
> The following works with spark 1.5.1, but not anymore with spark 1.6.0:
> {code}
> import org.apache.spark.sql.{ DataFrame, Row }
> import org.apache.spark.sql.types.StructType
> val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => 
> Row.empty)
> val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty))
> dfNoCols.registerTempTable("temp_table_no_cols")
> sqlContext.sql("select * from temp_table_no_cols").show
> {code}
> spark 1.5.1 result:
> {noformat}
> ++
> ||
> ++
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ++
> {noformat}
> spark 1.6.0 result:
> {noformat}
> java.lang.IllegalArgumentException: requirement failed
> at scala.Predef$.require(Predef.scala:221)
> at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:354)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:353)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:353)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:328)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:37)
> at $iwC$$iwC$$iwC$$iwC.(:39)
> at $iwC$$iwC$$iwC.(:41)
> at $iwC$$iwC.(:43)
> at $iwC.(:45)
> at (:47)
> at .(:51)
> at .()
> at .(:7)
> at .()
> at 

[jira] [Commented] (SPARK-16333) Excessive Spark history event/json data size (5GB each)

2016-06-30 Thread Peter Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357797#comment-15357797
 ] 

Peter Liu commented on SPARK-16333:
---

please see if this answers your question:

the history data file is the one whose location is defined in the 
spark-defaults.conf, such as:
spark.eventLog.dir  /hdd10/spark-event-log
spark.history.fs.logDirectory   /hdd10/spark-event-log
  
"app-20160622195029-" would be one file example.

the size becomes excessive in Spark2.0.0-preview (5.3G) as seen below, compared 
to 51M in spark1.6.1:

(a) after - using spark2.0.0-preview (May-24):
rwxrwx-- 1 root root 5.3G Jun 30 09:39 app-20160630091959-
rwxrwx-- 1 root root 5.3G Jun 30 09:56 app-20160630094213-

(b) before - using Spark 1.6.1
-rwxrwx--- 1 root root  51M Jun 22 20:00 app-20160622195029-
-rwxrwx--- 1 root root  51M Jun 22 20:13 app-20160622200258-

IMPACT: it can't be loaded into the Spark UI from the 18080 port (you got OOM 
error); No spark analysis on UI is possible after the run is over.

> Excessive Spark history event/json data size (5GB each)
> ---
>
> Key: SPARK-16333
> URL: https://issues.apache.org/jira/browse/SPARK-16333
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: this is seen on both x86 (Intel(R) Xeon(R), E5-2699 ) 
> and ppc platform (Habanero, Model: 8348-21C), Red Hat Enterprise Linux Server 
> release 7.2 (Maipo)., Spark2.0.0-preview (May-24, 2016 build)
>Reporter: Peter Liu
>  Labels: performance, spark2.0.0
>
> With Spark2.0.0-preview (May-24 build), the history event data (the json 
> file), that is generated for each Spark application run (see below), can be 
> as big as 5GB (instead of 14 MB for exactly the same application run and the 
> same input data of 1TB under Spark1.6.1)
> -rwxrwx--- 1 root root 5.3G Jun 30 09:39 app-20160630091959-
> -rwxrwx--- 1 root root 5.3G Jun 30 09:56 app-20160630094213-
> -rwxrwx--- 1 root root 5.3G Jun 30 10:13 app-20160630095856-
> -rwxrwx--- 1 root root 5.3G Jun 30 10:30 app-20160630101556-
> The test is done with Sparkbench V2, SQL RDD (see github: 
> https://github.com/SparkTC/spark-bench)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16329) select * from temp_table_no_cols fails

2016-06-30 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357792#comment-15357792
 ] 

Takeshi Yamamuro commented on SPARK-16329:
--

Additional info; the result of the current master is the same with that of 
v1.5.1, that is, it just returns an empty table.

> select * from temp_table_no_cols fails
> --
>
> Key: SPARK-16329
> URL: https://issues.apache.org/jira/browse/SPARK-16329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: Adrian Ionescu
>
> The following works with spark 1.5.1, but not anymore with spark 1.6.0:
> {code}
> import org.apache.spark.sql.{ DataFrame, Row }
> import org.apache.spark.sql.types.StructType
> val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => 
> Row.empty)
> val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty))
> dfNoCols.registerTempTable("temp_table_no_cols")
> sqlContext.sql("select * from temp_table_no_cols").show
> {code}
> spark 1.5.1 result:
> {noformat}
> ++
> ||
> ++
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ++
> {noformat}
> spark 1.6.0 result:
> {noformat}
> java.lang.IllegalArgumentException: requirement failed
> at scala.Predef$.require(Predef.scala:221)
> at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:354)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:353)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:353)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:328)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:37)
> at $iwC$$iwC$$iwC$$iwC.(:39)
> at $iwC$$iwC$$iwC.(:41)
> at $iwC$$iwC.(:43)
> at 

[jira] [Resolved] (SPARK-16212) code cleanup of kafka-0-8 to match review feedback on 0-10

2016-06-30 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-16212.
---
Resolution: Fixed
  Assignee: Cody Koeninger

> code cleanup of kafka-0-8 to match review feedback on 0-10
> --
>
> Key: SPARK-16212
> URL: https://issues.apache.org/jira/browse/SPARK-16212
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Cody Koeninger
>Assignee: Cody Koeninger
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16332) the history server of spark2.0-preview (may-24 build) consumes more than 1000% cpu

2016-06-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357759#comment-15357759
 ] 

Sean Owen commented on SPARK-16332:
---

I'm not sure that's an issue per se. It might consume a lot at startup, or when 
processing lots of events. Or if under GC load. Without more detail it's hard 
to say that's a problem.

> the history server of spark2.0-preview (may-24 build) consumes more than 
> 1000% cpu
> --
>
> Key: SPARK-16332
> URL: https://issues.apache.org/jira/browse/SPARK-16332
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: this is seen on both x86 (Intel(R) Xeon(R), E5-2699 ) 
> and ppc platform IBM Power8 Habanero (Model: 8348-21C), Red Hat Enterprise 
> Linux Server release 7.2 (Maipo), Spark2.0.0-preview (May-24, 2016)
>Reporter: Peter Liu
>
> the JVM instance of the Spark history server of spark2.0-preview (may-24 
> build) consumes more than 1000% cpu without the spark standalone cluster (of 
> 6 nodes) being under load. 
> When under load (1TB input data for a SQL query scenario), the JVM instance 
> of the Spark history server of spark2.0-preview consumes 2000% cpu (as seen 
> with "top" on linux 3.10)
> Note: can't see a proper Component selection here, surely not a Web GUI issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16333) Excessive Spark history event/json data size (5GB each)

2016-06-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357757#comment-15357757
 ] 

Sean Owen commented on SPARK-16333:
---

Can you comment on what the data is before and after? it may give a sense of 
whether this is really what it seems to be, and why so much more is being 
logged.

> Excessive Spark history event/json data size (5GB each)
> ---
>
> Key: SPARK-16333
> URL: https://issues.apache.org/jira/browse/SPARK-16333
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: this is seen on both x86 (Intel(R) Xeon(R), E5-2699 ) 
> and ppc platform (Habanero, Model: 8348-21C), Red Hat Enterprise Linux Server 
> release 7.2 (Maipo)., Spark2.0.0-preview (May-24, 2016 build)
>Reporter: Peter Liu
>  Labels: performance, spark2.0.0
>
> With Spark2.0.0-preview (May-24 build), the history event data (the json 
> file), that is generated for each Spark application run (see below), can be 
> as big as 5GB (instead of 14 MB for exactly the same application run and the 
> same input data of 1TB under Spark1.6.1)
> -rwxrwx--- 1 root root 5.3G Jun 30 09:39 app-20160630091959-
> -rwxrwx--- 1 root root 5.3G Jun 30 09:56 app-20160630094213-
> -rwxrwx--- 1 root root 5.3G Jun 30 10:13 app-20160630095856-
> -rwxrwx--- 1 root root 5.3G Jun 30 10:30 app-20160630101556-
> The test is done with Sparkbench V2, SQL RDD (see github: 
> https://github.com/SparkTC/spark-bench)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16329) select * from temp_table_no_cols fails

2016-06-30 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357758#comment-15357758
 ] 

Xiao Li commented on SPARK-16329:
-

[~rxin]What do you think about this? Should we just issue an error message in 
this case? Thanks!

> select * from temp_table_no_cols fails
> --
>
> Key: SPARK-16329
> URL: https://issues.apache.org/jira/browse/SPARK-16329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: Adrian Ionescu
>
> The following works with spark 1.5.1, but not anymore with spark 1.6.0:
> {code}
> import org.apache.spark.sql.{ DataFrame, Row }
> import org.apache.spark.sql.types.StructType
> val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => 
> Row.empty)
> val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty))
> dfNoCols.registerTempTable("temp_table_no_cols")
> sqlContext.sql("select * from temp_table_no_cols").show
> {code}
> spark 1.5.1 result:
> {noformat}
> ++
> ||
> ++
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ++
> {noformat}
> spark 1.6.0 result:
> {noformat}
> java.lang.IllegalArgumentException: requirement failed
> at scala.Predef$.require(Predef.scala:221)
> at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:354)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:353)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:353)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:328)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:37)
> at $iwC$$iwC$$iwC$$iwC.(:39)
> at $iwC$$iwC$$iwC.(:41)
> at $iwC$$iwC.(:43)
> at $iwC.(:45)
> at (:47)
> at 

[jira] [Resolved] (SPARK-16247) Using pyspark dataframe with pipeline and cross validator

2016-06-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16247.
---
Resolution: Not A Problem

> Using pyspark dataframe with pipeline and cross validator
> -
>
> Key: SPARK-16247
> URL: https://issues.apache.org/jira/browse/SPARK-16247
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.1
>Reporter: Edward Ma
>
> I am using pyspark with dataframe. Using pipeline operation to train and 
> predict the result. It is alright for single testing.
> However, I got issue when using pipeline and CrossValidator. The issue is 
> that I expect CrossValidator use "indexedLabel" and "indexedMsg" as label and 
> feature. Those fields are built by StringIndexer and VectorIndex. It suppose 
> to be existed after executing pipeline. 
> Then I dig into pyspark library [python/pyspark/ml/tuning.py] (line 222, _fit 
> function and line 239, est.fit), I found that it does not execute pipeline 
> stage. Therefore, I cannot get "indexedLabel" and "indexedMsg". 
> Would you mind advising whether my usage is correct or not.
> Thanks.
> Here is code snippet
> {noformat}
> // # Indexing
> labelIndexer = StringIndexer(inputCol="label", 
> outputCol="indexedLabel").fit(extracted_data)
> featureIndexer = VectorIndexer(inputCol="extracted_msg", 
> outputCol="indexedMsg", maxCategories=3000).fit(extracted_data)
> // # Training
> classification_model = RandomForestClassifier(labelCol="indexedLabel", 
> featuresCol="indexedMsg", numTrees=50, maxDepth=20)
> pipeline = Pipeline(stages=[labelIndexer, featureIndexer, 
> classification_model])
> // # Cross Validation
> paramGrid = ParamGridBuilder().addGrid(classification_model.maxDepth, (10, 
> 20, 30)).build()
> cvEvaluator = MulticlassClassificationEvaluator(metricName="precision")
> cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, 
> evaluator=cvEvaluator, numFolds=10)
> cvModel = cv.fit(trainingData)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15352) Topology aware block replication

2016-06-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15352:

Assignee: Shubham Chopra

> Topology aware block replication
> 
>
> Key: SPARK-15352
> URL: https://issues.apache.org/jira/browse/SPARK-15352
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Mesos, Spark Core, YARN
>Reporter: Shubham Chopra
>Assignee: Shubham Chopra
>
> With cached RDDs, Spark can be used for online analytics where it is used to 
> respond to online queries. But loss of RDD partitions due to node/executor 
> failures can cause huge delays in such use cases as the data would have to be 
> regenerated.
> Cached RDDs, even when using multiple replicas per block, are not currently 
> resilient to node failures when multiple executors are started on the same 
> node. Block replication currently chooses a peer at random, and this peer 
> could also exist on the same host. 
> This effort would add topology aware replication to Spark that can be enabled 
> with pluggable strategies. For ease of development/review, this is being 
> broken down to three major work-efforts:
> 1.Making peer selection for replication pluggable
> 2.Providing pluggable implementations for providing topology and topology 
> aware replication
> 3.Pro-active replenishment of lost blocks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16329) select * from temp_table_no_cols fails

2016-06-30 Thread Adrian Ionescu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357729#comment-15357729
 ] 

Adrian Ionescu commented on SPARK-16329:


Well, this is a simplified example. In reality we assemble the spark-sql query 
text at run-time, based on user input.

Sure, working with the Dataframe directly, as you suggest, is possible and it's 
what we're now doing as a workaround, but it requires special casing that would 
be nice to avoid...

> select * from temp_table_no_cols fails
> --
>
> Key: SPARK-16329
> URL: https://issues.apache.org/jira/browse/SPARK-16329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: Adrian Ionescu
>
> The following works with spark 1.5.1, but not anymore with spark 1.6.0:
> {code}
> import org.apache.spark.sql.{ DataFrame, Row }
> import org.apache.spark.sql.types.StructType
> val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => 
> Row.empty)
> val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty))
> dfNoCols.registerTempTable("temp_table_no_cols")
> sqlContext.sql("select * from temp_table_no_cols").show
> {code}
> spark 1.5.1 result:
> {noformat}
> ++
> ||
> ++
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ++
> {noformat}
> spark 1.6.0 result:
> {noformat}
> java.lang.IllegalArgumentException: requirement failed
> at scala.Predef$.require(Predef.scala:221)
> at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:354)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:353)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:353)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:328)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at 

[jira] [Comment Edited] (SPARK-16329) select * from temp_table_no_cols fails

2016-06-30 Thread Adrian Ionescu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357729#comment-15357729
 ] 

Adrian Ionescu edited comment on SPARK-16329 at 6/30/16 7:42 PM:
-

Well, this is a simplified example. In reality we assemble the SparkSql query 
text at run-time, based on user input.

Sure, working with the Dataframe directly, as you suggest, is possible and it's 
what we're now doing as a workaround, but it requires special casing that would 
be nice to avoid...


was (Author: i.adri):
Well, this is a simplified example. In reality we assemble the spark-sql query 
text at run-time, based on user input.

Sure, working with the Dataframe directly, as you suggest, is possible and it's 
what we're now doing as a workaround, but it requires special casing that would 
be nice to avoid...

> select * from temp_table_no_cols fails
> --
>
> Key: SPARK-16329
> URL: https://issues.apache.org/jira/browse/SPARK-16329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: Adrian Ionescu
>
> The following works with spark 1.5.1, but not anymore with spark 1.6.0:
> {code}
> import org.apache.spark.sql.{ DataFrame, Row }
> import org.apache.spark.sql.types.StructType
> val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => 
> Row.empty)
> val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty))
> dfNoCols.registerTempTable("temp_table_no_cols")
> sqlContext.sql("select * from temp_table_no_cols").show
> {code}
> spark 1.5.1 result:
> {noformat}
> ++
> ||
> ++
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ++
> {noformat}
> spark 1.6.0 result:
> {noformat}
> java.lang.IllegalArgumentException: requirement failed
> at scala.Predef$.require(Predef.scala:221)
> at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:354)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:353)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:353)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:328)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:36)
> at 
> 

[jira] [Updated] (SPARK-16333) Excessive Spark history event/json data size (5GB each)

2016-06-30 Thread Peter Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Liu updated SPARK-16333:
--
Summary: Excessive Spark history event/json data size (5GB each)  (was: 
Excessive Spark history event/json data (5GB!))

> Excessive Spark history event/json data size (5GB each)
> ---
>
> Key: SPARK-16333
> URL: https://issues.apache.org/jira/browse/SPARK-16333
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: this is seen on both x86 (Intel(R) Xeon(R), E5-2699 ) 
> and ppc platform (Habanero, Model: 8348-21C), Red Hat Enterprise Linux Server 
> release 7.2 (Maipo)., Spark2.0.0-preview (May-24, 2016 build)
>Reporter: Peter Liu
>  Labels: performance, spark2.0.0
>
> With Spark2.0.0-preview (May-24 build), the history event data (the json 
> file), that is generated for each Spark application run (see below), can be 
> as big as 5GB (instead of 14 MB for exactly the same application run and the 
> same input data of 1TB under Spark1.6.1)
> -rwxrwx--- 1 root root 5.3G Jun 30 09:39 app-20160630091959-
> -rwxrwx--- 1 root root 5.3G Jun 30 09:56 app-20160630094213-
> -rwxrwx--- 1 root root 5.3G Jun 30 10:13 app-20160630095856-
> -rwxrwx--- 1 root root 5.3G Jun 30 10:30 app-20160630101556-
> The test is done with Sparkbench V2, SQL RDD (see github: 
> https://github.com/SparkTC/spark-bench)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16332) the history server of spark2.0-preview (may-24 build) consumes more than 1000% cpu

2016-06-30 Thread Peter Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Liu updated SPARK-16332:
--
Environment: this is seen on both x86 (Intel(R) Xeon(R), E5-2699 ) and ppc 
platform IBM Power8 Habanero (Model: 8348-21C), Red Hat Enterprise Linux Server 
release 7.2 (Maipo), Spark2.0.0-preview (May-24, 2016)  (was: IBM Power8 
Habanero (Model: 8348-21C), Red Hat Enterprise Linux Server release 7.2 
(Maipo), Spark2.0.0-preview (May-24, 2016))

> the history server of spark2.0-preview (may-24 build) consumes more than 
> 1000% cpu
> --
>
> Key: SPARK-16332
> URL: https://issues.apache.org/jira/browse/SPARK-16332
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: this is seen on both x86 (Intel(R) Xeon(R), E5-2699 ) 
> and ppc platform IBM Power8 Habanero (Model: 8348-21C), Red Hat Enterprise 
> Linux Server release 7.2 (Maipo), Spark2.0.0-preview (May-24, 2016)
>Reporter: Peter Liu
>
> the JVM instance of the Spark history server of spark2.0-preview (may-24 
> build) consumes more than 1000% cpu without the spark standalone cluster (of 
> 6 nodes) being under load. 
> When under load (1TB input data for a SQL query scenario), the JVM instance 
> of the Spark history server of spark2.0-preview consumes 2000% cpu (as seen 
> with "top" on linux 3.10)
> Note: can't see a proper Component selection here, surely not a Web GUI issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16333) Excessive Spark history event/json data (5GB!)

2016-06-30 Thread Peter Liu (JIRA)
Peter Liu created SPARK-16333:
-

 Summary: Excessive Spark history event/json data (5GB!)
 Key: SPARK-16333
 URL: https://issues.apache.org/jira/browse/SPARK-16333
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0
 Environment: this is seen on both x86 (Intel(R) Xeon(R), E5-2699 ) and 
ppc platform (Habanero, Model: 8348-21C), Red Hat Enterprise Linux Server 
release 7.2 (Maipo)., Spark2.0.0-preview (May-24, 2016 build)
Reporter: Peter Liu


With Spark2.0.0-preview (May-24 build), the history event data (the json file), 
that is generated for each Spark application run (see below), can be as big as 
5GB (instead of 14 MB for exactly the same application run and the same input 
data of 1TB under Spark1.6.1)

-rwxrwx--- 1 root root 5.3G Jun 30 09:39 app-20160630091959-
-rwxrwx--- 1 root root 5.3G Jun 30 09:56 app-20160630094213-
-rwxrwx--- 1 root root 5.3G Jun 30 10:13 app-20160630095856-
-rwxrwx--- 1 root root 5.3G Jun 30 10:30 app-20160630101556-

The test is done with Sparkbench V2, SQL RDD (see github: 
https://github.com/SparkTC/spark-bench)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16332) the history server of spark2.0-preview (may-24 build) consumes more than 1000% cpu

2016-06-30 Thread Peter Liu (JIRA)
Peter Liu created SPARK-16332:
-

 Summary: the history server of spark2.0-preview (may-24 build) 
consumes more than 1000% cpu
 Key: SPARK-16332
 URL: https://issues.apache.org/jira/browse/SPARK-16332
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0
 Environment: IBM Power8 Habanero (Model: 8348-21C), Red Hat Enterprise 
Linux Server release 7.2 (Maipo), Spark2.0.0-preview (May-24, 2016)
Reporter: Peter Liu


the JVM instance of the Spark history server of spark2.0-preview (may-24 build) 
consumes more than 1000% cpu without the spark standalone cluster (of 6 nodes) 
being under load. 
When under load (1TB input data for a SQL query scenario), the JVM instance of 
the Spark history server of spark2.0-preview consumes 2000% cpu (as seen with 
"top" on linux 3.10)
Note: can't see a proper Component selection here, surely not a Web GUI issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16289) Implement posexplode table generating function

2016-06-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16289.
-
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.0.0

> Implement posexplode table generating function
> --
>
> Key: SPARK-16289
> URL: https://issues.apache.org/jira/browse/SPARK-16289
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Dongjoon Hyun
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-15069) GSoC 2016: Exposing more R and Python APIs for MLlib

2016-06-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-15069.
-
Resolution: Done

> GSoC 2016: Exposing more R and Python APIs for MLlib
> 
>
> Key: SPARK-15069
> URL: https://issues.apache.org/jira/browse/SPARK-15069
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, PySpark, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Kai Jiang
>  Labels: gsoc2016, mentor
> Attachments: 1458791046_[GSoC2016]ApacheSpark_KaiJiang_Proposal.pdf
>
>
> This issue is for tracking the Google Summer of Code 2016 project for Kai 
> Jiang: "Apache Spark: Exposing more R and Python APIs for MLlib"
> See attached proposal for details.  Note that the tasks listed in the 
> proposal are tentative and can adapt as the community works on these various 
> parts of MLlib.
> This umbrella will contain links for tasks included in this project, to be 
> added as each task begins.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15865) Blacklist should not result in job hanging with less than 4 executors

2016-06-30 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-15865.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 13603
[https://github.com/apache/spark/pull/13603]

> Blacklist should not result in job hanging with less than 4 executors
> -
>
> Key: SPARK-15865
> URL: https://issues.apache.org/jira/browse/SPARK-15865
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
> Fix For: 2.1.0
>
>
> Currently when you turn on blacklisting with 
> {{spark.scheduler.executorTaskBlacklistTime}}, but you have fewer than 
> {{spark.task.maxFailures}} executors, you can end with a job "hung" after 
> some task failures.
> If some task fails regularly (say, due to error in user code), then the task 
> will be blacklisted from the given executor.  It will then try another 
> executor, and fail there as well.  However, after it has tried all available 
> executors, the scheduler will simply stop trying to schedule the task 
> anywhere.  The job doesn't fail, nor it does it succeed -- it simply waits.  
> Eventually, when the blacklist expires, the task will be scheduled again.  
> But that can be quite far in the future, and in the meantime the user just 
> observes a stuck job.
> Instead we should abort the stage (and fail any dependent jobs) as soon as we 
> detect tasks that cannot be scheduled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16241) model loading backward compatibility for ml NaiveBayes

2016-06-30 Thread Li Ping Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357606#comment-15357606
 ] 

Li Ping Zhang commented on SPARK-16241:
---

Thanks Sean and Yanbo.

> model loading backward compatibility for ml NaiveBayes
> --
>
> Key: SPARK-16241
> URL: https://issues.apache.org/jira/browse/SPARK-16241
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Assignee: Li Ping Zhang
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
>
> To help users migrate from Spark 1.6. to 2.0, we should make model loading 
> backward compatible with models saved in 1.6. Please manually verify the fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16316) dataframe except API returning wrong result in spark 1.5.0

2016-06-30 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357553#comment-15357553
 ] 

Xiao Li commented on SPARK-16316:
-

How about 1.6 and 2.0?

> dataframe except API returning wrong result in spark 1.5.0
> --
>
> Key: SPARK-16316
> URL: https://issues.apache.org/jira/browse/SPARK-16316
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Jacky Li
>
> Version: spark 1.5.0
> Use case:  use except API to do subtract between two dataframe
> scala> val dfa = sc.parallelize(1 to 100).map(x => (x, x)).toDF("i", "j")
> dfa: org.apache.spark.sql.DataFrame = [i: int, j: int]
> scala> val dfb = sc.parallelize(1 to 10).map(x => (x, x)).toDF("i", "j")
> dfb: org.apache.spark.sql.DataFrame = [i: int, j: int]
> scala> dfa.except(dfb).count
> res13: Long = 0
> It should return 90 instead of 0
> While following statement works fine
> scala> dfa.except(dfb).rdd.count
> res13: Long = 90
> I guess the bug maybe somewhere in dataframe.count



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16247) Using pyspark dataframe with pipeline and cross validator

2016-06-30 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357548#comment-15357548
 ] 

Bryan Cutler commented on SPARK-16247:
--

Great, glad that solved the problem!  A cross-validation example for Python has 
already been added for the upcoming 2.0 release, but there are plenty of other 
ways to contribute.  Make sure to check out 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark.  Could 
you please mark this issue as resolved?

> Using pyspark dataframe with pipeline and cross validator
> -
>
> Key: SPARK-16247
> URL: https://issues.apache.org/jira/browse/SPARK-16247
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.1
>Reporter: Edward Ma
>
> I am using pyspark with dataframe. Using pipeline operation to train and 
> predict the result. It is alright for single testing.
> However, I got issue when using pipeline and CrossValidator. The issue is 
> that I expect CrossValidator use "indexedLabel" and "indexedMsg" as label and 
> feature. Those fields are built by StringIndexer and VectorIndex. It suppose 
> to be existed after executing pipeline. 
> Then I dig into pyspark library [python/pyspark/ml/tuning.py] (line 222, _fit 
> function and line 239, est.fit), I found that it does not execute pipeline 
> stage. Therefore, I cannot get "indexedLabel" and "indexedMsg". 
> Would you mind advising whether my usage is correct or not.
> Thanks.
> Here is code snippet
> {noformat}
> // # Indexing
> labelIndexer = StringIndexer(inputCol="label", 
> outputCol="indexedLabel").fit(extracted_data)
> featureIndexer = VectorIndexer(inputCol="extracted_msg", 
> outputCol="indexedMsg", maxCategories=3000).fit(extracted_data)
> // # Training
> classification_model = RandomForestClassifier(labelCol="indexedLabel", 
> featuresCol="indexedMsg", numTrees=50, maxDepth=20)
> pipeline = Pipeline(stages=[labelIndexer, featureIndexer, 
> classification_model])
> // # Cross Validation
> paramGrid = ParamGridBuilder().addGrid(classification_model.maxDepth, (10, 
> 20, 30)).build()
> cvEvaluator = MulticlassClassificationEvaluator(metricName="precision")
> cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, 
> evaluator=cvEvaluator, numFolds=10)
> cvModel = cv.fit(trainingData)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15896) Clean shuffle files after finish the SQL query

2016-06-30 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357555#comment-15357555
 ] 

Takeshi Yamamuro commented on SPARK-15896:
--

Anybody here? If no, I'll take on this; 
https://github.com/apache/spark/compare/master...maropu:SPARK-15896

> Clean shuffle files after finish the SQL query 
> ---
>
> Key: SPARK-15896
> URL: https://issues.apache.org/jira/browse/SPARK-15896
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> The ShuffleRDD in a SQL query could not be reuse later, we could remove the 
> shuffle files after finish a query to free the disk space as soon as possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16316) dataframe except API returning wrong result in spark 1.5.0

2016-06-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357569#comment-15357569
 ] 

Sean Owen commented on SPARK-16316:
---

I don't think we'd fix this unless it affected 1.6

> dataframe except API returning wrong result in spark 1.5.0
> --
>
> Key: SPARK-16316
> URL: https://issues.apache.org/jira/browse/SPARK-16316
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Jacky Li
>
> Version: spark 1.5.0
> Use case:  use except API to do subtract between two dataframe
> scala> val dfa = sc.parallelize(1 to 100).map(x => (x, x)).toDF("i", "j")
> dfa: org.apache.spark.sql.DataFrame = [i: int, j: int]
> scala> val dfb = sc.parallelize(1 to 10).map(x => (x, x)).toDF("i", "j")
> dfb: org.apache.spark.sql.DataFrame = [i: int, j: int]
> scala> dfa.except(dfb).count
> res13: Long = 0
> It should return 90 instead of 0
> While following statement works fine
> scala> dfa.except(dfb).rdd.count
> res13: Long = 90
> I guess the bug maybe somewhere in dataframe.count



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16329) select * from temp_table_no_cols fails

2016-06-30 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357545#comment-15357545
 ] 

Xiao Li commented on SPARK-16329:
-

You still can get what you need by 
{noformat}
val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => Row.empty)
val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty))
dfNoCols.show
{noformat}

Any use case for Spark SQL to support such a scenario? Otherwise, I agreed on a 
better error message we should issue.

> select * from temp_table_no_cols fails
> --
>
> Key: SPARK-16329
> URL: https://issues.apache.org/jira/browse/SPARK-16329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: Adrian Ionescu
>
> The following works with spark 1.5.1, but not anymore with spark 1.6.0:
> {code}
> import org.apache.spark.sql.{ DataFrame, Row }
> import org.apache.spark.sql.types.StructType
> val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => 
> Row.empty)
> val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty))
> dfNoCols.registerTempTable("temp_table_no_cols")
> sqlContext.sql("select * from temp_table_no_cols").show
> {code}
> spark 1.5.1 result:
> {noformat}
> ++
> ||
> ++
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ++
> {noformat}
> spark 1.6.0 result:
> {noformat}
> java.lang.IllegalArgumentException: requirement failed
> at scala.Predef$.require(Predef.scala:221)
> at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:354)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:353)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:353)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:328)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
>  

[jira] [Commented] (SPARK-9999) Dataset API on top of Catalyst/DataFrame

2016-06-30 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357535#comment-15357535
 ] 

Nicholas Chammas commented on SPARK-:
-

{quote}
Python itself has no compile time type safety.
{quote}

Practically speaking, this is no longer true. You can get a decent measure of 
"compile" time type safety using recent additions to Python (both the language 
itself and the ecosystem).

Specifically, optional static type checking has been a focus in Python since 
3.5+, and according to Python's BDFL both Google and Dropbox are updating large 
parts of their codebases to use Python's new typing features. Static type 
checkers for Python like [mypy|http://mypy-lang.org/] are already in use and 
are backed by several core Python developers, including Guido van Rossum 
(Python's creator/BDFL).

So I don't think Datasets are a critical feature for PySpark just yet, and it 
will take some time for the general Python community to learn and take 
advantage of Python's new optional static typing features and tools, but I 
would keep this on the radar.

> Dataset API on top of Catalyst/DataFrame
> 
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
> Fix For: 2.0.0
>
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]
> The initial version of the Dataset API has been merged in Spark 1.6. However, 
> it will take a few more future releases to flush everything out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16331) [SQL] Reduce code generation time

2016-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357519#comment-15357519
 ] 

Apache Spark commented on SPARK-16331:
--

User 'inouehrs' has created a pull request for this issue:
https://github.com/apache/spark/pull/14000

> [SQL] Reduce code generation time 
> --
>
> Key: SPARK-16331
> URL: https://issues.apache.org/jira/browse/SPARK-16331
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Hiroshi Inoue
>
> During the code generation, a {{LocalRelation}} often has a huge {{Vector}} 
> object as {{data}}. In the simple example below, a {{LocalRelation}} has a 
> Vector with 100 elements of {{UnsafeRow}}. 
> {quote}
> val numRows = 100
> val ds = (1 to numRows).toDS().persist()
> benchmark.addCase("filter+reduce") { iter =>
>   ds.filter(a => (a & 1) == 0).reduce(_ + _)
> }
> {quote}
> At {{TreeNode.transformChildren}}, all elements of the vector is 
> unnecessarily iterated to check whether any children exist in the vector 
> since {{Vector}} is Traversable. This part significantly increases code 
> generation time.
> This patch avoids this overhead by checking the number of children before 
> iterating all elements; {{LocalRelation}} does not have children since it 
> extends {{LeafNode}}.
> The performance of the above example 
> {quote}
> without this patch
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_91-b14 on Mac OS X 10.11.5
> Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz
> compilationTime: Best/Avg Time(ms)Rate(M/s)   Per 
> Row(ns)   Relative
> 
> filter+reduce 4426 / 4533  0.2
> 4426.0   1.0X
> with this patch
> compilationTime: Best/Avg Time(ms)Rate(M/s)   Per 
> Row(ns)   Relative
> 
> filter+reduce 3117 / 3391  0.3
> 3116.6   1.0X
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16331) [SQL] Reduce code generation time

2016-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16331:


Assignee: (was: Apache Spark)

> [SQL] Reduce code generation time 
> --
>
> Key: SPARK-16331
> URL: https://issues.apache.org/jira/browse/SPARK-16331
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Hiroshi Inoue
>
> During the code generation, a {{LocalRelation}} often has a huge {{Vector}} 
> object as {{data}}. In the simple example below, a {{LocalRelation}} has a 
> Vector with 100 elements of {{UnsafeRow}}. 
> {quote}
> val numRows = 100
> val ds = (1 to numRows).toDS().persist()
> benchmark.addCase("filter+reduce") { iter =>
>   ds.filter(a => (a & 1) == 0).reduce(_ + _)
> }
> {quote}
> At {{TreeNode.transformChildren}}, all elements of the vector is 
> unnecessarily iterated to check whether any children exist in the vector 
> since {{Vector}} is Traversable. This part significantly increases code 
> generation time.
> This patch avoids this overhead by checking the number of children before 
> iterating all elements; {{LocalRelation}} does not have children since it 
> extends {{LeafNode}}.
> The performance of the above example 
> {quote}
> without this patch
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_91-b14 on Mac OS X 10.11.5
> Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz
> compilationTime: Best/Avg Time(ms)Rate(M/s)   Per 
> Row(ns)   Relative
> 
> filter+reduce 4426 / 4533  0.2
> 4426.0   1.0X
> with this patch
> compilationTime: Best/Avg Time(ms)Rate(M/s)   Per 
> Row(ns)   Relative
> 
> filter+reduce 3117 / 3391  0.3
> 3116.6   1.0X
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16331) [SQL] Reduce code generation time

2016-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16331:


Assignee: Apache Spark

> [SQL] Reduce code generation time 
> --
>
> Key: SPARK-16331
> URL: https://issues.apache.org/jira/browse/SPARK-16331
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Hiroshi Inoue
>Assignee: Apache Spark
>
> During the code generation, a {{LocalRelation}} often has a huge {{Vector}} 
> object as {{data}}. In the simple example below, a {{LocalRelation}} has a 
> Vector with 100 elements of {{UnsafeRow}}. 
> {quote}
> val numRows = 100
> val ds = (1 to numRows).toDS().persist()
> benchmark.addCase("filter+reduce") { iter =>
>   ds.filter(a => (a & 1) == 0).reduce(_ + _)
> }
> {quote}
> At {{TreeNode.transformChildren}}, all elements of the vector is 
> unnecessarily iterated to check whether any children exist in the vector 
> since {{Vector}} is Traversable. This part significantly increases code 
> generation time.
> This patch avoids this overhead by checking the number of children before 
> iterating all elements; {{LocalRelation}} does not have children since it 
> extends {{LeafNode}}.
> The performance of the above example 
> {quote}
> without this patch
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_91-b14 on Mac OS X 10.11.5
> Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz
> compilationTime: Best/Avg Time(ms)Rate(M/s)   Per 
> Row(ns)   Relative
> 
> filter+reduce 4426 / 4533  0.2
> 4426.0   1.0X
> with this patch
> compilationTime: Best/Avg Time(ms)Rate(M/s)   Per 
> Row(ns)   Relative
> 
> filter+reduce 3117 / 3391  0.3
> 3116.6   1.0X
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16331) [SQL] Reduce code generation time

2016-06-30 Thread Hiroshi Inoue (JIRA)
Hiroshi Inoue created SPARK-16331:
-

 Summary: [SQL] Reduce code generation time 
 Key: SPARK-16331
 URL: https://issues.apache.org/jira/browse/SPARK-16331
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0, 2.1.0
Reporter: Hiroshi Inoue


During the code generation, a {{LocalRelation}} often has a huge {{Vector}} 
object as {{data}}. In the simple example below, a {{LocalRelation}} has a 
Vector with 100 elements of {{UnsafeRow}}. 

{quote}
val numRows = 100
val ds = (1 to numRows).toDS().persist()
benchmark.addCase("filter+reduce") { iter =>
  ds.filter(a => (a & 1) == 0).reduce(_ + _)
}
{quote}

At {{TreeNode.transformChildren}}, all elements of the vector is unnecessarily 
iterated to check whether any children exist in the vector since {{Vector}} is 
Traversable. This part significantly increases code generation time.

This patch avoids this overhead by checking the number of children before 
iterating all elements; {{LocalRelation}} does not have children since it 
extends {{LeafNode}}.

The performance of the above example 
{quote}
without this patch
Java HotSpot(TM) 64-Bit Server VM 1.8.0_91-b14 on Mac OS X 10.11.5
Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz
compilationTime: Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative

filter+reduce 4426 / 4533  0.2
4426.0   1.0X

with this patch
compilationTime: Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative

filter+reduce 3117 / 3391  0.3
3116.6   1.0X
{quote}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-16281) Implement parse_url SQL function

2016-06-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16281:

Comment: was deleted

(was: I can work on this one too.
)

> Implement parse_url SQL function
> 
>
> Key: SPARK-16281
> URL: https://issues.apache.org/jira/browse/SPARK-16281
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >