[jira] [Commented] (SPARK-9135) Filter fails when filtering with a method reference to overloaded method

2017-01-13 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822742#comment-15822742
 ] 

Hyukjin Kwon commented on SPARK-9135:
-

It still happens in the master branch.

> Filter fails when filtering with a method reference to overloaded method
> 
>
> Key: SPARK-9135
> URL: https://issues.apache.org/jira/browse/SPARK-9135
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.4.0
>Reporter: Mateusz Michalowski
>
> Filter fails when filtering with a method reference to overloaded method.
> In the example below we filter by Fruit::isRed, which is overloaded by 
> Apple::isRed and Banana::isRed. 
> {code}
> apples.filter(Fruit::isRed)
> bananas.filter(Fruit::isRed) //throws!
> {code}
> Spark will try to cast Apple::isRed to Banana::isRed - and then throw as a 
> result.
> However if we filter more generic rdd first - all works fine
> {code}
> fruit.filter(Fruit::isRed)
> bananas.filter(Fruit::isRed) //works fine!
> {code}
> It also works well if we use lambda instead of the method reference
> {code}
> apples.filter(f -> f.isRed())
> bananas.filter(f -> f.isRed()) //works fine!
> {code} 
> I attach a test setup below:
> {code:java}
> package com.doggybites;
> import org.apache.spark.SparkConf;
> import org.apache.spark.api.java.JavaRDD;
> import org.apache.spark.api.java.JavaSparkContext;
> import org.junit.After;
> import org.junit.Before;
> import org.junit.Test;
> import java.io.Serializable;
> import java.util.Arrays;
> import static org.hamcrest.CoreMatchers.equalTo;
> import static org.junit.Assert.assertThat;
> public class SparkTest {
> static abstract class Fruit implements Serializable {
> abstract boolean isRed();
> }
> static class Banana extends Fruit {
> @Override
> boolean isRed() {
> return false;
> }
> }
> static class Apple extends Fruit {
> @Override
> boolean isRed() {
> return true;
> }
> }
> private JavaSparkContext sparkContext;
> @Before
> public void setUp() throws Exception {
> SparkConf sparkConf = new 
> SparkConf().setAppName("test").setMaster("local[2]");
> sparkContext = new JavaSparkContext(sparkConf);
> }
> @After
> public void tearDown() throws Exception {
> sparkContext.stop();
> }
> private  JavaRDD toRdd(T ... array) {
> return sparkContext.parallelize(Arrays.asList(array));
> }
> @Test
> public void filters_apples_and_bananas_with_method_reference() {
> JavaRDD appleRdd = toRdd(new Apple());
> JavaRDD bananaRdd = toRdd(new Banana());
> 
> long redAppleCount = appleRdd.filter(Fruit::isRed).count();
> long redBananaCount = bananaRdd.filter(Fruit::isRed).count();
> assertThat(redAppleCount, equalTo(1L));
> assertThat(redBananaCount, equalTo(0L));
> }
> }
> {code}
> The test above throws:
> {code}
> 15/07/17 14:10:04 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 3)
> java.lang.ClassCastException: com.doggybites.SparkTest$Banana cannot be cast 
> to com.doggybites.SparkTest$Apple
>   at com.doggybites.SparkTest$$Lambda$2/976119300.call(Unknown Source)
>   at 
> org.apache.spark.api.java.JavaRDD$$anonfun$filter$1.apply(JavaRDD.scala:78)
>   at 
> org.apache.spark.api.java.JavaRDD$$anonfun$filter$1.apply(JavaRDD.scala:78)
>   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
>   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1626)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1099)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1099)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
>   at org.apache.spark.scheduler.Task.run(Task.scala:70)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 15/07/17 14:10:04 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID 3, 
> localhost): java.lang.ClassCastException: com.doggybites.SparkTest$Banana 
> cannot be cast to com.doggybites.SparkTest$Apple
>   at com.doggybites.SparkTest$$Lambda$2/976119300.call(Unknown Source)
>   at 
> org.apache.spark.api.java.JavaRDD$$anonfun$filter$1.apply(JavaRDD.scala:78)
>   at 
> 

[jira] [Resolved] (SPARK-6645) StructField/StructType and related classes are not in the Scaladoc

2017-01-13 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-6645.
-
Resolution: Not A Problem

It seems already documented in Scaladoc/Javadoc.


> StructField/StructType and related classes are not in the Scaladoc
> --
>
> Key: SPARK-6645
> URL: https://issues.apache.org/jira/browse/SPARK-6645
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.3.0
>Reporter: Aaron Defazio
>Priority: Minor
>
> The current programming guide uses StructField in the Scala examples, yet it 
> doesn't appear to exist in the Scaladoc. This is related to SPARK-6592, in 
> that several classes that a user might use do not appear in the Scaladoc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19222) Limit Query Performance issue

2017-01-13 Thread Sujith (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sujith updated SPARK-19222:
---
Description: 
Performance/memory bottle neck occurs in the below mentioned query
case 1:
create table t1 as select * from dest1 limit 1000;
case 2:
create table t1 as select * from dest1 limit 1000;
pre-condition : partition count >=1

In above cases limit is being added in the terminal of the physical plan 

== Physical Plan  ==
ExecutedCommand
   +- CreateHiveTableAsSelectCommand [Database:spark}, TableName: t2, 
InsertIntoHiveTable]
 +- GlobalLimit 1000
+- LocalLimit 1000
   +- Project [imei#101, age#102, task#103L, num#104, level#105, 
productdate#106, name#107, point#108]
  +- SubqueryAlias hive
 +- 
Relation[imei#101,age#102,task#103L,num#104,level#105,productdate#106,name#107,point#108]
 csv  |
Issue Hints: 

Possible Bottleneck snippet in limit.scala file under spark-sql package.
  protected override def doExecute(): RDD[InternalRow] = {
val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))
val shuffled = new ShuffledRowRDD(
  ShuffleExchange.prepareShuffleDependency(
locallyLimited, child.output, SinglePartition, serializer))
shuffled.mapPartitionsInternal(_.take(limit))
  }

As mentioned in above case 1  (where limit value is 1000 or partition count 
is > 1) and case 2(limit value is small(around 1000)), As per the above 
snippet when the ShuffledRowRDD
is created by grouping all the limit data from different partitions to a single 
partition in executer,  memory issue occurs since all the partition limit data 
will be collected and 
grouped  in a single partition for processing, in both former/later case the 
data count  can go very high which can create the memory bottleneck.

Proposed solution for case 2:
An accumulator value can be to send to all partitions, all executor will be 
updating the accumulator value based on the  data fetched , 
eg: Number of partition = 100, number of cores =10
Ideally tasks will be launched in a group of 10 task/core, once the first group 
finishes the tasks driver will check whether the accumulator value is been 
reached the limit value if its reached then no further tasks will be launched 
to executors and the result after applying limit will be returned.

Please let me now for any suggestions or solutions for the above mentioned 
problems

Thanks,
Sujith

  was:
Performance/memory bottle neck occurs in the below mentioned query
case 1:
create table t1 as select * from dest1 limit 1000;
case 2:
create table t1 as select * from dest1 limit 1000;
pre-condition : partition count >=1

In above cases limit is being added in the terminal of the physical plan 

== Physical Plan  ==
ExecutedCommand
   +- CreateHiveTableAsSelectCommand [Database:spark}, TableName: t2, 
InsertIntoHiveTable]
 +- GlobalLimit 1000
+- LocalLimit 1000
   +- Project [imei#101, age#102, task#103L, num#104, level#105, 
productdate#106, name#107, point#108]
  +- SubqueryAlias hive
 +- 
Relation[imei#101,age#102,task#103L,num#104,level#105,productdate#106,name#107,point#108]
 csv  |
Issue Hints: 

Possible Bottleneck snippet in limit.scala file under spark-sql package.
  protected override def doExecute(): RDD[InternalRow] = {
val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))
val shuffled = new ShuffledRowRDD(
  ShuffleExchange.prepareShuffleDependency(
locallyLimited, child.output, SinglePartition, serializer))
shuffled.mapPartitionsInternal(_.take(limit))
  }

As mentioned in above case 1  (where limit value is 1000 or partition count 
is > 1) and case 2(limit value is small(around 1000)), As per the above 
snippet when the ShuffledRowRDD
is created by grouping all the limit data from different partitions to a single 
partition in executer,  memory issue occurs since all the partition limit data 
will be collected and 
grouped  in a single partition for processing, in both former/later case the 
data count  can go very high which can create the memory bottleneck.

Proposed solution for case 2:
An accumulator value can be to send to all partitions, all executor will be 
updating the accumulator value based on the  data fetched , 
eg: Number of partition = 100, number of cores =10
Ideally tasks will be launched in a group of 10 task/core, once the first group 
finishes the tasks driver will check whether the accumulator value is been 
reached the limit value if its reached then no further tasks will be launched 
to executors and the result after applying limit will be returned.

Please let me now for any suggestions or solutions for the below problems


> Limit Query Performance issue
> -
>
> Key: 

[jira] [Commented] (SPARK-19223) InputFileBlockHolder doesn't work with Python UDF for datasource other than FileFormat

2017-01-13 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822733#comment-15822733
 ] 

Liang-Chi Hsieh commented on SPARK-19223:
-

Hi [~someonehere15],

For the issue on spark-xml package that when applying UDF on the column of 
input_file_name the result will be empty, I created this jira and submitted a 
PR to fix it.

> InputFileBlockHolder doesn't work with Python UDF for datasource other than 
> FileFormat
> --
>
> Key: SPARK-19223
> URL: https://issues.apache.org/jira/browse/SPARK-19223
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Liang-Chi Hsieh
>
> For the datasource other than FileFormat, such as spark-xml which is based on 
> BaseRelation and uses HadoopRDD, NewHadoopRDD, InputFileBlockHolder doesn't 
> work with Python UDF.
> The method to reproduce it is, running the following codes with {{bin/pyspark 
> --packages com.databricks:spark-xml_2.11:0.4.1}}:
> {code}
> from pyspark.sql.functions import udf,input_file_name
> from pyspark.sql.types import StringType
> from pyspark.sql import SparkSession
> def filename(path):
> return path
> session = SparkSession.builder.appName('APP').getOrCreate()
> session.udf.register('sameText',filename)
> sameText = udf(filename, StringType())
> df = session.read.format('xml').load('a.xml', 
> rowTag='root').select('*',input_file_name().alias('file'))
> df.select('file').show()  // works
> df.select(sameText(df['file'])).show()  // returns empty content
> {code}
> a.xml:
> {code}
> 
>   TEXT
>   TEXT2
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4862) Streaming | Setting checkpoint as a local directory results in Checkpoint RDD has different partitions error

2017-01-13 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822731#comment-15822731
 ] 

Hyukjin Kwon commented on SPARK-4862:
-

[~aniket] Would you be able to try this in 2.x or the current master please? I 
think this is fixed somewhere.

> Streaming | Setting checkpoint as a local directory results in Checkpoint RDD 
> has different partitions error
> 
>
> Key: SPARK-4862
> URL: https://issues.apache.org/jira/browse/SPARK-4862
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Aniket Bhatnagar
>Priority: Minor
>
> If the checkpoint is set as a local filesystem directory, it results in weird 
> error messages like the following:
> org.apache.spark.SparkException: Checkpoint RDD CheckpointRDD[467] at apply 
> at List.scala:318(0) has different number of partitions than original RDD 
> MapPartitionsRDD[461] at mapPartitions at StateDStream.scala:71(56)
> It would be great if Spark could output better error message that better 
> hints at what could have gone wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19223) InputFileBlockHolder doesn't work with Python UDF for datasource other than FileFormat

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19223:


Assignee: (was: Apache Spark)

> InputFileBlockHolder doesn't work with Python UDF for datasource other than 
> FileFormat
> --
>
> Key: SPARK-19223
> URL: https://issues.apache.org/jira/browse/SPARK-19223
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Liang-Chi Hsieh
>
> For the datasource other than FileFormat, such as spark-xml which is based on 
> BaseRelation and uses HadoopRDD, NewHadoopRDD, InputFileBlockHolder doesn't 
> work with Python UDF.
> The method to reproduce it is, running the following codes with {{bin/pyspark 
> --packages com.databricks:spark-xml_2.11:0.4.1}}:
> {code}
> from pyspark.sql.functions import udf,input_file_name
> from pyspark.sql.types import StringType
> from pyspark.sql import SparkSession
> def filename(path):
> return path
> session = SparkSession.builder.appName('APP').getOrCreate()
> session.udf.register('sameText',filename)
> sameText = udf(filename, StringType())
> df = session.read.format('xml').load('a.xml', 
> rowTag='root').select('*',input_file_name().alias('file'))
> df.select('file').show()  // works
> df.select(sameText(df['file'])).show()  // returns empty content
> {code}
> a.xml:
> {code}
> 
>   TEXT
>   TEXT2
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19223) InputFileBlockHolder doesn't work with Python UDF for datasource other than FileFormat

2017-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822729#comment-15822729
 ] 

Apache Spark commented on SPARK-19223:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/16585

> InputFileBlockHolder doesn't work with Python UDF for datasource other than 
> FileFormat
> --
>
> Key: SPARK-19223
> URL: https://issues.apache.org/jira/browse/SPARK-19223
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Liang-Chi Hsieh
>
> For the datasource other than FileFormat, such as spark-xml which is based on 
> BaseRelation and uses HadoopRDD, NewHadoopRDD, InputFileBlockHolder doesn't 
> work with Python UDF.
> The method to reproduce it is, running the following codes with {{bin/pyspark 
> --packages com.databricks:spark-xml_2.11:0.4.1}}:
> {code}
> from pyspark.sql.functions import udf,input_file_name
> from pyspark.sql.types import StringType
> from pyspark.sql import SparkSession
> def filename(path):
> return path
> session = SparkSession.builder.appName('APP').getOrCreate()
> session.udf.register('sameText',filename)
> sameText = udf(filename, StringType())
> df = session.read.format('xml').load('a.xml', 
> rowTag='root').select('*',input_file_name().alias('file'))
> df.select('file').show()  // works
> df.select(sameText(df['file'])).show()  // returns empty content
> {code}
> a.xml:
> {code}
> 
>   TEXT
>   TEXT2
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19223) InputFileBlockHolder doesn't work with Python UDF for datasource other than FileFormat

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19223:


Assignee: Apache Spark

> InputFileBlockHolder doesn't work with Python UDF for datasource other than 
> FileFormat
> --
>
> Key: SPARK-19223
> URL: https://issues.apache.org/jira/browse/SPARK-19223
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> For the datasource other than FileFormat, such as spark-xml which is based on 
> BaseRelation and uses HadoopRDD, NewHadoopRDD, InputFileBlockHolder doesn't 
> work with Python UDF.
> The method to reproduce it is, running the following codes with {{bin/pyspark 
> --packages com.databricks:spark-xml_2.11:0.4.1}}:
> {code}
> from pyspark.sql.functions import udf,input_file_name
> from pyspark.sql.types import StringType
> from pyspark.sql import SparkSession
> def filename(path):
> return path
> session = SparkSession.builder.appName('APP').getOrCreate()
> session.udf.register('sameText',filename)
> sameText = udf(filename, StringType())
> df = session.read.format('xml').load('a.xml', 
> rowTag='root').select('*',input_file_name().alias('file'))
> df.select('file').show()  // works
> df.select(sameText(df['file'])).show()  // returns empty content
> {code}
> a.xml:
> {code}
> 
>   TEXT
>   TEXT2
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2620) case class cannot be used as key for reduce

2017-01-13 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822726#comment-15822726
 ] 

Hyukjin Kwon commented on SPARK-2620:
-

^ I can still reproduce this.

> case class cannot be used as key for reduce
> ---
>
> Key: SPARK-2620
> URL: https://issues.apache.org/jira/browse/SPARK-2620
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.0.0, 1.1.0, 1.3.0, 1.4.0, 1.5.0, 1.6.0, 2.0.0, 2.1.0
> Environment: reproduced on spark-shell local[4]
>Reporter: Gerard Maas
>Assignee: Tobias Schlatter
>Priority: Critical
>  Labels: case-class, core
>
> Using a case class as a key doesn't seem to work properly on Spark 1.0.0
> A minimal example:
> case class P(name:String)
> val ps = Array(P("alice"), P("bob"), P("charly"), P("bob"))
> sc.parallelize(ps).map(x=> (x,1)).reduceByKey((x,y) => x+y).collect
> [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), 
> (P(bob),1), (P(abe),1), (P(charly),1))
> In contrast to the expected behavior, that should be equivalent to:
> sc.parallelize(ps).map(x=> (x.name,1)).reduceByKey((x,y) => x+y).collect
> Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2))
> groupByKey and distinct also present the same behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3249) Fix links in ScalaDoc that cause warning messages in `sbt/sbt unidoc`

2017-01-13 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822722#comment-15822722
 ] 

Hyukjin Kwon commented on SPARK-3249:
-

It prints as below after building this via {{jekyll build}} :

{code}
[warn] .../spark/core/src/main/scala/org/apache/spark/Accumulator.scala:20: The 
link target "SparkContext#accumulator" is ambiguous. Several members fit the 
target:
[warn] 
.../spark/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala:281:
 The link target "runMiniBatchSGD" is ambiguous. Several members fit the target:
[warn] 
.../spark/mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala:83:
 The link target "run" is ambiguous. Several members fit the target:
[warn] 
.../spark/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala:221: 
The link target "run" is ambiguous. Several members fit the target:
[warn] 
.../spark/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala:148: 
The link target "org.apache.spark.mllib.util.MLUtils#loadLibSVMFile" is 
ambiguous. Several members fit the target:
[warn] 
.../spark/mllib/src/main/scala/org/apache/spark/mllib/random/RandomRDDs.scala:239:
 The link target "RandomRDDs#exponentialJavaRDD" is ambiguous. Several members 
fit the target:
[warn] 
.../spark/mllib/src/main/scala/org/apache/spark/mllib/random/RandomRDDs.scala:227:
 The link target "RandomRDDs#exponentialJavaRDD" is ambiguous. Several members 
fit the target:
[warn] 
.../spark/mllib/src/main/scala/org/apache/spark/mllib/random/RandomRDDs.scala:750:
 The link target "RandomRDDs#exponentialJavaVectorRDD" is ambiguous. Several 
members fit the target:
[warn] 
.../spark/mllib/src/main/scala/org/apache/spark/mllib/random/RandomRDDs.scala:737:
 The link target "RandomRDDs#exponentialJavaVectorRDD" is ambiguous. Several 
members fit the target:
[warn] 
.../spark/mllib/src/main/scala/org/apache/spark/mllib/random/RandomRDDs.scala:298:
 The link target "RandomRDDs#gammaJavaRDD" is ambiguous. Several members fit 
the target:
[warn] 
.../spark/mllib/src/main/scala/org/apache/spark/mllib/random/RandomRDDs.scala:285:
 The link target "RandomRDDs#gammaJavaRDD" is ambiguous. Several members fit 
the target:
[warn] 
.../spark/mllib/src/main/scala/org/apache/spark/mllib/random/RandomRDDs.scala:819:
 The link target "RandomRDDs#gammaJavaVectorRDD" is ambiguous. Several members 
fit the target:
[warn] 
.../spark/mllib/src/main/scala/org/apache/spark/mllib/random/RandomRDDs.scala:805:
 The link target "RandomRDDs#gammaJavaVectorRDD" is ambiguous. Several members 
fit the target:
[warn] 
.../spark/mllib/src/main/scala/org/apache/spark/mllib/random/RandomRDDs.scala:361:
 The link target "RandomRDDs#logNormalJavaRDD" is ambiguous. Several members 
fit the target:
[warn] 
.../spark/mllib/src/main/scala/org/apache/spark/mllib/random/RandomRDDs.scala:348:
 The link target "RandomRDDs#logNormalJavaRDD" is ambiguous. Several members 
fit the target:
[warn] 
.../spark/mllib/src/main/scala/org/apache/spark/mllib/random/RandomRDDs.scala:621:
 The link target "RandomRDDs#logNormalJavaVectorRDD" is ambiguous. Several 
members fit the target:
[warn] 
.../spark/mllib/src/main/scala/org/apache/spark/mllib/random/RandomRDDs.scala:607:
 The link target "RandomRDDs#logNormalJavaVectorRDD" is ambiguous. Several 
members fit the target:
[warn] 
.../spark/mllib/src/main/scala/org/apache/spark/mllib/random/RandomRDDs.scala:129:
 The link target "RandomRDDs#normalJavaRDD" is ambiguous. Several members fit 
the target:
[warn] 
.../spark/mllib/src/main/scala/org/apache/spark/mllib/random/RandomRDDs.scala:121:
 The link target "RandomRDDs#normalJavaRDD" is ambiguous. Several members fit 
the target:
[warn] 
.../spark/mllib/src/main/scala/org/apache/spark/mllib/random/RandomRDDs.scala:554:
 The link target "RandomRDDs#normalJavaVectorRDD" is ambiguous. Several members 
fit the target:
[warn] 
.../spark/mllib/src/main/scala/org/apache/spark/mllib/random/RandomRDDs.scala:542:
 The link target "RandomRDDs#normalJavaVectorRDD" is ambiguous. Several members 
fit the target:
[warn] 
.../spark/mllib/src/main/scala/org/apache/spark/mllib/random/RandomRDDs.scala:184:
 The link target "RandomRDDs#poissonJavaRDD" is ambiguous. Several members fit 
the target:
[warn] 
.../spark/mllib/src/main/scala/org/apache/spark/mllib/random/RandomRDDs.scala:172:
 The link target "RandomRDDs#poissonJavaRDD" is ambiguous. Several members fit 
the target:
[warn] 
.../spark/mllib/src/main/scala/org/apache/spark/mllib/random/RandomRDDs.scala:686:
 The link target "RandomRDDs#poissonJavaVectorRDD" is ambiguous. Several 
members fit the target:
[warn] 
.../spark/mllib/src/main/scala/org/apache/spark/mllib/random/RandomRDDs.scala:673:
 The link target "RandomRDDs#poissonJavaVectorRDD" is ambiguous. Several 
members fit the target:
[warn] 
.../spark/mllib/src/main/scala/org/apache/spark/mllib/random/RandomRDDs.scala:434:
 

[jira] [Created] (SPARK-19223) InputFileBlockHolder doesn't work with Python UDF for datasource other than FileFormat

2017-01-13 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-19223:
---

 Summary: InputFileBlockHolder doesn't work with Python UDF for 
datasource other than FileFormat
 Key: SPARK-19223
 URL: https://issues.apache.org/jira/browse/SPARK-19223
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Reporter: Liang-Chi Hsieh


For the datasource other than FileFormat, such as spark-xml which is based on 
BaseRelation and uses HadoopRDD, NewHadoopRDD, InputFileBlockHolder doesn't 
work with Python UDF.

The method to reproduce it is, running the following codes with {{bin/pyspark 
--packages com.databricks:spark-xml_2.11:0.4.1}}:


{code}
from pyspark.sql.functions import udf,input_file_name
from pyspark.sql.types import StringType
from pyspark.sql import SparkSession

def filename(path):
return path

session = SparkSession.builder.appName('APP').getOrCreate()

session.udf.register('sameText',filename)
sameText = udf(filename, StringType())

df = session.read.format('xml').load('a.xml', 
rowTag='root').select('*',input_file_name().alias('file'))
df.select('file').show()  // works
df.select(sameText(df['file'])).show()  // returns empty content
{code}

a.xml:
{code}

  TEXT
  TEXT2

{code}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19222) Limit Query Performance issue

2017-01-13 Thread Sujith (JIRA)
Sujith created SPARK-19222:
--

 Summary: Limit Query Performance issue
 Key: SPARK-19222
 URL: https://issues.apache.org/jira/browse/SPARK-19222
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
 Environment: Linux/Windows
Reporter: Sujith
Priority: Minor


Performance/memory bottle neck occurs in the below mentioned query
case 1:
create table t1 as select * from dest1 limit 1000;
case 2:
create table t1 as select * from dest1 limit 1000;
pre-condition : partition count >=1

In above cases limit is being added in the terminal of the physical plan 

== Physical Plan  ==
ExecutedCommand
   +- CreateHiveTableAsSelectCommand [Database:spark}, TableName: t2, 
InsertIntoHiveTable]
 +- GlobalLimit 1000
+- LocalLimit 1000
   +- Project [imei#101, age#102, task#103L, num#104, level#105, 
productdate#106, name#107, point#108]
  +- SubqueryAlias hive
 +- 
Relation[imei#101,age#102,task#103L,num#104,level#105,productdate#106,name#107,point#108]
 csv  |
Issue Hints: 

Possible Bottleneck snippet in limit.scala file under spark-sql package.
  protected override def doExecute(): RDD[InternalRow] = {
val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))
val shuffled = new ShuffledRowRDD(
  ShuffleExchange.prepareShuffleDependency(
locallyLimited, child.output, SinglePartition, serializer))
shuffled.mapPartitionsInternal(_.take(limit))
  }

As mentioned in above case 1  (where limit value is 1000 or partition count 
is > 1) and case 2(limit value is small(around 1000)), As per the above 
snippet when the ShuffledRowRDD
is created by grouping all the limit data from different partitions to a single 
partition in executer,  memory issue occurs since all the partition limit data 
will be collected and 
grouped  in a single partition for processing, in both former/later case the 
data count  can go very high which can create the memory bottleneck.

Proposed solution for case 2:
An accumulator value can be to send to all partitions, all executor will be 
updating the accumulator value based on the  data fetched , 
eg: Number of partition = 100, number of cores =10
Ideally tasks will be launched in a group of 10 task/core, once the first group 
finishes the tasks driver will check whether the accumulator value is been 
reached the limit value if its reached then no further tasks will be launched 
to executors and the result after applying limit will be returned.

Please let me now for any suggestions or solutions for the below problems



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2356) Exception: Could not locate executable null\bin\winutils.exe in the Hadoop

2017-01-13 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822717#comment-15822717
 ] 

Hyukjin Kwon commented on SPARK-2356:
-

Is this really Spark-related issue?

> Exception: Could not locate executable null\bin\winutils.exe in the Hadoop 
> ---
>
> Key: SPARK-2356
> URL: https://issues.apache.org/jira/browse/SPARK-2356
> Project: Spark
>  Issue Type: Bug
>  Components: Windows
>Affects Versions: 1.0.0, 1.1.1, 1.2.1, 1.2.2, 1.3.1, 1.4.0, 1.4.1, 1.5.0, 
> 1.5.1, 1.5.2
>Reporter: Kostiantyn Kudriavtsev
>Priority: Critical
>
> I'm trying to run some transformation on Spark, it works fine on cluster 
> (YARN, linux machines). However, when I'm trying to run it on local machine 
> (Windows 7) under unit test, I got errors (I don't use Hadoop, I'm read file 
> from local filesystem):
> {code}
> 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the 
> hadoop binary path
> java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
> Hadoop binaries.
>   at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
>   at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
>   at org.apache.hadoop.util.Shell.(Shell.java:326)
>   at org.apache.hadoop.util.StringUtils.(StringUtils.java:76)
>   at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
>   at org.apache.hadoop.security.Groups.(Groups.java:77)
>   at 
> org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
>   at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
>   at 
> org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:36)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala:109)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala)
>   at org.apache.spark.SparkContext.(SparkContext.scala:228)
>   at org.apache.spark.SparkContext.(SparkContext.scala:97)
> {code}
> It's happened because Hadoop config is initialized each time when spark 
> context is created regardless is hadoop required or not.
> I propose to add some special flag to indicate if hadoop config is required 
> (or start this configuration manually)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2153) CassandraTest fails for newer Cassandra due to case insensitive key space

2017-01-13 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-2153.
-
Resolution: Not A Problem

It seems we don't have the test in the master anymore. It seems related with 
SPARK-14744.

> CassandraTest fails for newer Cassandra due to case insensitive key space
> -
>
> Key: SPARK-2153
> URL: https://issues.apache.org/jira/browse/SPARK-2153
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 1.0.0
>Reporter: vishnu
>Priority: Minor
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> The Spark Example CassandraTest.scala does cannot be built on newer versions 
> of cassandra. I tried it on Cassandra 2.0.8. 
> It is because Cassandra looks case sensitive for the key spaces and stores 
> all the keyspaces in lowercase. And in the example the KeySpace is "casDemo" 
> . So the program fails with an error stating keyspace not found.
> The new Cassandra jars do not have the org.apache.cassandra.db.IColumn .So 
> instead we have to use org.apache.cassandra.db.Column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19221) Add winutils binaries to Path in AppVeyor for Hadoop libraries to call native libraries properly

2017-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822693#comment-15822693
 ] 

Apache Spark commented on SPARK-19221:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/16584

> Add winutils binaries to Path in AppVeyor for Hadoop libraries to call native 
> libraries properly
> 
>
> Key: SPARK-19221
> URL: https://issues.apache.org/jira/browse/SPARK-19221
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra, SparkR
>Reporter: Hyukjin Kwon
>
> It seems Hadoop libraries need {{hadoop.dll}} for native libraries in the 
> path. It is not a problem in tests for now because we are only testing SparkR 
> on Windows via AppVeyor but it can be a problem if we run Scala tests via 
> AppVeyor as below:
> {code}
>  - SPARK-18220: read Hive orc table with varchar column *** FAILED *** (3 
> seconds, 937 milliseconds)
>org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution 
> Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. 
> org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:625)
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:609)
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283)
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:230)
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:229)
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:272)
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:609)
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:599)
>at 
> org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply$mcV$sp(OrcSourceSuite.scala:159)
>at 
> org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155)
>at 
> org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155)
>at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>at org.scalatest.Transformer.apply(Transformer.scala:22)
>at org.scalatest.Transformer.apply(Transformer.scala:20)
>at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
>at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>at scala.collection.immutable.List.foreach(List.scala:381)
>at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>at org.scalatest.Suite$class.run(Suite.scala:1424)
>at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
>at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
>at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
>at 

[jira] [Assigned] (SPARK-19221) Add winutils binaries to Path in AppVeyor for Hadoop libraries to call native libraries properly

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19221:


Assignee: (was: Apache Spark)

> Add winutils binaries to Path in AppVeyor for Hadoop libraries to call native 
> libraries properly
> 
>
> Key: SPARK-19221
> URL: https://issues.apache.org/jira/browse/SPARK-19221
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra, SparkR
>Reporter: Hyukjin Kwon
>
> It seems Hadoop libraries need {{hadoop.dll}} for native libraries in the 
> path. It is not a problem in tests for now because we are only testing SparkR 
> on Windows via AppVeyor but it can be a problem if we run Scala tests via 
> AppVeyor as below:
> {code}
>  - SPARK-18220: read Hive orc table with varchar column *** FAILED *** (3 
> seconds, 937 milliseconds)
>org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution 
> Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. 
> org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:625)
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:609)
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283)
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:230)
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:229)
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:272)
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:609)
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:599)
>at 
> org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply$mcV$sp(OrcSourceSuite.scala:159)
>at 
> org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155)
>at 
> org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155)
>at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>at org.scalatest.Transformer.apply(Transformer.scala:22)
>at org.scalatest.Transformer.apply(Transformer.scala:20)
>at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
>at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>at scala.collection.immutable.List.foreach(List.scala:381)
>at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>at org.scalatest.Suite$class.run(Suite.scala:1424)
>at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
>at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
>at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
>at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:31)
>at 
> 

[jira] [Assigned] (SPARK-19221) Add winutils binaries to Path in AppVeyor for Hadoop libraries to call native libraries properly

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19221:


Assignee: Apache Spark

> Add winutils binaries to Path in AppVeyor for Hadoop libraries to call native 
> libraries properly
> 
>
> Key: SPARK-19221
> URL: https://issues.apache.org/jira/browse/SPARK-19221
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra, SparkR
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>
> It seems Hadoop libraries need {{hadoop.dll}} for native libraries in the 
> path. It is not a problem in tests for now because we are only testing SparkR 
> on Windows via AppVeyor but it can be a problem if we run Scala tests via 
> AppVeyor as below:
> {code}
>  - SPARK-18220: read Hive orc table with varchar column *** FAILED *** (3 
> seconds, 937 milliseconds)
>org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution 
> Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. 
> org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:625)
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:609)
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283)
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:230)
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:229)
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:272)
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:609)
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:599)
>at 
> org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply$mcV$sp(OrcSourceSuite.scala:159)
>at 
> org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155)
>at 
> org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155)
>at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>at org.scalatest.Transformer.apply(Transformer.scala:22)
>at org.scalatest.Transformer.apply(Transformer.scala:20)
>at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
>at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>at scala.collection.immutable.List.foreach(List.scala:381)
>at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>at org.scalatest.Suite$class.run(Suite.scala:1424)
>at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
>at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
>at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
>at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:31)
>at 
> 

[jira] [Created] (SPARK-19221) Add winutils binaries to Path in AppVeyor for Hadoop libraries to call native libraries properly

2017-01-13 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-19221:


 Summary: Add winutils binaries to Path in AppVeyor for Hadoop 
libraries to call native libraries properly
 Key: SPARK-19221
 URL: https://issues.apache.org/jira/browse/SPARK-19221
 Project: Spark
  Issue Type: Bug
  Components: Project Infra, SparkR
Reporter: Hyukjin Kwon


It seems Hadoop libraries need {{hadoop.dll}} for native libraries in the path. 
It is not a problem in tests for now because we are only testing SparkR on 
Windows via AppVeyor but it can be a problem if we run Scala tests via AppVeyor 
as below:

{code}
 - SPARK-18220: read Hive orc table with varchar column *** FAILED *** (3 
seconds, 937 milliseconds)
   org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution 
Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. 
org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
   at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:625)
   at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:609)
   at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283)
   at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:230)
   at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:229)
   at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:272)
   at 
org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:609)
   at 
org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:599)
   at 
org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply$mcV$sp(OrcSourceSuite.scala:159)
   at 
org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155)
   at 
org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155)
   at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
   at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
   at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:381)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
   at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
   at 
org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
   at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:31)
   at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:357)
   at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:502)
   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 

[jira] [Updated] (SPARK-19178) convert string of large numbers to int should return null

2017-01-13 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-19178:

Fix Version/s: 2.1.1
   2.0.3

> convert string of large numbers to int should return null
> -
>
> Key: SPARK-19178
> URL: https://issues.apache.org/jira/browse/SPARK-19178
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19153) DataFrameWriter.saveAsTable should work with hive format to create partitioned table

2017-01-13 Thread Shuai Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822661#comment-15822661
 ] 

Shuai Lin commented on SPARK-19153:
---

I'm working on this ticket, thanks.

> DataFrameWriter.saveAsTable should work with hive format to create 
> partitioned table
> 
>
> Key: SPARK-19153
> URL: https://issues.apache.org/jira/browse/SPARK-19153
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18667) input_file_name function does not work with UDF

2017-01-13 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822631#comment-15822631
 ] 

Liang-Chi Hsieh edited comment on SPARK-18667 at 1/14/17 2:00 AM:
--

[~someonehere15],

Yeah, I can reproduce that the last line {{df3.selectExpr('file as FILE','x AS 
COL1','sameText( y ) AS COL2').show()}} costs more time to run. At the first 
look, I think they are different problems. I will take a look too. Thank you.


was (Author: viirya):
[~someonehere15],

Yeah, I can reproduce that the last line {{df3.selectExpr('file as FILE','x AS 
COL1','sameText(y) AS COL2').show()}} costs more time to run. At the first 
look, I think they are different problems. I will take a look too. Thank you.

> input_file_name function does not work with UDF
> ---
>
> Key: SPARK-18667
> URL: https://issues.apache.org/jira/browse/SPARK-18667
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Hyukjin Kwon
>Assignee: Liang-Chi Hsieh
> Fix For: 2.1.0
>
>
> {{input_file_name()}} does not return the file name but empty string instead 
> when it is used as input for UDF in PySpark as below: 
> with the data as below:
> {code}
> {"a": 1}
> {code}
> with the codes below:
> {code}
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> def filename(path):
> return path
> sourceFile = udf(filename, StringType())
> spark.read.json("tmp.json").select(sourceFile(input_file_name())).show()
> {code}
> prints as below:
> {code}
> +---+
> |filename(input_file_name())|
> +---+
> |   |
> +---+
> {code}
> but the codes below:
> {code}
> spark.read.json("tmp.json").select(input_file_name()).show()
> {code}
> prints correctly as below:
> {code}
> ++
> |   input_file_name()|
> ++
> |file:///Users/hyu...|
> ++
> {code}
> This seems PySpark specific issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18667) input_file_name function does not work with UDF

2017-01-13 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822631#comment-15822631
 ] 

Liang-Chi Hsieh commented on SPARK-18667:
-

[~someonehere15],

Yeah, I can reproduce that the last line {{df3.selectExpr('file as FILE','x AS 
COL1','sameText(y) AS COL2').show()}} costs more time to run. At the first 
look, I think they are different problems. I will take a look too. Thank you.

> input_file_name function does not work with UDF
> ---
>
> Key: SPARK-18667
> URL: https://issues.apache.org/jira/browse/SPARK-18667
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Hyukjin Kwon
>Assignee: Liang-Chi Hsieh
> Fix For: 2.1.0
>
>
> {{input_file_name()}} does not return the file name but empty string instead 
> when it is used as input for UDF in PySpark as below: 
> with the data as below:
> {code}
> {"a": 1}
> {code}
> with the codes below:
> {code}
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> def filename(path):
> return path
> sourceFile = udf(filename, StringType())
> spark.read.json("tmp.json").select(sourceFile(input_file_name())).show()
> {code}
> prints as below:
> {code}
> +---+
> |filename(input_file_name())|
> +---+
> |   |
> +---+
> {code}
> but the codes below:
> {code}
> spark.read.json("tmp.json").select(input_file_name()).show()
> {code}
> prints correctly as below:
> {code}
> ++
> |   input_file_name()|
> ++
> |file:///Users/hyu...|
> ++
> {code}
> This seems PySpark specific issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19129) alter table table_name drop partition with a empty string will drop the whole table

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19129:


Assignee: Xiao Li  (was: Apache Spark)

> alter table table_name drop partition with a empty string will drop the whole 
> table
> ---
>
> Key: SPARK-19129
> URL: https://issues.apache.org/jira/browse/SPARK-19129
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: lichenglin
>Assignee: Xiao Li
>Priority: Critical
>  Labels: correctness
>
> {code}
> val spark = SparkSession
>   .builder
>   .appName("PartitionDropTest")
>   .master("local[2]").enableHiveSupport()
>   .getOrCreate()
> val sentenceData = spark.createDataFrame(Seq(
>   (0, "a"),
>   (1, "b"),
>   (2, "c")))
>   .toDF("id", "name")
> spark.sql("drop table if exists licllocal.partition_table")
> 
> sentenceData.write.mode(SaveMode.Overwrite).partitionBy("id").saveAsTable("licllocal.partition_table")
> spark.sql("alter table licllocal.partition_table drop partition(id='')")
> spark.table("licllocal.partition_table").show()
> {code}
> the result is 
> {code}
> |name| id|
> ++---+
> ++---+
> {code}
> Maybe the partition match have something wrong when the partition value is 
> set to empty string



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19129) alter table table_name drop partition with a empty string will drop the whole table

2017-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822575#comment-15822575
 ] 

Apache Spark commented on SPARK-19129:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/16583

> alter table table_name drop partition with a empty string will drop the whole 
> table
> ---
>
> Key: SPARK-19129
> URL: https://issues.apache.org/jira/browse/SPARK-19129
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: lichenglin
>Assignee: Xiao Li
>Priority: Critical
>  Labels: correctness
>
> {code}
> val spark = SparkSession
>   .builder
>   .appName("PartitionDropTest")
>   .master("local[2]").enableHiveSupport()
>   .getOrCreate()
> val sentenceData = spark.createDataFrame(Seq(
>   (0, "a"),
>   (1, "b"),
>   (2, "c")))
>   .toDF("id", "name")
> spark.sql("drop table if exists licllocal.partition_table")
> 
> sentenceData.write.mode(SaveMode.Overwrite).partitionBy("id").saveAsTable("licllocal.partition_table")
> spark.sql("alter table licllocal.partition_table drop partition(id='')")
> spark.table("licllocal.partition_table").show()
> {code}
> the result is 
> {code}
> |name| id|
> ++---+
> ++---+
> {code}
> Maybe the partition match have something wrong when the partition value is 
> set to empty string



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19129) alter table table_name drop partition with a empty string will drop the whole table

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19129:


Assignee: Apache Spark  (was: Xiao Li)

> alter table table_name drop partition with a empty string will drop the whole 
> table
> ---
>
> Key: SPARK-19129
> URL: https://issues.apache.org/jira/browse/SPARK-19129
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: lichenglin
>Assignee: Apache Spark
>Priority: Critical
>  Labels: correctness
>
> {code}
> val spark = SparkSession
>   .builder
>   .appName("PartitionDropTest")
>   .master("local[2]").enableHiveSupport()
>   .getOrCreate()
> val sentenceData = spark.createDataFrame(Seq(
>   (0, "a"),
>   (1, "b"),
>   (2, "c")))
>   .toDF("id", "name")
> spark.sql("drop table if exists licllocal.partition_table")
> 
> sentenceData.write.mode(SaveMode.Overwrite).partitionBy("id").saveAsTable("licllocal.partition_table")
> spark.sql("alter table licllocal.partition_table drop partition(id='')")
> spark.table("licllocal.partition_table").show()
> {code}
> the result is 
> {code}
> |name| id|
> ++---+
> ++---+
> {code}
> Maybe the partition match have something wrong when the partition value is 
> set to empty string



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11520) RegressionMetrics should support instance weights

2017-01-13 Thread Ilya Matiach (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822556#comment-15822556
 ] 

Ilya Matiach commented on SPARK-11520:
--

I've sent a pull request that includes this JIRA and SPARK-18693 (and includes 
the closed pull request which was never checked in):

https://github.com/apache/spark/pull/16557

> RegressionMetrics should support instance weights
> -
>
> Key: SPARK-11520
> URL: https://issues.apache.org/jira/browse/SPARK-11520
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This will be important to improve LinearRegressionSummary, which currently 
> has a mix of weighted and unweighted metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19208) MaxAbsScaler and MinMaxScaler are very inefficient

2017-01-13 Thread Ilya Matiach (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822549#comment-15822549
 ] 

Ilya Matiach commented on SPARK-19208:
--

[~srowen] isn't feature hashing (eg HashingTF) to large bit sizes a common 
scenario (and then using SVMs or linear learners) for large text data - eg 2^20 
bit sizes.  In addition, even if there are just a few columns but a ton of data 
we are wasting a lot of performance by computing other metrics. This seems like 
a good change, but maybe it would instead be better to modify the api to the 
summarizer to only provide a subset of statistics - that way you would both 1.) 
not duplicate the code 2.) offer more flexibility to users.  Otherwise, I am 
fine with the code change, but it would be good to see more statistics on how 
the performance has improved, since this is a significant code change and we 
need to be convinced that it is a positive change (no perf regressions) for all 
types of data.

> MaxAbsScaler and MinMaxScaler are very inefficient
> --
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Assignee: Apache Spark
> Attachments: WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18821) Bisecting k-means wrapper in SparkR

2017-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822530#comment-15822530
 ] 

Apache Spark commented on SPARK-18821:
--

User 'wangmiao1981' has created a pull request for this issue:
https://github.com/apache/spark/pull/16566

> Bisecting k-means wrapper in SparkR
> ---
>
> Key: SPARK-18821
> URL: https://issues.apache.org/jira/browse/SPARK-18821
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Felix Cheung
>
> Implement a wrapper in SparkR to support bisecting k-means



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18821) Bisecting k-means wrapper in SparkR

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18821:


Assignee: (was: Apache Spark)

> Bisecting k-means wrapper in SparkR
> ---
>
> Key: SPARK-18821
> URL: https://issues.apache.org/jira/browse/SPARK-18821
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Felix Cheung
>
> Implement a wrapper in SparkR to support bisecting k-means



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18821) Bisecting k-means wrapper in SparkR

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18821:


Assignee: Apache Spark

> Bisecting k-means wrapper in SparkR
> ---
>
> Key: SPARK-18821
> URL: https://issues.apache.org/jira/browse/SPARK-18821
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Felix Cheung
>Assignee: Apache Spark
>
> Implement a wrapper in SparkR to support bisecting k-means



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18739) Models in pyspark.classification and regression support setXXXCol methods

2017-01-13 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822495#comment-15822495
 ] 

Bryan Cutler commented on SPARK-18739:
--

What about other missing methods from models, like param getters?  If we are 
changing the class hierarchy like this, shouldn't it mirror the Scala classes?  
For example, {{LogisticRegressionModel}} inherits from 
{{ProbabalisticClassificationModel}} and {{LogisticRegressionParams}}.  
Otherwise, issues like SPARK-19216 will still exist.

> Models in pyspark.classification and regression support setXXXCol methods
> -
>
> Key: SPARK-18739
> URL: https://issues.apache.org/jira/browse/SPARK-18739
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: zhengruifeng
>
> Now, models in pyspark don't suport {{setXXCol}} methods at all.
> I update models in {{classification.py}} according the hierarchy in the scala 
> side:
> 1, add {{setFeaturesCol}} and {{setPredictionCol}} in class 
> {{JavaPredictionModel}}
> 2, add {{setRawPredictionCol}} in class {{JavaClassificationModel}}
> 3, create class {{JavaProbabilisticClassificationModel}} inherit 
> {{JavaClassificationModel}}, and add {{setProbabilityCol}} in it
> 4, {{LogisticRegressionModel}}, {{DecisionTreeClassificationModel}}, 
> {{RandomForestClassificationModel}} and {{NaiveBayesModel}} inherit 
> {{JavaProbabilisticClassificationModel}}
> 5, {{GBTClassificationModel}} and {{MultilayerPerceptronClassificationModel}} 
> inherit {{JavaClassificationModel}}
> 6, {{OneVsRestModel}} inherit {{JavaModel}}, and add {{setFeaturesCol}} and 
> {{setPredictionCol}} method.
> With regard to models in clustering and features, I suggest that we first add 
> some abstract classes like {{ClusteringModel}}, 
> {{ProbabilisticClusteringModel}},  {{FeatureModel}} in the scala side, 
> otherwise we need to manually add setXXXCol methods one by one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19220) SSL redirect handler only redirects the server's root

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19220:


Assignee: Apache Spark

> SSL redirect handler only redirects the server's root
> -
>
> Key: SPARK-19220
> URL: https://issues.apache.org/jira/browse/SPARK-19220
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>
> The redirect handler that is started in the HTTP port when SSL is enabled 
> only redirects the root of the server. Additional handlers do not go through 
> the handler, so if you have a deep link to the non-https server, you won't be 
> redirected to the https port.
> I tested this with the history server, but it should be the same for the 
> normal UI; the fix should be the same for both too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19220) SSL redirect handler only redirects the server's root

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19220:


Assignee: (was: Apache Spark)

> SSL redirect handler only redirects the server's root
> -
>
> Key: SPARK-19220
> URL: https://issues.apache.org/jira/browse/SPARK-19220
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>
> The redirect handler that is started in the HTTP port when SSL is enabled 
> only redirects the root of the server. Additional handlers do not go through 
> the handler, so if you have a deep link to the non-https server, you won't be 
> redirected to the https port.
> I tested this with the history server, but it should be the same for the 
> normal UI; the fix should be the same for both too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19220) SSL redirect handler only redirects the server's root

2017-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822478#comment-15822478
 ] 

Apache Spark commented on SPARK-19220:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/16582

> SSL redirect handler only redirects the server's root
> -
>
> Key: SPARK-19220
> URL: https://issues.apache.org/jira/browse/SPARK-19220
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>
> The redirect handler that is started in the HTTP port when SSL is enabled 
> only redirects the root of the server. Additional handlers do not go through 
> the handler, so if you have a deep link to the non-https server, you won't be 
> redirected to the https port.
> I tested this with the history server, but it should be the same for the 
> normal UI; the fix should be the same for both too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19129) alter table table_name drop partition with a empty string will drop the whole table

2017-01-13 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822445#comment-15822445
 ] 

Xiao Li commented on SPARK-19129:
-

This is actually a bug in Hive. Anyway, Spark can detect it and block it. 
Thanks for reporting it! 

> alter table table_name drop partition with a empty string will drop the whole 
> table
> ---
>
> Key: SPARK-19129
> URL: https://issues.apache.org/jira/browse/SPARK-19129
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: lichenglin
>Assignee: Xiao Li
>Priority: Critical
>  Labels: correctness
>
> {code}
> val spark = SparkSession
>   .builder
>   .appName("PartitionDropTest")
>   .master("local[2]").enableHiveSupport()
>   .getOrCreate()
> val sentenceData = spark.createDataFrame(Seq(
>   (0, "a"),
>   (1, "b"),
>   (2, "c")))
>   .toDF("id", "name")
> spark.sql("drop table if exists licllocal.partition_table")
> 
> sentenceData.write.mode(SaveMode.Overwrite).partitionBy("id").saveAsTable("licllocal.partition_table")
> spark.sql("alter table licllocal.partition_table drop partition(id='')")
> spark.table("licllocal.partition_table").show()
> {code}
> the result is 
> {code}
> |name| id|
> ++---+
> ++---+
> {code}
> Maybe the partition match have something wrong when the partition value is 
> set to empty string



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19213) FileSourceScanExec uses SparkSession from HadoopFsRelation creation time instead of the active session at execution time

2017-01-13 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-19213:
---
Summary: FileSourceScanExec uses SparkSession from HadoopFsRelation 
creation time instead of the active session at execution time  (was: 
FileSourceScanExec usese sparksession from hadoopfsrelation creation time 
instead of the one active at time of execution)

> FileSourceScanExec uses SparkSession from HadoopFsRelation creation time 
> instead of the active session at execution time
> 
>
> Key: SPARK-19213
> URL: https://issues.apache.org/jira/browse/SPARK-19213
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Robert Kruszewski
>
> If you look at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
>  you'll notice that the sparksession used for execution is the one that was 
> captured from logicalplan. Whereas in other places you have 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
>  and SparkPlan captures active session upon execution in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52
> From my understanding of the code it looks like we should be using the 
> sparksession that is currently active hence take the one from spark plan. 
> However, in case you want share Datasets across SparkSessions that is not 
> enough since as soon as dataset is executed the queryexecution will have 
> capture spark session at that point. If we want to share datasets across 
> users we need to make configurations not fixed upon first execution. I 
> consider 1st part (using sparksession from logical plan) a bug while the 
> second (using sparksession active at runtime) an enhancement so that sharing 
> across sessions is made easier.
> For example:
> {code}
> val df = spark.read.parquet(...)
> df.count()
> val newSession = spark.newSession()
> SparkSession.setActiveSession(newSession)
> //  (simplest one to try is disable 
> vectorized reads)
> val df2 = Dataset.ofRows(newSession, df.logicalPlan) // logical plan still 
> holds reference to original sparksession and changes don't take effect
> {code}
> I suggest that it shouldn't be necessary to create a new dataset for changes 
> to take effect. For most of the plans doing Dataset.ofRows work but this is 
> not the case for hadoopfsrelation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18568) vertex attributes in the edge triplet not getting updated in super steps for Pregel API

2017-01-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18568.
---
Resolution: Not A Problem

Yes, the result of updating a mutable object in an RDD is undefined. If that's 
the essence of this, then it's not a problem.

> vertex attributes in the edge triplet not getting updated in super steps for 
> Pregel API
> ---
>
> Key: SPARK-18568
> URL: https://issues.apache.org/jira/browse/SPARK-18568
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.0.2
>Reporter: Rohit
>
> When running the Pregel API with vertex attribute as complex objects. The 
> vertex attributes are not getting updated in the triplet view. For example if 
> the vertex attributes changes in first superstep for vertex"a" the triplet 
> src attributes in the send msg program for the first super step gets the 
> latest attributes of the vertex "a" but on 2nd super step if the vertex 
> attributes changes in the vprog the edge triplets are not updated with this 
> new state of the vertex for all the edge triplets having the vertex "a" as 
> src or destination. if I re-create the graph using g = Graph(g.vertices, 
> g.edges) in the while loop before the next super step then its getting 
> updated. But this fix is not good performance wise. A detailed description of 
> the bug along with the code to recreate it is in the attached URL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18568) vertex attributes in the edge triplet not getting updated in super steps for Pregel API

2017-01-13 Thread Andrew Ray (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822397#comment-15822397
 ] 

Andrew Ray commented on SPARK-18568:


RDD's have the same problem for cached collections of mutable objects
{code}
import scala.collection.mutable.{Map => MMap}
val rdd = sc.parallelize(MMap(1->1)::Nil)
rdd.cache()
val rdd2 = rdd.map(_ += 2 ->2)

scala> rdd2.collect()
res21: Array[scala.collection.mutable.Map[Int,Int]] = Array(Map(2 -> 2, 1 -> 1))
scala> rdd.collect()
res22: Array[scala.collection.mutable.Map[Int,Int]] = Array(Map(2 -> 2, 1 -> 1))
{code}

So I think the moral is it use immutable objects.

> vertex attributes in the edge triplet not getting updated in super steps for 
> Pregel API
> ---
>
> Key: SPARK-18568
> URL: https://issues.apache.org/jira/browse/SPARK-18568
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.0.2
>Reporter: Rohit
>
> When running the Pregel API with vertex attribute as complex objects. The 
> vertex attributes are not getting updated in the triplet view. For example if 
> the vertex attributes changes in first superstep for vertex"a" the triplet 
> src attributes in the send msg program for the first super step gets the 
> latest attributes of the vertex "a" but on 2nd super step if the vertex 
> attributes changes in the vprog the edge triplets are not updated with this 
> new state of the vertex for all the edge triplets having the vertex "a" as 
> src or destination. if I re-create the graph using g = Graph(g.vertices, 
> g.edges) in the while loop before the next super step then its getting 
> updated. But this fix is not good performance wise. A detailed description of 
> the bug along with the code to recreate it is in the attached URL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19180) the offset of short is 4 in OffHeapColumnVector's putShorts

2017-01-13 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-19180.

   Resolution: Fixed
Fix Version/s: 2.0.3
   2.1.1

Issue resolved by pull request 16555
[https://github.com/apache/spark/pull/16555]

> the offset of short is 4 in OffHeapColumnVector's putShorts
> ---
>
> Key: SPARK-19180
> URL: https://issues.apache.org/jira/browse/SPARK-19180
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: yucai
> Fix For: 2.1.1, 2.0.3, 2.2.0
>
>
> the offset of short is 4 in OffHeapColumnVector's putShorts, actually it 
> should be 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18682) Batch Source for Kafka

2017-01-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18682:
-
Assignee: Tyson Condie

> Batch Source for Kafka
> --
>
> Key: SPARK-18682
> URL: https://issues.apache.org/jira/browse/SPARK-18682
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Structured Streaming
>Reporter: Michael Armbrust
>Assignee: Tyson Condie
>
> Today, you can start a stream that reads from kafka.  However, given kafka's 
> configurable retention period, it seems like sometimes you might just want to 
> read all of the data that is available now.  As such we should add a version 
> that works with {{spark.read}} as well.
> The options should be the same as the streaming kafka source, with the 
> following differences:
>  - {{startingOffsets}} should default to earliest, and should not allow 
> {{latest}} (which would always be empty).
>  - {{endingOffsets}} should also be allowed and should default to {{latest}}. 
> the same assign json format as {{startingOffsets}} should also be accepted.
> It would be really good, if things like {{.limit\(n\)}} were enough to 
> prevent all the data from being read (this might just work).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19113) Fix flaky test: o.a.s.sql.streaming.StreamSuite fatal errors from a source should be sent to the user

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19113:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Fix flaky test: o.a.s.sql.streaming.StreamSuite fatal errors from a source 
> should be sent to the user
> -
>
> Key: SPARK-19113
> URL: https://issues.apache.org/jira/browse/SPARK-19113
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19220) SSL redirect handler only redirects the server's root

2017-01-13 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-19220:
--

 Summary: SSL redirect handler only redirects the server's root
 Key: SPARK-19220
 URL: https://issues.apache.org/jira/browse/SPARK-19220
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.0.0
Reporter: Marcelo Vanzin


The redirect handler that is started in the HTTP port when SSL is enabled only 
redirects the root of the server. Additional handlers do not go through the 
handler, so if you have a deep link to the non-https server, you won't be 
redirected to the https port.

I tested this with the history server, but it should be the same for the normal 
UI; the fix should be the same for both too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19129) alter table table_name drop partition with a empty string will drop the whole table

2017-01-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-19129:
---

Assignee: Xiao Li

> alter table table_name drop partition with a empty string will drop the whole 
> table
> ---
>
> Key: SPARK-19129
> URL: https://issues.apache.org/jira/browse/SPARK-19129
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: lichenglin
>Assignee: Xiao Li
>  Labels: correctness
>
> {code}
> val spark = SparkSession
>   .builder
>   .appName("PartitionDropTest")
>   .master("local[2]").enableHiveSupport()
>   .getOrCreate()
> val sentenceData = spark.createDataFrame(Seq(
>   (0, "a"),
>   (1, "b"),
>   (2, "c")))
>   .toDF("id", "name")
> spark.sql("drop table if exists licllocal.partition_table")
> 
> sentenceData.write.mode(SaveMode.Overwrite).partitionBy("id").saveAsTable("licllocal.partition_table")
> spark.sql("alter table licllocal.partition_table drop partition(id='')")
> spark.table("licllocal.partition_table").show()
> {code}
> the result is 
> {code}
> |name| id|
> ++---+
> ++---+
> {code}
> Maybe the partition match have something wrong when the partition value is 
> set to empty string



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19129) alter table table_name drop partition with a empty string will drop the whole table

2017-01-13 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822330#comment-15822330
 ] 

Xiao Li commented on SPARK-19129:
-

This is a bug we need to fix. Let me try it. Thanks!

> alter table table_name drop partition with a empty string will drop the whole 
> table
> ---
>
> Key: SPARK-19129
> URL: https://issues.apache.org/jira/browse/SPARK-19129
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: lichenglin
>  Labels: correctness
>
> {code}
> val spark = SparkSession
>   .builder
>   .appName("PartitionDropTest")
>   .master("local[2]").enableHiveSupport()
>   .getOrCreate()
> val sentenceData = spark.createDataFrame(Seq(
>   (0, "a"),
>   (1, "b"),
>   (2, "c")))
>   .toDF("id", "name")
> spark.sql("drop table if exists licllocal.partition_table")
> 
> sentenceData.write.mode(SaveMode.Overwrite).partitionBy("id").saveAsTable("licllocal.partition_table")
> spark.sql("alter table licllocal.partition_table drop partition(id='')")
> spark.table("licllocal.partition_table").show()
> {code}
> the result is 
> {code}
> |name| id|
> ++---+
> ++---+
> {code}
> Maybe the partition match have something wrong when the partition value is 
> set to empty string



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19129) alter table table_name drop partition with a empty string will drop the whole table

2017-01-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-19129:

Priority: Critical  (was: Major)

> alter table table_name drop partition with a empty string will drop the whole 
> table
> ---
>
> Key: SPARK-19129
> URL: https://issues.apache.org/jira/browse/SPARK-19129
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: lichenglin
>Assignee: Xiao Li
>Priority: Critical
>  Labels: correctness
>
> {code}
> val spark = SparkSession
>   .builder
>   .appName("PartitionDropTest")
>   .master("local[2]").enableHiveSupport()
>   .getOrCreate()
> val sentenceData = spark.createDataFrame(Seq(
>   (0, "a"),
>   (1, "b"),
>   (2, "c")))
>   .toDF("id", "name")
> spark.sql("drop table if exists licllocal.partition_table")
> 
> sentenceData.write.mode(SaveMode.Overwrite).partitionBy("id").saveAsTable("licllocal.partition_table")
> spark.sql("alter table licllocal.partition_table drop partition(id='')")
> spark.table("licllocal.partition_table").show()
> {code}
> the result is 
> {code}
> |name| id|
> ++---+
> ++---+
> {code}
> Maybe the partition match have something wrong when the partition value is 
> set to empty string



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19129) alter table table_name drop partition with a empty string will drop the whole table

2017-01-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-19129:

Labels: correctness  (was: )

> alter table table_name drop partition with a empty string will drop the whole 
> table
> ---
>
> Key: SPARK-19129
> URL: https://issues.apache.org/jira/browse/SPARK-19129
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: lichenglin
>  Labels: correctness
>
> {code}
> val spark = SparkSession
>   .builder
>   .appName("PartitionDropTest")
>   .master("local[2]").enableHiveSupport()
>   .getOrCreate()
> val sentenceData = spark.createDataFrame(Seq(
>   (0, "a"),
>   (1, "b"),
>   (2, "c")))
>   .toDF("id", "name")
> spark.sql("drop table if exists licllocal.partition_table")
> 
> sentenceData.write.mode(SaveMode.Overwrite).partitionBy("id").saveAsTable("licllocal.partition_table")
> spark.sql("alter table licllocal.partition_table drop partition(id='')")
> spark.table("licllocal.partition_table").show()
> {code}
> the result is 
> {code}
> |name| id|
> ++---+
> ++---+
> {code}
> Maybe the partition match have something wrong when the partition value is 
> set to empty string



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18589) persist() resolves "java.lang.RuntimeException: Invalid PythonUDF (...), requires attributes from more than one child"

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18589:


Assignee: Davies Liu  (was: Apache Spark)

> persist() resolves "java.lang.RuntimeException: Invalid PythonUDF 
> (...), requires attributes from more than one child"
> --
>
> Key: SPARK-18589
> URL: https://issues.apache.org/jira/browse/SPARK-18589
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.1.0
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
>Assignee: Davies Liu
>Priority: Critical
>
> Smells like another optimizer bug that's similar to SPARK-17100 and 
> SPARK-18254. I'm seeing this on 2.0.2 and on master at commit 
> {{fb07bbe575aabe68422fd3a31865101fb7fa1722}}.
> I don't have a minimal repro for this yet, but the error I'm seeing is:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling o247.count.
> : java.lang.RuntimeException: Invalid PythonUDF <...>(...), requires 
> attributes from more than one child.
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:150)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:149)
> at scala.collection.immutable.Stream.foreach(Stream.scala:594)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:149)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:113)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:93)
> at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93)
> at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> 

[jira] [Assigned] (SPARK-18589) persist() resolves "java.lang.RuntimeException: Invalid PythonUDF (...), requires attributes from more than one child"

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18589:


Assignee: Apache Spark  (was: Davies Liu)

> persist() resolves "java.lang.RuntimeException: Invalid PythonUDF 
> (...), requires attributes from more than one child"
> --
>
> Key: SPARK-18589
> URL: https://issues.apache.org/jira/browse/SPARK-18589
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.1.0
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
>Assignee: Apache Spark
>Priority: Critical
>
> Smells like another optimizer bug that's similar to SPARK-17100 and 
> SPARK-18254. I'm seeing this on 2.0.2 and on master at commit 
> {{fb07bbe575aabe68422fd3a31865101fb7fa1722}}.
> I don't have a minimal repro for this yet, but the error I'm seeing is:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling o247.count.
> : java.lang.RuntimeException: Invalid PythonUDF <...>(...), requires 
> attributes from more than one child.
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:150)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:149)
> at scala.collection.immutable.Stream.foreach(Stream.scala:594)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:149)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:113)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:93)
> at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93)
> at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> 

[jira] [Commented] (SPARK-18589) persist() resolves "java.lang.RuntimeException: Invalid PythonUDF (...), requires attributes from more than one child"

2017-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822278#comment-15822278
 ] 

Apache Spark commented on SPARK-18589:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/16581

> persist() resolves "java.lang.RuntimeException: Invalid PythonUDF 
> (...), requires attributes from more than one child"
> --
>
> Key: SPARK-18589
> URL: https://issues.apache.org/jira/browse/SPARK-18589
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.1.0
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
>Assignee: Davies Liu
>Priority: Critical
>
> Smells like another optimizer bug that's similar to SPARK-17100 and 
> SPARK-18254. I'm seeing this on 2.0.2 and on master at commit 
> {{fb07bbe575aabe68422fd3a31865101fb7fa1722}}.
> I don't have a minimal repro for this yet, but the error I'm seeing is:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling o247.count.
> : java.lang.RuntimeException: Invalid PythonUDF <...>(...), requires 
> attributes from more than one child.
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:150)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:149)
> at scala.collection.immutable.Stream.foreach(Stream.scala:594)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:149)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:113)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:93)
> at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93)
> at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
> at 

[jira] [Closed] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

2017-01-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust closed SPARK-18475.

Resolution: Won't Fix

> Be able to provide higher parallelization for StructuredStreaming Kafka Source
> --
>
> Key: SPARK-18475
> URL: https://issues.apache.org/jira/browse/SPARK-18475
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Burak Yavuz
>
> Right now the StructuredStreaming Kafka Source creates as many Spark tasks as 
> there are TopicPartitions that we're going to read from Kafka.
> This doesn't work well when we have data skew, and there is no reason why we 
> shouldn't be able to increase parallelism further, i.e. have multiple Spark 
> tasks reading from the same Kafka TopicPartition.
> What this will mean is that we won't be able to use the "CachedKafkaConsumer" 
> for what it is defined for (being cached) in this use case, but the extra 
> overhead is worth handling data skew and increasing parallelism especially in 
> ETL use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18970) FileSource failure during file list refresh doesn't cause an application to fail, but stops further processing

2017-01-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18970:
-
Description: 
Spark streaming application uses S3 files as streaming sources. After running 
for several day processing stopped even though an application continued to run. 
Stack trace:
{code}
java.io.FileNotFoundException: No such file or directory 
's3n://X'
at 
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:818)
at 
com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:511)
at 
org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:465)
at 
org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:462)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}
I believe 2 things should (or can) be fixed:
1. Application should fail in case of such an error.
2. Allow application to ignore such failure, since there is a chance that 
during next refresh the error will not resurface. (In my case I believe an 
error was cased by S3 cleaning the bucket exactly at the same moment when 
refresh was running) 

My code to create streaming processing looks as the following:
{code}
  val cq = sqlContext.readStream
.format("json")
.schema(struct)
.load(s"input")
.writeStream
.option("checkpointLocation", s"checkpoints")
.foreach(new ForeachWriter[Row] {...})
.trigger(ProcessingTime("10 seconds")).start()

  cq.awaitTermination() 
{code}

  was:
Spark streaming application uses S3 files as streaming sources. After running 
for several day processing stopped even though an application continued to run. 
Stack trace:
java.io.FileNotFoundException: No such file or directory 
's3n://X'
at 
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:818)
at 
com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:511)
at 
org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:465)
at 
org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:462)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
 

[jira] [Updated] (SPARK-18589) persist() resolves "java.lang.RuntimeException: Invalid PythonUDF (...), requires attributes from more than one child"

2017-01-13 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-18589:
---
Priority: Critical  (was: Minor)

> persist() resolves "java.lang.RuntimeException: Invalid PythonUDF 
> (...), requires attributes from more than one child"
> --
>
> Key: SPARK-18589
> URL: https://issues.apache.org/jira/browse/SPARK-18589
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.1.0
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
>Assignee: Davies Liu
>Priority: Critical
>
> Smells like another optimizer bug that's similar to SPARK-17100 and 
> SPARK-18254. I'm seeing this on 2.0.2 and on master at commit 
> {{fb07bbe575aabe68422fd3a31865101fb7fa1722}}.
> I don't have a minimal repro for this yet, but the error I'm seeing is:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling o247.count.
> : java.lang.RuntimeException: Invalid PythonUDF <...>(...), requires 
> attributes from more than one child.
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:150)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:149)
> at scala.collection.immutable.Stream.foreach(Stream.scala:594)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:149)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:113)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:93)
> at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93)
> at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> 

[jira] [Assigned] (SPARK-18589) persist() resolves "java.lang.RuntimeException: Invalid PythonUDF (...), requires attributes from more than one child"

2017-01-13 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-18589:
--

Assignee: Davies Liu

> persist() resolves "java.lang.RuntimeException: Invalid PythonUDF 
> (...), requires attributes from more than one child"
> --
>
> Key: SPARK-18589
> URL: https://issues.apache.org/jira/browse/SPARK-18589
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.1.0
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
>Assignee: Davies Liu
>Priority: Minor
>
> Smells like another optimizer bug that's similar to SPARK-17100 and 
> SPARK-18254. I'm seeing this on 2.0.2 and on master at commit 
> {{fb07bbe575aabe68422fd3a31865101fb7fa1722}}.
> I don't have a minimal repro for this yet, but the error I'm seeing is:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling o247.count.
> : java.lang.RuntimeException: Invalid PythonUDF <...>(...), requires 
> attributes from more than one child.
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:150)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:149)
> at scala.collection.immutable.Stream.foreach(Stream.scala:594)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:149)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:113)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:93)
> at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93)
> at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:93)

[jira] [Commented] (SPARK-17993) Spark prints an avalanche of warning messages from Parquet when reading parquet files written by older versions of Parquet-mr

2017-01-13 Thread Michael Allman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822248#comment-15822248
 ] 

Michael Allman commented on SPARK-17993:


[~emre.colak] FYI https://github.com/apache/spark/pull/16580

> Spark prints an avalanche of warning messages from Parquet when reading 
> parquet files written by older versions of Parquet-mr
> -
>
> Key: SPARK-17993
> URL: https://issues.apache.org/jira/browse/SPARK-17993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Allman
>Assignee: Michael Allman
> Fix For: 2.1.0
>
>
> It looks like https://github.com/apache/spark/pull/14690 broke parquet log 
> output redirection. After that patch, when querying parquet files written by 
> Parquet-mr 1.6.0 Spark prints a torrent of (harmless) warning messages from 
> the Parquet reader:
> {code}
> Oct 18, 2016 7:42:18 PM WARNING: org.apache.parquet.CorruptStatistics: 
> Ignoring statistics because created_by could not be parsed (see PARQUET-251): 
> parquet-mr version 1.6.0
> org.apache.parquet.VersionParser$VersionParseException: Could not parse 
> created_by: parquet-mr version 1.6.0 using format: (.+) version ((.*) 
> )?\(build ?(.*)\)
>   at org.apache.parquet.VersionParser.parse(VersionParser.java:112)
>   at 
> org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)
>   at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:583)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:513)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:225)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:162)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> This only happens during execution, not planning, and it doesn't matter what 
> log level the {{SparkContext}} is set to.
> This is a regression I noted as something we needed to fix as a follow up to 
> PR 14690. I feel responsible, so I'm going to expedite a fix for it. I 
> suspect that PR broke Spark's Parquet log output redirection. That's the 
> premise I'm going by.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: 

[jira] [Updated] (SPARK-19213) FileSourceScanExec usese sparksession from hadoopfsrelation creation time instead of the one active at time of execution

2017-01-13 Thread Robert Kruszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kruszewski updated SPARK-19213:
--
Description: 
If you look at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
 you'll notice that the sparksession used for execution is the one that was 
captured from logicalplan. Whereas in other places you have 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
 and SparkPlan captures active session upon execution in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52

>From my understanding of the code it looks like we should be using the 
>sparksession that is currently active hence take the one from spark plan. 
>However, in case you want share Datasets across SparkSessions that is not 
>enough since as soon as dataset is executed the queryexecution will have 
>capture spark session at that point. If we want to share datasets across users 
>we need to make configurations not fixed upon first execution. I consider 1st 
>part (using sparksession from logical plan) a bug while the second (using 
>sparksession active at runtime) an enhancement so that sharing across sessions 
>is made easier.

For example:
{code}
val df = spark.read.parquet(...)
df.count()
val newSession = spark.newSession()
SparkSession.setActiveSession(newSession)
 (simplest one to try is disable vectorized 
reads)
val df2 = Dataset.ofRows(newSession, df.logicalPlan) <- logical plan still 
holds reference to original sparksession and changes don't take effect
{code}
I suggest that it shouldn't be necessary to create a new dataset for changes to 
take effect. For most of the plans doing Dataset.ofRows work but this is not 
the case for hadoopfsrelation.

  was:
If you look at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
 you'll notice that the sparksession used for execution is the one that was 
captured from logicalplan. Whereas in other places you have 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
 and SparkPlan captures active session upon execution in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52

>From my understanding of the code it looks like we should be using the 
>sparksession that is currently active hence take the one from spark plan. 
>However, in case you want share Datasets across SparkSessions that is not 
>enough since as soon as dataset is executed the queryexecution will have 
>capture spark session at that point. If we want to share datasets across users 
>we need to make configurations not fixed upon first execution. I consider 1st 
>part (using sparksession from logical plan) a bug while the second (using 
>sparksession active at runtime) an enhancement so that sharing across sessions 
>is made easier.

For example:

val df = spark.read.parquet(...)
df.count()
val newSession = spark.newSession()
SparkSession.setActiveSession(newSession)
 (simplest one to try is disable vectorized 
reads)
val df2 = Dataset.ofRows(newSession, df.logicalPlan) <- logical plan still 
holds reference to original sparksession and changes don't take effect

I suggest that it shouldn't be necessary to create a new dataset for changes to 
take effect. For most of the plans doing Dataset.ofRows work but this is not 
the case for hadoopfsrelation.


> FileSourceScanExec usese sparksession from hadoopfsrelation creation time 
> instead of the one active at time of execution
> 
>
> Key: SPARK-19213
> URL: https://issues.apache.org/jira/browse/SPARK-19213
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Robert Kruszewski
>
> If you look at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
>  you'll notice that the sparksession used for execution is the one that was 
> captured from logicalplan. Whereas in other places you have 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
>  and SparkPlan captures active session upon execution in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52
> From my understanding of the code it looks like we should be using the 
> sparksession that is currently active hence 

[jira] [Updated] (SPARK-19213) FileSourceScanExec usese sparksession from hadoopfsrelation creation time instead of the one active at time of execution

2017-01-13 Thread Robert Kruszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kruszewski updated SPARK-19213:
--
Description: 
If you look at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
 you'll notice that the sparksession used for execution is the one that was 
captured from logicalplan. Whereas in other places you have 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
 and SparkPlan captures active session upon execution in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52

>From my understanding of the code it looks like we should be using the 
>sparksession that is currently active hence take the one from spark plan. 
>However, in case you want share Datasets across SparkSessions that is not 
>enough since as soon as dataset is executed the queryexecution will have 
>capture spark session at that point. If we want to share datasets across users 
>we need to make configurations not fixed upon first execution. I consider 1st 
>part (using sparksession from logical plan) a bug while the second (using 
>sparksession active at runtime) an enhancement so that sharing across sessions 
>is made easier.

For example:
{code}
val df = spark.read.parquet(...)
df.count()
val newSession = spark.newSession()
SparkSession.setActiveSession(newSession)
//  (simplest one to try is disable vectorized 
reads)
val df2 = Dataset.ofRows(newSession, df.logicalPlan) // logical plan still 
holds reference to original sparksession and changes don't take effect
{code}
I suggest that it shouldn't be necessary to create a new dataset for changes to 
take effect. For most of the plans doing Dataset.ofRows work but this is not 
the case for hadoopfsrelation.

  was:
If you look at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
 you'll notice that the sparksession used for execution is the one that was 
captured from logicalplan. Whereas in other places you have 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
 and SparkPlan captures active session upon execution in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52

>From my understanding of the code it looks like we should be using the 
>sparksession that is currently active hence take the one from spark plan. 
>However, in case you want share Datasets across SparkSessions that is not 
>enough since as soon as dataset is executed the queryexecution will have 
>capture spark session at that point. If we want to share datasets across users 
>we need to make configurations not fixed upon first execution. I consider 1st 
>part (using sparksession from logical plan) a bug while the second (using 
>sparksession active at runtime) an enhancement so that sharing across sessions 
>is made easier.

For example:
{code}
val df = spark.read.parquet(...)
df.count()
val newSession = spark.newSession()
SparkSession.setActiveSession(newSession)
 (simplest one to try is disable vectorized 
reads)
val df2 = Dataset.ofRows(newSession, df.logicalPlan) <- logical plan still 
holds reference to original sparksession and changes don't take effect
{code}
I suggest that it shouldn't be necessary to create a new dataset for changes to 
take effect. For most of the plans doing Dataset.ofRows work but this is not 
the case for hadoopfsrelation.


> FileSourceScanExec usese sparksession from hadoopfsrelation creation time 
> instead of the one active at time of execution
> 
>
> Key: SPARK-19213
> URL: https://issues.apache.org/jira/browse/SPARK-19213
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Robert Kruszewski
>
> If you look at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
>  you'll notice that the sparksession used for execution is the one that was 
> captured from logicalplan. Whereas in other places you have 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
>  and SparkPlan captures active session upon execution in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52
> From my understanding of the code it looks like we should be using the 
> sparksession that is currently 

[jira] [Updated] (SPARK-19213) FileSourceScanExec usese sparksession from hadoopfsrelation creation time instead of the one active at time of execution

2017-01-13 Thread Robert Kruszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kruszewski updated SPARK-19213:
--
Description: 
If you look at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
 you'll notice that the sparksession used for execution is the one that was 
captured from logicalplan. Whereas in other places you have 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
 and SparkPlan captures active session upon execution in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52

>From my understanding of the code it looks like we should be using the 
>sparksession that is currently active hence take the one from spark plan. 
>However, in case you want share Datasets across SparkSessions that is not 
>enough since as soon as dataset is executed the queryexecution will have 
>capture spark session at that point. If we want to share datasets across users 
>we need to make configurations not fixed upon first execution. I consider 1st 
>part (using sparksession from logical plan) a bug while the second (using 
>sparksession active at runtime) an enhancement so that sharing across sessions 
>is made easier.

For example:

val df = spark.read.parquet(...)
df.count()
val newSession = spark.newSession()
SparkSession.setActiveSession(newSession)
 (simplest one to try is disable vectorized 
reads)
val df2 = Dataset.ofRows(newSession, df.logicalPlan) <- logical plan still 
holds reference to original sparksession and changes don't take effect

I suggest that it shouldn't be necessary to create a new dataset for changes to 
take effect. For most of the plans doing Dataset.ofRows work but this is not 
the case for hadoopfsrelation.

  was:
If you look at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
 you'll notice that the sparksession used for execution is the one that was 
captured from logicalplan. Whereas in other places you have 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
 and SparkPlan captures active session upon execution in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52

>From my understanding of the code it looks like we should be using the 
>sparksession that is currently active hence take the one from spark plan. 
>However, in case you want share Datasets across SparkSessions that is not 
>enough since as soon as dataset is executed the queryexecution will have 
>capture spark session at that point. If we want to share datasets across users 
>we need to make configurations not fixed upon first execution. I consider 1st 
>part (using sparksession from logical plan) a bug while the second (using 
>sparksession active at runtime) an enhancement so that sharing across sessions 
>is made easier.


> FileSourceScanExec usese sparksession from hadoopfsrelation creation time 
> instead of the one active at time of execution
> 
>
> Key: SPARK-19213
> URL: https://issues.apache.org/jira/browse/SPARK-19213
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Robert Kruszewski
>
> If you look at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
>  you'll notice that the sparksession used for execution is the one that was 
> captured from logicalplan. Whereas in other places you have 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
>  and SparkPlan captures active session upon execution in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52
> From my understanding of the code it looks like we should be using the 
> sparksession that is currently active hence take the one from spark plan. 
> However, in case you want share Datasets across SparkSessions that is not 
> enough since as soon as dataset is executed the queryexecution will have 
> capture spark session at that point. If we want to share datasets across 
> users we need to make configurations not fixed upon first execution. I 
> consider 1st part (using sparksession from logical plan) a bug while the 
> second (using sparksession active at runtime) an enhancement so that sharing 
> across sessions is made easier.
> For example:
> 

[jira] [Commented] (SPARK-19219) Parquet log output overly verbose by default

2017-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822238#comment-15822238
 ] 

Apache Spark commented on SPARK-19219:
--

User 'nicklavers' has created a pull request for this issue:
https://github.com/apache/spark/pull/16580

> Parquet log output overly verbose by default
> 
>
> Key: SPARK-19219
> URL: https://issues.apache.org/jira/browse/SPARK-19219
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Nicholas
>  Labels: easyfix
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> PR #15538 addressed the problematically verbose logging when reading from 
> older parquet files, but did not change the default logging properties in 
> order to incorporate that fix into the default behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19219) Parquet log output overly verbose by default

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19219:


Assignee: Apache Spark

> Parquet log output overly verbose by default
> 
>
> Key: SPARK-19219
> URL: https://issues.apache.org/jira/browse/SPARK-19219
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Nicholas
>Assignee: Apache Spark
>  Labels: easyfix
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> PR #15538 addressed the problematically verbose logging when reading from 
> older parquet files, but did not change the default logging properties in 
> order to incorporate that fix into the default behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19219) Parquet log output overly verbose by default

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19219:


Assignee: (was: Apache Spark)

> Parquet log output overly verbose by default
> 
>
> Key: SPARK-19219
> URL: https://issues.apache.org/jira/browse/SPARK-19219
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Nicholas
>  Labels: easyfix
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> PR #15538 addressed the problematically verbose logging when reading from 
> older parquet files, but did not change the default logging properties in 
> order to incorporate that fix into the default behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19217) Offer easy cast from vector to array

2017-01-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1585#comment-1585
 ] 

Sean Owen commented on SPARK-19217:
---

It makes some sense to me, as I also find I write a UDF to do this just about 
every time.

> Offer easy cast from vector to array
> 
>
> Key: SPARK-19217
> URL: https://issues.apache.org/jira/browse/SPARK-19217
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Working with ML often means working with DataFrames with vector columns. You 
> can't save these DataFrames to storage without converting the vector columns 
> to array columns, and there doesn't appear to an easy way to make that 
> conversion.
> This is a common enough problem that it is [documented on Stack 
> Overflow|http://stackoverflow.com/q/35855382/877069]. The current solutions 
> to making the conversion from a vector column to an array column are:
> # Convert the DataFrame to an RDD and back
> # Use a UDF
> Both approaches work fine, but it really seems like you should be able to do 
> something like this instead:
> {code}
> (le_data
> .select(
> col('features').cast('array').alias('features')
> ))
> {code}
> We already have an {{ArrayType}} in {{pyspark.sql.types}}, but it appears 
> that {{cast()}} doesn't support this conversion.
> Would this be an appropriate thing to add?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19209) "No suitable driver" on first try

2017-01-13 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1584#comment-1584
 ] 

Xiao Li commented on SPARK-19209:
-

This could be caused by the classLoader issue. Anyway, let me first move the 
driverClass initialization back to createConnectionFactory. Thanks! 

> "No suitable driver" on first try
> -
>
> Key: SPARK-19209
> URL: https://issues.apache.org/jira/browse/SPARK-19209
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Darabos
>
> This is a regression from Spark 2.0.2. Observe!
> {code}
> $ ~/spark-2.0.2/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> This is the "good" exception. Now with Spark 2.1.0:
> {code}
> $ ~/spark-2.1.0/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: No suitable driver
>   at java.sql.DriverManager.getDriver(DriverManager.java:315)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
>   ... 48 elided
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> Simply re-executing the same command a second time "fixes" the {{No suitable 
> driver}} error.
> My guess is this is fallout from https://github.com/apache/spark/pull/15292 
> which changed the JDBC driver management code. But this code is so hard to 
> understand for me, I could be totally wrong.
> This is nothing more than a nuisance for {{spark-shell}} usage, but it is 
> more painful to work around for applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19213) FileSourceScanExec usese sparksession from hadoopfsrelation creation time instead of the one active at time of execution

2017-01-13 Thread Robert Kruszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kruszewski updated SPARK-19213:
--
Description: 
If you look at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
 you'll notice that the sparksession used for execution is the one that was 
captured from logicalplan. Whereas in other places you have 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
 and SparkPlan captures active session upon execution in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52

>From my understanding of the code it looks like we should be using the 
>sparksession that is currently active hence take the one from spark plan. 
>However, in case you want share Datasets across SparkSessions that is not 
>enough since as soon as dataset is executed the queryexecution will have 
>capture spark session at that point. If we want to share datasets across users 
>we need to make configurations not fixed upon first execution. I consider 1st 
>part (using sparksession from logical plan) a bug while the second (using 
>sparksession active at runtime) an enhancement so that sharing across sessions 
>is made easier.

  was:
If you look at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
 you'll notice that the sparksession used for execution is the one that was 
captured from logicalplan. Whereas in other places you have 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
 and SparkPlan captures active session upon execution in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52

>From my understanding of the io code it would be beneficial to be able to use 
>the active session in order to be able to modify hadoop config without 
>recreating the dataset. What would be interesting is to not lock the spark 
>session in the physical plan for ios and let you share datasets across spark 
>sessions. Is that supposed to work? Otherwise you'd have to get a new query 
>execution to bind to new sparksession which would only let you share logical 
>plans. 


> FileSourceScanExec usese sparksession from hadoopfsrelation creation time 
> instead of the one active at time of execution
> 
>
> Key: SPARK-19213
> URL: https://issues.apache.org/jira/browse/SPARK-19213
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Robert Kruszewski
>
> If you look at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
>  you'll notice that the sparksession used for execution is the one that was 
> captured from logicalplan. Whereas in other places you have 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
>  and SparkPlan captures active session upon execution in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52
> From my understanding of the code it looks like we should be using the 
> sparksession that is currently active hence take the one from spark plan. 
> However, in case you want share Datasets across SparkSessions that is not 
> enough since as soon as dataset is executed the queryexecution will have 
> capture spark session at that point. If we want to share datasets across 
> users we need to make configurations not fixed upon first execution. I 
> consider 1st part (using sparksession from logical plan) a bug while the 
> second (using sparksession active at runtime) an enhancement so that sharing 
> across sessions is made easier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19219) Parquet log output overly verbose by default

2017-01-13 Thread Nicholas (JIRA)
Nicholas created SPARK-19219:


 Summary: Parquet log output overly verbose by default
 Key: SPARK-19219
 URL: https://issues.apache.org/jira/browse/SPARK-19219
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 2.1.0
Reporter: Nicholas


PR #15538 addressed the problematically verbose logging when reading from older 
parquet files, but did not change the default logging properties in order to 
incorporate that fix into the default behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19209) "No suitable driver" on first try

2017-01-13 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822201#comment-15822201
 ] 

Xiao Li commented on SPARK-19209:
-

I am trying to find a workaround for your case. Could you add an extra option 
{{.option("driver", "com.mysql.jdbc.Driver")}} in your code? 

Note, I do not have your class name. Could you replace it by your class name in 
the option?


> "No suitable driver" on first try
> -
>
> Key: SPARK-19209
> URL: https://issues.apache.org/jira/browse/SPARK-19209
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Darabos
>
> This is a regression from Spark 2.0.2. Observe!
> {code}
> $ ~/spark-2.0.2/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> This is the "good" exception. Now with Spark 2.1.0:
> {code}
> $ ~/spark-2.1.0/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: No suitable driver
>   at java.sql.DriverManager.getDriver(DriverManager.java:315)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
>   ... 48 elided
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> Simply re-executing the same command a second time "fixes" the {{No suitable 
> driver}} error.
> My guess is this is fallout from https://github.com/apache/spark/pull/15292 
> which changed the JDBC driver management code. But this code is so hard to 
> understand for me, I could be totally wrong.
> This is nothing more than a nuisance for {{spark-shell}} usage, but it is 
> more painful to work around for applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19209) "No suitable driver" on first try

2017-01-13 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822186#comment-15822186
 ] 

Xiao Li edited comment on SPARK-19209 at 1/13/17 7:10 PM:
--

Do you also hit the same exception {{java.sql.SQLException: No suitable 
driver}} when the table exists?


was (Author: smilegator):
Did you also hit the same exception `java.sql.SQLException: No suitable driver` 
when the table exists?

> "No suitable driver" on first try
> -
>
> Key: SPARK-19209
> URL: https://issues.apache.org/jira/browse/SPARK-19209
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Darabos
>
> This is a regression from Spark 2.0.2. Observe!
> {code}
> $ ~/spark-2.0.2/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> This is the "good" exception. Now with Spark 2.1.0:
> {code}
> $ ~/spark-2.1.0/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: No suitable driver
>   at java.sql.DriverManager.getDriver(DriverManager.java:315)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
>   ... 48 elided
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> Simply re-executing the same command a second time "fixes" the {{No suitable 
> driver}} error.
> My guess is this is fallout from https://github.com/apache/spark/pull/15292 
> which changed the JDBC driver management code. But this code is so hard to 
> understand for me, I could be totally wrong.
> This is nothing more than a nuisance for {{spark-shell}} usage, but it is 
> more painful to work around for applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19209) "No suitable driver" on first try

2017-01-13 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822186#comment-15822186
 ] 

Xiao Li commented on SPARK-19209:
-

Did you also hit the same exception `java.sql.SQLException: No suitable driver` 
when the table exists?

> "No suitable driver" on first try
> -
>
> Key: SPARK-19209
> URL: https://issues.apache.org/jira/browse/SPARK-19209
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Darabos
>
> This is a regression from Spark 2.0.2. Observe!
> {code}
> $ ~/spark-2.0.2/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> This is the "good" exception. Now with Spark 2.1.0:
> {code}
> $ ~/spark-2.1.0/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: No suitable driver
>   at java.sql.DriverManager.getDriver(DriverManager.java:315)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
>   ... 48 elided
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> Simply re-executing the same command a second time "fixes" the {{No suitable 
> driver}} error.
> My guess is this is fallout from https://github.com/apache/spark/pull/15292 
> which changed the JDBC driver management code. But this code is so hard to 
> understand for me, I could be totally wrong.
> This is nothing more than a nuisance for {{spark-shell}} usage, but it is 
> more painful to work around for applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19218) SET command should show a sorted result

2017-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822181#comment-15822181
 ] 

Apache Spark commented on SPARK-19218:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/16579

> SET command should show a sorted result
> ---
>
> Key: SPARK-19218
> URL: https://issues.apache.org/jira/browse/SPARK-19218
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> Currently, `SET` command shows unsorted result. We had better show a sorted 
> result for UX. Also, this is compatible with Hive.
> *BEFORE*
> {code}
> scala> sql("set").show(false)
> +---+-+
> |key|value
>   
>   |
> +---+-+
> |spark.driver.host  |10.22.16.140 
>   
>   |
> |spark.driver.port  |63893
>   
>   |
> |hive.metastore.warehouse.dir   |file:/Users/dhyun/spark/spark-warehouse  
>   
>   |
> |spark.repl.class.uri   |spark://10.22.16.140:63893/classes   
>   
>   |
> |spark.jars | 
>   
>   |
> |spark.repl.class.outputDir 
> |/private/var/folders/bl/67vhzgqs1ks88l92h8dy8_1rgp/T/spark-43da424e-7530-4053-b30e-4068e8424dc9/repl-f1c957c7-2e4a-4f14-b234-f7b9f2447971|
> |spark.app.name |Spark shell  
>   
>   |
> |spark.driver.memory|4G   
>   
>   |
> |spark.executor.id  |driver   
>   
>   |
> |spark.submit.deployMode|client   
>   
>   |
> |spark.master   |local[*] 
>   
>   |
> |spark.home |/Users/dhyun/spark   
>   
>   |
> |spark.sql.catalogImplementation|hive 
>   
>   |
> |spark.app.id   |local-1484333618945  
>   
>   |
> +---+-+
> {code}
> *AFTER*
> {code}
> scala> sql("set").show(false)
> +---+-+
> |key|value
>   
>   |
> +---+-+
> |hive.metastore.warehouse.dir   
> |file:/Users/dhyun/SPARK-SORTED-SET/spark-warehouse   
> |
> |spark.app.id   

[jira] [Assigned] (SPARK-19218) SET command should show a sorted result

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19218:


Assignee: (was: Apache Spark)

> SET command should show a sorted result
> ---
>
> Key: SPARK-19218
> URL: https://issues.apache.org/jira/browse/SPARK-19218
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> Currently, `SET` command shows unsorted result. We had better show a sorted 
> result for UX. Also, this is compatible with Hive.
> *BEFORE*
> {code}
> scala> sql("set").show(false)
> +---+-+
> |key|value
>   
>   |
> +---+-+
> |spark.driver.host  |10.22.16.140 
>   
>   |
> |spark.driver.port  |63893
>   
>   |
> |hive.metastore.warehouse.dir   |file:/Users/dhyun/spark/spark-warehouse  
>   
>   |
> |spark.repl.class.uri   |spark://10.22.16.140:63893/classes   
>   
>   |
> |spark.jars | 
>   
>   |
> |spark.repl.class.outputDir 
> |/private/var/folders/bl/67vhzgqs1ks88l92h8dy8_1rgp/T/spark-43da424e-7530-4053-b30e-4068e8424dc9/repl-f1c957c7-2e4a-4f14-b234-f7b9f2447971|
> |spark.app.name |Spark shell  
>   
>   |
> |spark.driver.memory|4G   
>   
>   |
> |spark.executor.id  |driver   
>   
>   |
> |spark.submit.deployMode|client   
>   
>   |
> |spark.master   |local[*] 
>   
>   |
> |spark.home |/Users/dhyun/spark   
>   
>   |
> |spark.sql.catalogImplementation|hive 
>   
>   |
> |spark.app.id   |local-1484333618945  
>   
>   |
> +---+-+
> {code}
> *AFTER*
> {code}
> scala> sql("set").show(false)
> +---+-+
> |key|value
>   
>   |
> +---+-+
> |hive.metastore.warehouse.dir   
> |file:/Users/dhyun/SPARK-SORTED-SET/spark-warehouse   
> |
> |spark.app.id   |local-1484333925649  
>   

[jira] [Assigned] (SPARK-19218) SET command should show a sorted result

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19218:


Assignee: Apache Spark

> SET command should show a sorted result
> ---
>
> Key: SPARK-19218
> URL: https://issues.apache.org/jira/browse/SPARK-19218
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Trivial
>
> Currently, `SET` command shows unsorted result. We had better show a sorted 
> result for UX. Also, this is compatible with Hive.
> *BEFORE*
> {code}
> scala> sql("set").show(false)
> +---+-+
> |key|value
>   
>   |
> +---+-+
> |spark.driver.host  |10.22.16.140 
>   
>   |
> |spark.driver.port  |63893
>   
>   |
> |hive.metastore.warehouse.dir   |file:/Users/dhyun/spark/spark-warehouse  
>   
>   |
> |spark.repl.class.uri   |spark://10.22.16.140:63893/classes   
>   
>   |
> |spark.jars | 
>   
>   |
> |spark.repl.class.outputDir 
> |/private/var/folders/bl/67vhzgqs1ks88l92h8dy8_1rgp/T/spark-43da424e-7530-4053-b30e-4068e8424dc9/repl-f1c957c7-2e4a-4f14-b234-f7b9f2447971|
> |spark.app.name |Spark shell  
>   
>   |
> |spark.driver.memory|4G   
>   
>   |
> |spark.executor.id  |driver   
>   
>   |
> |spark.submit.deployMode|client   
>   
>   |
> |spark.master   |local[*] 
>   
>   |
> |spark.home |/Users/dhyun/spark   
>   
>   |
> |spark.sql.catalogImplementation|hive 
>   
>   |
> |spark.app.id   |local-1484333618945  
>   
>   |
> +---+-+
> {code}
> *AFTER*
> {code}
> scala> sql("set").show(false)
> +---+-+
> |key|value
>   
>   |
> +---+-+
> |hive.metastore.warehouse.dir   
> |file:/Users/dhyun/SPARK-SORTED-SET/spark-warehouse   
> |
> |spark.app.id   |local-1484333925649  
>  

[jira] [Created] (SPARK-19218) SET command should show a sorted result

2017-01-13 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-19218:
-

 Summary: SET command should show a sorted result
 Key: SPARK-19218
 URL: https://issues.apache.org/jira/browse/SPARK-19218
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Dongjoon Hyun
Priority: Trivial


Currently, `SET` command shows unsorted result. We had better show a sorted 
result for UX. Also, this is compatible with Hive.

*BEFORE*
{code}
scala> sql("set").show(false)
+---+-+
|key|value  

  |
+---+-+
|spark.driver.host  |10.22.16.140   

  |
|spark.driver.port  |63893  

  |
|hive.metastore.warehouse.dir   |file:/Users/dhyun/spark/spark-warehouse

  |
|spark.repl.class.uri   |spark://10.22.16.140:63893/classes 

  |
|spark.jars |   

  |
|spark.repl.class.outputDir 
|/private/var/folders/bl/67vhzgqs1ks88l92h8dy8_1rgp/T/spark-43da424e-7530-4053-b30e-4068e8424dc9/repl-f1c957c7-2e4a-4f14-b234-f7b9f2447971|
|spark.app.name |Spark shell

  |
|spark.driver.memory|4G 

  |
|spark.executor.id  |driver 

  |
|spark.submit.deployMode|client 

  |
|spark.master   |local[*]   

  |
|spark.home |/Users/dhyun/spark 

  |
|spark.sql.catalogImplementation|hive   

  |
|spark.app.id   |local-1484333618945

  |
+---+-+
{code}

*AFTER*
{code}
scala> sql("set").show(false)
+---+-+
|key|value  

  |
+---+-+
|hive.metastore.warehouse.dir   
|file:/Users/dhyun/SPARK-SORTED-SET/spark-warehouse 
  |
|spark.app.id   |local-1484333925649

  |
|spark.app.name |Spark shell

  |
|spark.driver.host  |10.22.16.140   
   

[jira] [Commented] (SPARK-19209) "No suitable driver" on first try

2017-01-13 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822154#comment-15822154
 ] 

Xiao Li commented on SPARK-19209:
-

Thanks for reporting the regression. Let me take a look at this. 

> "No suitable driver" on first try
> -
>
> Key: SPARK-19209
> URL: https://issues.apache.org/jira/browse/SPARK-19209
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Darabos
>
> This is a regression from Spark 2.0.2. Observe!
> {code}
> $ ~/spark-2.0.2/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> This is the "good" exception. Now with Spark 2.1.0:
> {code}
> $ ~/spark-2.1.0/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: No suitable driver
>   at java.sql.DriverManager.getDriver(DriverManager.java:315)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
>   ... 48 elided
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> Simply re-executing the same command a second time "fixes" the {{No suitable 
> driver}} error.
> My guess is this is fallout from https://github.com/apache/spark/pull/15292 
> which changed the JDBC driver management code. But this code is so hard to 
> understand for me, I could be totally wrong.
> This is nothing more than a nuisance for {{spark-shell}} usage, but it is 
> more painful to work around for applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Deleted] (SPARK-19205) "No suitable driver" on first try

2017-01-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen deleted SPARK-19205:
--


> "No suitable driver" on first try
> -
>
> Key: SPARK-19205
> URL: https://issues.apache.org/jira/browse/SPARK-19205
> Project: Spark
>  Issue Type: Bug
>Reporter: Daniel Darabos
>
> This is a regression from Spark 2.0.2. Observe!
> {code}
> $ ~/spark-2.0.2/bin/spark-shell --jars 
> stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar --driver-class-path 
> stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> This is the "good" exception. Now with Spark 2.1.0:
> {code}
> $ ~/spark-2.1.0/bin/spark-shell --jars 
> stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar --driver-class-path 
> stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: No suitable driver
>   at java.sql.DriverManager.getDriver(DriverManager.java:315)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
>   ... 48 elided
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> Simply re-executing the same command a second time "fixes" the {{No suitable 
> driver}} error.
> My guess is this is fallout from https://github.com/apache/spark/pull/15292 
> which changed the JDBC driver management code. But this code is so hard to 
> understand for me, I could be totally wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Deleted] (SPARK-19204) "No suitable driver" on first try

2017-01-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen deleted SPARK-19204:
--


> "No suitable driver" on first try
> -
>
> Key: SPARK-19204
> URL: https://issues.apache.org/jira/browse/SPARK-19204
> Project: Spark
>  Issue Type: Bug
>Reporter: Daniel Darabos
>
> This is a regression from Spark 2.0.2. Observe!
> {code}
> $ ~/spark-2.0.2/bin/spark-shell --jars 
> stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar --driver-class-path 
> stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> This is the "good" exception. Now with Spark 2.1.0:
> {code}
> $ ~/spark-2.1.0/bin/spark-shell --jars 
> stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar --driver-class-path 
> stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: No suitable driver
>   at java.sql.DriverManager.getDriver(DriverManager.java:315)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
>   ... 48 elided
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> Simply re-executing the same command a second time "fixes" the {{No suitable 
> driver}} error.
> My guess is this is fallout from https://github.com/apache/spark/pull/15292 
> which changed the JDBC driver management code. But this code is so hard to 
> understand for me, I could be totally wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19209) "No suitable driver" on first try

2017-01-13 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822145#comment-15822145
 ] 

Xiao Li edited comment on SPARK-19209 at 1/13/17 6:50 PM:
--

 It sounds like you create multiple duplicate JIRAs: SPARK-19204, SPARK-19205 
and SPARK-19209. Could you delete the first two?

Let me reopen this one.


was (Author: smilegator):
 It sounds like you create multiple duplicate JIRAs: SPARK-19204, SPARK-19205 
and SPARK-19209. Let me close the last two. 

> "No suitable driver" on first try
> -
>
> Key: SPARK-19209
> URL: https://issues.apache.org/jira/browse/SPARK-19209
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Darabos
>
> This is a regression from Spark 2.0.2. Observe!
> {code}
> $ ~/spark-2.0.2/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> This is the "good" exception. Now with Spark 2.1.0:
> {code}
> $ ~/spark-2.1.0/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: No suitable driver
>   at java.sql.DriverManager.getDriver(DriverManager.java:315)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
>   ... 48 elided
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> Simply re-executing the same command a second time "fixes" the {{No suitable 
> driver}} error.
> My guess is this is fallout from https://github.com/apache/spark/pull/15292 
> which changed the JDBC driver management code. But this code is so hard to 
> understand for me, I could be totally wrong.
> This is nothing more than a nuisance for {{spark-shell}} usage, but it is 
> more painful to work around for applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-19209) "No suitable driver" on first try

2017-01-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reopened SPARK-19209:
-

> "No suitable driver" on first try
> -
>
> Key: SPARK-19209
> URL: https://issues.apache.org/jira/browse/SPARK-19209
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Darabos
>
> This is a regression from Spark 2.0.2. Observe!
> {code}
> $ ~/spark-2.0.2/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> This is the "good" exception. Now with Spark 2.1.0:
> {code}
> $ ~/spark-2.1.0/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: No suitable driver
>   at java.sql.DriverManager.getDriver(DriverManager.java:315)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
>   ... 48 elided
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> Simply re-executing the same command a second time "fixes" the {{No suitable 
> driver}} error.
> My guess is this is fallout from https://github.com/apache/spark/pull/15292 
> which changed the JDBC driver management code. But this code is so hard to 
> understand for me, I could be totally wrong.
> This is nothing more than a nuisance for {{spark-shell}} usage, but it is 
> more painful to work around for applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19209) "No suitable driver" on first try

2017-01-13 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822145#comment-15822145
 ] 

Xiao Li commented on SPARK-19209:
-

 It sounds like you create multiple duplicate JIRAs: SPARK-19204, SPARK-19205 
and SPARK-19209. Let me close the last two. 

> "No suitable driver" on first try
> -
>
> Key: SPARK-19209
> URL: https://issues.apache.org/jira/browse/SPARK-19209
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Darabos
>
> This is a regression from Spark 2.0.2. Observe!
> {code}
> $ ~/spark-2.0.2/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> This is the "good" exception. Now with Spark 2.1.0:
> {code}
> $ ~/spark-2.1.0/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: No suitable driver
>   at java.sql.DriverManager.getDriver(DriverManager.java:315)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
>   ... 48 elided
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> Simply re-executing the same command a second time "fixes" the {{No suitable 
> driver}} error.
> My guess is this is fallout from https://github.com/apache/spark/pull/15292 
> which changed the JDBC driver management code. But this code is so hard to 
> understand for me, I could be totally wrong.
> This is nothing more than a nuisance for {{spark-shell}} usage, but it is 
> more painful to work around for applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19113) Fix flaky test: o.a.s.sql.streaming.StreamSuite fatal errors from a source should be sent to the user

2017-01-13 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19113:
-
Fix Version/s: (was: 2.1.1)
   (was: 2.2.0)

> Fix flaky test: o.a.s.sql.streaming.StreamSuite fatal errors from a source 
> should be sent to the user
> -
>
> Key: SPARK-19113
> URL: https://issues.apache.org/jira/browse/SPARK-19113
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-19113) Fix flaky test: o.a.s.sql.streaming.StreamSuite fatal errors from a source should be sent to the user

2017-01-13 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reopened SPARK-19113:
--

Reopened it as it's still flaky

> Fix flaky test: o.a.s.sql.streaming.StreamSuite fatal errors from a source 
> should be sent to the user
> -
>
> Key: SPARK-19113
> URL: https://issues.apache.org/jira/browse/SPARK-19113
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19209) "No suitable driver" on first try

2017-01-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-19209.
---
Resolution: Duplicate

> "No suitable driver" on first try
> -
>
> Key: SPARK-19209
> URL: https://issues.apache.org/jira/browse/SPARK-19209
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Darabos
>
> This is a regression from Spark 2.0.2. Observe!
> {code}
> $ ~/spark-2.0.2/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> This is the "good" exception. Now with Spark 2.1.0:
> {code}
> $ ~/spark-2.1.0/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: No suitable driver
>   at java.sql.DriverManager.getDriver(DriverManager.java:315)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
>   ... 48 elided
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> Simply re-executing the same command a second time "fixes" the {{No suitable 
> driver}} error.
> My guess is this is fallout from https://github.com/apache/spark/pull/15292 
> which changed the JDBC driver management code. But this code is so hard to 
> understand for me, I could be totally wrong.
> This is nothing more than a nuisance for {{spark-shell}} usage, but it is 
> more painful to work around for applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19217) Offer easy cast from vector to array

2017-01-13 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-19217:


 Summary: Offer easy cast from vector to array
 Key: SPARK-19217
 URL: https://issues.apache.org/jira/browse/SPARK-19217
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark, SQL
Affects Versions: 2.1.0
Reporter: Nicholas Chammas
Priority: Minor


Working with ML often means working with DataFrames with vector columns. You 
can't save these DataFrames to storage without converting the vector columns to 
array columns, and there doesn't appear to an easy way to make that conversion.

This is a common enough problem that it is [documented on Stack 
Overflow|http://stackoverflow.com/q/35855382/877069]. The current solutions to 
making the conversion from a vector column to an array column are:
# Convert the DataFrame to an RDD and back
# Use a UDF

Both approaches work fine, but it really seems like you should be able to do 
something like this instead:

{code}
(le_data
.select(
col('features').cast('array').alias('features')
))
{code}

We already have an {{ArrayType}} in {{pyspark.sql.types}}, but it appears that 
{{cast()}} doesn't support this conversion.

Would this be an appropriate thing to add?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19131) Support "alter table drop partition [if exists]"

2017-01-13 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-19131.
-
Resolution: Invalid

Hi, [~licl].

I'm closing this issue because it's already supported feature. Please try the 
following.

{code}
scala> spark.version
res0: String = 2.1.0

scala> sql("create table t(a int) partitioned by (p int)")
res1: org.apache.spark.sql.DataFrame = []

scala> sql("alter table t drop if exists partition (p=1)")
res2: org.apache.spark.sql.DataFrame = []
{code}

> Support "alter table drop partition [if exists]"
> 
>
> Key: SPARK-19131
> URL: https://issues.apache.org/jira/browse/SPARK-19131
> Project: Spark
>  Issue Type: New Feature
>Affects Versions: 2.1.0
>Reporter: lichenglin
>
> {code}
> val parts = client.getPartitions(hiveTable, s.asJava).asScala
> if (parts.isEmpty && !ignoreIfNotExists) {
>   throw new AnalysisException(
> s"No partition is dropped. One partition spec '$s' does not exist 
> in table '$table' " +
> s"database '$db'")
> }
> parts.map(_.getValues)
> {code}
> Until 2.1.0,drop partition will throw a exception when no partition to drop.
> I notice there is a param named ignoreIfNotExists.
> But I don't know how to set it.
> May be we can implement "alter table drop partition [if exists] " 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4502) Spark SQL reads unneccesary nested fields from Parquet

2017-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822120#comment-15822120
 ] 

Apache Spark commented on SPARK-4502:
-

User 'mallman' has created a pull request for this issue:
https://github.com/apache/spark/pull/16578

> Spark SQL reads unneccesary nested fields from Parquet
> --
>
> Key: SPARK-4502
> URL: https://issues.apache.org/jira/browse/SPARK-4502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Liwen Sun
>Priority: Critical
>
> When reading a field of a nested column from Parquet, SparkSQL reads and 
> assemble all the fields of that nested column. This is unnecessary, as 
> Parquet supports fine-grained field reads out of a nested column. This may 
> degrades the performance significantly when a nested column has many fields. 
> For example, I loaded json tweets data into SparkSQL and ran the following 
> query:
> {{SELECT User.contributors_enabled from Tweets;}}
> User is a nested structure that has 38 primitive fields (for Tweets schema, 
> see: https://dev.twitter.com/overview/api/tweets), here is the log message:
> {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 
> cell/ms}}
> For comparison, I also ran:
> {{SELECT User FROM Tweets;}}
> And here is the log message:
> {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}}
> So both queries load 38 columns from Parquet, while the first query only 
> needs 1 column. I also measured the bytes read within Parquet. In these two 
> cases, the same number of bytes (99365194 bytes) were read. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4502) Spark SQL reads unneccesary nested fields from Parquet

2017-01-13 Thread Michael Allman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822111#comment-15822111
 ] 

Michael Allman commented on SPARK-4502:
---

Hi Guys,

I'm going to submit a PR for this shortly. We've had a patch for this 
functionality in production for a year now but are just now getting around to 
contributing it.

I've examined the other two PR's. Our patch is substantially different from the 
other two and provides a superset of their functionality. We've added over two 
dozen new unit tests to guard against regressions and test expected pruning. 
We've built and tested the latest patch, and found a significant number of test 
failures from our suite. I also found test failures in the unmodified codebase 
when enabling the schema pruning functionality.

I do not take the idea of submitting a parallel, "competing" PR lightly, but in 
this case I think we can offer a better foundation for review. Please examine 
our PR and judge for yourself.

Cheers.

> Spark SQL reads unneccesary nested fields from Parquet
> --
>
> Key: SPARK-4502
> URL: https://issues.apache.org/jira/browse/SPARK-4502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Liwen Sun
>Priority: Critical
>
> When reading a field of a nested column from Parquet, SparkSQL reads and 
> assemble all the fields of that nested column. This is unnecessary, as 
> Parquet supports fine-grained field reads out of a nested column. This may 
> degrades the performance significantly when a nested column has many fields. 
> For example, I loaded json tweets data into SparkSQL and ran the following 
> query:
> {{SELECT User.contributors_enabled from Tweets;}}
> User is a nested structure that has 38 primitive fields (for Tweets schema, 
> see: https://dev.twitter.com/overview/api/tweets), here is the log message:
> {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 
> cell/ms}}
> For comparison, I also ran:
> {{SELECT User FROM Tweets;}}
> And here is the log message:
> {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}}
> So both queries load 38 columns from Parquet, while the first query only 
> needs 1 column. I also measured the bytes read within Parquet. In these two 
> cases, the same number of bytes (99365194 bytes) were read. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19216) LogisticRegressionModel is missing getThreshold()

2017-01-13 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822094#comment-15822094
 ] 

Nicholas Chammas commented on SPARK-19216:
--

cc [~josephkb] - Is this a valid gap in Python's API, or I did just 
misunderstand things?

> LogisticRegressionModel is missing getThreshold()
> -
>
> Key: SPARK-19216
> URL: https://issues.apache.org/jira/browse/SPARK-19216
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Say I just loaded a logistic regression model from storage. How do I check 
> that model's threshold in PySpark? From what I can see, the only way to do 
> that is to dip into the Java object:
> {code}
> model._java_obj.getThreshold())
> {code}
> It seems like PySpark's version of {{LogisticRegressionModel}} should include 
> this method.
> Another issue is that it's not clear whether the threshold is for the raw 
> prediction or the probability. Maybe it's obvious to machine learning 
> practitioners, but I couldn't tell from reading the docs or skimming the code 
> what the threshold was for exactly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19216) LogisticRegressionModel is missing getThreshold()

2017-01-13 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-19216:


 Summary: LogisticRegressionModel is missing getThreshold()
 Key: SPARK-19216
 URL: https://issues.apache.org/jira/browse/SPARK-19216
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 2.1.0
Reporter: Nicholas Chammas
Priority: Minor


Say I just loaded a logistic regression model from storage. How do I check that 
model's threshold in PySpark? From what I can see, the only way to do that is 
to dip into the Java object:

{code}
model._java_obj.getThreshold())
{code}

It seems like PySpark's version of {{LogisticRegressionModel}} should include 
this method.

Another issue is that it's not clear whether the threshold is for the raw 
prediction or the probability. Maybe it's obvious to machine learning 
practitioners, but I couldn't tell from reading the docs or skimming the code 
what the threshold was for exactly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18335) Add a numSlices parameter to SparkR's createDataFrame

2017-01-13 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-18335.
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

Issue resolved by pull request 16512
[https://github.com/apache/spark/pull/16512]

> Add a numSlices parameter to SparkR's createDataFrame
> -
>
> Key: SPARK-18335
> URL: https://issues.apache.org/jira/browse/SPARK-18335
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Shixiong Zhu
> Fix For: 2.1.1, 2.2.0
>
>
> SparkR's createDataFrame doesn't have a `numSlices` parameter. The user 
> cannot set a partition number when converting a large R dataframe to SparkR 
> dataframe. A workaround is using `repartition`, but it requires a shuffle 
> stage. It's better to support the `numSlices` parameter in the 
> `createDataFrame` method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19186) Hash symbol in middle of Sybase database table name causes Spark Exception

2017-01-13 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822053#comment-15822053
 ] 

Dongjoon Hyun commented on SPARK-19186:
---

Hi, [~schulewa].
It looks like 
`net.sourceforge.jtds.jdbc.SQLDiagnostic.addDiagnostic(SQLDiagnostic.java)` 
(instead of Spark) complains about the query.
Could you try that SQL Syntax directly on that library without Spark?

> Hash symbol in middle of Sybase database table name causes Spark Exception
> --
>
> Key: SPARK-19186
> URL: https://issues.apache.org/jira/browse/SPARK-19186
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Adrian Schulewitz
>Priority: Minor
>
> If I use a table name without a '#' symbol in the middle then no exception 
> occurs but with one an exception is thrown. According to Sybase 15 
> documentation a '#' is a legal character.
> val testSql = "SELECT * FROM CTP#ADR_TYPE_DBF"
> val conf = new SparkConf().setAppName("MUREX DMart Simple Reader via 
> SQL").setMaster("local[2]")
> val sess = SparkSession
>   .builder()
>   .appName("MUREX DMart Simple SQL Reader")
>   .config(conf)
>   .getOrCreate()
> import sess.implicits._
> val df = sess.read
> .format("jdbc")
> .option("url", 
> "jdbc:jtds:sybase://auq7064s.unix.anz:4020/mxdmart56")
> .option("driver", "net.sourceforge.jtds.jdbc.Driver")
> .option("dbtable", "CTP#ADR_TYPE_DBF")
> .option("UDT_DEALCRD_REP", "mxdmart56")
> .option("user", "INSTAL")
> .option("password", "INSTALL")
> .load()
> df.createOrReplaceTempView("trades")
> val resultsDF = sess.sql(testSql)
> resultsDF.show()
> 17/01/12 14:30:01 INFO SharedState: Warehouse path is 
> 'file:/C:/DEVELOPMENT/Projects/MUREX/trunk/murex-eom-reporting/spark-warehouse/'.
> 17/01/12 14:30:04 INFO SparkSqlParser: Parsing command: trades
> 17/01/12 14:30:04 INFO SparkSqlParser: Parsing command: SELECT * FROM 
> CTP#ADR_TYPE_DBF
> Exception in thread "main" 
> org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input '#' expecting {, ',', 'SELECT', 'FROM', 'ADD', 'AS', 
> 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 
> 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 
> 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 
> 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 
> 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 
> 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 
> 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'LAST', 'ROW', 'WITH', 
> 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 
> 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 
> 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 
> 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 
> 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 
> 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 
> 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 
> 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 
> 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 
> 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 
> 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', 
> 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 
> 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 
> 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 
> 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 
> 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', 
> 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', 
> 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 
> 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 'PRINCIPALS', 
> 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', 
> 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', IDENTIFIER, 
> BACKQUOTED_IDENTIFIER}(line 1, pos 17)
> == SQL ==
> SELECT * FROM CTP#ADR_TYPE_DBF
> -^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
>   at 
> 

[jira] [Resolved] (SPARK-19092) Save() API of DataFrameWriter should not scan all the saved files

2017-01-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-19092.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> Save() API of DataFrameWriter should not scan all the saved files
> -
>
> Key: SPARK-19092
> URL: https://issues.apache.org/jira/browse/SPARK-19092
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.2.0
>
>
> `DataFrameWriter`'s save() API is performing a unnecessary full filesystem 
> scan for the saved files. The save() API is the most basic/core API in 
> `DataFrameWriter`. We should avoid these unnecessary file scan. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19092) Save() API of DataFrameWriter should not scan all the saved files

2017-01-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-19092:
---

Assignee: Xiao Li

> Save() API of DataFrameWriter should not scan all the saved files
> -
>
> Key: SPARK-19092
> URL: https://issues.apache.org/jira/browse/SPARK-19092
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.2.0
>
>
> `DataFrameWriter`'s save() API is performing a unnecessary full filesystem 
> scan for the saved files. The save() API is the most basic/core API in 
> `DataFrameWriter`. We should avoid these unnecessary file scan. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17237) DataFrame fill after pivot causing org.apache.spark.sql.AnalysisException

2017-01-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-17237.
-
   Resolution: Fixed
 Assignee: Takeshi Yamamuro
Fix Version/s: 2.2.0
   2.1.1

> DataFrame fill after pivot causing org.apache.spark.sql.AnalysisException
> -
>
> Key: SPARK-17237
> URL: https://issues.apache.org/jira/browse/SPARK-17237
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jiang Qiqi
>Assignee: Takeshi Yamamuro
>  Labels: newbie
> Fix For: 2.1.1, 2.2.0
>
>
> I am trying to run a pivot transformation which I ran on a spark1.6 cluster, 
> namely
> sc.parallelize(Seq((2,3,4), (3,4,5))).toDF("a", "b", "c")
> res1: org.apache.spark.sql.DataFrame = [a: int, b: int, c: int]
> scala> res1.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0)
> res2: org.apache.spark.sql.DataFrame = [a: int, 3_count(c): bigint, 3_avg(c): 
> double, 4_count(c): bigint, 4_avg(c): double]
> scala> res1.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0).show
> +---+--++--++
> |  a|3_count(c)|3_avg(c)|4_count(c)|4_avg(c)|
> +---+--++--++
> |  2| 1| 4.0| 0| 0.0|
> |  3| 0| 0.0| 1| 5.0|
> +---+--++--++
> after upgrade the environment to spark2.0, got an error while executing 
> .na.fill method
> scala> sc.parallelize(Seq((2,3,4), (3,4,5))).toDF("a", "b", "c")
> res3: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
> scala> res3.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0)
> org.apache.spark.sql.AnalysisException: syntax error in attribute name: 
> `3_count(`c`)`;
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:103)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:113)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:168)
>   at org.apache.spark.sql.Dataset.resolve(Dataset.scala:218)
>   at org.apache.spark.sql.Dataset.col(Dataset.scala:921)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:411)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:162)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:159)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:159)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:149)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19142) spark.kmeans should take seed, initSteps, and tol as parameters

2017-01-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-19142.
-
   Resolution: Fixed
 Assignee: Miao Wang
Fix Version/s: 2.2.0

> spark.kmeans should take seed, initSteps, and tol as parameters
> ---
>
> Key: SPARK-19142
> URL: https://issues.apache.org/jira/browse/SPARK-19142
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Miao Wang
>Assignee: Miao Wang
> Fix For: 2.2.0
>
>
> spark.kmeans doesn't have interface to set initSteps, seed and tol. As Spark 
> Kmeans algorithm doesn't take the same set of parameters as R kmeans, we 
> should maintain a different interface in spark.kmeans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19178) convert string of large numbers to int should return null

2017-01-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-19178.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> convert string of large numbers to int should return null
> -
>
> Key: SPARK-19178
> URL: https://issues.apache.org/jira/browse/SPARK-19178
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2017-01-13 Thread Danilo Ascione (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822021#comment-15822021
 ] 

Danilo Ascione commented on SPARK-13857:


I have a pipeline similar to [~abudd2014]'s one. I have implemented a dataframe 
api based RankingEvaluator that takes care of getting the top K recommendations 
at the evaluation phase of the pipeline, and it can be used in model selection 
pipeline (Cross-Validation). 
Sample usage code:
{code}
val als = new ALS() //input dataframe (userId, itemId, clicked)
  .setUserCol("userId")
  .setItemCol("itemId")
  .setRatingCol("clicked")
  .setImplicitPrefs(true)

val paramGrid = new ParamGridBuilder()
.addGrid(als.regParam, Array(0.01,0.1))
.addGrid(als.alpha, Array(40.0, 1.0))
.build()

val evaluator = new RankingEvaluator()
.setMetricName("mpr") //Mean Percentile Rank
.setLabelCol("itemId")
.setPredictionCol("prediction")
.setQueryCol("userId")
.setK(5) //Top K
 
val cv = new CrossValidator()
  .setEstimator(als)
  .setEvaluator(evaluator)
  .setEstimatorParamMaps(paramGrid)
  .setNumFolds(3)

val crossValidatorModel = cv.fit(inputDF)

// Print the average metrics per ParamGrid entry
val avgMetricsParamGrid = crossValidatorModel.avgMetrics

// Combine with paramGrid to see how they affect the overall metrics
val combined = paramGrid.zip(avgMetricsParamGrid)
{code}

Then the resulting "bestModel" from cross validation model is used to generate 
the top K recommendations in batches.

RankingEvaluator code is here 
[https://github.com/daniloascione/spark/commit/c93ab86d35984e9f70a3b4f543fb88f5541333f0]

I would appreciate any feedback. Thanks!


> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18687) Backward compatibility - creating a Dataframe on a new SQLContext object fails with a Derby error

2017-01-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-18687.
-
   Resolution: Fixed
 Assignee: Vinayak Joshi
Fix Version/s: 2.2.0
   2.1.1
   2.0.3

> Backward compatibility - creating a Dataframe on a new SQLContext object 
> fails with a Derby error
> -
>
> Key: SPARK-18687
> URL: https://issues.apache.org/jira/browse/SPARK-18687
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
> Environment: Spark built with hive support
>Reporter: Vinayak Joshi
>Assignee: Vinayak Joshi
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> With a local spark instance built with hive support, (-Pyarn -Phadoop-2.6 
> -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver)
> The following script/sequence works in Pyspark without any error in 1.6.x, 
> but fails in 2.x.
> {code}
> people = sc.parallelize(["Michael,30", "Andy,12", "Justin,19"])
> peoplePartsRDD = people.map(lambda p: p.split(","))
> peopleRDD = peoplePartsRDD.map(lambda p: pyspark.sql.Row(name=p[0], 
> age=int(p[1])))
> peopleDF= sqlContext.createDataFrame(peopleRDD)
> peopleDF.first()
> sqlContext2 = SQLContext(sc)
> people2 = sc.parallelize(["Abcd,40", "Efgh,14", "Ijkl,16"])
> peoplePartsRDD2 = people2.map(lambda l: l.split(","))
> peopleRDD2 = peoplePartsRDD2.map(lambda p: pyspark.sql.Row(fname=p[0], 
> age=int(p[1])))
> peopleDF2 = sqlContext2.createDataFrame(peopleRDD2) # < error here
> {code}
> The error produced is:
> {noformat}
> 16/12/01 22:35:36 ERROR Schema: Failed initialising database.
> Unable to open a test connection to the given database. JDBC url = 
> jdbc:derby:;databaseName=metastore_db;create=true, username = APP. 
> Terminating connection pool (set lazyInit to true if you expect to start your 
> database after your app). Original Exception: --
> java.sql.SQLException: Failed to start database 'metastore_db' with class 
> loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@4494053, 
> see the next exception for details.
> at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
> at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
> at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source)
> .
> .
> --
> org.datanucleus.exceptions.NucleusDataStoreException: Unable to open a test 
> connection to the given database. JDBC url = 
> jdbc:derby:;databaseName=metastore_db;create=true, username = APP. 
> Terminating connection pool (set lazyInit to true if you expect to start your 
> database after your app). Original Exception: --
> java.sql.SQLException: Failed to start database 'metastore_db' with class 
> loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@519dabfd, see 
> the next exception for details.
> at org.apache.derby.impl.jdb
> .
> .
> .
> NestedThrowables:
> java.sql.SQLException: Unable to open a test connection to the given 
> database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, 
> username = APP. Terminating connection pool (set lazyInit to true if you 
> expect to start your database after your app). Original Exception: --
> java.sql.SQLException: Failed to start database 'metastore_db' with class 
> loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@519dabfd, see 
> the next exception for details.
> at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
> .
> .
> .
> Caused by: java.sql.SQLException: Unable to open a test connection to the 
> given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, 
> username = APP. Terminating connection pool (set lazyInit to true if you 
> expect to start your database after your app). Original Exception: --
> java.sql.SQLException: Failed to start database 'metastore_db' with class 
> loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@519dabfd, see 
> the next exception for details.
> at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
> at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
> at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source)
> at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(Unknown 
> Source)
> at org.apache.derby.impl.jdbc.EmbedConnection.(Unknown Source)
> .
> .
> .
> 16/12/01 22:48:09 ERROR Schema: Failed initialising database.
> Unable to open a test connection to the given database. JDBC url = 
> jdbc:derby:;databaseName=metastore_db;create=true, username 

[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: (was: Apache Spark)

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: (was: Apache Spark)

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: Apache Spark

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: Apache Spark

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: Apache Spark

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >