[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: (was: Apache Spark)

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: (was: Apache Spark)

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19215) Add necessary check for `RDD.checkpoint` to avoid potential mistakes

2017-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821871#comment-15821871
 ] 

Apache Spark commented on SPARK-19215:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/16576

> Add necessary check for `RDD.checkpoint` to avoid potential mistakes
> 
>
> Key: SPARK-19215
> URL: https://issues.apache.org/jira/browse/SPARK-19215
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently RDD.checkpoint must be called before any job executed on this RDD, 
> otherwise the `doCheckpoint` will never be called. This is a pitfall we 
> should check this and throw exception (or at least log warning ? ) for such 
> case.
> And, if RDD haven't been persisted, doing checkpoint will cause RDD 
> recomputation, because current implementation will run separated job for 
> checkpointing. I think such case it should also print some warning message, 
> remind user to check whether he forgot persist the RDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19215) Add necessary check for `RDD.checkpoint` to avoid potential mistakes

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19215:


Assignee: (was: Apache Spark)

> Add necessary check for `RDD.checkpoint` to avoid potential mistakes
> 
>
> Key: SPARK-19215
> URL: https://issues.apache.org/jira/browse/SPARK-19215
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently RDD.checkpoint must be called before any job executed on this RDD, 
> otherwise the `doCheckpoint` will never be called. This is a pitfall we 
> should check this and throw exception (or at least log warning ? ) for such 
> case.
> And, if RDD haven't been persisted, doing checkpoint will cause RDD 
> recomputation, because current implementation will run separated job for 
> checkpointing. I think such case it should also print some warning message, 
> remind user to check whether he forgot persist the RDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: Apache Spark

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: (was: Apache Spark)

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19214) Inconsistencies between DataFrame and Dataset APIs

2017-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821905#comment-15821905
 ] 

Apache Spark commented on SPARK-19214:
--

User 'aray' has created a pull request for this issue:
https://github.com/apache/spark/pull/16577

> Inconsistencies between DataFrame and Dataset APIs
> --
>
> Key: SPARK-19214
> URL: https://issues.apache.org/jira/browse/SPARK-19214
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Alexander Alexandrov
>Priority: Trivial
>
> I am not sure whether this has been reported already, but there are some 
> confusing & annoying inconsistencies when programming the same expression in 
> the Dataset and the DataFrame APIs.
> Consider the following minimal example executed in a Spark Shell:
> {code}
> case class Point(x: Int, y: Int, z: Int)
> val ps = spark.createDataset(for {
>   x <- 1 to 10; 
>   y <- 1 to 10; 
>   z <- 1 to 10
> } yield Point(x, y, z))
> // Problem 1:
> // count produces different fields in the Dataset / DataFrame variants
> // count() on grouped DataFrame: field name is `count`
> ps.groupBy($"x").count().printSchema
> // root
> //  |-- x: integer (nullable = false)
> //  |-- count: long (nullable = false)
> // count() on grouped Dataset: field name is `count(1)`
> ps.groupByKey(_.x).count().printSchema
> // root
> //  |-- value: integer (nullable = true)
> //  |-- count(1): long (nullable = false)
> // Problem 2:
> // groupByKey produces different `key` field name depending
> // on the result type
> // this is especially confusing in the first case below (simple key types)
> // where the key field is actually named `value`
> // simple key types
> ps.groupByKey(p => p.x).count().printSchema
> // root
> //  |-- value: integer (nullable = true)
> //  |-- count(1): long (nullable = false)
> // complex key types
> ps.groupByKey(p => (p.x, p.y)).count().printSchema
> // root
> //  |-- key: struct (nullable = false)
> //  ||-- _1: integer (nullable = true)
> //  ||-- _2: integer (nullable = true)
> //  |-- count(1): long (nullable = false)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19214) Inconsistencies between DataFrame and Dataset APIs

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19214:


Assignee: Apache Spark

> Inconsistencies between DataFrame and Dataset APIs
> --
>
> Key: SPARK-19214
> URL: https://issues.apache.org/jira/browse/SPARK-19214
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Alexander Alexandrov
>Assignee: Apache Spark
>Priority: Trivial
>
> I am not sure whether this has been reported already, but there are some 
> confusing & annoying inconsistencies when programming the same expression in 
> the Dataset and the DataFrame APIs.
> Consider the following minimal example executed in a Spark Shell:
> {code}
> case class Point(x: Int, y: Int, z: Int)
> val ps = spark.createDataset(for {
>   x <- 1 to 10; 
>   y <- 1 to 10; 
>   z <- 1 to 10
> } yield Point(x, y, z))
> // Problem 1:
> // count produces different fields in the Dataset / DataFrame variants
> // count() on grouped DataFrame: field name is `count`
> ps.groupBy($"x").count().printSchema
> // root
> //  |-- x: integer (nullable = false)
> //  |-- count: long (nullable = false)
> // count() on grouped Dataset: field name is `count(1)`
> ps.groupByKey(_.x).count().printSchema
> // root
> //  |-- value: integer (nullable = true)
> //  |-- count(1): long (nullable = false)
> // Problem 2:
> // groupByKey produces different `key` field name depending
> // on the result type
> // this is especially confusing in the first case below (simple key types)
> // where the key field is actually named `value`
> // simple key types
> ps.groupByKey(p => p.x).count().printSchema
> // root
> //  |-- value: integer (nullable = true)
> //  |-- count(1): long (nullable = false)
> // complex key types
> ps.groupByKey(p => (p.x, p.y)).count().printSchema
> // root
> //  |-- key: struct (nullable = false)
> //  ||-- _1: integer (nullable = true)
> //  ||-- _2: integer (nullable = true)
> //  |-- count(1): long (nullable = false)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19214) Inconsistencies between DataFrame and Dataset APIs

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19214:


Assignee: (was: Apache Spark)

> Inconsistencies between DataFrame and Dataset APIs
> --
>
> Key: SPARK-19214
> URL: https://issues.apache.org/jira/browse/SPARK-19214
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Alexander Alexandrov
>Priority: Trivial
>
> I am not sure whether this has been reported already, but there are some 
> confusing & annoying inconsistencies when programming the same expression in 
> the Dataset and the DataFrame APIs.
> Consider the following minimal example executed in a Spark Shell:
> {code}
> case class Point(x: Int, y: Int, z: Int)
> val ps = spark.createDataset(for {
>   x <- 1 to 10; 
>   y <- 1 to 10; 
>   z <- 1 to 10
> } yield Point(x, y, z))
> // Problem 1:
> // count produces different fields in the Dataset / DataFrame variants
> // count() on grouped DataFrame: field name is `count`
> ps.groupBy($"x").count().printSchema
> // root
> //  |-- x: integer (nullable = false)
> //  |-- count: long (nullable = false)
> // count() on grouped Dataset: field name is `count(1)`
> ps.groupByKey(_.x).count().printSchema
> // root
> //  |-- value: integer (nullable = true)
> //  |-- count(1): long (nullable = false)
> // Problem 2:
> // groupByKey produces different `key` field name depending
> // on the result type
> // this is especially confusing in the first case below (simple key types)
> // where the key field is actually named `value`
> // simple key types
> ps.groupByKey(p => p.x).count().printSchema
> // root
> //  |-- value: integer (nullable = true)
> //  |-- count(1): long (nullable = false)
> // complex key types
> ps.groupByKey(p => (p.x, p.y)).count().printSchema
> // root
> //  |-- key: struct (nullable = false)
> //  ||-- _1: integer (nullable = true)
> //  ||-- _2: integer (nullable = true)
> //  |-- count(1): long (nullable = false)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18801) Support resolve a nested view

2017-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821910#comment-15821910
 ] 

Apache Spark commented on SPARK-18801:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/16561

> Support resolve a nested view
> -
>
> Key: SPARK-18801
> URL: https://issues.apache.org/jira/browse/SPARK-18801
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>
> We should be able to resolve a nested view. The main advantage is that if you 
> update an underlying view, the current view also gets updated.
> The new approach should be compatible with older versions of SPARK/HIVE, that 
> means:
>   1. The new approach should be able to resolve the views that created by 
> older versions of SPARK/HIVE;
>   2. The new approach should be able to resolve the views that are 
> currently supported by SPARK SQL.
> The new approach mainly brings in the following changes:
>   1. Add a new operator called `View` to keep track of the CatalogTable 
> that describes the view, and the output attributes as well as the child of 
> the view;
>   2. Update the `ResolveRelations` rule to resolve the relations and 
> views, note that a nested view should be resolved correctly;
>   3. Add `AnalysisContext` to enable us to still support a view created 
> with CTE/Windows query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19116) LogicalPlan.statistics.sizeInBytes wrong for trivial parquet file

2017-01-13 Thread Andrew Ray (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821914#comment-15821914
 ] 

Andrew Ray commented on SPARK-19116:


The 2318 number is the size of the parquet files written to disk

> LogicalPlan.statistics.sizeInBytes wrong for trivial parquet file
> -
>
> Key: SPARK-19116
> URL: https://issues.apache.org/jira/browse/SPARK-19116
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.1, 2.0.2
> Environment: Python 3.5.x
> Windows 10
>Reporter: Shea Parkes
>
> We're having some modestly severe issues with broadcast join inference, and 
> I've been chasing them through the join heuristics in the catalyst engine.  
> I've made it as far as I can, and I've hit upon something that does not make 
> any sense to me.
> I thought that loading from parquet would be a RelationPlan, which would just 
> use the sum of default sizeInBytes for each column times the number of rows.  
> But this trivial example shows that I am not correct:
> {code}
> import pyspark.sql.functions as F
> df_range = session.range(100).select(F.col('id').cast('integer'))
> df_range.write.parquet('c:/scratch/hundred_integers.parquet')
> df_parquet = session.read.parquet('c:/scratch/hundred_integers.parquet')
> df_parquet.explain(True)
> # Expected sizeInBytes
> integer_default_sizeinbytes = 4
> print(df_parquet.count() * integer_default_sizeinbytes)  # = 400
> # Inferred sizeInBytes
> print(df_parquet._jdf.logicalPlan().statistics().sizeInBytes())  # = 2318
> # For posterity (Didn't really expect this to match anything above)
> print(df_range._jdf.logicalPlan().statistics().sizeInBytes())  # = 600
> {code}
> And here's the results of explain(True) on df_parquet:
> {code}
> In [456]: == Parsed Logical Plan ==
> Relation[id#794] parquet
> == Analyzed Logical Plan ==
> id: int
> Relation[id#794] parquet
> == Optimized Logical Plan ==
> Relation[id#794] parquet
> == Physical Plan ==
> *BatchedScan parquet [id#794] Format: ParquetFormat, InputPaths: 
> file:/c:/scratch/hundred_integers.parquet, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {code}
> So basically, I'm not understanding well how the size of the parquet file is 
> being estimated.  I don't expect it to be extremely accurate, but empirically 
> it's so inaccurate that we're having to mess with autoBroadcastJoinThreshold 
> way too much.  (It's not always too high like the example above, it's often 
> way too low.)
> Without deeper understanding, I'm considering a result of 2318 instead of 400 
> to be a bug.  My apologies if I'm missing something obvious.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18667) input_file_name function does not work with UDF

2017-01-13 Thread Ben (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821506#comment-15821506
 ] 

Ben edited comment on SPARK-18667 at 1/13/17 3:51 PM:
--

So, I created a new example now, and here is the code for everything:

a.xml:
{noformat}

  TEXT
  TEXT2

{noformat}

b.xml:
{noformat}

  file:/C:/Users/SSS/a.xml
  AAA

{noformat}

code:
{noformat}
from pyspark.sql.functions import udf,input_file_name
from pyspark.sql.types import StringType
from pyspark.sql import SparkSession

def filename(path):
return path

session = SparkSession.builder.appName('APP').getOrCreate()

session.udf.register('sameText',filename)
sameText = udf(filename, StringType())

df = session.read.format('xml').load('../../res/Other/a.xml', 
rowTag='root').select('*',input_file_name().alias('file'))
df.select('file').show()
df.select(sameText(df['file'])).show()

df2 = session.read.format('xml').load('../../res/Other/b.xml', rowTag='root')
df3 = df.join(df2, 'file')

df.show()
df2.show()
df3.show()
df3.selectExpr('file as FILE','x AS COL1','sameText(y) AS COL2').show()
{noformat}

and this is the console output:
{noformat}
++
|file|
++
|file:/C:/Users/SS...|
++

+--+
|filename(file)|
+--+
|  |
+--+

++-++
|   x|y|file|
++-++
|TEXT|TEXT2|file:/C:/Users/SS...|
++-++

++-+
|file|other|
++-+
|file:/C:/Users/SS...|  AAA|
++-+

+++-+-+
|file|   x|y|other|
+++-+-+
|file:/C:/Users/SS...|TEXT|TEXT2|  AAA|
+++-+-+


[Stage 26:> (0 + 4) / 4]


[Stage 29:>(0 + 8) / 20]
[Stage 29:=>   (6 + 8) / 20]
[Stage 29:===> (7 + 8) / 20]
[Stage 29:==>  (8 + 8) / 20]
[Stage 29:>   (10 + 8) / 20]
[Stage 29:>   (13 + 7) / 20]
[Stage 29:===>(14 + 6) / 20]
[Stage 29:==> (15 + 5) / 20]


[Stage 32:>   (0 + 8) / 100]
[Stage 32:===>(7 + 8) / 100]
[Stage 32:>   (8 + 8) / 100]
[Stage 32:===>   (13 + 8) / 100]
[Stage 32:>  (15 + 8) / 100]
[Stage 32:===>   (20 + 8) / 100]
[Stage 32:>  (22 + 8) / 100]
[Stage 32:==>(27 + 8) / 100]
[Stage 32:===>   (29 + 8) / 100]
[Stage 32:==>(34 + 8) / 100]
[Stage 32:===>   (36 + 8) / 100]
[Stage 32:==>(41 + 8) / 100]
[Stage 32:===>   (42 + 8) / 100]
[Stage 32:=> (46 + 8) / 100]
[Stage 32:==>(48 + 8) / 100]
[Stage 32:==>(49 + 8) / 100]
[Stage 32:===>   (50 + 8) / 100]
[Stage 32:=> (53 + 8) / 100]
[Stage 32:==>(55 + 8) / 100]
[Stage 32:==>(56 + 8) / 100]
[Stage 32:===>   (57 + 8) / 100]
[Stage 32:=> (60 + 8) / 100]
[Stage 32:==>(62 + 8) / 100]
[Stage 32:==>(63 + 8) / 100]
[Stage 32:===>   (65 + 8) / 100]
[Stage 32:>  (67 + 8) / 100]
[Stage 32:==

[jira] [Resolved] (SPARK-19187) querying from parquet partitioned table throws FileNotFoundException when some partitions' hdfs locations do not exist

2017-01-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19187.
---
Resolution: Duplicate

> querying from parquet partitioned table throws FileNotFoundException when 
> some partitions' hdfs locations do not exist
> --
>
> Key: SPARK-19187
> URL: https://issues.apache.org/jira/browse/SPARK-19187
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: roncenzhao
>
> Hi, all.
> When the parquet partitioned table's some partition's hdfs paths do not 
> exist, querying from it throws FileNotFoundException .
> The error stack is :
> `
> TaskSetManager: Lost task 522.0 in stage 1.0 (TID 523, 
> sd-hadoop-datanode-50-135.i
> dc.vip.com): java.io.FileNotFoundException: File does not exist: 
> hdfs://bipcluster/bip/external_table/vip
> dw/dw_log_app_pageinfo_clean_spark_parquet/dt=20161223/hm=1730
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1128)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
> at 
> org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(
> fileSourceInterfaces.scala:465)
> at 
> org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(
> fileSourceInterfaces.scala:462)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
> at scala.collection.AbstractIterator.to(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:912)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:912)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> `



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18667) input_file_name function does not work with UDF

2017-01-13 Thread Ben (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821929#comment-15821929
 ] 

Ben commented on SPARK-18667:
-

Happy to help.
Were you also able to reproduce the second issue, regarding the join and UDF?
As I said, in my script, the count shows a correct number, but returns an empty 
table. Whereas in the example I provided seems to work, but very slowly, which 
is also suspicious for 1 single small row.
Would this also be caused by the same bug causing the UDF on input_file_name 
issue or is it unrelated?

> input_file_name function does not work with UDF
> ---
>
> Key: SPARK-18667
> URL: https://issues.apache.org/jira/browse/SPARK-18667
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Hyukjin Kwon
>Assignee: Liang-Chi Hsieh
> Fix For: 2.1.0
>
>
> {{input_file_name()}} does not return the file name but empty string instead 
> when it is used as input for UDF in PySpark as below: 
> with the data as below:
> {code}
> {"a": 1}
> {code}
> with the codes below:
> {code}
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> def filename(path):
> return path
> sourceFile = udf(filename, StringType())
> spark.read.json("tmp.json").select(sourceFile(input_file_name())).show()
> {code}
> prints as below:
> {code}
> +---+
> |filename(input_file_name())|
> +---+
> |   |
> +---+
> {code}
> but the codes below:
> {code}
> spark.read.json("tmp.json").select(input_file_name()).show()
> {code}
> prints correctly as below:
> {code}
> ++
> |   input_file_name()|
> ++
> |file:///Users/hyu...|
> ++
> {code}
> This seems PySpark specific issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19192) collection function: index

2017-01-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821944#comment-15821944
 ] 

Sean Owen commented on SPARK-19192:
---

Deleting as a corrupted duplicate of SPARK-19194

> collection function: index
> --
>
> Key: SPARK-19192
> URL: https://issues.apache.org/jira/browse/SPARK-19192
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Chenzhao Guo
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19191) collection function:index

2017-01-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821943#comment-15821943
 ] 

Sean Owen commented on SPARK-19191:
---

Deleting as a corrupted duplicate of SPARK-19194

> collection function:index
> -
>
> Key: SPARK-19191
> URL: https://issues.apache.org/jira/browse/SPARK-19191
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Chenzhao Guo
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Deleted] (SPARK-19193) collection function: index

2017-01-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen deleted SPARK-19193:
--


> collection function: index
> --
>
> Key: SPARK-19193
> URL: https://issues.apache.org/jira/browse/SPARK-19193
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Chenzhao Guo
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19195) collection function:index

2017-01-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821946#comment-15821946
 ] 

Sean Owen commented on SPARK-19195:
---

Deleting as a corrupted duplicate of SPARK-19194

> collection function:index
> -
>
> Key: SPARK-19195
> URL: https://issues.apache.org/jira/browse/SPARK-19195
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Chenzhao Guo
>
> Expression Index returns the value of index n (right) of the array/map (left).
> To be consistent with Hive, the left parameter should be of ArrayType or 
> MapType, while the right parameter should be of AtomicType or NullType.
> Except the rules above, Hive won't throw any exceptions, even if there's a 
> type mismatch, it will output NULL to indicate not found. This implementation 
> follows the rule.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19193) collection function: index

2017-01-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821945#comment-15821945
 ] 

Sean Owen commented on SPARK-19193:
---

Deleting as a corrupted duplicate of SPARK-19194

> collection function: index
> --
>
> Key: SPARK-19193
> URL: https://issues.apache.org/jira/browse/SPARK-19193
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Chenzhao Guo
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Deleted] (SPARK-19191) collection function:index

2017-01-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen deleted SPARK-19191:
--


> collection function:index
> -
>
> Key: SPARK-19191
> URL: https://issues.apache.org/jira/browse/SPARK-19191
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Chenzhao Guo
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Deleted] (SPARK-19195) collection function:index

2017-01-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen deleted SPARK-19195:
--


> collection function:index
> -
>
> Key: SPARK-19195
> URL: https://issues.apache.org/jira/browse/SPARK-19195
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Chenzhao Guo
>
> Expression Index returns the value of index n (right) of the array/map (left).
> To be consistent with Hive, the left parameter should be of ArrayType or 
> MapType, while the right parameter should be of AtomicType or NullType.
> Except the rules above, Hive won't throw any exceptions, even if there's a 
> type mismatch, it will output NULL to indicate not found. This implementation 
> follows the rule.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Deleted] (SPARK-19192) collection function: index

2017-01-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen deleted SPARK-19192:
--


> collection function: index
> --
>
> Key: SPARK-19192
> URL: https://issues.apache.org/jira/browse/SPARK-19192
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Chenzhao Guo
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Deleted] (SPARK-19198) Explicitly prevent Insert into View or Create View As Insert

2017-01-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen deleted SPARK-19198:
--


> Explicitly prevent Insert into View or Create View As Insert
> 
>
> Key: SPARK-19198
> URL: https://issues.apache.org/jira/browse/SPARK-19198
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Jiang Xingbo
>
> Currently we don't explicitly forbid the following behaviors:
> 1. The statement CREATE VIEW AS INSERT INTO throws the following exception 
> from SQLBuilder:
> `java.lang.UnsupportedOperationException: unsupported plan InsertIntoTable 
> MetastoreRelation default, tbl, false, false`;
> 2. The statement INSERT INTO view VALUES throws the following exception from 
> checkAnalysis:
> `Error in query: Inserting into an RDD-based table is not allowed.;;`
> We should check for these behaviors earlier and explicitly prevent them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19199) Explicitly prevent Insert into View or Create View As Insert

2017-01-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821950#comment-15821950
 ] 

Sean Owen commented on SPARK-19199:
---

Deleting as a corrupted duplicate of SPARK-19196

> Explicitly prevent Insert into View or Create View As Insert
> 
>
> Key: SPARK-19199
> URL: https://issues.apache.org/jira/browse/SPARK-19199
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Jiang Xingbo
>
> Currently we don't explicitly forbid the following behaviors:
> 1. The statement CREATE VIEW AS INSERT INTO throws the following exception 
> from SQLBuilder:
> `java.lang.UnsupportedOperationException: unsupported plan InsertIntoTable 
> MetastoreRelation default, tbl, false, false`;
> 2. The statement INSERT INTO view VALUES throws the following exception from 
> checkAnalysis:
> `Error in query: Inserting into an RDD-based table is not allowed.;;`
> We should check for these behaviors earlier and explicitly prevent them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19200) Explicitly prevent Insert into View or Create View As Insert

2017-01-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821952#comment-15821952
 ] 

Sean Owen commented on SPARK-19200:
---

Deleting as a corrupted duplicate of SPARK-19196

> Explicitly prevent Insert into View or Create View As Insert
> 
>
> Key: SPARK-19200
> URL: https://issues.apache.org/jira/browse/SPARK-19200
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Jiang Xingbo
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Deleted] (SPARK-19200) Explicitly prevent Insert into View or Create View As Insert

2017-01-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen deleted SPARK-19200:
--


> Explicitly prevent Insert into View or Create View As Insert
> 
>
> Key: SPARK-19200
> URL: https://issues.apache.org/jira/browse/SPARK-19200
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Jiang Xingbo
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19198) Explicitly prevent Insert into View or Create View As Insert

2017-01-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821949#comment-15821949
 ] 

Sean Owen commented on SPARK-19198:
---

Deleting as a corrupted duplicate of SPARK-19196

> Explicitly prevent Insert into View or Create View As Insert
> 
>
> Key: SPARK-19198
> URL: https://issues.apache.org/jira/browse/SPARK-19198
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Jiang Xingbo
>
> Currently we don't explicitly forbid the following behaviors:
> 1. The statement CREATE VIEW AS INSERT INTO throws the following exception 
> from SQLBuilder:
> `java.lang.UnsupportedOperationException: unsupported plan InsertIntoTable 
> MetastoreRelation default, tbl, false, false`;
> 2. The statement INSERT INTO view VALUES throws the following exception from 
> checkAnalysis:
> `Error in query: Inserting into an RDD-based table is not allowed.;;`
> We should check for these behaviors earlier and explicitly prevent them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19197) Explicitly prevent Insert into View or Create View As Insert

2017-01-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821947#comment-15821947
 ] 

Sean Owen commented on SPARK-19197:
---

Deleting as a corrupted duplicate of SPARK-19196

> Explicitly prevent Insert into View or Create View As Insert
> 
>
> Key: SPARK-19197
> URL: https://issues.apache.org/jira/browse/SPARK-19197
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Jiang Xingbo
>
> Currently we don't explicitly forbid the following behaviors:
> 1. The statement CREATE VIEW AS INSERT INTO throws the following exception 
> from SQLBuilder:
> `java.lang.UnsupportedOperationException: unsupported plan InsertIntoTable 
> MetastoreRelation default, tbl, false, false`;
> 2. The statement INSERT INTO view VALUES throws the following exception from 
> checkAnalysis:
> `Error in query: Inserting into an RDD-based table is not allowed.;;`
> We should check for these behaviors earlier and explicitly prevent them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19201) Explicitly prevent Insert into View or Create View As Insert

2017-01-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821953#comment-15821953
 ] 

Sean Owen commented on SPARK-19201:
---

Deleting as a corrupted duplicate of SPARK-19196

> Explicitly prevent Insert into View or Create View As Insert
> 
>
> Key: SPARK-19201
> URL: https://issues.apache.org/jira/browse/SPARK-19201
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Jiang Xingbo
>
> Currently we don't explicitly forbid the following behaviors:
> 1. The statement CREATE VIEW AS INSERT INTO throws the following exception 
> from SQLBuilder:
> `java.lang.UnsupportedOperationException: unsupported plan InsertIntoTable 
> MetastoreRelation default, tbl, false, false`;
> 2. The statement INSERT INTO view VALUES throws the following exception from 
> checkAnalysis:
> `Error in query: Inserting into an RDD-based table is not allowed.;;`
> We should check for these behaviors earlier and explicitly prevent them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Deleted] (SPARK-19199) Explicitly prevent Insert into View or Create View As Insert

2017-01-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen deleted SPARK-19199:
--


> Explicitly prevent Insert into View or Create View As Insert
> 
>
> Key: SPARK-19199
> URL: https://issues.apache.org/jira/browse/SPARK-19199
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Jiang Xingbo
>
> Currently we don't explicitly forbid the following behaviors:
> 1. The statement CREATE VIEW AS INSERT INTO throws the following exception 
> from SQLBuilder:
> `java.lang.UnsupportedOperationException: unsupported plan InsertIntoTable 
> MetastoreRelation default, tbl, false, false`;
> 2. The statement INSERT INTO view VALUES throws the following exception from 
> checkAnalysis:
> `Error in query: Inserting into an RDD-based table is not allowed.;;`
> We should check for these behaviors earlier and explicitly prevent them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Deleted] (SPARK-19197) Explicitly prevent Insert into View or Create View As Insert

2017-01-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen deleted SPARK-19197:
--


> Explicitly prevent Insert into View or Create View As Insert
> 
>
> Key: SPARK-19197
> URL: https://issues.apache.org/jira/browse/SPARK-19197
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Jiang Xingbo
>
> Currently we don't explicitly forbid the following behaviors:
> 1. The statement CREATE VIEW AS INSERT INTO throws the following exception 
> from SQLBuilder:
> `java.lang.UnsupportedOperationException: unsupported plan InsertIntoTable 
> MetastoreRelation default, tbl, false, false`;
> 2. The statement INSERT INTO view VALUES throws the following exception from 
> checkAnalysis:
> `Error in query: Inserting into an RDD-based table is not allowed.;;`
> We should check for these behaviors earlier and explicitly prevent them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Deleted] (SPARK-19202) Explicitly prevent Insert into View or Create View As Insert

2017-01-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen deleted SPARK-19202:
--


> Explicitly prevent Insert into View or Create View As Insert
> 
>
> Key: SPARK-19202
> URL: https://issues.apache.org/jira/browse/SPARK-19202
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Jiang Xingbo
>
> Currently we don't explicitly forbid the following behaviors:
> 1. The statement CREATE VIEW AS INSERT INTO throws the following exception 
> from SQLBuilder:
> `java.lang.UnsupportedOperationException: unsupported plan InsertIntoTable 
> MetastoreRelation default, tbl, false, false`;
> 2. The statement INSERT INTO view VALUES throws the following exception from 
> checkAnalysis:
> `Error in query: Inserting into an RDD-based table is not allowed.;;`
> We should check for these behaviors earlier and explicitly prevent them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Deleted] (SPARK-19194) collection function: index

2017-01-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen deleted SPARK-19194:
--


> collection function: index
> --
>
> Key: SPARK-19194
> URL: https://issues.apache.org/jira/browse/SPARK-19194
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Chenzhao Guo
>
> Expression Index returns the value of index n (right) of the array/map (left).
> To be consistent with Hive, the left parameter should be of ArrayType or 
> MapType, while the right parameter should be of AtomicType or NullType.
> Except the rules above, Hive won't throw any exceptions, even if there's a 
> type mismatch, it will output NULL to indicate not found. This implementation 
> follows the rule.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19202) Explicitly prevent Insert into View or Create View As Insert

2017-01-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821954#comment-15821954
 ] 

Sean Owen commented on SPARK-19202:
---

Deleting as a corrupted duplicate of SPARK-19196

> Explicitly prevent Insert into View or Create View As Insert
> 
>
> Key: SPARK-19202
> URL: https://issues.apache.org/jira/browse/SPARK-19202
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Jiang Xingbo
>
> Currently we don't explicitly forbid the following behaviors:
> 1. The statement CREATE VIEW AS INSERT INTO throws the following exception 
> from SQLBuilder:
> `java.lang.UnsupportedOperationException: unsupported plan InsertIntoTable 
> MetastoreRelation default, tbl, false, false`;
> 2. The statement INSERT INTO view VALUES throws the following exception from 
> checkAnalysis:
> `Error in query: Inserting into an RDD-based table is not allowed.;;`
> We should check for these behaviors earlier and explicitly prevent them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19194) collection function: index

2017-01-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821956#comment-15821956
 ] 

Sean Owen commented on SPARK-19194:
---

This issue has become corrupted somehow by JIRA and I'll have to delete it. 
Could you reopen it? More detail would be useful, and please don't set Target.

> collection function: index
> --
>
> Key: SPARK-19194
> URL: https://issues.apache.org/jira/browse/SPARK-19194
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Chenzhao Guo
>
> Expression Index returns the value of index n (right) of the array/map (left).
> To be consistent with Hive, the left parameter should be of ArrayType or 
> MapType, while the right parameter should be of AtomicType or NullType.
> Except the rules above, Hive won't throw any exceptions, even if there's a 
> type mismatch, it will output NULL to indicate not found. This implementation 
> follows the rule.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Deleted] (SPARK-19201) Explicitly prevent Insert into View or Create View As Insert

2017-01-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen deleted SPARK-19201:
--


> Explicitly prevent Insert into View or Create View As Insert
> 
>
> Key: SPARK-19201
> URL: https://issues.apache.org/jira/browse/SPARK-19201
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Jiang Xingbo
>
> Currently we don't explicitly forbid the following behaviors:
> 1. The statement CREATE VIEW AS INSERT INTO throws the following exception 
> from SQLBuilder:
> `java.lang.UnsupportedOperationException: unsupported plan InsertIntoTable 
> MetastoreRelation default, tbl, false, false`;
> 2. The statement INSERT INTO view VALUES throws the following exception from 
> checkAnalysis:
> `Error in query: Inserting into an RDD-based table is not allowed.;;`
> We should check for these behaviors earlier and explicitly prevent them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Deleted] (SPARK-19196) Explicitly prevent Insert into View or Create View As Insert

2017-01-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen deleted SPARK-19196:
--


> Explicitly prevent Insert into View or Create View As Insert
> 
>
> Key: SPARK-19196
> URL: https://issues.apache.org/jira/browse/SPARK-19196
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Jiang Xingbo
>
> Currently we don't explicitly forbid the following behaviors:
> 1. The statement CREATE VIEW AS INSERT INTO throws the following exception 
> from SQLBuilder:
> `java.lang.UnsupportedOperationException: unsupported plan InsertIntoTable 
> MetastoreRelation default, tbl, false, false`;
> 2. The statement INSERT INTO view VALUES throws the following exception from 
> checkAnalysis:
> `Error in query: Inserting into an RDD-based table is not allowed.;;`
> We should check for these behaviors earlier and explicitly prevent them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19196) Explicitly prevent Insert into View or Create View As Insert

2017-01-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821957#comment-15821957
 ] 

Sean Owen commented on SPARK-19196:
---

This JIRA has become corrupted somehow. Could you reopen a new one?

> Explicitly prevent Insert into View or Create View As Insert
> 
>
> Key: SPARK-19196
> URL: https://issues.apache.org/jira/browse/SPARK-19196
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Jiang Xingbo
>
> Currently we don't explicitly forbid the following behaviors:
> 1. The statement CREATE VIEW AS INSERT INTO throws the following exception 
> from SQLBuilder:
> `java.lang.UnsupportedOperationException: unsupported plan InsertIntoTable 
> MetastoreRelation default, tbl, false, false`;
> 2. The statement INSERT INTO view VALUES throws the following exception from 
> checkAnalysis:
> `Error in query: Inserting into an RDD-based table is not allowed.;;`
> We should check for these behaviors earlier and explicitly prevent them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19213) FileSourceScanExec usese sparksession from hadoopfsrelation creation time instead of the one active at time of execution

2017-01-13 Thread Robert Kruszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kruszewski updated SPARK-19213:
--
Description: 
If you look at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
 you'll notice that the sparksession used for execution is the one that was 
captured from logicalplan. Whereas in other places you have 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
 and SparkPlan captures active session upon execution in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52

>From my understanding of the io code it would be beneficial to be able to use 
>the active session in order to be able to modify hadoop config without 
>recreating the dataset. What would be interesting is to not lock the spark 
>session in the physical plan for ios and let you share datasets across spark 
>sessions. Is that supposed to work? Otherwise you'd have to get a new query 
>execution to bind to new sparksession which would only let you share logical 
>plans. 

  was:
If you look at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
 you'll notice that the sparksession used for execution is the one that was 
captured from logicalplan. Whereas in other places you have 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
 and SparkPlan captures active session upon execution in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52

>From my understanding of the io code it would be beneficial to be able to use 
>the active session in order to be able to modify hadoop config without 
>recreating the dataset. What would be interesting is to not lock the spark 
>session in the physical plan for ios and let you share datasets across spark 
>sessions. Is that supposed to work? Otherwise you'd have to get a new query 
>execution to bind to new sparksession which would only let you share logical 
>plans. 

I am sending pr along with the latter.


> FileSourceScanExec usese sparksession from hadoopfsrelation creation time 
> instead of the one active at time of execution
> 
>
> Key: SPARK-19213
> URL: https://issues.apache.org/jira/browse/SPARK-19213
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Robert Kruszewski
>
> If you look at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
>  you'll notice that the sparksession used for execution is the one that was 
> captured from logicalplan. Whereas in other places you have 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
>  and SparkPlan captures active session upon execution in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52
> From my understanding of the io code it would be beneficial to be able to use 
> the active session in order to be able to modify hadoop config without 
> recreating the dataset. What would be interesting is to not lock the spark 
> session in the physical plan for ios and let you share datasets across spark 
> sessions. Is that supposed to work? Otherwise you'd have to get a new query 
> execution to bind to new sparksession which would only let you share logical 
> plans. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19113) Fix flaky test: o.a.s.sql.streaming.StreamSuite fatal errors from a source should be sent to the user

2017-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821984#comment-15821984
 ] 

Apache Spark commented on SPARK-19113:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/16567

> Fix flaky test: o.a.s.sql.streaming.StreamSuite fatal errors from a source 
> should be sent to the user
> -
>
> Key: SPARK-19113
> URL: https://issues.apache.org/jira/browse/SPARK-19113
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.1.1, 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: (was: Apache Spark)

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17237) DataFrame fill after pivot causing org.apache.spark.sql.AnalysisException

2017-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821983#comment-15821983
 ] 

Apache Spark commented on SPARK-17237:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/16565

> DataFrame fill after pivot causing org.apache.spark.sql.AnalysisException
> -
>
> Key: SPARK-17237
> URL: https://issues.apache.org/jira/browse/SPARK-17237
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jiang Qiqi
>  Labels: newbie
>
> I am trying to run a pivot transformation which I ran on a spark1.6 cluster, 
> namely
> sc.parallelize(Seq((2,3,4), (3,4,5))).toDF("a", "b", "c")
> res1: org.apache.spark.sql.DataFrame = [a: int, b: int, c: int]
> scala> res1.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0)
> res2: org.apache.spark.sql.DataFrame = [a: int, 3_count(c): bigint, 3_avg(c): 
> double, 4_count(c): bigint, 4_avg(c): double]
> scala> res1.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0).show
> +---+--++--++
> |  a|3_count(c)|3_avg(c)|4_count(c)|4_avg(c)|
> +---+--++--++
> |  2| 1| 4.0| 0| 0.0|
> |  3| 0| 0.0| 1| 5.0|
> +---+--++--++
> after upgrade the environment to spark2.0, got an error while executing 
> .na.fill method
> scala> sc.parallelize(Seq((2,3,4), (3,4,5))).toDF("a", "b", "c")
> res3: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
> scala> res3.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0)
> org.apache.spark.sql.AnalysisException: syntax error in attribute name: 
> `3_count(`c`)`;
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:103)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:113)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:168)
>   at org.apache.spark.sql.Dataset.resolve(Dataset.scala:218)
>   at org.apache.spark.sql.Dataset.col(Dataset.scala:921)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:411)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:162)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:159)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:159)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:149)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: Apache Spark

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19177) SparkR Data Frame operation between columns elements

2017-01-13 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821986#comment-15821986
 ] 

Shivaram Venkataraman commented on SPARK-19177:
---

Thanks for the example - this is very useful. Just to confirm the problem here 
is in dealing with schema objects and specifically being able to append and / 
or optionally mutate the schema ? I'll update the JIRA title appropriately and 
we'll work on a fix for this.

> SparkR Data Frame operation between columns elements
> 
>
> Key: SPARK-19177
> URL: https://issues.apache.org/jira/browse/SPARK-19177
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 2.0.2
>Reporter: Vicente Masip
>Priority: Minor
>  Labels: schema, sparkR, struct
>
> I have commented this in other thread, but I think it can be important to 
> clarify that:
> What happen when you are working with 50 columns and gapply? Do I rewrite 50 
> columns scheme with it's new column from gapply operation? I think there is 
> no alternative because structFields cannot be appended to structType. Any 
> suggestions?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: Apache Spark

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: Apache Spark

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: (was: Apache Spark)

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: Apache Spark

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: (was: Apache Spark)

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18687) Backward compatibility - creating a Dataframe on a new SQLContext object fails with a Derby error

2017-01-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-18687.
-
   Resolution: Fixed
 Assignee: Vinayak Joshi
Fix Version/s: 2.2.0
   2.1.1
   2.0.3

> Backward compatibility - creating a Dataframe on a new SQLContext object 
> fails with a Derby error
> -
>
> Key: SPARK-18687
> URL: https://issues.apache.org/jira/browse/SPARK-18687
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
> Environment: Spark built with hive support
>Reporter: Vinayak Joshi
>Assignee: Vinayak Joshi
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> With a local spark instance built with hive support, (-Pyarn -Phadoop-2.6 
> -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver)
> The following script/sequence works in Pyspark without any error in 1.6.x, 
> but fails in 2.x.
> {code}
> people = sc.parallelize(["Michael,30", "Andy,12", "Justin,19"])
> peoplePartsRDD = people.map(lambda p: p.split(","))
> peopleRDD = peoplePartsRDD.map(lambda p: pyspark.sql.Row(name=p[0], 
> age=int(p[1])))
> peopleDF= sqlContext.createDataFrame(peopleRDD)
> peopleDF.first()
> sqlContext2 = SQLContext(sc)
> people2 = sc.parallelize(["Abcd,40", "Efgh,14", "Ijkl,16"])
> peoplePartsRDD2 = people2.map(lambda l: l.split(","))
> peopleRDD2 = peoplePartsRDD2.map(lambda p: pyspark.sql.Row(fname=p[0], 
> age=int(p[1])))
> peopleDF2 = sqlContext2.createDataFrame(peopleRDD2) # < error here
> {code}
> The error produced is:
> {noformat}
> 16/12/01 22:35:36 ERROR Schema: Failed initialising database.
> Unable to open a test connection to the given database. JDBC url = 
> jdbc:derby:;databaseName=metastore_db;create=true, username = APP. 
> Terminating connection pool (set lazyInit to true if you expect to start your 
> database after your app). Original Exception: --
> java.sql.SQLException: Failed to start database 'metastore_db' with class 
> loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@4494053, 
> see the next exception for details.
> at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
> at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
> at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source)
> .
> .
> --
> org.datanucleus.exceptions.NucleusDataStoreException: Unable to open a test 
> connection to the given database. JDBC url = 
> jdbc:derby:;databaseName=metastore_db;create=true, username = APP. 
> Terminating connection pool (set lazyInit to true if you expect to start your 
> database after your app). Original Exception: --
> java.sql.SQLException: Failed to start database 'metastore_db' with class 
> loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@519dabfd, see 
> the next exception for details.
> at org.apache.derby.impl.jdb
> .
> .
> .
> NestedThrowables:
> java.sql.SQLException: Unable to open a test connection to the given 
> database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, 
> username = APP. Terminating connection pool (set lazyInit to true if you 
> expect to start your database after your app). Original Exception: --
> java.sql.SQLException: Failed to start database 'metastore_db' with class 
> loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@519dabfd, see 
> the next exception for details.
> at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
> .
> .
> .
> Caused by: java.sql.SQLException: Unable to open a test connection to the 
> given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, 
> username = APP. Terminating connection pool (set lazyInit to true if you 
> expect to start your database after your app). Original Exception: --
> java.sql.SQLException: Failed to start database 'metastore_db' with class 
> loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@519dabfd, see 
> the next exception for details.
> at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
> at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
> at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source)
> at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(Unknown 
> Source)
> at org.apache.derby.impl.jdbc.EmbedConnection.(Unknown Source)
> .
> .
> .
> 16/12/01 22:48:09 ERROR Schema: Failed initialising database.
> Unable to open a test connection to the given database. JDBC url = 
> jdbc:derby:;databaseName=metastore_db;create=true, username 

[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2017-01-13 Thread Danilo Ascione (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822021#comment-15822021
 ] 

Danilo Ascione commented on SPARK-13857:


I have a pipeline similar to [~abudd2014]'s one. I have implemented a dataframe 
api based RankingEvaluator that takes care of getting the top K recommendations 
at the evaluation phase of the pipeline, and it can be used in model selection 
pipeline (Cross-Validation). 
Sample usage code:
{code}
val als = new ALS() //input dataframe (userId, itemId, clicked)
  .setUserCol("userId")
  .setItemCol("itemId")
  .setRatingCol("clicked")
  .setImplicitPrefs(true)

val paramGrid = new ParamGridBuilder()
.addGrid(als.regParam, Array(0.01,0.1))
.addGrid(als.alpha, Array(40.0, 1.0))
.build()

val evaluator = new RankingEvaluator()
.setMetricName("mpr") //Mean Percentile Rank
.setLabelCol("itemId")
.setPredictionCol("prediction")
.setQueryCol("userId")
.setK(5) //Top K
 
val cv = new CrossValidator()
  .setEstimator(als)
  .setEvaluator(evaluator)
  .setEstimatorParamMaps(paramGrid)
  .setNumFolds(3)

val crossValidatorModel = cv.fit(inputDF)

// Print the average metrics per ParamGrid entry
val avgMetricsParamGrid = crossValidatorModel.avgMetrics

// Combine with paramGrid to see how they affect the overall metrics
val combined = paramGrid.zip(avgMetricsParamGrid)
{code}

Then the resulting "bestModel" from cross validation model is used to generate 
the top K recommendations in batches.

RankingEvaluator code is here 
[https://github.com/daniloascione/spark/commit/c93ab86d35984e9f70a3b4f543fb88f5541333f0]

I would appreciate any feedback. Thanks!


> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19178) convert string of large numbers to int should return null

2017-01-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-19178.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> convert string of large numbers to int should return null
> -
>
> Key: SPARK-19178
> URL: https://issues.apache.org/jira/browse/SPARK-19178
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19142) spark.kmeans should take seed, initSteps, and tol as parameters

2017-01-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-19142.
-
   Resolution: Fixed
 Assignee: Miao Wang
Fix Version/s: 2.2.0

> spark.kmeans should take seed, initSteps, and tol as parameters
> ---
>
> Key: SPARK-19142
> URL: https://issues.apache.org/jira/browse/SPARK-19142
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Miao Wang
>Assignee: Miao Wang
> Fix For: 2.2.0
>
>
> spark.kmeans doesn't have interface to set initSteps, seed and tol. As Spark 
> Kmeans algorithm doesn't take the same set of parameters as R kmeans, we 
> should maintain a different interface in spark.kmeans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17237) DataFrame fill after pivot causing org.apache.spark.sql.AnalysisException

2017-01-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-17237.
-
   Resolution: Fixed
 Assignee: Takeshi Yamamuro
Fix Version/s: 2.2.0
   2.1.1

> DataFrame fill after pivot causing org.apache.spark.sql.AnalysisException
> -
>
> Key: SPARK-17237
> URL: https://issues.apache.org/jira/browse/SPARK-17237
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jiang Qiqi
>Assignee: Takeshi Yamamuro
>  Labels: newbie
> Fix For: 2.1.1, 2.2.0
>
>
> I am trying to run a pivot transformation which I ran on a spark1.6 cluster, 
> namely
> sc.parallelize(Seq((2,3,4), (3,4,5))).toDF("a", "b", "c")
> res1: org.apache.spark.sql.DataFrame = [a: int, b: int, c: int]
> scala> res1.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0)
> res2: org.apache.spark.sql.DataFrame = [a: int, 3_count(c): bigint, 3_avg(c): 
> double, 4_count(c): bigint, 4_avg(c): double]
> scala> res1.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0).show
> +---+--++--++
> |  a|3_count(c)|3_avg(c)|4_count(c)|4_avg(c)|
> +---+--++--++
> |  2| 1| 4.0| 0| 0.0|
> |  3| 0| 0.0| 1| 5.0|
> +---+--++--++
> after upgrade the environment to spark2.0, got an error while executing 
> .na.fill method
> scala> sc.parallelize(Seq((2,3,4), (3,4,5))).toDF("a", "b", "c")
> res3: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
> scala> res3.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0)
> org.apache.spark.sql.AnalysisException: syntax error in attribute name: 
> `3_count(`c`)`;
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:103)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:113)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:168)
>   at org.apache.spark.sql.Dataset.resolve(Dataset.scala:218)
>   at org.apache.spark.sql.Dataset.col(Dataset.scala:921)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:411)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:162)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:159)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:159)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:149)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19092) Save() API of DataFrameWriter should not scan all the saved files

2017-01-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-19092:
---

Assignee: Xiao Li

> Save() API of DataFrameWriter should not scan all the saved files
> -
>
> Key: SPARK-19092
> URL: https://issues.apache.org/jira/browse/SPARK-19092
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.2.0
>
>
> `DataFrameWriter`'s save() API is performing a unnecessary full filesystem 
> scan for the saved files. The save() API is the most basic/core API in 
> `DataFrameWriter`. We should avoid these unnecessary file scan. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19092) Save() API of DataFrameWriter should not scan all the saved files

2017-01-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-19092.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> Save() API of DataFrameWriter should not scan all the saved files
> -
>
> Key: SPARK-19092
> URL: https://issues.apache.org/jira/browse/SPARK-19092
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.2.0
>
>
> `DataFrameWriter`'s save() API is performing a unnecessary full filesystem 
> scan for the saved files. The save() API is the most basic/core API in 
> `DataFrameWriter`. We should avoid these unnecessary file scan. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19186) Hash symbol in middle of Sybase database table name causes Spark Exception

2017-01-13 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822053#comment-15822053
 ] 

Dongjoon Hyun commented on SPARK-19186:
---

Hi, [~schulewa].
It looks like 
`net.sourceforge.jtds.jdbc.SQLDiagnostic.addDiagnostic(SQLDiagnostic.java)` 
(instead of Spark) complains about the query.
Could you try that SQL Syntax directly on that library without Spark?

> Hash symbol in middle of Sybase database table name causes Spark Exception
> --
>
> Key: SPARK-19186
> URL: https://issues.apache.org/jira/browse/SPARK-19186
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Adrian Schulewitz
>Priority: Minor
>
> If I use a table name without a '#' symbol in the middle then no exception 
> occurs but with one an exception is thrown. According to Sybase 15 
> documentation a '#' is a legal character.
> val testSql = "SELECT * FROM CTP#ADR_TYPE_DBF"
> val conf = new SparkConf().setAppName("MUREX DMart Simple Reader via 
> SQL").setMaster("local[2]")
> val sess = SparkSession
>   .builder()
>   .appName("MUREX DMart Simple SQL Reader")
>   .config(conf)
>   .getOrCreate()
> import sess.implicits._
> val df = sess.read
> .format("jdbc")
> .option("url", 
> "jdbc:jtds:sybase://auq7064s.unix.anz:4020/mxdmart56")
> .option("driver", "net.sourceforge.jtds.jdbc.Driver")
> .option("dbtable", "CTP#ADR_TYPE_DBF")
> .option("UDT_DEALCRD_REP", "mxdmart56")
> .option("user", "INSTAL")
> .option("password", "INSTALL")
> .load()
> df.createOrReplaceTempView("trades")
> val resultsDF = sess.sql(testSql)
> resultsDF.show()
> 17/01/12 14:30:01 INFO SharedState: Warehouse path is 
> 'file:/C:/DEVELOPMENT/Projects/MUREX/trunk/murex-eom-reporting/spark-warehouse/'.
> 17/01/12 14:30:04 INFO SparkSqlParser: Parsing command: trades
> 17/01/12 14:30:04 INFO SparkSqlParser: Parsing command: SELECT * FROM 
> CTP#ADR_TYPE_DBF
> Exception in thread "main" 
> org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input '#' expecting {, ',', 'SELECT', 'FROM', 'ADD', 'AS', 
> 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 
> 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 
> 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 
> 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 
> 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 
> 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 
> 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'LAST', 'ROW', 'WITH', 
> 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 
> 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 
> 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 
> 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 
> 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 
> 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 
> 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 
> 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 
> 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 
> 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 
> 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', 
> 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 
> 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 
> 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 
> 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 
> 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', 
> 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', 
> 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 
> 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 'PRINCIPALS', 
> 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', 
> 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', IDENTIFIER, 
> BACKQUOTED_IDENTIFIER}(line 1, pos 17)
> == SQL ==
> SELECT * FROM CTP#ADR_TYPE_DBF
> -^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(Sp

[jira] [Resolved] (SPARK-18335) Add a numSlices parameter to SparkR's createDataFrame

2017-01-13 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-18335.
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

Issue resolved by pull request 16512
[https://github.com/apache/spark/pull/16512]

> Add a numSlices parameter to SparkR's createDataFrame
> -
>
> Key: SPARK-18335
> URL: https://issues.apache.org/jira/browse/SPARK-18335
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Shixiong Zhu
> Fix For: 2.1.1, 2.2.0
>
>
> SparkR's createDataFrame doesn't have a `numSlices` parameter. The user 
> cannot set a partition number when converting a large R dataframe to SparkR 
> dataframe. A workaround is using `repartition`, but it requires a shuffle 
> stage. It's better to support the `numSlices` parameter in the 
> `createDataFrame` method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19216) LogisticRegressionModel is missing getThreshold()

2017-01-13 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-19216:


 Summary: LogisticRegressionModel is missing getThreshold()
 Key: SPARK-19216
 URL: https://issues.apache.org/jira/browse/SPARK-19216
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 2.1.0
Reporter: Nicholas Chammas
Priority: Minor


Say I just loaded a logistic regression model from storage. How do I check that 
model's threshold in PySpark? From what I can see, the only way to do that is 
to dip into the Java object:

{code}
model._java_obj.getThreshold())
{code}

It seems like PySpark's version of {{LogisticRegressionModel}} should include 
this method.

Another issue is that it's not clear whether the threshold is for the raw 
prediction or the probability. Maybe it's obvious to machine learning 
practitioners, but I couldn't tell from reading the docs or skimming the code 
what the threshold was for exactly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19216) LogisticRegressionModel is missing getThreshold()

2017-01-13 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822094#comment-15822094
 ] 

Nicholas Chammas commented on SPARK-19216:
--

cc [~josephkb] - Is this a valid gap in Python's API, or I did just 
misunderstand things?

> LogisticRegressionModel is missing getThreshold()
> -
>
> Key: SPARK-19216
> URL: https://issues.apache.org/jira/browse/SPARK-19216
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Say I just loaded a logistic regression model from storage. How do I check 
> that model's threshold in PySpark? From what I can see, the only way to do 
> that is to dip into the Java object:
> {code}
> model._java_obj.getThreshold())
> {code}
> It seems like PySpark's version of {{LogisticRegressionModel}} should include 
> this method.
> Another issue is that it's not clear whether the threshold is for the raw 
> prediction or the probability. Maybe it's obvious to machine learning 
> practitioners, but I couldn't tell from reading the docs or skimming the code 
> what the threshold was for exactly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4502) Spark SQL reads unneccesary nested fields from Parquet

2017-01-13 Thread Michael Allman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822111#comment-15822111
 ] 

Michael Allman commented on SPARK-4502:
---

Hi Guys,

I'm going to submit a PR for this shortly. We've had a patch for this 
functionality in production for a year now but are just now getting around to 
contributing it.

I've examined the other two PR's. Our patch is substantially different from the 
other two and provides a superset of their functionality. We've added over two 
dozen new unit tests to guard against regressions and test expected pruning. 
We've built and tested the latest patch, and found a significant number of test 
failures from our suite. I also found test failures in the unmodified codebase 
when enabling the schema pruning functionality.

I do not take the idea of submitting a parallel, "competing" PR lightly, but in 
this case I think we can offer a better foundation for review. Please examine 
our PR and judge for yourself.

Cheers.

> Spark SQL reads unneccesary nested fields from Parquet
> --
>
> Key: SPARK-4502
> URL: https://issues.apache.org/jira/browse/SPARK-4502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Liwen Sun
>Priority: Critical
>
> When reading a field of a nested column from Parquet, SparkSQL reads and 
> assemble all the fields of that nested column. This is unnecessary, as 
> Parquet supports fine-grained field reads out of a nested column. This may 
> degrades the performance significantly when a nested column has many fields. 
> For example, I loaded json tweets data into SparkSQL and ran the following 
> query:
> {{SELECT User.contributors_enabled from Tweets;}}
> User is a nested structure that has 38 primitive fields (for Tweets schema, 
> see: https://dev.twitter.com/overview/api/tweets), here is the log message:
> {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 
> cell/ms}}
> For comparison, I also ran:
> {{SELECT User FROM Tweets;}}
> And here is the log message:
> {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}}
> So both queries load 38 columns from Parquet, while the first query only 
> needs 1 column. I also measured the bytes read within Parquet. In these two 
> cases, the same number of bytes (99365194 bytes) were read. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4502) Spark SQL reads unneccesary nested fields from Parquet

2017-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822120#comment-15822120
 ] 

Apache Spark commented on SPARK-4502:
-

User 'mallman' has created a pull request for this issue:
https://github.com/apache/spark/pull/16578

> Spark SQL reads unneccesary nested fields from Parquet
> --
>
> Key: SPARK-4502
> URL: https://issues.apache.org/jira/browse/SPARK-4502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Liwen Sun
>Priority: Critical
>
> When reading a field of a nested column from Parquet, SparkSQL reads and 
> assemble all the fields of that nested column. This is unnecessary, as 
> Parquet supports fine-grained field reads out of a nested column. This may 
> degrades the performance significantly when a nested column has many fields. 
> For example, I loaded json tweets data into SparkSQL and ran the following 
> query:
> {{SELECT User.contributors_enabled from Tweets;}}
> User is a nested structure that has 38 primitive fields (for Tweets schema, 
> see: https://dev.twitter.com/overview/api/tweets), here is the log message:
> {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 
> cell/ms}}
> For comparison, I also ran:
> {{SELECT User FROM Tweets;}}
> And here is the log message:
> {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}}
> So both queries load 38 columns from Parquet, while the first query only 
> needs 1 column. I also measured the bytes read within Parquet. In these two 
> cases, the same number of bytes (99365194 bytes) were read. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19131) Support "alter table drop partition [if exists]"

2017-01-13 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-19131.
-
Resolution: Invalid

Hi, [~licl].

I'm closing this issue because it's already supported feature. Please try the 
following.

{code}
scala> spark.version
res0: String = 2.1.0

scala> sql("create table t(a int) partitioned by (p int)")
res1: org.apache.spark.sql.DataFrame = []

scala> sql("alter table t drop if exists partition (p=1)")
res2: org.apache.spark.sql.DataFrame = []
{code}

> Support "alter table drop partition [if exists]"
> 
>
> Key: SPARK-19131
> URL: https://issues.apache.org/jira/browse/SPARK-19131
> Project: Spark
>  Issue Type: New Feature
>Affects Versions: 2.1.0
>Reporter: lichenglin
>
> {code}
> val parts = client.getPartitions(hiveTable, s.asJava).asScala
> if (parts.isEmpty && !ignoreIfNotExists) {
>   throw new AnalysisException(
> s"No partition is dropped. One partition spec '$s' does not exist 
> in table '$table' " +
> s"database '$db'")
> }
> parts.map(_.getValues)
> {code}
> Until 2.1.0,drop partition will throw a exception when no partition to drop.
> I notice there is a param named ignoreIfNotExists.
> But I don't know how to set it.
> May be we can implement "alter table drop partition [if exists] " 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19217) Offer easy cast from vector to array

2017-01-13 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-19217:


 Summary: Offer easy cast from vector to array
 Key: SPARK-19217
 URL: https://issues.apache.org/jira/browse/SPARK-19217
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark, SQL
Affects Versions: 2.1.0
Reporter: Nicholas Chammas
Priority: Minor


Working with ML often means working with DataFrames with vector columns. You 
can't save these DataFrames to storage without converting the vector columns to 
array columns, and there doesn't appear to an easy way to make that conversion.

This is a common enough problem that it is [documented on Stack 
Overflow|http://stackoverflow.com/q/35855382/877069]. The current solutions to 
making the conversion from a vector column to an array column are:
# Convert the DataFrame to an RDD and back
# Use a UDF

Both approaches work fine, but it really seems like you should be able to do 
something like this instead:

{code}
(le_data
.select(
col('features').cast('array').alias('features')
))
{code}

We already have an {{ArrayType}} in {{pyspark.sql.types}}, but it appears that 
{{cast()}} doesn't support this conversion.

Would this be an appropriate thing to add?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-19113) Fix flaky test: o.a.s.sql.streaming.StreamSuite fatal errors from a source should be sent to the user

2017-01-13 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reopened SPARK-19113:
--

Reopened it as it's still flaky

> Fix flaky test: o.a.s.sql.streaming.StreamSuite fatal errors from a source 
> should be sent to the user
> -
>
> Key: SPARK-19113
> URL: https://issues.apache.org/jira/browse/SPARK-19113
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19209) "No suitable driver" on first try

2017-01-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-19209.
---
Resolution: Duplicate

> "No suitable driver" on first try
> -
>
> Key: SPARK-19209
> URL: https://issues.apache.org/jira/browse/SPARK-19209
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Darabos
>
> This is a regression from Spark 2.0.2. Observe!
> {code}
> $ ~/spark-2.0.2/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> This is the "good" exception. Now with Spark 2.1.0:
> {code}
> $ ~/spark-2.1.0/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: No suitable driver
>   at java.sql.DriverManager.getDriver(DriverManager.java:315)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
>   ... 48 elided
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> Simply re-executing the same command a second time "fixes" the {{No suitable 
> driver}} error.
> My guess is this is fallout from https://github.com/apache/spark/pull/15292 
> which changed the JDBC driver management code. But this code is so hard to 
> understand for me, I could be totally wrong.
> This is nothing more than a nuisance for {{spark-shell}} usage, but it is 
> more painful to work around for applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19113) Fix flaky test: o.a.s.sql.streaming.StreamSuite fatal errors from a source should be sent to the user

2017-01-13 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19113:
-
Fix Version/s: (was: 2.1.1)
   (was: 2.2.0)

> Fix flaky test: o.a.s.sql.streaming.StreamSuite fatal errors from a source 
> should be sent to the user
> -
>
> Key: SPARK-19113
> URL: https://issues.apache.org/jira/browse/SPARK-19113
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19209) "No suitable driver" on first try

2017-01-13 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822145#comment-15822145
 ] 

Xiao Li commented on SPARK-19209:
-

 It sounds like you create multiple duplicate JIRAs: SPARK-19204, SPARK-19205 
and SPARK-19209. Let me close the last two. 

> "No suitable driver" on first try
> -
>
> Key: SPARK-19209
> URL: https://issues.apache.org/jira/browse/SPARK-19209
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Darabos
>
> This is a regression from Spark 2.0.2. Observe!
> {code}
> $ ~/spark-2.0.2/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> This is the "good" exception. Now with Spark 2.1.0:
> {code}
> $ ~/spark-2.1.0/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: No suitable driver
>   at java.sql.DriverManager.getDriver(DriverManager.java:315)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
>   ... 48 elided
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> Simply re-executing the same command a second time "fixes" the {{No suitable 
> driver}} error.
> My guess is this is fallout from https://github.com/apache/spark/pull/15292 
> which changed the JDBC driver management code. But this code is so hard to 
> understand for me, I could be totally wrong.
> This is nothing more than a nuisance for {{spark-shell}} usage, but it is 
> more painful to work around for applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-19209) "No suitable driver" on first try

2017-01-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reopened SPARK-19209:
-

> "No suitable driver" on first try
> -
>
> Key: SPARK-19209
> URL: https://issues.apache.org/jira/browse/SPARK-19209
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Darabos
>
> This is a regression from Spark 2.0.2. Observe!
> {code}
> $ ~/spark-2.0.2/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> This is the "good" exception. Now with Spark 2.1.0:
> {code}
> $ ~/spark-2.1.0/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: No suitable driver
>   at java.sql.DriverManager.getDriver(DriverManager.java:315)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
>   ... 48 elided
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> Simply re-executing the same command a second time "fixes" the {{No suitable 
> driver}} error.
> My guess is this is fallout from https://github.com/apache/spark/pull/15292 
> which changed the JDBC driver management code. But this code is so hard to 
> understand for me, I could be totally wrong.
> This is nothing more than a nuisance for {{spark-shell}} usage, but it is 
> more painful to work around for applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19209) "No suitable driver" on first try

2017-01-13 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822145#comment-15822145
 ] 

Xiao Li edited comment on SPARK-19209 at 1/13/17 6:50 PM:
--

 It sounds like you create multiple duplicate JIRAs: SPARK-19204, SPARK-19205 
and SPARK-19209. Could you delete the first two?

Let me reopen this one.


was (Author: smilegator):
 It sounds like you create multiple duplicate JIRAs: SPARK-19204, SPARK-19205 
and SPARK-19209. Let me close the last two. 

> "No suitable driver" on first try
> -
>
> Key: SPARK-19209
> URL: https://issues.apache.org/jira/browse/SPARK-19209
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Darabos
>
> This is a regression from Spark 2.0.2. Observe!
> {code}
> $ ~/spark-2.0.2/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> This is the "good" exception. Now with Spark 2.1.0:
> {code}
> $ ~/spark-2.1.0/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: No suitable driver
>   at java.sql.DriverManager.getDriver(DriverManager.java:315)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
>   ... 48 elided
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> Simply re-executing the same command a second time "fixes" the {{No suitable 
> driver}} error.
> My guess is this is fallout from https://github.com/apache/spark/pull/15292 
> which changed the JDBC driver management code. But this code is so hard to 
> understand for me, I could be totally wrong.
> This is nothing more than a nuisance for {{spark-shell}} usage, but it is 
> more painful to work around for applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Deleted] (SPARK-19204) "No suitable driver" on first try

2017-01-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen deleted SPARK-19204:
--


> "No suitable driver" on first try
> -
>
> Key: SPARK-19204
> URL: https://issues.apache.org/jira/browse/SPARK-19204
> Project: Spark
>  Issue Type: Bug
>Reporter: Daniel Darabos
>
> This is a regression from Spark 2.0.2. Observe!
> {code}
> $ ~/spark-2.0.2/bin/spark-shell --jars 
> stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar --driver-class-path 
> stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> This is the "good" exception. Now with Spark 2.1.0:
> {code}
> $ ~/spark-2.1.0/bin/spark-shell --jars 
> stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar --driver-class-path 
> stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: No suitable driver
>   at java.sql.DriverManager.getDriver(DriverManager.java:315)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
>   ... 48 elided
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> Simply re-executing the same command a second time "fixes" the {{No suitable 
> driver}} error.
> My guess is this is fallout from https://github.com/apache/spark/pull/15292 
> which changed the JDBC driver management code. But this code is so hard to 
> understand for me, I could be totally wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Deleted] (SPARK-19205) "No suitable driver" on first try

2017-01-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen deleted SPARK-19205:
--


> "No suitable driver" on first try
> -
>
> Key: SPARK-19205
> URL: https://issues.apache.org/jira/browse/SPARK-19205
> Project: Spark
>  Issue Type: Bug
>Reporter: Daniel Darabos
>
> This is a regression from Spark 2.0.2. Observe!
> {code}
> $ ~/spark-2.0.2/bin/spark-shell --jars 
> stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar --driver-class-path 
> stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> This is the "good" exception. Now with Spark 2.1.0:
> {code}
> $ ~/spark-2.1.0/bin/spark-shell --jars 
> stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar --driver-class-path 
> stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: No suitable driver
>   at java.sql.DriverManager.getDriver(DriverManager.java:315)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
>   ... 48 elided
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> Simply re-executing the same command a second time "fixes" the {{No suitable 
> driver}} error.
> My guess is this is fallout from https://github.com/apache/spark/pull/15292 
> which changed the JDBC driver management code. But this code is so hard to 
> understand for me, I could be totally wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19209) "No suitable driver" on first try

2017-01-13 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822154#comment-15822154
 ] 

Xiao Li commented on SPARK-19209:
-

Thanks for reporting the regression. Let me take a look at this. 

> "No suitable driver" on first try
> -
>
> Key: SPARK-19209
> URL: https://issues.apache.org/jira/browse/SPARK-19209
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Darabos
>
> This is a regression from Spark 2.0.2. Observe!
> {code}
> $ ~/spark-2.0.2/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> This is the "good" exception. Now with Spark 2.1.0:
> {code}
> $ ~/spark-2.1.0/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: No suitable driver
>   at java.sql.DriverManager.getDriver(DriverManager.java:315)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
>   ... 48 elided
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> Simply re-executing the same command a second time "fixes" the {{No suitable 
> driver}} error.
> My guess is this is fallout from https://github.com/apache/spark/pull/15292 
> which changed the JDBC driver management code. But this code is so hard to 
> understand for me, I could be totally wrong.
> This is nothing more than a nuisance for {{spark-shell}} usage, but it is 
> more painful to work around for applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19218) SET command should show a sorted result

2017-01-13 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-19218:
-

 Summary: SET command should show a sorted result
 Key: SPARK-19218
 URL: https://issues.apache.org/jira/browse/SPARK-19218
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Dongjoon Hyun
Priority: Trivial


Currently, `SET` command shows unsorted result. We had better show a sorted 
result for UX. Also, this is compatible with Hive.

*BEFORE*
{code}
scala> sql("set").show(false)
+---+-+
|key|value  

  |
+---+-+
|spark.driver.host  |10.22.16.140   

  |
|spark.driver.port  |63893  

  |
|hive.metastore.warehouse.dir   |file:/Users/dhyun/spark/spark-warehouse

  |
|spark.repl.class.uri   |spark://10.22.16.140:63893/classes 

  |
|spark.jars |   

  |
|spark.repl.class.outputDir 
|/private/var/folders/bl/67vhzgqs1ks88l92h8dy8_1rgp/T/spark-43da424e-7530-4053-b30e-4068e8424dc9/repl-f1c957c7-2e4a-4f14-b234-f7b9f2447971|
|spark.app.name |Spark shell

  |
|spark.driver.memory|4G 

  |
|spark.executor.id  |driver 

  |
|spark.submit.deployMode|client 

  |
|spark.master   |local[*]   

  |
|spark.home |/Users/dhyun/spark 

  |
|spark.sql.catalogImplementation|hive   

  |
|spark.app.id   |local-1484333618945

  |
+---+-+
{code}

*AFTER*
{code}
scala> sql("set").show(false)
+---+-+
|key|value  

  |
+---+-+
|hive.metastore.warehouse.dir   
|file:/Users/dhyun/SPARK-SORTED-SET/spark-warehouse 
  |
|spark.app.id   |local-1484333925649

  |
|spark.app.name |Spark shell

  |
|spark.driver.host  |10.22.16.140   
   

[jira] [Assigned] (SPARK-19218) SET command should show a sorted result

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19218:


Assignee: Apache Spark

> SET command should show a sorted result
> ---
>
> Key: SPARK-19218
> URL: https://issues.apache.org/jira/browse/SPARK-19218
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Trivial
>
> Currently, `SET` command shows unsorted result. We had better show a sorted 
> result for UX. Also, this is compatible with Hive.
> *BEFORE*
> {code}
> scala> sql("set").show(false)
> +---+-+
> |key|value
>   
>   |
> +---+-+
> |spark.driver.host  |10.22.16.140 
>   
>   |
> |spark.driver.port  |63893
>   
>   |
> |hive.metastore.warehouse.dir   |file:/Users/dhyun/spark/spark-warehouse  
>   
>   |
> |spark.repl.class.uri   |spark://10.22.16.140:63893/classes   
>   
>   |
> |spark.jars | 
>   
>   |
> |spark.repl.class.outputDir 
> |/private/var/folders/bl/67vhzgqs1ks88l92h8dy8_1rgp/T/spark-43da424e-7530-4053-b30e-4068e8424dc9/repl-f1c957c7-2e4a-4f14-b234-f7b9f2447971|
> |spark.app.name |Spark shell  
>   
>   |
> |spark.driver.memory|4G   
>   
>   |
> |spark.executor.id  |driver   
>   
>   |
> |spark.submit.deployMode|client   
>   
>   |
> |spark.master   |local[*] 
>   
>   |
> |spark.home |/Users/dhyun/spark   
>   
>   |
> |spark.sql.catalogImplementation|hive 
>   
>   |
> |spark.app.id   |local-1484333618945  
>   
>   |
> +---+-+
> {code}
> *AFTER*
> {code}
> scala> sql("set").show(false)
> +---+-+
> |key|value
>   
>   |
> +---+-+
> |hive.metastore.warehouse.dir   
> |file:/Users/dhyun/SPARK-SORTED-SET/spark-warehouse   
> |
> |spark.app.id   |local-1484333925649  
>  

[jira] [Assigned] (SPARK-19218) SET command should show a sorted result

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19218:


Assignee: (was: Apache Spark)

> SET command should show a sorted result
> ---
>
> Key: SPARK-19218
> URL: https://issues.apache.org/jira/browse/SPARK-19218
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> Currently, `SET` command shows unsorted result. We had better show a sorted 
> result for UX. Also, this is compatible with Hive.
> *BEFORE*
> {code}
> scala> sql("set").show(false)
> +---+-+
> |key|value
>   
>   |
> +---+-+
> |spark.driver.host  |10.22.16.140 
>   
>   |
> |spark.driver.port  |63893
>   
>   |
> |hive.metastore.warehouse.dir   |file:/Users/dhyun/spark/spark-warehouse  
>   
>   |
> |spark.repl.class.uri   |spark://10.22.16.140:63893/classes   
>   
>   |
> |spark.jars | 
>   
>   |
> |spark.repl.class.outputDir 
> |/private/var/folders/bl/67vhzgqs1ks88l92h8dy8_1rgp/T/spark-43da424e-7530-4053-b30e-4068e8424dc9/repl-f1c957c7-2e4a-4f14-b234-f7b9f2447971|
> |spark.app.name |Spark shell  
>   
>   |
> |spark.driver.memory|4G   
>   
>   |
> |spark.executor.id  |driver   
>   
>   |
> |spark.submit.deployMode|client   
>   
>   |
> |spark.master   |local[*] 
>   
>   |
> |spark.home |/Users/dhyun/spark   
>   
>   |
> |spark.sql.catalogImplementation|hive 
>   
>   |
> |spark.app.id   |local-1484333618945  
>   
>   |
> +---+-+
> {code}
> *AFTER*
> {code}
> scala> sql("set").show(false)
> +---+-+
> |key|value
>   
>   |
> +---+-+
> |hive.metastore.warehouse.dir   
> |file:/Users/dhyun/SPARK-SORTED-SET/spark-warehouse   
> |
> |spark.app.id   |local-1484333925649  
>   

[jira] [Commented] (SPARK-19218) SET command should show a sorted result

2017-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822181#comment-15822181
 ] 

Apache Spark commented on SPARK-19218:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/16579

> SET command should show a sorted result
> ---
>
> Key: SPARK-19218
> URL: https://issues.apache.org/jira/browse/SPARK-19218
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> Currently, `SET` command shows unsorted result. We had better show a sorted 
> result for UX. Also, this is compatible with Hive.
> *BEFORE*
> {code}
> scala> sql("set").show(false)
> +---+-+
> |key|value
>   
>   |
> +---+-+
> |spark.driver.host  |10.22.16.140 
>   
>   |
> |spark.driver.port  |63893
>   
>   |
> |hive.metastore.warehouse.dir   |file:/Users/dhyun/spark/spark-warehouse  
>   
>   |
> |spark.repl.class.uri   |spark://10.22.16.140:63893/classes   
>   
>   |
> |spark.jars | 
>   
>   |
> |spark.repl.class.outputDir 
> |/private/var/folders/bl/67vhzgqs1ks88l92h8dy8_1rgp/T/spark-43da424e-7530-4053-b30e-4068e8424dc9/repl-f1c957c7-2e4a-4f14-b234-f7b9f2447971|
> |spark.app.name |Spark shell  
>   
>   |
> |spark.driver.memory|4G   
>   
>   |
> |spark.executor.id  |driver   
>   
>   |
> |spark.submit.deployMode|client   
>   
>   |
> |spark.master   |local[*] 
>   
>   |
> |spark.home |/Users/dhyun/spark   
>   
>   |
> |spark.sql.catalogImplementation|hive 
>   
>   |
> |spark.app.id   |local-1484333618945  
>   
>   |
> +---+-+
> {code}
> *AFTER*
> {code}
> scala> sql("set").show(false)
> +---+-+
> |key|value
>   
>   |
> +---+-+
> |hive.metastore.warehouse.dir   
> |file:/Users/dhyun/SPARK-SORTED-SET/spark-warehouse   
> |
>

[jira] [Commented] (SPARK-19209) "No suitable driver" on first try

2017-01-13 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822186#comment-15822186
 ] 

Xiao Li commented on SPARK-19209:
-

Did you also hit the same exception `java.sql.SQLException: No suitable driver` 
when the table exists?

> "No suitable driver" on first try
> -
>
> Key: SPARK-19209
> URL: https://issues.apache.org/jira/browse/SPARK-19209
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Darabos
>
> This is a regression from Spark 2.0.2. Observe!
> {code}
> $ ~/spark-2.0.2/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> This is the "good" exception. Now with Spark 2.1.0:
> {code}
> $ ~/spark-2.1.0/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: No suitable driver
>   at java.sql.DriverManager.getDriver(DriverManager.java:315)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
>   ... 48 elided
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> Simply re-executing the same command a second time "fixes" the {{No suitable 
> driver}} error.
> My guess is this is fallout from https://github.com/apache/spark/pull/15292 
> which changed the JDBC driver management code. But this code is so hard to 
> understand for me, I could be totally wrong.
> This is nothing more than a nuisance for {{spark-shell}} usage, but it is 
> more painful to work around for applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19209) "No suitable driver" on first try

2017-01-13 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822186#comment-15822186
 ] 

Xiao Li edited comment on SPARK-19209 at 1/13/17 7:10 PM:
--

Do you also hit the same exception {{java.sql.SQLException: No suitable 
driver}} when the table exists?


was (Author: smilegator):
Did you also hit the same exception `java.sql.SQLException: No suitable driver` 
when the table exists?

> "No suitable driver" on first try
> -
>
> Key: SPARK-19209
> URL: https://issues.apache.org/jira/browse/SPARK-19209
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Darabos
>
> This is a regression from Spark 2.0.2. Observe!
> {code}
> $ ~/spark-2.0.2/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> This is the "good" exception. Now with Spark 2.1.0:
> {code}
> $ ~/spark-2.1.0/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: No suitable driver
>   at java.sql.DriverManager.getDriver(DriverManager.java:315)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
>   ... 48 elided
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> Simply re-executing the same command a second time "fixes" the {{No suitable 
> driver}} error.
> My guess is this is fallout from https://github.com/apache/spark/pull/15292 
> which changed the JDBC driver management code. But this code is so hard to 
> understand for me, I could be totally wrong.
> This is nothing more than a nuisance for {{spark-shell}} usage, but it is 
> more painful to work around for applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19209) "No suitable driver" on first try

2017-01-13 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822201#comment-15822201
 ] 

Xiao Li commented on SPARK-19209:
-

I am trying to find a workaround for your case. Could you add an extra option 
{{.option("driver", "com.mysql.jdbc.Driver")}} in your code? 

Note, I do not have your class name. Could you replace it by your class name in 
the option?


> "No suitable driver" on first try
> -
>
> Key: SPARK-19209
> URL: https://issues.apache.org/jira/browse/SPARK-19209
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Darabos
>
> This is a regression from Spark 2.0.2. Observe!
> {code}
> $ ~/spark-2.0.2/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> This is the "good" exception. Now with Spark 2.1.0:
> {code}
> $ ~/spark-2.1.0/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: No suitable driver
>   at java.sql.DriverManager.getDriver(DriverManager.java:315)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
>   ... 48 elided
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> Simply re-executing the same command a second time "fixes" the {{No suitable 
> driver}} error.
> My guess is this is fallout from https://github.com/apache/spark/pull/15292 
> which changed the JDBC driver management code. But this code is so hard to 
> understand for me, I could be totally wrong.
> This is nothing more than a nuisance for {{spark-shell}} usage, but it is 
> more painful to work around for applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19219) Parquet log output overly verbose by default

2017-01-13 Thread Nicholas (JIRA)
Nicholas created SPARK-19219:


 Summary: Parquet log output overly verbose by default
 Key: SPARK-19219
 URL: https://issues.apache.org/jira/browse/SPARK-19219
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 2.1.0
Reporter: Nicholas


PR #15538 addressed the problematically verbose logging when reading from older 
parquet files, but did not change the default logging properties in order to 
incorporate that fix into the default behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19213) FileSourceScanExec usese sparksession from hadoopfsrelation creation time instead of the one active at time of execution

2017-01-13 Thread Robert Kruszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kruszewski updated SPARK-19213:
--
Description: 
If you look at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
 you'll notice that the sparksession used for execution is the one that was 
captured from logicalplan. Whereas in other places you have 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
 and SparkPlan captures active session upon execution in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52

>From my understanding of the code it looks like we should be using the 
>sparksession that is currently active hence take the one from spark plan. 
>However, in case you want share Datasets across SparkSessions that is not 
>enough since as soon as dataset is executed the queryexecution will have 
>capture spark session at that point. If we want to share datasets across users 
>we need to make configurations not fixed upon first execution. I consider 1st 
>part (using sparksession from logical plan) a bug while the second (using 
>sparksession active at runtime) an enhancement so that sharing across sessions 
>is made easier.

  was:
If you look at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
 you'll notice that the sparksession used for execution is the one that was 
captured from logicalplan. Whereas in other places you have 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
 and SparkPlan captures active session upon execution in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52

>From my understanding of the io code it would be beneficial to be able to use 
>the active session in order to be able to modify hadoop config without 
>recreating the dataset. What would be interesting is to not lock the spark 
>session in the physical plan for ios and let you share datasets across spark 
>sessions. Is that supposed to work? Otherwise you'd have to get a new query 
>execution to bind to new sparksession which would only let you share logical 
>plans. 


> FileSourceScanExec usese sparksession from hadoopfsrelation creation time 
> instead of the one active at time of execution
> 
>
> Key: SPARK-19213
> URL: https://issues.apache.org/jira/browse/SPARK-19213
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Robert Kruszewski
>
> If you look at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
>  you'll notice that the sparksession used for execution is the one that was 
> captured from logicalplan. Whereas in other places you have 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
>  and SparkPlan captures active session upon execution in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52
> From my understanding of the code it looks like we should be using the 
> sparksession that is currently active hence take the one from spark plan. 
> However, in case you want share Datasets across SparkSessions that is not 
> enough since as soon as dataset is executed the queryexecution will have 
> capture spark session at that point. If we want to share datasets across 
> users we need to make configurations not fixed upon first execution. I 
> consider 1st part (using sparksession from logical plan) a bug while the 
> second (using sparksession active at runtime) an enhancement so that sharing 
> across sessions is made easier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19209) "No suitable driver" on first try

2017-01-13 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1584#comment-1584
 ] 

Xiao Li commented on SPARK-19209:
-

This could be caused by the classLoader issue. Anyway, let me first move the 
driverClass initialization back to createConnectionFactory. Thanks! 

> "No suitable driver" on first try
> -
>
> Key: SPARK-19209
> URL: https://issues.apache.org/jira/browse/SPARK-19209
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Darabos
>
> This is a regression from Spark 2.0.2. Observe!
> {code}
> $ ~/spark-2.0.2/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> This is the "good" exception. Now with Spark 2.1.0:
> {code}
> $ ~/spark-2.1.0/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: No suitable driver
>   at java.sql.DriverManager.getDriver(DriverManager.java:315)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
>   ... 48 elided
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> Simply re-executing the same command a second time "fixes" the {{No suitable 
> driver}} error.
> My guess is this is fallout from https://github.com/apache/spark/pull/15292 
> which changed the JDBC driver management code. But this code is so hard to 
> understand for me, I could be totally wrong.
> This is nothing more than a nuisance for {{spark-shell}} usage, but it is 
> more painful to work around for applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19217) Offer easy cast from vector to array

2017-01-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1585#comment-1585
 ] 

Sean Owen commented on SPARK-19217:
---

It makes some sense to me, as I also find I write a UDF to do this just about 
every time.

> Offer easy cast from vector to array
> 
>
> Key: SPARK-19217
> URL: https://issues.apache.org/jira/browse/SPARK-19217
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Working with ML often means working with DataFrames with vector columns. You 
> can't save these DataFrames to storage without converting the vector columns 
> to array columns, and there doesn't appear to an easy way to make that 
> conversion.
> This is a common enough problem that it is [documented on Stack 
> Overflow|http://stackoverflow.com/q/35855382/877069]. The current solutions 
> to making the conversion from a vector column to an array column are:
> # Convert the DataFrame to an RDD and back
> # Use a UDF
> Both approaches work fine, but it really seems like you should be able to do 
> something like this instead:
> {code}
> (le_data
> .select(
> col('features').cast('array').alias('features')
> ))
> {code}
> We already have an {{ArrayType}} in {{pyspark.sql.types}}, but it appears 
> that {{cast()}} doesn't support this conversion.
> Would this be an appropriate thing to add?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19219) Parquet log output overly verbose by default

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19219:


Assignee: Apache Spark

> Parquet log output overly verbose by default
> 
>
> Key: SPARK-19219
> URL: https://issues.apache.org/jira/browse/SPARK-19219
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Nicholas
>Assignee: Apache Spark
>  Labels: easyfix
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> PR #15538 addressed the problematically verbose logging when reading from 
> older parquet files, but did not change the default logging properties in 
> order to incorporate that fix into the default behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19219) Parquet log output overly verbose by default

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19219:


Assignee: (was: Apache Spark)

> Parquet log output overly verbose by default
> 
>
> Key: SPARK-19219
> URL: https://issues.apache.org/jira/browse/SPARK-19219
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Nicholas
>  Labels: easyfix
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> PR #15538 addressed the problematically verbose logging when reading from 
> older parquet files, but did not change the default logging properties in 
> order to incorporate that fix into the default behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19219) Parquet log output overly verbose by default

2017-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822238#comment-15822238
 ] 

Apache Spark commented on SPARK-19219:
--

User 'nicklavers' has created a pull request for this issue:
https://github.com/apache/spark/pull/16580

> Parquet log output overly verbose by default
> 
>
> Key: SPARK-19219
> URL: https://issues.apache.org/jira/browse/SPARK-19219
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Nicholas
>  Labels: easyfix
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> PR #15538 addressed the problematically verbose logging when reading from 
> older parquet files, but did not change the default logging properties in 
> order to incorporate that fix into the default behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19213) FileSourceScanExec usese sparksession from hadoopfsrelation creation time instead of the one active at time of execution

2017-01-13 Thread Robert Kruszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kruszewski updated SPARK-19213:
--
Description: 
If you look at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
 you'll notice that the sparksession used for execution is the one that was 
captured from logicalplan. Whereas in other places you have 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
 and SparkPlan captures active session upon execution in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52

>From my understanding of the code it looks like we should be using the 
>sparksession that is currently active hence take the one from spark plan. 
>However, in case you want share Datasets across SparkSessions that is not 
>enough since as soon as dataset is executed the queryexecution will have 
>capture spark session at that point. If we want to share datasets across users 
>we need to make configurations not fixed upon first execution. I consider 1st 
>part (using sparksession from logical plan) a bug while the second (using 
>sparksession active at runtime) an enhancement so that sharing across sessions 
>is made easier.

For example:
{code}
val df = spark.read.parquet(...)
df.count()
val newSession = spark.newSession()
SparkSession.setActiveSession(newSession)
//  (simplest one to try is disable vectorized 
reads)
val df2 = Dataset.ofRows(newSession, df.logicalPlan) // logical plan still 
holds reference to original sparksession and changes don't take effect
{code}
I suggest that it shouldn't be necessary to create a new dataset for changes to 
take effect. For most of the plans doing Dataset.ofRows work but this is not 
the case for hadoopfsrelation.

  was:
If you look at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
 you'll notice that the sparksession used for execution is the one that was 
captured from logicalplan. Whereas in other places you have 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
 and SparkPlan captures active session upon execution in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52

>From my understanding of the code it looks like we should be using the 
>sparksession that is currently active hence take the one from spark plan. 
>However, in case you want share Datasets across SparkSessions that is not 
>enough since as soon as dataset is executed the queryexecution will have 
>capture spark session at that point. If we want to share datasets across users 
>we need to make configurations not fixed upon first execution. I consider 1st 
>part (using sparksession from logical plan) a bug while the second (using 
>sparksession active at runtime) an enhancement so that sharing across sessions 
>is made easier.

For example:
{code}
val df = spark.read.parquet(...)
df.count()
val newSession = spark.newSession()
SparkSession.setActiveSession(newSession)
 (simplest one to try is disable vectorized 
reads)
val df2 = Dataset.ofRows(newSession, df.logicalPlan) <- logical plan still 
holds reference to original sparksession and changes don't take effect
{code}
I suggest that it shouldn't be necessary to create a new dataset for changes to 
take effect. For most of the plans doing Dataset.ofRows work but this is not 
the case for hadoopfsrelation.


> FileSourceScanExec usese sparksession from hadoopfsrelation creation time 
> instead of the one active at time of execution
> 
>
> Key: SPARK-19213
> URL: https://issues.apache.org/jira/browse/SPARK-19213
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Robert Kruszewski
>
> If you look at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
>  you'll notice that the sparksession used for execution is the one that was 
> captured from logicalplan. Whereas in other places you have 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
>  and SparkPlan captures active session upon execution in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52
> From my understanding of the code it looks like we should be using the 
> sparksession that is currently a

[jira] [Updated] (SPARK-19213) FileSourceScanExec usese sparksession from hadoopfsrelation creation time instead of the one active at time of execution

2017-01-13 Thread Robert Kruszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kruszewski updated SPARK-19213:
--
Description: 
If you look at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
 you'll notice that the sparksession used for execution is the one that was 
captured from logicalplan. Whereas in other places you have 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
 and SparkPlan captures active session upon execution in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52

>From my understanding of the code it looks like we should be using the 
>sparksession that is currently active hence take the one from spark plan. 
>However, in case you want share Datasets across SparkSessions that is not 
>enough since as soon as dataset is executed the queryexecution will have 
>capture spark session at that point. If we want to share datasets across users 
>we need to make configurations not fixed upon first execution. I consider 1st 
>part (using sparksession from logical plan) a bug while the second (using 
>sparksession active at runtime) an enhancement so that sharing across sessions 
>is made easier.

For example:

val df = spark.read.parquet(...)
df.count()
val newSession = spark.newSession()
SparkSession.setActiveSession(newSession)
 (simplest one to try is disable vectorized 
reads)
val df2 = Dataset.ofRows(newSession, df.logicalPlan) <- logical plan still 
holds reference to original sparksession and changes don't take effect

I suggest that it shouldn't be necessary to create a new dataset for changes to 
take effect. For most of the plans doing Dataset.ofRows work but this is not 
the case for hadoopfsrelation.

  was:
If you look at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
 you'll notice that the sparksession used for execution is the one that was 
captured from logicalplan. Whereas in other places you have 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
 and SparkPlan captures active session upon execution in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52

>From my understanding of the code it looks like we should be using the 
>sparksession that is currently active hence take the one from spark plan. 
>However, in case you want share Datasets across SparkSessions that is not 
>enough since as soon as dataset is executed the queryexecution will have 
>capture spark session at that point. If we want to share datasets across users 
>we need to make configurations not fixed upon first execution. I consider 1st 
>part (using sparksession from logical plan) a bug while the second (using 
>sparksession active at runtime) an enhancement so that sharing across sessions 
>is made easier.


> FileSourceScanExec usese sparksession from hadoopfsrelation creation time 
> instead of the one active at time of execution
> 
>
> Key: SPARK-19213
> URL: https://issues.apache.org/jira/browse/SPARK-19213
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Robert Kruszewski
>
> If you look at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
>  you'll notice that the sparksession used for execution is the one that was 
> captured from logicalplan. Whereas in other places you have 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
>  and SparkPlan captures active session upon execution in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52
> From my understanding of the code it looks like we should be using the 
> sparksession that is currently active hence take the one from spark plan. 
> However, in case you want share Datasets across SparkSessions that is not 
> enough since as soon as dataset is executed the queryexecution will have 
> capture spark session at that point. If we want to share datasets across 
> users we need to make configurations not fixed upon first execution. I 
> consider 1st part (using sparksession from logical plan) a bug while the 
> second (using sparksession active at runtime) an enhancement so that sharing 
> across sessions is made easier.
> For example:
> v

[jira] [Updated] (SPARK-19213) FileSourceScanExec usese sparksession from hadoopfsrelation creation time instead of the one active at time of execution

2017-01-13 Thread Robert Kruszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kruszewski updated SPARK-19213:
--
Description: 
If you look at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
 you'll notice that the sparksession used for execution is the one that was 
captured from logicalplan. Whereas in other places you have 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
 and SparkPlan captures active session upon execution in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52

>From my understanding of the code it looks like we should be using the 
>sparksession that is currently active hence take the one from spark plan. 
>However, in case you want share Datasets across SparkSessions that is not 
>enough since as soon as dataset is executed the queryexecution will have 
>capture spark session at that point. If we want to share datasets across users 
>we need to make configurations not fixed upon first execution. I consider 1st 
>part (using sparksession from logical plan) a bug while the second (using 
>sparksession active at runtime) an enhancement so that sharing across sessions 
>is made easier.

For example:
{code}
val df = spark.read.parquet(...)
df.count()
val newSession = spark.newSession()
SparkSession.setActiveSession(newSession)
 (simplest one to try is disable vectorized 
reads)
val df2 = Dataset.ofRows(newSession, df.logicalPlan) <- logical plan still 
holds reference to original sparksession and changes don't take effect
{code}
I suggest that it shouldn't be necessary to create a new dataset for changes to 
take effect. For most of the plans doing Dataset.ofRows work but this is not 
the case for hadoopfsrelation.

  was:
If you look at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
 you'll notice that the sparksession used for execution is the one that was 
captured from logicalplan. Whereas in other places you have 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
 and SparkPlan captures active session upon execution in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52

>From my understanding of the code it looks like we should be using the 
>sparksession that is currently active hence take the one from spark plan. 
>However, in case you want share Datasets across SparkSessions that is not 
>enough since as soon as dataset is executed the queryexecution will have 
>capture spark session at that point. If we want to share datasets across users 
>we need to make configurations not fixed upon first execution. I consider 1st 
>part (using sparksession from logical plan) a bug while the second (using 
>sparksession active at runtime) an enhancement so that sharing across sessions 
>is made easier.

For example:

val df = spark.read.parquet(...)
df.count()
val newSession = spark.newSession()
SparkSession.setActiveSession(newSession)
 (simplest one to try is disable vectorized 
reads)
val df2 = Dataset.ofRows(newSession, df.logicalPlan) <- logical plan still 
holds reference to original sparksession and changes don't take effect

I suggest that it shouldn't be necessary to create a new dataset for changes to 
take effect. For most of the plans doing Dataset.ofRows work but this is not 
the case for hadoopfsrelation.


> FileSourceScanExec usese sparksession from hadoopfsrelation creation time 
> instead of the one active at time of execution
> 
>
> Key: SPARK-19213
> URL: https://issues.apache.org/jira/browse/SPARK-19213
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Robert Kruszewski
>
> If you look at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
>  you'll notice that the sparksession used for execution is the one that was 
> captured from logicalplan. Whereas in other places you have 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
>  and SparkPlan captures active session upon execution in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52
> From my understanding of the code it looks like we should be using the 
> sparksession that is currently active hence tak

[jira] [Commented] (SPARK-17993) Spark prints an avalanche of warning messages from Parquet when reading parquet files written by older versions of Parquet-mr

2017-01-13 Thread Michael Allman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822248#comment-15822248
 ] 

Michael Allman commented on SPARK-17993:


[~emre.colak] FYI https://github.com/apache/spark/pull/16580

> Spark prints an avalanche of warning messages from Parquet when reading 
> parquet files written by older versions of Parquet-mr
> -
>
> Key: SPARK-17993
> URL: https://issues.apache.org/jira/browse/SPARK-17993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Allman
>Assignee: Michael Allman
> Fix For: 2.1.0
>
>
> It looks like https://github.com/apache/spark/pull/14690 broke parquet log 
> output redirection. After that patch, when querying parquet files written by 
> Parquet-mr 1.6.0 Spark prints a torrent of (harmless) warning messages from 
> the Parquet reader:
> {code}
> Oct 18, 2016 7:42:18 PM WARNING: org.apache.parquet.CorruptStatistics: 
> Ignoring statistics because created_by could not be parsed (see PARQUET-251): 
> parquet-mr version 1.6.0
> org.apache.parquet.VersionParser$VersionParseException: Could not parse 
> created_by: parquet-mr version 1.6.0 using format: (.+) version ((.*) 
> )?\(build ?(.*)\)
>   at org.apache.parquet.VersionParser.parse(VersionParser.java:112)
>   at 
> org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)
>   at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:583)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:513)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:225)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:162)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> This only happens during execution, not planning, and it doesn't matter what 
> log level the {{SparkContext}} is set to.
> This is a regression I noted as something we needed to fix as a follow up to 
> PR 14690. I feel responsible, so I'm going to expedite a fix for it. I 
> suspect that PR broke Spark's Parquet log output redirection. That's the 
> premise I'm going by.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscri

[jira] [Assigned] (SPARK-18589) persist() resolves "java.lang.RuntimeException: Invalid PythonUDF (...), requires attributes from more than one child"

2017-01-13 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-18589:
--

Assignee: Davies Liu

> persist() resolves "java.lang.RuntimeException: Invalid PythonUDF 
> (...), requires attributes from more than one child"
> --
>
> Key: SPARK-18589
> URL: https://issues.apache.org/jira/browse/SPARK-18589
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.1.0
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
>Assignee: Davies Liu
>Priority: Minor
>
> Smells like another optimizer bug that's similar to SPARK-17100 and 
> SPARK-18254. I'm seeing this on 2.0.2 and on master at commit 
> {{fb07bbe575aabe68422fd3a31865101fb7fa1722}}.
> I don't have a minimal repro for this yet, but the error I'm seeing is:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling o247.count.
> : java.lang.RuntimeException: Invalid PythonUDF <...>(...), requires 
> attributes from more than one child.
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:150)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:149)
> at scala.collection.immutable.Stream.foreach(Stream.scala:594)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:149)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:113)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:93)
> at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93)
> at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:93)

[jira] [Updated] (SPARK-18589) persist() resolves "java.lang.RuntimeException: Invalid PythonUDF (...), requires attributes from more than one child"

2017-01-13 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-18589:
---
Priority: Critical  (was: Minor)

> persist() resolves "java.lang.RuntimeException: Invalid PythonUDF 
> (...), requires attributes from more than one child"
> --
>
> Key: SPARK-18589
> URL: https://issues.apache.org/jira/browse/SPARK-18589
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.1.0
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
>Assignee: Davies Liu
>Priority: Critical
>
> Smells like another optimizer bug that's similar to SPARK-17100 and 
> SPARK-18254. I'm seeing this on 2.0.2 and on master at commit 
> {{fb07bbe575aabe68422fd3a31865101fb7fa1722}}.
> I don't have a minimal repro for this yet, but the error I'm seeing is:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling o247.count.
> : java.lang.RuntimeException: Invalid PythonUDF <...>(...), requires 
> attributes from more than one child.
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:150)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:149)
> at scala.collection.immutable.Stream.foreach(Stream.scala:594)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:149)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:113)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:93)
> at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93)
> at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.sc

[jira] [Updated] (SPARK-18970) FileSource failure during file list refresh doesn't cause an application to fail, but stops further processing

2017-01-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18970:
-
Description: 
Spark streaming application uses S3 files as streaming sources. After running 
for several day processing stopped even though an application continued to run. 
Stack trace:
{code}
java.io.FileNotFoundException: No such file or directory 
's3n://X'
at 
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:818)
at 
com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:511)
at 
org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:465)
at 
org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:462)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}
I believe 2 things should (or can) be fixed:
1. Application should fail in case of such an error.
2. Allow application to ignore such failure, since there is a chance that 
during next refresh the error will not resurface. (In my case I believe an 
error was cased by S3 cleaning the bucket exactly at the same moment when 
refresh was running) 

My code to create streaming processing looks as the following:
{code}
  val cq = sqlContext.readStream
.format("json")
.schema(struct)
.load(s"input")
.writeStream
.option("checkpointLocation", s"checkpoints")
.foreach(new ForeachWriter[Row] {...})
.trigger(ProcessingTime("10 seconds")).start()

  cq.awaitTermination() 
{code}

  was:
Spark streaming application uses S3 files as streaming sources. After running 
for several day processing stopped even though an application continued to run. 
Stack trace:
java.io.FileNotFoundException: No such file or directory 
's3n://X'
at 
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:818)
at 
com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:511)
at 
org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:465)
at 
org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:462)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
 

[jira] [Closed] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

2017-01-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust closed SPARK-18475.

Resolution: Won't Fix

> Be able to provide higher parallelization for StructuredStreaming Kafka Source
> --
>
> Key: SPARK-18475
> URL: https://issues.apache.org/jira/browse/SPARK-18475
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Burak Yavuz
>
> Right now the StructuredStreaming Kafka Source creates as many Spark tasks as 
> there are TopicPartitions that we're going to read from Kafka.
> This doesn't work well when we have data skew, and there is no reason why we 
> shouldn't be able to increase parallelism further, i.e. have multiple Spark 
> tasks reading from the same Kafka TopicPartition.
> What this will mean is that we won't be able to use the "CachedKafkaConsumer" 
> for what it is defined for (being cached) in this use case, but the extra 
> overhead is worth handling data skew and increasing parallelism especially in 
> ETL use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18589) persist() resolves "java.lang.RuntimeException: Invalid PythonUDF (...), requires attributes from more than one child"

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18589:


Assignee: Apache Spark  (was: Davies Liu)

> persist() resolves "java.lang.RuntimeException: Invalid PythonUDF 
> (...), requires attributes from more than one child"
> --
>
> Key: SPARK-18589
> URL: https://issues.apache.org/jira/browse/SPARK-18589
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.1.0
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
>Assignee: Apache Spark
>Priority: Critical
>
> Smells like another optimizer bug that's similar to SPARK-17100 and 
> SPARK-18254. I'm seeing this on 2.0.2 and on master at commit 
> {{fb07bbe575aabe68422fd3a31865101fb7fa1722}}.
> I don't have a minimal repro for this yet, but the error I'm seeing is:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling o247.count.
> : java.lang.RuntimeException: Invalid PythonUDF <...>(...), requires 
> attributes from more than one child.
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:150)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:149)
> at scala.collection.immutable.Stream.foreach(Stream.scala:594)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:149)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:113)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:93)
> at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93)
> at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> org.apache.spark.sql.execution.QueryExecution.prepareForExecu

[jira] [Commented] (SPARK-18589) persist() resolves "java.lang.RuntimeException: Invalid PythonUDF (...), requires attributes from more than one child"

2017-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822278#comment-15822278
 ] 

Apache Spark commented on SPARK-18589:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/16581

> persist() resolves "java.lang.RuntimeException: Invalid PythonUDF 
> (...), requires attributes from more than one child"
> --
>
> Key: SPARK-18589
> URL: https://issues.apache.org/jira/browse/SPARK-18589
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.1.0
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
>Assignee: Davies Liu
>Priority: Critical
>
> Smells like another optimizer bug that's similar to SPARK-17100 and 
> SPARK-18254. I'm seeing this on 2.0.2 and on master at commit 
> {{fb07bbe575aabe68422fd3a31865101fb7fa1722}}.
> I don't have a minimal repro for this yet, but the error I'm seeing is:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling o247.count.
> : java.lang.RuntimeException: Invalid PythonUDF <...>(...), requires 
> attributes from more than one child.
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:150)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:149)
> at scala.collection.immutable.Stream.foreach(Stream.scala:594)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:149)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:113)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:93)
> at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93)
> at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
> at scala.collection.immutable

[jira] [Assigned] (SPARK-18589) persist() resolves "java.lang.RuntimeException: Invalid PythonUDF (...), requires attributes from more than one child"

2017-01-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18589:


Assignee: Davies Liu  (was: Apache Spark)

> persist() resolves "java.lang.RuntimeException: Invalid PythonUDF 
> (...), requires attributes from more than one child"
> --
>
> Key: SPARK-18589
> URL: https://issues.apache.org/jira/browse/SPARK-18589
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.1.0
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
>Assignee: Davies Liu
>Priority: Critical
>
> Smells like another optimizer bug that's similar to SPARK-17100 and 
> SPARK-18254. I'm seeing this on 2.0.2 and on master at commit 
> {{fb07bbe575aabe68422fd3a31865101fb7fa1722}}.
> I don't have a minimal repro for this yet, but the error I'm seeing is:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling o247.count.
> : java.lang.RuntimeException: Invalid PythonUDF <...>(...), requires 
> attributes from more than one child.
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:150)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:149)
> at scala.collection.immutable.Stream.foreach(Stream.scala:594)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:149)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:113)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:93)
> at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93)
> at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> org.apache.spark.sql.execution.QueryExecution.prepareForExecuti

[jira] [Updated] (SPARK-19129) alter table table_name drop partition with a empty string will drop the whole table

2017-01-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-19129:

Priority: Critical  (was: Major)

> alter table table_name drop partition with a empty string will drop the whole 
> table
> ---
>
> Key: SPARK-19129
> URL: https://issues.apache.org/jira/browse/SPARK-19129
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: lichenglin
>Assignee: Xiao Li
>Priority: Critical
>  Labels: correctness
>
> {code}
> val spark = SparkSession
>   .builder
>   .appName("PartitionDropTest")
>   .master("local[2]").enableHiveSupport()
>   .getOrCreate()
> val sentenceData = spark.createDataFrame(Seq(
>   (0, "a"),
>   (1, "b"),
>   (2, "c")))
>   .toDF("id", "name")
> spark.sql("drop table if exists licllocal.partition_table")
> 
> sentenceData.write.mode(SaveMode.Overwrite).partitionBy("id").saveAsTable("licllocal.partition_table")
> spark.sql("alter table licllocal.partition_table drop partition(id='')")
> spark.table("licllocal.partition_table").show()
> {code}
> the result is 
> {code}
> |name| id|
> ++---+
> ++---+
> {code}
> Maybe the partition match have something wrong when the partition value is 
> set to empty string



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19129) alter table table_name drop partition with a empty string will drop the whole table

2017-01-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-19129:

Labels: correctness  (was: )

> alter table table_name drop partition with a empty string will drop the whole 
> table
> ---
>
> Key: SPARK-19129
> URL: https://issues.apache.org/jira/browse/SPARK-19129
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: lichenglin
>  Labels: correctness
>
> {code}
> val spark = SparkSession
>   .builder
>   .appName("PartitionDropTest")
>   .master("local[2]").enableHiveSupport()
>   .getOrCreate()
> val sentenceData = spark.createDataFrame(Seq(
>   (0, "a"),
>   (1, "b"),
>   (2, "c")))
>   .toDF("id", "name")
> spark.sql("drop table if exists licllocal.partition_table")
> 
> sentenceData.write.mode(SaveMode.Overwrite).partitionBy("id").saveAsTable("licllocal.partition_table")
> spark.sql("alter table licllocal.partition_table drop partition(id='')")
> spark.table("licllocal.partition_table").show()
> {code}
> the result is 
> {code}
> |name| id|
> ++---+
> ++---+
> {code}
> Maybe the partition match have something wrong when the partition value is 
> set to empty string



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   >