[jira] [Updated] (SPARK-17618) Dataframe except returns incorrect results when combined with coalesce

2016-09-21 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17618:
---
Priority: Blocker  (was: Minor)

> Dataframe except returns incorrect results when combined with coalesce
> --
>
> Key: SPARK-17618
> URL: https://issues.apache.org/jira/browse/SPARK-17618
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 1.6.2
>Reporter: Graeme Edwards
>Priority: Blocker
>  Labels: correctness
>
> We were getting incorrect results from the DataFrame except method - all rows 
> were being returned instead of the ones that intersected. Calling subtract on 
> the underlying RDD returned the correct result.
> We tracked it down to the use of coalesce - the following is the simplest 
> example case we created that reproduces the issue:
> {code}
> val schema = new StructType().add("test", types.IntegerType )
> val t1 = sql.createDataFrame(sql.sparkContext.parallelize(1 to 100).map(i=> 
> Row(i)), schema)
> val t2 = sql.createDataFrame(sql.sparkContext.parallelize(5 to 10).map(i=> 
> Row(i)), schema)
> val t3 = t1.join(t2, t1.col("test").equalTo(t2.col("test")), "leftsemi")
> println("Count using normal except = " + t1.except(t3).count())
> println("Count using coalesce = " + 
> t1.coalesce(8).except(t3.coalesce(8)).count())
> {code}
> We should get the same result from both uses of except, but the one using 
> coalesce returns 100 instead of 94.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17618) Dataframe except returns incorrect results when combined with coalesce

2016-09-21 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17618:
---
Target Version/s: 1.6.3

> Dataframe except returns incorrect results when combined with coalesce
> --
>
> Key: SPARK-17618
> URL: https://issues.apache.org/jira/browse/SPARK-17618
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 1.6.2
>Reporter: Graeme Edwards
>Priority: Blocker
>  Labels: correctness
>
> We were getting incorrect results from the DataFrame except method - all rows 
> were being returned instead of the ones that intersected. Calling subtract on 
> the underlying RDD returned the correct result.
> We tracked it down to the use of coalesce - the following is the simplest 
> example case we created that reproduces the issue:
> {code}
> val schema = new StructType().add("test", types.IntegerType )
> val t1 = sql.createDataFrame(sql.sparkContext.parallelize(1 to 100).map(i=> 
> Row(i)), schema)
> val t2 = sql.createDataFrame(sql.sparkContext.parallelize(5 to 10).map(i=> 
> Row(i)), schema)
> val t3 = t1.join(t2, t1.col("test").equalTo(t2.col("test")), "leftsemi")
> println("Count using normal except = " + t1.except(t3).count())
> println("Count using coalesce = " + 
> t1.coalesce(8).except(t3.coalesce(8)).count())
> {code}
> We should get the same result from both uses of except, but the one using 
> coalesce returns 100 instead of 94.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17618) Dataframe except returns incorrect results when combined with coalesce

2016-09-21 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15510934#comment-15510934
 ] 

Josh Rosen commented on SPARK-17618:


Yep, the problem is that {{Coalesce}} advertises that it accepts Unsafe rows 
but misdeclares its row output format as being regular rows. Comparing an 
UnsafeRow to any other row type for equality always returns false (its 
{{equals()}} implementation is compatible with Java universal equality, so it 
doesn't throw when performing a comparison against a different type). As a 
result, the Except compares safe and unsafe rows, causing the comparisons to be 
incorrect and leading to the wrong answer that you saw here.

I'm marking this as a blocker for 1.6.3 and am working on a fix which will fix 
this issue.

> Dataframe except returns incorrect results when combined with coalesce
> --
>
> Key: SPARK-17618
> URL: https://issues.apache.org/jira/browse/SPARK-17618
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 1.6.2
>Reporter: Graeme Edwards
>Priority: Blocker
>  Labels: correctness
>
> We were getting incorrect results from the DataFrame except method - all rows 
> were being returned instead of the ones that intersected. Calling subtract on 
> the underlying RDD returned the correct result.
> We tracked it down to the use of coalesce - the following is the simplest 
> example case we created that reproduces the issue:
> {code}
> val schema = new StructType().add("test", types.IntegerType )
> val t1 = sql.createDataFrame(sql.sparkContext.parallelize(1 to 100).map(i=> 
> Row(i)), schema)
> val t2 = sql.createDataFrame(sql.sparkContext.parallelize(5 to 10).map(i=> 
> Row(i)), schema)
> val t3 = t1.join(t2, t1.col("test").equalTo(t2.col("test")), "leftsemi")
> println("Count using normal except = " + t1.except(t3).count())
> println("Count using coalesce = " + 
> t1.coalesce(8).except(t3.coalesce(8)).count())
> {code}
> We should get the same result from both uses of except, but the one using 
> coalesce returns 100 instead of 94.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17592) SQL: CAST string as INT inconsistent with Hive

2016-09-21 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17592:

Labels:   (was: correctness)

> SQL: CAST string as INT inconsistent with Hive
> --
>
> Key: SPARK-17592
> URL: https://issues.apache.org/jira/browse/SPARK-17592
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Furcy Pin
>
> Hello,
> there seem to be an inconsistency between Spark and Hive when casting a 
> string into an Int. 
> With Hive:
> {code}
> select cast("0.4" as INT) ;
> > 0
> select cast("0.5" as INT) ;
> > 0
> select cast("0.6" as INT) ;
> > 0
> {code}
> With Spark-SQL:
> {code}
> select cast("0.4" as INT) ;
> > 0
> select cast("0.5" as INT) ;
> > 1
> select cast("0.6" as INT) ;
> > 1
> {code}
> Hive seems to perform a floor(string.toDouble), while Spark seems to perform 
> a round(string.toDouble)
> I'm not sure there is any ISO standard for this, mysql has the same behavior 
> than Hive, while postgresql performs a string.toInt and throws an 
> NumberFormatException
> Personnally I think Hive is right, hence my posting this here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17592) SQL: CAST string as INT inconsistent with Hive

2016-09-21 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17592:

Fix Version/s: (was: 2.0.1)
   (was: 2.1.0)

> SQL: CAST string as INT inconsistent with Hive
> --
>
> Key: SPARK-17592
> URL: https://issues.apache.org/jira/browse/SPARK-17592
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Furcy Pin
>
> Hello,
> there seem to be an inconsistency between Spark and Hive when casting a 
> string into an Int. 
> With Hive:
> {code}
> select cast("0.4" as INT) ;
> > 0
> select cast("0.5" as INT) ;
> > 0
> select cast("0.6" as INT) ;
> > 0
> {code}
> With Spark-SQL:
> {code}
> select cast("0.4" as INT) ;
> > 0
> select cast("0.5" as INT) ;
> > 1
> select cast("0.6" as INT) ;
> > 1
> {code}
> Hive seems to perform a floor(string.toDouble), while Spark seems to perform 
> a round(string.toDouble)
> I'm not sure there is any ISO standard for this, mysql has the same behavior 
> than Hive, while postgresql performs a string.toInt and throws an 
> NumberFormatException
> Personnally I think Hive is right, hence my posting this here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17618) Dataframe except returns incorrect results when combined with coalesce

2016-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511034#comment-15511034
 ] 

Apache Spark commented on SPARK-17618:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/15185

> Dataframe except returns incorrect results when combined with coalesce
> --
>
> Key: SPARK-17618
> URL: https://issues.apache.org/jira/browse/SPARK-17618
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 1.6.2
>Reporter: Graeme Edwards
>Assignee: Josh Rosen
>Priority: Blocker
>  Labels: correctness
>
> We were getting incorrect results from the DataFrame except method - all rows 
> were being returned instead of the ones that intersected. Calling subtract on 
> the underlying RDD returned the correct result.
> We tracked it down to the use of coalesce - the following is the simplest 
> example case we created that reproduces the issue:
> {code}
> val schema = new StructType().add("test", types.IntegerType )
> val t1 = sql.createDataFrame(sql.sparkContext.parallelize(1 to 100).map(i=> 
> Row(i)), schema)
> val t2 = sql.createDataFrame(sql.sparkContext.parallelize(5 to 10).map(i=> 
> Row(i)), schema)
> val t3 = t1.join(t2, t1.col("test").equalTo(t2.col("test")), "leftsemi")
> println("Count using normal except = " + t1.except(t3).count())
> println("Count using coalesce = " + 
> t1.coalesce(8).except(t3.coalesce(8)).count())
> {code}
> We should get the same result from both uses of except, but the one using 
> coalesce returns 100 instead of 94.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17618) Dataframe except returns incorrect results when combined with coalesce

2016-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17618:


Assignee: Apache Spark  (was: Josh Rosen)

> Dataframe except returns incorrect results when combined with coalesce
> --
>
> Key: SPARK-17618
> URL: https://issues.apache.org/jira/browse/SPARK-17618
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 1.6.2
>Reporter: Graeme Edwards
>Assignee: Apache Spark
>Priority: Blocker
>  Labels: correctness
>
> We were getting incorrect results from the DataFrame except method - all rows 
> were being returned instead of the ones that intersected. Calling subtract on 
> the underlying RDD returned the correct result.
> We tracked it down to the use of coalesce - the following is the simplest 
> example case we created that reproduces the issue:
> {code}
> val schema = new StructType().add("test", types.IntegerType )
> val t1 = sql.createDataFrame(sql.sparkContext.parallelize(1 to 100).map(i=> 
> Row(i)), schema)
> val t2 = sql.createDataFrame(sql.sparkContext.parallelize(5 to 10).map(i=> 
> Row(i)), schema)
> val t3 = t1.join(t2, t1.col("test").equalTo(t2.col("test")), "leftsemi")
> println("Count using normal except = " + t1.except(t3).count())
> println("Count using coalesce = " + 
> t1.coalesce(8).except(t3.coalesce(8)).count())
> {code}
> We should get the same result from both uses of except, but the one using 
> coalesce returns 100 instead of 94.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17618) Dataframe except returns incorrect results when combined with coalesce

2016-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17618:


Assignee: Josh Rosen  (was: Apache Spark)

> Dataframe except returns incorrect results when combined with coalesce
> --
>
> Key: SPARK-17618
> URL: https://issues.apache.org/jira/browse/SPARK-17618
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 1.6.2
>Reporter: Graeme Edwards
>Assignee: Josh Rosen
>Priority: Blocker
>  Labels: correctness
>
> We were getting incorrect results from the DataFrame except method - all rows 
> were being returned instead of the ones that intersected. Calling subtract on 
> the underlying RDD returned the correct result.
> We tracked it down to the use of coalesce - the following is the simplest 
> example case we created that reproduces the issue:
> {code}
> val schema = new StructType().add("test", types.IntegerType )
> val t1 = sql.createDataFrame(sql.sparkContext.parallelize(1 to 100).map(i=> 
> Row(i)), schema)
> val t2 = sql.createDataFrame(sql.sparkContext.parallelize(5 to 10).map(i=> 
> Row(i)), schema)
> val t3 = t1.join(t2, t1.col("test").equalTo(t2.col("test")), "leftsemi")
> println("Count using normal except = " + t1.except(t3).count())
> println("Count using coalesce = " + 
> t1.coalesce(8).except(t3.coalesce(8)).count())
> {code}
> We should get the same result from both uses of except, but the one using 
> coalesce returns 100 instead of 94.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17618) Dataframe except returns incorrect results when combined with coalesce

2016-09-21 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-17618:
--

Assignee: Josh Rosen

> Dataframe except returns incorrect results when combined with coalesce
> --
>
> Key: SPARK-17618
> URL: https://issues.apache.org/jira/browse/SPARK-17618
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 1.6.2
>Reporter: Graeme Edwards
>Assignee: Josh Rosen
>Priority: Blocker
>  Labels: correctness
>
> We were getting incorrect results from the DataFrame except method - all rows 
> were being returned instead of the ones that intersected. Calling subtract on 
> the underlying RDD returned the correct result.
> We tracked it down to the use of coalesce - the following is the simplest 
> example case we created that reproduces the issue:
> {code}
> val schema = new StructType().add("test", types.IntegerType )
> val t1 = sql.createDataFrame(sql.sparkContext.parallelize(1 to 100).map(i=> 
> Row(i)), schema)
> val t2 = sql.createDataFrame(sql.sparkContext.parallelize(5 to 10).map(i=> 
> Row(i)), schema)
> val t3 = t1.join(t2, t1.col("test").equalTo(t2.col("test")), "leftsemi")
> println("Count using normal except = " + t1.except(t3).count())
> println("Count using coalesce = " + 
> t1.coalesce(8).except(t3.coalesce(8)).count())
> {code}
> We should get the same result from both uses of except, but the one using 
> coalesce returns 100 instead of 94.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14235) All downloads of spark .tgz archives seem to be corrupt

2016-09-21 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-14235.

Resolution: Fixed

I believe that this is now resolved, but please re-open if this is still an 
issue. Thanks!

> All downloads of spark .tgz archives seem to be corrupt
> ---
>
> Key: SPARK-14235
> URL: https://issues.apache.org/jira/browse/SPARK-14235
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.6.1
> Environment: any browser...
>Reporter: Morgan Jones
>
> Hi Guys,
> there seems to be an issue with all of the build currently sitting on 
> http://spark.apache.org/downloads.html. I can't seem to get any of them to 
> untar. They exit with the following error:
> gzip: stdin: unexpected end of file
> tar: Unexpected EOF in archive
> tar: Unexpected EOF in archive
> tar: Error is not recoverable: exiting now
> any help would be greatly appreciated.
> Cheers,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17452) Spark 2.0.0 is not supporting the "partition" keyword on a "describe" statement when using Hive Support

2016-09-21 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17452:
---
Component/s: (was: Build)
 SQL

> Spark 2.0.0 is not supporting the "partition" keyword on a "describe" 
> statement when using Hive Support
> ---
>
> Key: SPARK-17452
> URL: https://issues.apache.org/jira/browse/SPARK-17452
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Amazon EMR 5.0.0
>Reporter: Hernan Vivani
>
> Changes introduced in Spark 2 are not supporting the "partition" keyword on a 
> "describe" statement.
> EMR 5 (Spark 2.0):
> ==
> scala> import org.apache.spark.sql.SparkSession
> scala> val 
> sess=SparkSession.builder().appName("test").enableHiveSupport().getOrCreate()
> scala> sess.sql("describe formatted page_view partition (dt='2008-06-08', 
> country='AR')").show 
> org.apache.spark.sql.catalyst.parser.ParseException:
> Unsupported SQL statement
> == SQL ==
> describe formatted page_view partition (dt='2008-06-08', country='AR')
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:58)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:82)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:46)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
>   ... 48 elided
> Same statement is working fine on Spark 1.6.2 and Spark 1.5.2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14209) Application failure during preemption.

2016-09-21 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511187#comment-15511187
 ] 

Josh Rosen commented on SPARK-14209:


I believe that this issue should have been fixed by SPARK-17485 in Spark 2.x.

> Application failure during preemption.
> --
>
> Key: SPARK-14209
> URL: https://issues.apache.org/jira/browse/SPARK-14209
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.6.1
> Environment: Spark on YARN
>Reporter: Miles Crawford
>
> We have a fair-sharing cluster set up, including the external shuffle 
> service.  When a new job arrives, existing jobs are successfully preempted 
> down to fit.
> A spate of these messages arrives:
>   ExecutorLostFailure (executor 48 exited unrelated to the running tasks) 
> Reason: Container container_1458935819920_0019_01_000143 on host: 
> ip-10-12-46-235.us-west-2.compute.internal was preempted.
> This seems fine - the problem is that soon thereafter, our whole application 
> fails because it is unable to fetch blocks from the pre-empted containers:
> org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 
> locations. Most recent failure cause:
> Caused by: java.io.IOException: Failed to connect to 
> ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
> Caused by: java.net.ConnectException: Connection refused: 
> ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
> Full stack: https://gist.github.com/milescrawford/33a1c1e61d88cc8c6daf
> Spark does not attempt to recreate these blocks - the tasks simply fail over 
> and over until the maxTaskAttempts value is reached.
> It appears to me that there is some fault in the way preempted containers are 
> being handled - shouldn't these blocks be recreated on demand?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-17485) Failed remote cached block reads can lead to whole job failure

2016-09-21 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reopened SPARK-17485:


Re-opening so I can backport to branch-1.6.

> Failed remote cached block reads can lead to whole job failure
> --
>
> Key: SPARK-17485
> URL: https://issues.apache.org/jira/browse/SPARK-17485
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
>
> In Spark's RDD.getOrCompute we first try to read a local copy of a cached 
> block, then a remote copy, and only fall back to recomputing the block if no 
> cached copy (local or remote) can be read. This logic works correctly in the 
> case where no remote copies of the block exist, but if there _are_ remote 
> copies but reads of those copies fail (due to network issues or internal 
> Spark bugs) then the BlockManager will throw a {{BlockFetchException}} error 
> that fails the entire job.
> In the case of torrent broadcast we really _do_ want to fail the entire job 
> in case no remote blocks can be fetched, but this logic is inappropriate for 
> cached blocks because those can/should be recomputed.
> Therefore, I think that this exception should be thrown higher up the call 
> stack by the BlockManager client code and not the block manager itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17485) Failed remote cached block reads can lead to whole job failure

2016-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511244#comment-15511244
 ] 

Apache Spark commented on SPARK-17485:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/15186

> Failed remote cached block reads can lead to whole job failure
> --
>
> Key: SPARK-17485
> URL: https://issues.apache.org/jira/browse/SPARK-17485
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
>
> In Spark's RDD.getOrCompute we first try to read a local copy of a cached 
> block, then a remote copy, and only fall back to recomputing the block if no 
> cached copy (local or remote) can be read. This logic works correctly in the 
> case where no remote copies of the block exist, but if there _are_ remote 
> copies but reads of those copies fail (due to network issues or internal 
> Spark bugs) then the BlockManager will throw a {{BlockFetchException}} error 
> that fails the entire job.
> In the case of torrent broadcast we really _do_ want to fail the entire job 
> in case no remote blocks can be fetched, but this logic is inappropriate for 
> cached blocks because those can/should be recomputed.
> Therefore, I think that this exception should be thrown higher up the call 
> stack by the BlockManager client code and not the block manager itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17485) Failed remote cached block reads can lead to whole job failure

2016-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17485:


Assignee: Apache Spark  (was: Josh Rosen)

> Failed remote cached block reads can lead to whole job failure
> --
>
> Key: SPARK-17485
> URL: https://issues.apache.org/jira/browse/SPARK-17485
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
>
> In Spark's RDD.getOrCompute we first try to read a local copy of a cached 
> block, then a remote copy, and only fall back to recomputing the block if no 
> cached copy (local or remote) can be read. This logic works correctly in the 
> case where no remote copies of the block exist, but if there _are_ remote 
> copies but reads of those copies fail (due to network issues or internal 
> Spark bugs) then the BlockManager will throw a {{BlockFetchException}} error 
> that fails the entire job.
> In the case of torrent broadcast we really _do_ want to fail the entire job 
> in case no remote blocks can be fetched, but this logic is inappropriate for 
> cached blocks because those can/should be recomputed.
> Therefore, I think that this exception should be thrown higher up the call 
> stack by the BlockManager client code and not the block manager itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17592) SQL: CAST string as INT inconsistent with Hive

2016-09-21 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511251#comment-15511251
 ] 

Reynold Xin commented on SPARK-17592:
-

What does other databases do, e.g. SQL server, Oracle?


> SQL: CAST string as INT inconsistent with Hive
> --
>
> Key: SPARK-17592
> URL: https://issues.apache.org/jira/browse/SPARK-17592
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Furcy Pin
>
> Hello,
> there seem to be an inconsistency between Spark and Hive when casting a 
> string into an Int. 
> With Hive:
> {code}
> select cast("0.4" as INT) ;
> > 0
> select cast("0.5" as INT) ;
> > 0
> select cast("0.6" as INT) ;
> > 0
> {code}
> With Spark-SQL:
> {code}
> select cast("0.4" as INT) ;
> > 0
> select cast("0.5" as INT) ;
> > 1
> select cast("0.6" as INT) ;
> > 1
> {code}
> Hive seems to perform a floor(string.toDouble), while Spark seems to perform 
> a round(string.toDouble)
> I'm not sure there is any ISO standard for this, mysql has the same behavior 
> than Hive, while postgresql performs a string.toInt and throws an 
> NumberFormatException
> Personnally I think Hive is right, hence my posting this here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17620) hive.default.fileformat=orc does not set OrcSerde

2016-09-21 Thread Dilip Biswal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511267#comment-15511267
 ] 

Dilip Biswal commented on SPARK-17620:
--

fix it now. Thanks!

> hive.default.fileformat=orc does not set OrcSerde
> -
>
> Key: SPARK-17620
> URL: https://issues.apache.org/jira/browse/SPARK-17620
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Brian Cho
>Priority: Minor
>
> Setting {{hive.default.fileformat=orc}} does not set OrcSerde. This behavior 
> is inconsistent with {{STORED AS ORC}}. This means we cannot set a default 
> behavior for creating tables using orc.
> The behavior using stored as:
> {noformat}
> scala> spark.sql("CREATE TABLE tmp_stored_as(id INT) STORED AS ORC")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("DESC FORMATTED tmp_stored_as").collect.foreach(println)
> ...
> [# Storage Information,,]
> [SerDe Library:,org.apache.hadoop.hive.ql.io.orc.OrcSerde,]
> [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
> [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
> ...
> {noformat}
> Behavior setting default conf (SerDe Library is not set properly):
> {noformat}
> scala> spark.sql("SET hive.default.fileformat=orc")
> res2: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.sql("CREATE TABLE tmp_default(id INT)")
> res3: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("DESC FORMATTED tmp_default").collect.foreach(println)
> ...
> [# Storage Information,,]
> [SerDe Library:,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,]
> [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
> [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4563) Allow spark driver to bind to different ip then advertise ip

2016-09-21 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-4563.
-
   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 2.1.0

> Allow spark driver to bind to different ip then advertise ip
> 
>
> Key: SPARK-4563
> URL: https://issues.apache.org/jira/browse/SPARK-4563
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: Long Nguyen
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.1.0
>
>
> Spark driver bind ip and advertise is not configurable. spark.driver.host is 
> only bind ip. SPARK_PUBLIC_DNS does not work for spark driver. Allow option 
> to set advertised ip/hostname



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17623) Failed tasks end reason is always a TaskFailedReason, types should reflect this

2016-09-21 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-17623.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

> Failed tasks end reason is always a TaskFailedReason, types should reflect 
> this
> ---
>
> Key: SPARK-17623
> URL: https://issues.apache.org/jira/browse/SPARK-17623
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 2.0.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Minor
> Fix For: 2.1.0
>
>
> Minor code cleanup.  In TaskResultGetter, enqueueFailedTask currently 
> deserializes the result as a TaskEndReason.  But the type is actually more 
> specific, its a TaskFailedReason.  This just leads to more blind casting 
> later on -- it would be more clear if the msg was cast to the right type 
> immediately, so method parameter types could be tightened.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17616) Getting "java.lang.RuntimeException: Distinct columns cannot exist in Aggregate "

2016-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511386#comment-15511386
 ] 

Apache Spark commented on SPARK-17616:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/15187

> Getting "java.lang.RuntimeException: Distinct columns cannot exist in 
> Aggregate "
> -
>
> Key: SPARK-17616
> URL: https://issues.apache.org/jira/browse/SPARK-17616
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Priority: Minor
>
> I execute:
> {code}
> select platform, 
> collect_set(user_auth) as paid_types,
> count(distinct sessionid) as sessions
> from non_hss.session
> where
> event = 'stop' and platform != 'testplatform' and
> not (month = MONTH(current_date()) AND year = YEAR(current_date()) 
> and day = day(current_date())) and
> (
> (month >= MONTH(add_months(CURRENT_DATE(), -5)) AND year = 
> YEAR(add_months(CURRENT_DATE(), -5)))
> OR
> (month <= MONTH(add_months(CURRENT_DATE(), -5)) AND year > 
> YEAR(add_months(CURRENT_DATE(), -5)))
> )
> group by platform
> {code}
> I get:
> {code}
> java.lang.RuntimeException: Distinct columns cannot exist in Aggregate 
> operator containing aggregate functions which don't support partial 
> aggregation.
> {code}
> IT WORKED IN 1.6.2. I've read error 5 times, and read code once. I still 
> don't understand what I do incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17627) Streaming Providers should be labeled Experimental

2016-09-21 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-17627:
-
Component/s: SQL

> Streaming Providers should be labeled Experimental
> --
>
> Key: SPARK-17627
> URL: https://issues.apache.org/jira/browse/SPARK-17627
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
>
> All of structured streaming is experimental, but we missed the annotation on 
> two of the APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17616) Getting "java.lang.RuntimeException: Distinct columns cannot exist in Aggregate "

2016-09-21 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17616:
---
Target Version/s: 2.0.1, 2.1.0  (was: 2.0.1)

> Getting "java.lang.RuntimeException: Distinct columns cannot exist in 
> Aggregate "
> -
>
> Key: SPARK-17616
> URL: https://issues.apache.org/jira/browse/SPARK-17616
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Assignee: Herman van Hovell
>Priority: Minor
>
> I execute:
> {code}
> select platform, 
> collect_set(user_auth) as paid_types,
> count(distinct sessionid) as sessions
> from non_hss.session
> where
> event = 'stop' and platform != 'testplatform' and
> not (month = MONTH(current_date()) AND year = YEAR(current_date()) 
> and day = day(current_date())) and
> (
> (month >= MONTH(add_months(CURRENT_DATE(), -5)) AND year = 
> YEAR(add_months(CURRENT_DATE(), -5)))
> OR
> (month <= MONTH(add_months(CURRENT_DATE(), -5)) AND year > 
> YEAR(add_months(CURRENT_DATE(), -5)))
> )
> group by platform
> {code}
> I get:
> {code}
> java.lang.RuntimeException: Distinct columns cannot exist in Aggregate 
> operator containing aggregate functions which don't support partial 
> aggregation.
> {code}
> IT WORKED IN 1.6.2. I've read error 5 times, and read code once. I still 
> don't understand what I do incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17616) Getting "java.lang.RuntimeException: Distinct columns cannot exist in Aggregate "

2016-09-21 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17616:
---
Assignee: Herman van Hovell

> Getting "java.lang.RuntimeException: Distinct columns cannot exist in 
> Aggregate "
> -
>
> Key: SPARK-17616
> URL: https://issues.apache.org/jira/browse/SPARK-17616
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Assignee: Herman van Hovell
>Priority: Minor
>
> I execute:
> {code}
> select platform, 
> collect_set(user_auth) as paid_types,
> count(distinct sessionid) as sessions
> from non_hss.session
> where
> event = 'stop' and platform != 'testplatform' and
> not (month = MONTH(current_date()) AND year = YEAR(current_date()) 
> and day = day(current_date())) and
> (
> (month >= MONTH(add_months(CURRENT_DATE(), -5)) AND year = 
> YEAR(add_months(CURRENT_DATE(), -5)))
> OR
> (month <= MONTH(add_months(CURRENT_DATE(), -5)) AND year > 
> YEAR(add_months(CURRENT_DATE(), -5)))
> )
> group by platform
> {code}
> I get:
> {code}
> java.lang.RuntimeException: Distinct columns cannot exist in Aggregate 
> operator containing aggregate functions which don't support partial 
> aggregation.
> {code}
> IT WORKED IN 1.6.2. I've read error 5 times, and read code once. I still 
> don't understand what I do incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17616) Getting "java.lang.RuntimeException: Distinct columns cannot exist in Aggregate "

2016-09-21 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17616:
---
Target Version/s: 2.0.1

> Getting "java.lang.RuntimeException: Distinct columns cannot exist in 
> Aggregate "
> -
>
> Key: SPARK-17616
> URL: https://issues.apache.org/jira/browse/SPARK-17616
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Priority: Minor
>
> I execute:
> {code}
> select platform, 
> collect_set(user_auth) as paid_types,
> count(distinct sessionid) as sessions
> from non_hss.session
> where
> event = 'stop' and platform != 'testplatform' and
> not (month = MONTH(current_date()) AND year = YEAR(current_date()) 
> and day = day(current_date())) and
> (
> (month >= MONTH(add_months(CURRENT_DATE(), -5)) AND year = 
> YEAR(add_months(CURRENT_DATE(), -5)))
> OR
> (month <= MONTH(add_months(CURRENT_DATE(), -5)) AND year > 
> YEAR(add_months(CURRENT_DATE(), -5)))
> )
> group by platform
> {code}
> I get:
> {code}
> java.lang.RuntimeException: Distinct columns cannot exist in Aggregate 
> operator containing aggregate functions which don't support partial 
> aggregation.
> {code}
> IT WORKED IN 1.6.2. I've read error 5 times, and read code once. I still 
> don't understand what I do incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17615) Getting "java.lang.RuntimeException: Distinct columns cannot exist in Aggregate "

2016-09-21 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-17615.

Resolution: Duplicate

> Getting "java.lang.RuntimeException: Distinct columns cannot exist in 
> Aggregate "
> -
>
> Key: SPARK-17615
> URL: https://issues.apache.org/jira/browse/SPARK-17615
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Priority: Minor
>
> I execute:
> {code}
> select platform, 
> collect_set(user_auth) as paid_types,
> count(distinct sessionid) as sessions
> from non_hss.session
> where
> event = 'stop' and platform != 'testplatform' and
> not (month = MONTH(current_date()) AND year = YEAR(current_date()) 
> and day = day(current_date())) and
> (
> (month >= MONTH(add_months(CURRENT_DATE(), -5)) AND year = 
> YEAR(add_months(CURRENT_DATE(), -5)))
> OR
> (month <= MONTH(add_months(CURRENT_DATE(), -5)) AND year > 
> YEAR(add_months(CURRENT_DATE(), -5)))
> )
> group by platform
> {code}
> I get:
> {code}
> java.lang.RuntimeException: Distinct columns cannot exist in Aggregate 
> operator containing aggregate functions which don't support partial 
> aggregation.
> {code}
> IT WORKED IN 1.6.2. I've read error 5 times, and read code once. I still 
> don't understand what I do incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17627) Streaming Providers should be labeled Experimental

2016-09-21 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-17627:


 Summary: Streaming Providers should be labeled Experimental
 Key: SPARK-17627
 URL: https://issues.apache.org/jira/browse/SPARK-17627
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker


All of structured streaming is experimental, but we missed the annotation on 
two of the APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-17616) Getting "java.lang.RuntimeException: Distinct columns cannot exist in Aggregate "

2016-09-21 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reopened SPARK-17616:


> Getting "java.lang.RuntimeException: Distinct columns cannot exist in 
> Aggregate "
> -
>
> Key: SPARK-17616
> URL: https://issues.apache.org/jira/browse/SPARK-17616
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Priority: Minor
>
> I execute:
> {code}
> select platform, 
> collect_set(user_auth) as paid_types,
> count(distinct sessionid) as sessions
> from non_hss.session
> where
> event = 'stop' and platform != 'testplatform' and
> not (month = MONTH(current_date()) AND year = YEAR(current_date()) 
> and day = day(current_date())) and
> (
> (month >= MONTH(add_months(CURRENT_DATE(), -5)) AND year = 
> YEAR(add_months(CURRENT_DATE(), -5)))
> OR
> (month <= MONTH(add_months(CURRENT_DATE(), -5)) AND year > 
> YEAR(add_months(CURRENT_DATE(), -5)))
> )
> group by platform
> {code}
> I get:
> {code}
> java.lang.RuntimeException: Distinct columns cannot exist in Aggregate 
> operator containing aggregate functions which don't support partial 
> aggregation.
> {code}
> IT WORKED IN 1.6.2. I've read error 5 times, and read code once. I still 
> don't understand what I do incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17627) Streaming Providers should be labeled Experimental

2016-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511420#comment-15511420
 ] 

Apache Spark commented on SPARK-17627:
--

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/15188

> Streaming Providers should be labeled Experimental
> --
>
> Key: SPARK-17627
> URL: https://issues.apache.org/jira/browse/SPARK-17627
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
>
> All of structured streaming is experimental, but we missed the annotation on 
> two of the APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17627) Streaming Providers should be labeled Experimental

2016-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17627:


Assignee: Michael Armbrust  (was: Apache Spark)

> Streaming Providers should be labeled Experimental
> --
>
> Key: SPARK-17627
> URL: https://issues.apache.org/jira/browse/SPARK-17627
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
>
> All of structured streaming is experimental, but we missed the annotation on 
> two of the APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17627) Streaming Providers should be labeled Experimental

2016-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17627:


Assignee: Apache Spark  (was: Michael Armbrust)

> Streaming Providers should be labeled Experimental
> --
>
> Key: SPARK-17627
> URL: https://issues.apache.org/jira/browse/SPARK-17627
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>Priority: Blocker
>
> All of structured streaming is experimental, but we missed the annotation on 
> two of the APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17618) Dataframe except returns incorrect results when combined with coalesce

2016-09-21 Thread Graeme Edwards (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511443#comment-15511443
 ] 

Graeme Edwards commented on SPARK-17618:


Thanks Josh, that perfectly explains what we saw. Thanks for the quick response!

> Dataframe except returns incorrect results when combined with coalesce
> --
>
> Key: SPARK-17618
> URL: https://issues.apache.org/jira/browse/SPARK-17618
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 1.6.2
>Reporter: Graeme Edwards
>Assignee: Josh Rosen
>Priority: Blocker
>  Labels: correctness
>
> We were getting incorrect results from the DataFrame except method - all rows 
> were being returned instead of the ones that intersected. Calling subtract on 
> the underlying RDD returned the correct result.
> We tracked it down to the use of coalesce - the following is the simplest 
> example case we created that reproduces the issue:
> {code}
> val schema = new StructType().add("test", types.IntegerType )
> val t1 = sql.createDataFrame(sql.sparkContext.parallelize(1 to 100).map(i=> 
> Row(i)), schema)
> val t2 = sql.createDataFrame(sql.sparkContext.parallelize(5 to 10).map(i=> 
> Row(i)), schema)
> val t3 = t1.join(t2, t1.col("test").equalTo(t2.col("test")), "leftsemi")
> println("Count using normal except = " + t1.except(t3).count())
> println("Count using coalesce = " + 
> t1.coalesce(8).except(t3.coalesce(8)).count())
> {code}
> We should get the same result from both uses of except, but the one using 
> coalesce returns 100 instead of 94.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17549) InMemoryRelation doesn't scale to large tables

2016-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511459#comment-15511459
 ] 

Apache Spark commented on SPARK-17549:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/15189

> InMemoryRelation doesn't scale to large tables
> --
>
> Key: SPARK-17549
> URL: https://issues.apache.org/jira/browse/SPARK-17549
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Marcelo Vanzin
> Fix For: 2.0.1, 2.1.0
>
> Attachments: create_parquet.scala, example_1.6_post_patch.png, 
> example_1.6_pre_patch.png, spark-1.6-2.patch, spark-1.6.patch, spark-2.0.patch
>
>
> An {{InMemoryRelation}} is created when you cache a table; but if the table 
> is large, defined by either having a really large amount of columns, or a 
> really large amount of partitions (in the file split sense, not the "table 
> partition" sense), or both, it causes an immense amount of memory to be used 
> in the driver.
> The reason is that it uses an accumulator to collect statistics about each 
> partition, and instead of summarizing the data in the driver, it keeps *all* 
> entries in memory.
> I'm attaching a script I used to create a parquet file with 20,000 columns 
> and a single row, which I then copied 500 times so I'd have 500 partitions.
> When doing the following:
> {code}
> sqlContext.read.parquet(...).count()
> {code}
> Everything works fine, both in Spark 1.6 and 2.0. (It's super slow with the 
> settings I used, but it works.)
> I ran spark-shell like this:
> {code}
> ./bin/spark-shell --master 'local-cluster[4,1,4096]' --driver-memory 2g 
> --conf spark.executor.memory=2g
> {code}
> And ran:
> {code}
> sqlContext.read.parquet(...).cache().count()
> {code}
> You'll see the results in screenshot {{example_1.6_pre_patch.png}}. After 40 
> partitions were processed, there were 40 GenericInternalRow objects with
> 100,000 items each (5 stat info fields * 20,000 columns). So, memory usage 
> was:
> {code}
>   40 * 10 * (4 * 20 + 24) = 41600 =~ 400MB
> {code}
> (Note: Integer = 20 bytes, Long = 24 bytes.)
> If I waited until the end, there would be 500 partitions, so ~ 5GB of memory 
> to hold the stats.
> I'm also attaching a patch I made on top of 1.6 that uses just a long 
> accumulator to capture the table size; with that patch memory usage on the 
> driver doesn't keep growing. Also note in the patch that I'm multiplying the 
> column size by the row count, which I think is a different bug in the 
> existing code (those stats should be for the whole batch, not just a single 
> row, right?). I also added {{example_1.6_post_patch.png}} to show the 
> {{InMemoryRelation}} with the patch.
> I also applied a very similar patch on top of Spark 2.0. But there things 
> blow up even more spectacularly when I try to run the count on the cached 
> table. It starts with this error:
> {noformat}
> 14:19:43 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2, 
> vanzin-st1-3.gce.cloudera.com): java.util.concurrent.ExecutionException: 
> java.lang.Exception: failed to compile: java.lang.IndexOutOfBoundsException: 
> Index: 63235, Size: 1
> (lots of generated code here...)
> Caused by: java.lang.IndexOutOfBoundsException: Index: 63235, Size: 1
>   at java.util.ArrayList.rangeCheck(ArrayList.java:635)
>   at java.util.ArrayList.get(ArrayList.java:411)
>   at 
> org.codehaus.janino.util.ClassFile.getConstantPoolInfo(ClassFile.java:556)
>   at 
> org.codehaus.janino.util.ClassFile.getConstantUtf8(ClassFile.java:572)
>   at org.codehaus.janino.util.ClassFile.loadAttribute(ClassFile.java:1513)
>   at org.codehaus.janino.util.ClassFile.loadAttributes(ClassFile.java:644)
>   at org.codehaus.janino.util.ClassFile.loadFields(ClassFile.java:623)
>   at org.codehaus.janino.util.ClassFile.(ClassFile.java:280)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:913)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:911)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:911)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(Co

[jira] [Commented] (SPARK-17620) hive.default.fileformat=orc does not set OrcSerde

2016-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511467#comment-15511467
 ] 

Apache Spark commented on SPARK-17620:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/15190

> hive.default.fileformat=orc does not set OrcSerde
> -
>
> Key: SPARK-17620
> URL: https://issues.apache.org/jira/browse/SPARK-17620
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Brian Cho
>Priority: Minor
>
> Setting {{hive.default.fileformat=orc}} does not set OrcSerde. This behavior 
> is inconsistent with {{STORED AS ORC}}. This means we cannot set a default 
> behavior for creating tables using orc.
> The behavior using stored as:
> {noformat}
> scala> spark.sql("CREATE TABLE tmp_stored_as(id INT) STORED AS ORC")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("DESC FORMATTED tmp_stored_as").collect.foreach(println)
> ...
> [# Storage Information,,]
> [SerDe Library:,org.apache.hadoop.hive.ql.io.orc.OrcSerde,]
> [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
> [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
> ...
> {noformat}
> Behavior setting default conf (SerDe Library is not set properly):
> {noformat}
> scala> spark.sql("SET hive.default.fileformat=orc")
> res2: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.sql("CREATE TABLE tmp_default(id INT)")
> res3: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("DESC FORMATTED tmp_default").collect.foreach(println)
> ...
> [# Storage Information,,]
> [SerDe Library:,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,]
> [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
> [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17620) hive.default.fileformat=orc does not set OrcSerde

2016-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17620:


Assignee: (was: Apache Spark)

> hive.default.fileformat=orc does not set OrcSerde
> -
>
> Key: SPARK-17620
> URL: https://issues.apache.org/jira/browse/SPARK-17620
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Brian Cho
>Priority: Minor
>
> Setting {{hive.default.fileformat=orc}} does not set OrcSerde. This behavior 
> is inconsistent with {{STORED AS ORC}}. This means we cannot set a default 
> behavior for creating tables using orc.
> The behavior using stored as:
> {noformat}
> scala> spark.sql("CREATE TABLE tmp_stored_as(id INT) STORED AS ORC")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("DESC FORMATTED tmp_stored_as").collect.foreach(println)
> ...
> [# Storage Information,,]
> [SerDe Library:,org.apache.hadoop.hive.ql.io.orc.OrcSerde,]
> [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
> [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
> ...
> {noformat}
> Behavior setting default conf (SerDe Library is not set properly):
> {noformat}
> scala> spark.sql("SET hive.default.fileformat=orc")
> res2: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.sql("CREATE TABLE tmp_default(id INT)")
> res3: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("DESC FORMATTED tmp_default").collect.foreach(println)
> ...
> [# Storage Information,,]
> [SerDe Library:,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,]
> [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
> [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17620) hive.default.fileformat=orc does not set OrcSerde

2016-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17620:


Assignee: Apache Spark

> hive.default.fileformat=orc does not set OrcSerde
> -
>
> Key: SPARK-17620
> URL: https://issues.apache.org/jira/browse/SPARK-17620
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Brian Cho
>Assignee: Apache Spark
>Priority: Minor
>
> Setting {{hive.default.fileformat=orc}} does not set OrcSerde. This behavior 
> is inconsistent with {{STORED AS ORC}}. This means we cannot set a default 
> behavior for creating tables using orc.
> The behavior using stored as:
> {noformat}
> scala> spark.sql("CREATE TABLE tmp_stored_as(id INT) STORED AS ORC")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("DESC FORMATTED tmp_stored_as").collect.foreach(println)
> ...
> [# Storage Information,,]
> [SerDe Library:,org.apache.hadoop.hive.ql.io.orc.OrcSerde,]
> [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
> [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
> ...
> {noformat}
> Behavior setting default conf (SerDe Library is not set properly):
> {noformat}
> scala> spark.sql("SET hive.default.fileformat=orc")
> res2: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.sql("CREATE TABLE tmp_default(id INT)")
> res3: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("DESC FORMATTED tmp_default").collect.foreach(println)
> ...
> [# Storage Information,,]
> [SerDe Library:,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,]
> [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
> [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17569) Don't recheck existence of files when generating File Relation resolution in StructuredStreaming

2016-09-21 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-17569.
--
   Resolution: Fixed
 Assignee: Burak Yavuz
Fix Version/s: 2.1.0

> Don't recheck existence of files when generating File Relation resolution in 
> StructuredStreaming
> 
>
> Key: SPARK-17569
> URL: https://issues.apache.org/jira/browse/SPARK-17569
> Project: Spark
>  Issue Type: Improvement
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 2.1.0
>
>
> Structured Streaming's FileSource lists files to classify files as Offsets. 
> Once this file list is committed to a metadata log for a batch, this file 
> list is turned into a "Batch FileSource" Relation which acts as the source to 
> the incremental execution.
> While this "Batch FileSource" Relation is resolved, we re-check that every 
> single file exists on the Driver. It takes a horrible amount of time, and is 
> a total waste. We can simply skip file existence during execution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17628) Name of "object StreamingExamples" should be more self-explanatory

2016-09-21 Thread Xin Ren (JIRA)
Xin Ren created SPARK-17628:
---

 Summary: Name of "object StreamingExamples" should be more 
self-explanatory 
 Key: SPARK-17628
 URL: https://issues.apache.org/jira/browse/SPARK-17628
 Project: Spark
  Issue Type: Bug
  Components: Examples, Streaming
Affects Versions: 2.0.0
Reporter: Xin Ren
Priority: Minor


`object StreamingExamples` is more of a utility object, and the name is too 
general and I thought it's an actual streaming example at the very beginning.

{code}
/** Utility functions for Spark Streaming examples. */
object StreamingExamples extends Logging {

  /** Set reasonable logging levels for streaming if the user has not 
configured log4j. */
  def setStreamingLogLevels() {
val log4jInitialized = Logger.getRootLogger.getAllAppenders.hasMoreElements
if (!log4jInitialized) {
  // We first log something to initialize Spark's default logging, then we 
override the
  // logging level.
  logInfo("Setting log level to [WARN] for streaming example." +
" To override add a custom log4j.properties to the classpath.")
  Logger.getRootLogger.setLevel(Level.WARN)
}
  }
}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17628) Name of "object StreamingExamples" should be more self-explanatory

2016-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17628:


Assignee: (was: Apache Spark)

> Name of "object StreamingExamples" should be more self-explanatory 
> ---
>
> Key: SPARK-17628
> URL: https://issues.apache.org/jira/browse/SPARK-17628
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, Streaming
>Affects Versions: 2.0.0
>Reporter: Xin Ren
>Priority: Minor
>
> `object StreamingExamples` is more of a utility object, and the name is too 
> general and I thought it's an actual streaming example at the very beginning.
> {code}
> /** Utility functions for Spark Streaming examples. */
> object StreamingExamples extends Logging {
>   /** Set reasonable logging levels for streaming if the user has not 
> configured log4j. */
>   def setStreamingLogLevels() {
> val log4jInitialized = 
> Logger.getRootLogger.getAllAppenders.hasMoreElements
> if (!log4jInitialized) {
>   // We first log something to initialize Spark's default logging, then 
> we override the
>   // logging level.
>   logInfo("Setting log level to [WARN] for streaming example." +
> " To override add a custom log4j.properties to the classpath.")
>   Logger.getRootLogger.setLevel(Level.WARN)
> }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17628) Name of "object StreamingExamples" should be more self-explanatory

2016-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511650#comment-15511650
 ] 

Apache Spark commented on SPARK-17628:
--

User 'keypointt' has created a pull request for this issue:
https://github.com/apache/spark/pull/15191

> Name of "object StreamingExamples" should be more self-explanatory 
> ---
>
> Key: SPARK-17628
> URL: https://issues.apache.org/jira/browse/SPARK-17628
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, Streaming
>Affects Versions: 2.0.0
>Reporter: Xin Ren
>Priority: Minor
>
> `object StreamingExamples` is more of a utility object, and the name is too 
> general and I thought it's an actual streaming example at the very beginning.
> {code}
> /** Utility functions for Spark Streaming examples. */
> object StreamingExamples extends Logging {
>   /** Set reasonable logging levels for streaming if the user has not 
> configured log4j. */
>   def setStreamingLogLevels() {
> val log4jInitialized = 
> Logger.getRootLogger.getAllAppenders.hasMoreElements
> if (!log4jInitialized) {
>   // We first log something to initialize Spark's default logging, then 
> we override the
>   // logging level.
>   logInfo("Setting log level to [WARN] for streaming example." +
> " To override add a custom log4j.properties to the classpath.")
>   Logger.getRootLogger.setLevel(Level.WARN)
> }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17628) Name of "object StreamingExamples" should be more self-explanatory

2016-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17628:


Assignee: Apache Spark

> Name of "object StreamingExamples" should be more self-explanatory 
> ---
>
> Key: SPARK-17628
> URL: https://issues.apache.org/jira/browse/SPARK-17628
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, Streaming
>Affects Versions: 2.0.0
>Reporter: Xin Ren
>Assignee: Apache Spark
>Priority: Minor
>
> `object StreamingExamples` is more of a utility object, and the name is too 
> general and I thought it's an actual streaming example at the very beginning.
> {code}
> /** Utility functions for Spark Streaming examples. */
> object StreamingExamples extends Logging {
>   /** Set reasonable logging levels for streaming if the user has not 
> configured log4j. */
>   def setStreamingLogLevels() {
> val log4jInitialized = 
> Logger.getRootLogger.getAllAppenders.hasMoreElements
> if (!log4jInitialized) {
>   // We first log something to initialize Spark's default logging, then 
> we override the
>   // logging level.
>   logInfo("Setting log level to [WARN] for streaming example." +
> " To override add a custom log4j.properties to the classpath.")
>   Logger.getRootLogger.setLevel(Level.WARN)
> }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17616) Getting "java.lang.RuntimeException: Distinct columns cannot exist in Aggregate "

2016-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17616:


Assignee: Herman van Hovell  (was: Apache Spark)

> Getting "java.lang.RuntimeException: Distinct columns cannot exist in 
> Aggregate "
> -
>
> Key: SPARK-17616
> URL: https://issues.apache.org/jira/browse/SPARK-17616
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Assignee: Herman van Hovell
>Priority: Minor
>
> I execute:
> {code}
> select platform, 
> collect_set(user_auth) as paid_types,
> count(distinct sessionid) as sessions
> from non_hss.session
> where
> event = 'stop' and platform != 'testplatform' and
> not (month = MONTH(current_date()) AND year = YEAR(current_date()) 
> and day = day(current_date())) and
> (
> (month >= MONTH(add_months(CURRENT_DATE(), -5)) AND year = 
> YEAR(add_months(CURRENT_DATE(), -5)))
> OR
> (month <= MONTH(add_months(CURRENT_DATE(), -5)) AND year > 
> YEAR(add_months(CURRENT_DATE(), -5)))
> )
> group by platform
> {code}
> I get:
> {code}
> java.lang.RuntimeException: Distinct columns cannot exist in Aggregate 
> operator containing aggregate functions which don't support partial 
> aggregation.
> {code}
> IT WORKED IN 1.6.2. I've read error 5 times, and read code once. I still 
> don't understand what I do incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17616) Getting "java.lang.RuntimeException: Distinct columns cannot exist in Aggregate "

2016-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17616:


Assignee: Apache Spark  (was: Herman van Hovell)

> Getting "java.lang.RuntimeException: Distinct columns cannot exist in 
> Aggregate "
> -
>
> Key: SPARK-17616
> URL: https://issues.apache.org/jira/browse/SPARK-17616
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Assignee: Apache Spark
>Priority: Minor
>
> I execute:
> {code}
> select platform, 
> collect_set(user_auth) as paid_types,
> count(distinct sessionid) as sessions
> from non_hss.session
> where
> event = 'stop' and platform != 'testplatform' and
> not (month = MONTH(current_date()) AND year = YEAR(current_date()) 
> and day = day(current_date())) and
> (
> (month >= MONTH(add_months(CURRENT_DATE(), -5)) AND year = 
> YEAR(add_months(CURRENT_DATE(), -5)))
> OR
> (month <= MONTH(add_months(CURRENT_DATE(), -5)) AND year > 
> YEAR(add_months(CURRENT_DATE(), -5)))
> )
> group by platform
> {code}
> I get:
> {code}
> java.lang.RuntimeException: Distinct columns cannot exist in Aggregate 
> operator containing aggregate functions which don't support partial 
> aggregation.
> {code}
> IT WORKED IN 1.6.2. I've read error 5 times, and read code once. I still 
> don't understand what I do incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14536) NPE in JDBCRDD when array column contains nulls (postgresql)

2016-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511714#comment-15511714
 ] 

Apache Spark commented on SPARK-14536:
--

User 'sureshthalamati' has created a pull request for this issue:
https://github.com/apache/spark/pull/15192

> NPE in JDBCRDD when array column contains nulls (postgresql)
> 
>
> Key: SPARK-14536
> URL: https://issues.apache.org/jira/browse/SPARK-14536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Jeremy Smith
>  Labels: NullPointerException
>
> At 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L453
>  it is assumed that the JDBC driver will definitely return a non-null `Array` 
> object from the call to `getArray`, and that in the event of a null array it 
> will return an non-null `Array` object with a null underlying array.  But as 
> you can see here 
> https://github.com/pgjdbc/pgjdbc/blob/master/pgjdbc/src/main/java/org/postgresql/jdbc/PgResultSet.java#L387
>  that isn't the case, at least for PostgreSQL.  This causes a 
> `NullPointerException` whenever an array column contains null values. It 
> seems like the PostgreSQL JDBC driver is probably doing the wrong thing, but 
> even so there should be a null check in JDBCRDD.  I'm happy to submit a PR if 
> that would be helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-14536) NPE in JDBCRDD when array column contains nulls (postgresql)

2016-09-21 Thread Suresh Thalamati (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suresh Thalamati reopened SPARK-14536:
--

SPARK-10186  added array data type support  for postgres in 1.6.  NPE issues 
still exists. I was able repro in the  master. 

> NPE in JDBCRDD when array column contains nulls (postgresql)
> 
>
> Key: SPARK-14536
> URL: https://issues.apache.org/jira/browse/SPARK-14536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Jeremy Smith
>  Labels: NullPointerException
>
> At 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L453
>  it is assumed that the JDBC driver will definitely return a non-null `Array` 
> object from the call to `getArray`, and that in the event of a null array it 
> will return an non-null `Array` object with a null underlying array.  But as 
> you can see here 
> https://github.com/pgjdbc/pgjdbc/blob/master/pgjdbc/src/main/java/org/postgresql/jdbc/PgResultSet.java#L387
>  that isn't the case, at least for PostgreSQL.  This causes a 
> `NullPointerException` whenever an array column contains null values. It 
> seems like the PostgreSQL JDBC driver is probably doing the wrong thing, but 
> even so there should be a null check in JDBCRDD.  I'm happy to submit a PR if 
> that would be helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15717) Cannot perform RDD operations on a checkpointed VertexRDD.

2016-09-21 Thread Asher Krim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511727#comment-15511727
 ] 

Asher Krim commented on SPARK-15717:


Any update on this issue? We are experiencing ClassCastExceptions when using 
checkpointing and LDA with the EM optimizer.

> Cannot perform RDD operations on a checkpointed VertexRDD.
> --
>
> Key: SPARK-15717
> URL: https://issues.apache.org/jira/browse/SPARK-15717
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.6.1
>Reporter: Anderson de Andrade
>
> A checkpointed (materialized) VertexRDD throws the following exception when 
> collected:
> bq. java.lang.ArrayStoreException: 
> org.apache.spark.graphx.impl.ShippableVertexPartition
> Can be replicated by running:
> {code:java}
> graph.vertices.checkpoint()
> graph.vertices.count() // materialize
> graph.vertices.collect()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14536) NPE in JDBCRDD when array column contains nulls (postgresql)

2016-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14536:


Assignee: (was: Apache Spark)

> NPE in JDBCRDD when array column contains nulls (postgresql)
> 
>
> Key: SPARK-14536
> URL: https://issues.apache.org/jira/browse/SPARK-14536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Jeremy Smith
>  Labels: NullPointerException
>
> At 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L453
>  it is assumed that the JDBC driver will definitely return a non-null `Array` 
> object from the call to `getArray`, and that in the event of a null array it 
> will return an non-null `Array` object with a null underlying array.  But as 
> you can see here 
> https://github.com/pgjdbc/pgjdbc/blob/master/pgjdbc/src/main/java/org/postgresql/jdbc/PgResultSet.java#L387
>  that isn't the case, at least for PostgreSQL.  This causes a 
> `NullPointerException` whenever an array column contains null values. It 
> seems like the PostgreSQL JDBC driver is probably doing the wrong thing, but 
> even so there should be a null check in JDBCRDD.  I'm happy to submit a PR if 
> that would be helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2016-09-21 Thread Gary Gregory (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511730#comment-15511730
 ] 

Gary Gregory commented on SPARK-6305:
-

Hi,

My name is Gary Gregory ([~garydgregory]) and I am also a committer and PMC 
member on for Appache Logging. 

I'd be happy to help along with [~jvz] and [~mikaelstaldal] which I see have 
commented here already.

Gary

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14536) NPE in JDBCRDD when array column contains nulls (postgresql)

2016-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14536:


Assignee: Apache Spark

> NPE in JDBCRDD when array column contains nulls (postgresql)
> 
>
> Key: SPARK-14536
> URL: https://issues.apache.org/jira/browse/SPARK-14536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Jeremy Smith
>Assignee: Apache Spark
>  Labels: NullPointerException
>
> At 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L453
>  it is assumed that the JDBC driver will definitely return a non-null `Array` 
> object from the call to `getArray`, and that in the event of a null array it 
> will return an non-null `Array` object with a null underlying array.  But as 
> you can see here 
> https://github.com/pgjdbc/pgjdbc/blob/master/pgjdbc/src/main/java/org/postgresql/jdbc/PgResultSet.java#L387
>  that isn't the case, at least for PostgreSQL.  This causes a 
> `NullPointerException` whenever an array column contains null values. It 
> seems like the PostgreSQL JDBC driver is probably doing the wrong thing, but 
> even so there should be a null check in JDBCRDD.  I'm happy to submit a PR if 
> that would be helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2016-09-21 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511767#comment-15511767
 ] 

Charles Allen commented on SPARK-6305:
--

Just FYI, as I found out recently, kafka (at least 8.x) requires log4j on the 
classpath 
(http://mail-archives.apache.org/mod_mbox/kafka-users/201401.mbox/%3ccaa7ooca0+3sltognxaxwofysedkysfyqt0hs_a6r3jy...@mail.gmail.com%3E
 for only other reference to this problem I could find). But the slf4j-log4j12 
bridge can at least be removed.

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17629) Should ml Word2Vec findSynonyms match the mllib implementation?

2016-09-21 Thread Asher Krim (JIRA)
Asher Krim created SPARK-17629:
--

 Summary: Should ml Word2Vec findSynonyms match the mllib 
implementation?
 Key: SPARK-17629
 URL: https://issues.apache.org/jira/browse/SPARK-17629
 Project: Spark
  Issue Type: Question
Reporter: Asher Krim
Priority: Minor


ml Word2Vec's findSynonyms methods depart from mllib in that they return 
distributed results, rather than the results directly:

{code}
  def findSynonyms(word: String, num: Int): DataFrame = {
val spark = SparkSession.builder().getOrCreate()
spark.createDataFrame(wordVectors.findSynonyms(word, num)).toDF("word", 
"similarity")
  }
{code}

What was the reason for this decision? I would think that most users would 
request a reasonably small number of results back, and want to use them 
directly on the driver, similar to the _take_ method on dataframes. Returning 
parallelized results creates a costly round trip for the data that doesn't seem 
necessary.

The original PR: https://github.com/apache/spark/pull/7263
[~MechCoder] - do you perhaps recall the reason?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14536) NPE in JDBCRDD when array column contains nulls (postgresql)

2016-09-21 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511837#comment-15511837
 ] 

Hyukjin Kwon commented on SPARK-14536:
--

I see. I rushed to read this and didn't noticed that this is actually a 
PostgreSQL specific issue (I thought this JIRA describes a general JDBC 
problem).
Yea, {{ArrayType}} seems only supported in {{PostgreSQL}} in Spark. Maybe we 
should make some relations with those JIRAs with SPARK-8500 to prevent 
confusion.

> NPE in JDBCRDD when array column contains nulls (postgresql)
> 
>
> Key: SPARK-14536
> URL: https://issues.apache.org/jira/browse/SPARK-14536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Jeremy Smith
>  Labels: NullPointerException
>
> At 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L453
>  it is assumed that the JDBC driver will definitely return a non-null `Array` 
> object from the call to `getArray`, and that in the event of a null array it 
> will return an non-null `Array` object with a null underlying array.  But as 
> you can see here 
> https://github.com/pgjdbc/pgjdbc/blob/master/pgjdbc/src/main/java/org/postgresql/jdbc/PgResultSet.java#L387
>  that isn't the case, at least for PostgreSQL.  This causes a 
> `NullPointerException` whenever an array column contains null values. It 
> seems like the PostgreSQL JDBC driver is probably doing the wrong thing, but 
> even so there should be a null check in JDBCRDD.  I'm happy to submit a PR if 
> that would be helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8500) Support for array types in JDBCRDD

2016-09-21 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511840#comment-15511840
 ] 

Hyukjin Kwon commented on SPARK-8500:
-

I am leaving a note that PostgreSQL is supporting {{ArrayType}} as a dialect in 
[PostgresDialect.scala|https://github.com/apache/spark/blob/a133057ce5817f834babe9f25023092aec3c321d/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L47-L65].

> Support for array types in JDBCRDD
> --
>
> Key: SPARK-8500
> URL: https://issues.apache.org/jira/browse/SPARK-8500
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: MacOSX 10.10.3, Postgres 9.3.5, Spark 1.4 hadoop 2.6, 
> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_40)
> spark-shell --driver-class-path ./postgresql-9.3-1103.jdbc41.jar
>Reporter: michal pisanko
>
> Loading a table with a text[] column via sqlContext causes an error.
> sqlContext.load("jdbc", Map("url" -> "jdbc:postgresql://localhost/my_db", 
> "dbtable" -> "table"))
> Table has a column:
> my_col  | text[]  |
> Stacktrace: https://gist.github.com/8b163bf5fdc2aea7dbb6.git
> Same occurs in pyspark shell.
> Loading another table without text array column works allright.
> Possible hint:
> https://github.com/apache/spark/blob/d986fb9a378416248768828e6e6c7405697f9a5a/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JDBCRDD.scala#L57



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8500) Support for array types in JDBCRDD

2016-09-21 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511840#comment-15511840
 ] 

Hyukjin Kwon edited comment on SPARK-8500 at 9/22/16 2:04 AM:
--

I am leaving a note that PostgreSQL is supporting {{ArrayType}} as a dialect in 
[PostgresDialect.scala|https://github.com/apache/spark/blob/a133057ce5817f834babe9f25023092aec3c321d/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L43]
 by using type name. See SPARK-10186


was (Author: hyukjin.kwon):
I am leaving a note that PostgreSQL is supporting {{ArrayType}} as a dialect in 
[PostgresDialect.scala|https://github.com/apache/spark/blob/a133057ce5817f834babe9f25023092aec3c321d/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L47-L65].

> Support for array types in JDBCRDD
> --
>
> Key: SPARK-8500
> URL: https://issues.apache.org/jira/browse/SPARK-8500
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: MacOSX 10.10.3, Postgres 9.3.5, Spark 1.4 hadoop 2.6, 
> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_40)
> spark-shell --driver-class-path ./postgresql-9.3-1103.jdbc41.jar
>Reporter: michal pisanko
>
> Loading a table with a text[] column via sqlContext causes an error.
> sqlContext.load("jdbc", Map("url" -> "jdbc:postgresql://localhost/my_db", 
> "dbtable" -> "table"))
> Table has a column:
> my_col  | text[]  |
> Stacktrace: https://gist.github.com/8b163bf5fdc2aea7dbb6.git
> Same occurs in pyspark shell.
> Loading another table without text array column works allright.
> Possible hint:
> https://github.com/apache/spark/blob/d986fb9a378416248768828e6e6c7405697f9a5a/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JDBCRDD.scala#L57



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17577) SparkR support add files to Spark job and get by executors

2016-09-21 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-17577.
-
   Resolution: Fixed
 Assignee: Yanbo Liang
Fix Version/s: 2.1.0

> SparkR support add files to Spark job and get by executors
> --
>
> Key: SPARK-17577
> URL: https://issues.apache.org/jira/browse/SPARK-17577
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
> Fix For: 2.1.0
>
>
> Scala/Python users can add files to Spark job by submit options {{--files}} 
> or {{SparkContext.addFile()}}. Meanwhile, users can get the added file by 
> {{SparkFiles.get(filename)}}.
> We should also support this function for SparkR users, since they also have 
> the requirements for some shared dependency files. For example, SparkR users 
> can download third party R packages to driver firstly, add these files to the 
> Spark job as dependency by this API and then each executor can install these 
> packages by {{install.packages}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13928) Move org.apache.spark.Logging into org.apache.spark.internal.Logging

2016-09-21 Thread kui xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511989#comment-15511989
 ] 

kui xiang commented on SPARK-13928:
---

just copy the org.apache.spark.internal.Logging into your project in package 
org.apache.spark
and every thing is ok 

> Move org.apache.spark.Logging into org.apache.spark.internal.Logging
> 
>
> Key: SPARK-13928
> URL: https://issues.apache.org/jira/browse/SPARK-13928
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>
> Logging was made private in Spark 2.0. If we move it, then users would be 
> able to create a Logging trait themselves to avoid changing their own code. 
> Alternatively, we can also provide in a compatibility package that adds 
> logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17622) Cannot run create or load DF on Windows- Spark 2.0.0

2016-09-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17622:
--
Target Version/s:   (was: 2.0.0)
   Fix Version/s: (was: 1.6.2)
  (was: 1.6.1)
 Component/s: (was: Java API)
  SparkR

This doesn't actually show the underlying error.

> Cannot run create or load DF on Windows- Spark 2.0.0
> 
>
> Key: SPARK-17622
> URL: https://issues.apache.org/jira/browse/SPARK-17622
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
> Environment: windows 10
> R 3.3.1
> RStudio 1.0.20
>Reporter: renzhi he
>  Labels: windows
>
> Under spark2.0.0- on Windows- when try to load or create data with the 
> similar codes below, I also get error message and cannot execute the 
> functions.
> |sc <- sparkR.session(master="local",sparkConfig = list(spark.driver.memory = 
> "2g")) |
> |df <- as.DataFrame(faithful) |
> Here is the error message:
> #Error in invokeJava(isStatic = TRUE, className, methodName, ...) :   
>  
> #java.lang.reflect.InvocationTargetException
> #at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> #at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> #at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> #at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> #at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
> #at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
> #at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
> #at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> #at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> #at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
> #at org.apache.spark.sql.hive.HiveSharedSt
> However, under spark1.6.1 or spark1.6.2, run the same functional functions, 
> there will be no problem.
> |sc1 <- sparkR.init(master = "local", sparkEnvir = 
> list(spark.driver.memory="2g"))|
> |sqlContext <- sparkRSQL.init(sc1)|
> |df <- as.DataFrame(sqlContext,faithful|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17627) Streaming Providers should be labeled Experimental

2016-09-21 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-17627.
-
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

> Streaming Providers should be labeled Experimental
> --
>
> Key: SPARK-17627
> URL: https://issues.apache.org/jira/browse/SPARK-17627
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
> Fix For: 2.0.1, 2.1.0
>
>
> All of structured streaming is experimental, but we missed the annotation on 
> two of the APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17494) Floor/ceil of decimal returns wrong result if it's in compact format

2016-09-21 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-17494.
-
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

> Floor/ceil of decimal returns wrong result if it's in compact format
> 
>
> Key: SPARK-17494
> URL: https://issues.apache.org/jira/browse/SPARK-17494
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Gokhan Civan
>Assignee: Davies Liu
>  Labels: correctness
> Fix For: 2.0.1, 2.1.0
>
>
> If you create tables as follows:
> create table a as select 'A' as str, cast(10.5 as decimal(15,6)) as num;
> create table b as select 'A' as str;
> Then
> select floor(num) from a;
> returns 10
> but
> select floor(num) from a join b on a.str = b.str;
> returns 11



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17622) Cannot run create or load DF on Windows- Spark 2.0.0

2016-09-21 Thread renzhi he (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15512050#comment-15512050
 ] 

renzhi he commented on SPARK-17622:
---

Hi Sean,


Sorry, I just stepped into this field.

For this "bug", I just added a spark.sql.warehouse.dir="my/own/drive" in my 
sparkConfig list, and then my spark worked.

Just new to spark and R, so I am still confused to few things, I will spend 
more on reading the official docs, and sorry for bothering you :)

Best wishes,
Renzhi

 

> Cannot run create or load DF on Windows- Spark 2.0.0
> 
>
> Key: SPARK-17622
> URL: https://issues.apache.org/jira/browse/SPARK-17622
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
> Environment: windows 10
> R 3.3.1
> RStudio 1.0.20
>Reporter: renzhi he
>  Labels: windows
>
> Under spark2.0.0- on Windows- when try to load or create data with the 
> similar codes below, I also get error message and cannot execute the 
> functions.
> |sc <- sparkR.session(master="local",sparkConfig = list(spark.driver.memory = 
> "2g")) |
> |df <- as.DataFrame(faithful) |
> Here is the error message:
> #Error in invokeJava(isStatic = TRUE, className, methodName, ...) :   
>  
> #java.lang.reflect.InvocationTargetException
> #at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> #at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> #at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> #at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> #at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
> #at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
> #at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
> #at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> #at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> #at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
> #at org.apache.spark.sql.hive.HiveSharedSt
> However, under spark1.6.1 or spark1.6.2, run the same functional functions, 
> there will be no problem.
> |sc1 <- sparkR.init(master = "local", sparkEnvir = 
> list(spark.driver.memory="2g"))|
> |sqlContext <- sparkRSQL.init(sc1)|
> |df <- as.DataFrame(sqlContext,faithful|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14709) spark.ml API for linear SVM

2016-09-21 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15512085#comment-15512085
 ] 

Yanbo Liang commented on SPARK-14709:
-

[~yuhaoyan] Any update about this? I think providing DataFrame-based SVM 
algorithm is very important to users, so it's better we can get it in ASAP. I'd 
like to get in the implementation with OWLQN and Hinge loss firstly, and to 
discuss SMO version later. Like [~mlnick] said, it's better to get more 
performance number and user case of SMO impl. And it's not very hard to add a 
new internal implementation after we have the basic SVM API. I saw you have a 
implementation with OWLQN and Hinge loss already, could you send the PR? If you 
are busy with other things, I can help and you are still the primary author of 
this PR. Thanks!

> spark.ml API for linear SVM
> ---
>
> Key: SPARK-14709
> URL: https://issues.apache.org/jira/browse/SPARK-14709
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Provide API for SVM algorithm for DataFrames.  I would recommend using 
> OWL-QN, rather than wrapping spark.mllib's SGD-based implementation.
> The API should mimic existing spark.ml.classification APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14709) spark.ml API for linear SVM

2016-09-21 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15512085#comment-15512085
 ] 

Yanbo Liang edited comment on SPARK-14709 at 9/22/16 4:27 AM:
--

[~yuhaoyan] Any update about this? I think providing DataFrame-based SVM 
algorithm is very important to users, so it's better we can get it in ASAP. I'd 
like to get in the implementation with OWLQN and Hinge loss firstly, and to 
discuss SMO version later. Like [~mlnick] said, it's better to get more 
performance number and user case of SMO impl. And it's not very hard to add a 
new internal implementation after we have the basic SVM API. I saw you have an 
implementation with OWLQN and Hinge loss already, could you send the PR? If you 
are busy with other things, I can help and you are still the primary author of 
this PR. Thanks!


was (Author: yanboliang):
[~yuhaoyan] Any update about this? I think providing DataFrame-based SVM 
algorithm is very important to users, so it's better we can get it in ASAP. I'd 
like to get in the implementation with OWLQN and Hinge loss firstly, and to 
discuss SMO version later. Like [~mlnick] said, it's better to get more 
performance number and user case of SMO impl. And it's not very hard to add a 
new internal implementation after we have the basic SVM API. I saw you have a 
implementation with OWLQN and Hinge loss already, could you send the PR? If you 
are busy with other things, I can help and you are still the primary author of 
this PR. Thanks!

> spark.ml API for linear SVM
> ---
>
> Key: SPARK-14709
> URL: https://issues.apache.org/jira/browse/SPARK-14709
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Provide API for SVM algorithm for DataFrames.  I would recommend using 
> OWL-QN, rather than wrapping spark.mllib's SGD-based implementation.
> The API should mimic existing spark.ml.classification APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17609) SessionCatalog.tableExists should not check temp view

2016-09-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17609.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15160
[https://github.com/apache/spark/pull/15160]

> SessionCatalog.tableExists should not check temp view
> -
>
> Key: SPARK-17609
> URL: https://issues.apache.org/jira/browse/SPARK-17609
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14709) spark.ml API for linear SVM

2016-09-21 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15512158#comment-15512158
 ] 

yuhao yang commented on SPARK-14709:


Thanks [~yanboliang] for picking this up. I'll try to send a PR tomorrow and we 
can work together on it. Thanks.

> spark.ml API for linear SVM
> ---
>
> Key: SPARK-14709
> URL: https://issues.apache.org/jira/browse/SPARK-14709
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Provide API for SVM algorithm for DataFrames.  I would recommend using 
> OWL-QN, rather than wrapping spark.mllib's SGD-based implementation.
> The API should mimic existing spark.ml.classification APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17425) Override sameResult in HiveTableScanExec to make ReuseExchange work in text format table

2016-09-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17425.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14988
[https://github.com/apache/spark/pull/14988]

> Override sameResult in HiveTableScanExec to make ReuseExchange work in text 
> format table
> 
>
> Key: SPARK-17425
> URL: https://issues.apache.org/jira/browse/SPARK-17425
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yadong Qi
> Fix For: 2.1.0
>
>
> When I run the below SQL(table src is text format):
> {code:sql}
> SELECT * FROM src t1
> JOIN src t2 ON t1.key = t2.key
> JOIN src t3 ON t1.key = t3.key;
> {code}
> The PhysicalPlan doesn't contain *ReuseExchange*. And I use src_pqt(parquet 
> format) instead of src(text format), PhysicalPlan contain *ReuseExchange* in 
> PhysicalPlan.
> I found the *sameResult* in *FileSourceScanExec* has already overrided, but 
> *HiveTableScanExec* didn't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17425) Override sameResult in HiveTableScanExec to make ReuseExchange work in text format table

2016-09-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-17425:

Assignee: Yadong Qi

> Override sameResult in HiveTableScanExec to make ReuseExchange work in text 
> format table
> 
>
> Key: SPARK-17425
> URL: https://issues.apache.org/jira/browse/SPARK-17425
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yadong Qi
>Assignee: Yadong Qi
> Fix For: 2.1.0
>
>
> When I run the below SQL(table src is text format):
> {code:sql}
> SELECT * FROM src t1
> JOIN src t2 ON t1.key = t2.key
> JOIN src t3 ON t1.key = t3.key;
> {code}
> The PhysicalPlan doesn't contain *ReuseExchange*. And I use src_pqt(parquet 
> format) instead of src(text format), PhysicalPlan contain *ReuseExchange* in 
> PhysicalPlan.
> I found the *sameResult* in *FileSourceScanExec* has already overrided, but 
> *HiveTableScanExec* didn't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17492) Reading Cataloged Data Sources without Extending SchemaRelationProvider

2016-09-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-17492:

Assignee: Xiao Li

> Reading Cataloged Data Sources without Extending SchemaRelationProvider
> ---
>
> Key: SPARK-17492
> URL: https://issues.apache.org/jira/browse/SPARK-17492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.1.0
>
>
> For data sources without extending `SchemaRelationProvider`, we expect users 
> to not specify schemas when they creating tables. If the schema is input from 
> users, an exception is issued. 
> Since Spark 2.1, for any data source, to avoid infer the schema every time, 
> we store the schema in the metastore catalog. Thus, when reading a cataloged 
> data source table, the schema could be read from metastore catalog. In this 
> case, we also got an exception. For example, 
> {noformat}
> sql(
>   s"""
>  |CREATE TABLE relationProvierWithSchema
>  |USING org.apache.spark.sql.sources.SimpleScanSource
>  |OPTIONS (
>  |  From '1',
>  |  To '10'
>  |)
>""".stripMargin)
> spark.table(tableName).show()
> {noformat}
> {noformat}
> org.apache.spark.sql.sources.SimpleScanSource does not allow user-specified 
> schemas.;
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17492) Reading Cataloged Data Sources without Extending SchemaRelationProvider

2016-09-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17492.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15046
[https://github.com/apache/spark/pull/15046]

> Reading Cataloged Data Sources without Extending SchemaRelationProvider
> ---
>
> Key: SPARK-17492
> URL: https://issues.apache.org/jira/browse/SPARK-17492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
> Fix For: 2.1.0
>
>
> For data sources without extending `SchemaRelationProvider`, we expect users 
> to not specify schemas when they creating tables. If the schema is input from 
> users, an exception is issued. 
> Since Spark 2.1, for any data source, to avoid infer the schema every time, 
> we store the schema in the metastore catalog. Thus, when reading a cataloged 
> data source table, the schema could be read from metastore catalog. In this 
> case, we also got an exception. For example, 
> {noformat}
> sql(
>   s"""
>  |CREATE TABLE relationProvierWithSchema
>  |USING org.apache.spark.sql.sources.SimpleScanSource
>  |OPTIONS (
>  |  From '1',
>  |  To '10'
>  |)
>""".stripMargin)
> spark.table(tableName).show()
> {noformat}
> {noformat}
> org.apache.spark.sql.sources.SimpleScanSource does not allow user-specified 
> schemas.;
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17624) Flaky test? StateStoreSuite maintenance

2016-09-21 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15512237#comment-15512237
 ] 

Saisai Shao commented on SPARK-17624:
-

I cannot reproduce locally on my 

> Flaky test? StateStoreSuite maintenance
> ---
>
> Key: SPARK-17624
> URL: https://issues.apache.org/jira/browse/SPARK-17624
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.0.1
>Reporter: Adam Roberts
>Priority: Minor
>
> I've noticed this test failing consistently (25x in a row) with a two core 
> machine but not on an eight core machine
> If we increase the spark.rpc.numRetries value used in the test from 1 to 2 (3 
> being the default in Spark), the test reliably passes, we can also gain 
> reliability by setting the master to be anything other than just local.
> Is there a reason spark.rpc.numRetries is set to be 1?
> I see this failure is also mentioned here so it's been flaky for a while 
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-2-0-0-RC5-td18367.html
> If we run without the "quietly" code so we get debug info:
> {code}
> 16:26:15.213 WARN org.apache.spark.rpc.netty.NettyRpcEndpointRef: Error 
> sending message [message = 
> VerifyIfInstanceActive(StateStoreId(/home/aroberts/Spark-DK/sql/core/target/tmp/spark-cc44f5fa-b675-426f-9440-76785c365507/ૺꎖ鮎衲넅-28e9196f-8b2d-43ba-8421-44a5c5e98ceb,0,0),driver)]
>  in 1 attempts
> org.apache.spark.SparkException: Exception thrown in awaitResult
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.verifyIfInstanceActive(StateStoreCoordinator.scala:91)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$3.apply(StateStore.scala:227)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$3.apply(StateStore.scala:227)
> at scala.Option.map(Option.scala:146)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStore$.org$apache$spark$sql$execution$streaming$state$StateStore$$verifyIfStoreInstanceActive(StateStore.scala:227)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$org$apache$spark$sql$execution$streaming$state$StateStore$$doMaintenance$2.apply(StateStore.scala:199)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$org$apache$spark$sql$execution$streaming$state$StateStore$$doMaintenance$2.apply(StateStore.scala:197)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStore$.org$apache$spark$sql$execution$streaming$state$StateStore$$doMaintenance(StateStore.scala:197)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStore$$anon$1.run(StateStore.scala:180)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:522)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:319)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:191)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.lang.Thread.run(Thread.java:785)
> Caused by: org.apache.spark.SparkException: Could not find 
> StateStoreCoordinator.
> at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:154)
> at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:129)
> at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:225)
> at 
> org.ap

[jira] [Comment Edited] (SPARK-17624) Flaky test? StateStoreSuite maintenance

2016-09-21 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15512237#comment-15512237
 ] 

Saisai Shao edited comment on SPARK-17624 at 9/22/16 5:36 AM:
--

I cannot reproduce locally on my Mac laptop, Maybe you test machine is not so 
powerful to handle this unit test?


was (Author: jerryshao):
I cannot reproduce locally on my 

> Flaky test? StateStoreSuite maintenance
> ---
>
> Key: SPARK-17624
> URL: https://issues.apache.org/jira/browse/SPARK-17624
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.0.1
>Reporter: Adam Roberts
>Priority: Minor
>
> I've noticed this test failing consistently (25x in a row) with a two core 
> machine but not on an eight core machine
> If we increase the spark.rpc.numRetries value used in the test from 1 to 2 (3 
> being the default in Spark), the test reliably passes, we can also gain 
> reliability by setting the master to be anything other than just local.
> Is there a reason spark.rpc.numRetries is set to be 1?
> I see this failure is also mentioned here so it's been flaky for a while 
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-2-0-0-RC5-td18367.html
> If we run without the "quietly" code so we get debug info:
> {code}
> 16:26:15.213 WARN org.apache.spark.rpc.netty.NettyRpcEndpointRef: Error 
> sending message [message = 
> VerifyIfInstanceActive(StateStoreId(/home/aroberts/Spark-DK/sql/core/target/tmp/spark-cc44f5fa-b675-426f-9440-76785c365507/ૺꎖ鮎衲넅-28e9196f-8b2d-43ba-8421-44a5c5e98ceb,0,0),driver)]
>  in 1 attempts
> org.apache.spark.SparkException: Exception thrown in awaitResult
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.verifyIfInstanceActive(StateStoreCoordinator.scala:91)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$3.apply(StateStore.scala:227)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$3.apply(StateStore.scala:227)
> at scala.Option.map(Option.scala:146)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStore$.org$apache$spark$sql$execution$streaming$state$StateStore$$verifyIfStoreInstanceActive(StateStore.scala:227)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$org$apache$spark$sql$execution$streaming$state$StateStore$$doMaintenance$2.apply(StateStore.scala:199)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$org$apache$spark$sql$execution$streaming$state$StateStore$$doMaintenance$2.apply(StateStore.scala:197)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStore$.org$apache$spark$sql$execution$streaming$state$StateStore$$doMaintenance(StateStore.scala:197)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStore$$anon$1.run(StateStore.scala:180)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:522)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:319)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:191)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.lang.Thread.run(Thread.java:785)
> Caused by: org.apache.spark.SparkException: Could not find 
> StateStoreCoordinator.
> at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:154)
>   

[jira] [Created] (SPARK-17630) jvm-exit-on-fatal-error for spark.rpc.netty like there is available for akka

2016-09-21 Thread Mario Briggs (JIRA)
Mario Briggs created SPARK-17630:


 Summary: jvm-exit-on-fatal-error for spark.rpc.netty like there is 
available for akka
 Key: SPARK-17630
 URL: https://issues.apache.org/jira/browse/SPARK-17630
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Mario Briggs


Hi,

I have 2 code-paths from my app that result in a jvm OOM. 

In the first code path, 'akka.jvm-exit-on-fatal-error' kicks in and shuts down 
the JVM, so that the caller (py4J) get notified with proper stack trace

In the 2nd code path (rpc.netty), no such handler kicks in and shutdown the 
JVM, so the caller does not get notified.

Is it possible to have an jvm exit handle for the rpc. netty path?

First code path trace
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17630) jvm-exit-on-fatal-error for spark.rpc.netty like there is available for akka

2016-09-21 Thread Mario Briggs (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mario Briggs updated SPARK-17630:
-
Description: 
Hi,

I have 2 code-paths from my app that result in a jvm OOM. 

In the first code path, 'akka.jvm-exit-on-fatal-error' kicks in and shuts down 
the JVM, so that the caller (py4J) get notified with proper stack trace

In the 2nd code path (rpc.netty), no such handler kicks in and shutdown the 
JVM, so the caller does not get notified.

Is it possible to have an jvm exit handle for the rpc. netty path?

First code path trace file - 
 

  was:
Hi,

I have 2 code-paths from my app that result in a jvm OOM. 

In the first code path, 'akka.jvm-exit-on-fatal-error' kicks in and shuts down 
the JVM, so that the caller (py4J) get notified with proper stack trace

In the 2nd code path (rpc.netty), no such handler kicks in and shutdown the 
JVM, so the caller does not get notified.

Is it possible to have an jvm exit handle for the rpc. netty path?

First code path trace
 


> jvm-exit-on-fatal-error for spark.rpc.netty like there is available for akka
> 
>
> Key: SPARK-17630
> URL: https://issues.apache.org/jira/browse/SPARK-17630
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Mario Briggs
>
> Hi,
> I have 2 code-paths from my app that result in a jvm OOM. 
> In the first code path, 'akka.jvm-exit-on-fatal-error' kicks in and shuts 
> down the JVM, so that the caller (py4J) get notified with proper stack trace
> In the 2nd code path (rpc.netty), no such handler kicks in and shutdown the 
> JVM, so the caller does not get notified.
> Is it possible to have an jvm exit handle for the rpc. netty path?
> First code path trace file - 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17630) jvm-exit-on-fatal-error for spark.rpc.netty like there is available for akka

2016-09-21 Thread Mario Briggs (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mario Briggs updated SPARK-17630:
-
Attachment: firstCodepath.txt

> jvm-exit-on-fatal-error for spark.rpc.netty like there is available for akka
> 
>
> Key: SPARK-17630
> URL: https://issues.apache.org/jira/browse/SPARK-17630
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Mario Briggs
> Attachments: firstCodepath.txt
>
>
> Hi,
> I have 2 code-paths from my app that result in a jvm OOM. 
> In the first code path, 'akka.jvm-exit-on-fatal-error' kicks in and shuts 
> down the JVM, so that the caller (py4J) get notified with proper stack trace
> In the 2nd code path (rpc.netty), no such handler kicks in and shutdown the 
> JVM, so the caller does not get notified.
> Is it possible to have an jvm exit handle for the rpc. netty path?
> First code path trace file - 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17630) jvm-exit-on-fatal-error for spark.rpc.netty like there is available for akka

2016-09-21 Thread Mario Briggs (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mario Briggs updated SPARK-17630:
-
Attachment: SecondCodePath.txt

> jvm-exit-on-fatal-error for spark.rpc.netty like there is available for akka
> 
>
> Key: SPARK-17630
> URL: https://issues.apache.org/jira/browse/SPARK-17630
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Mario Briggs
> Attachments: SecondCodePath.txt, firstCodepath.txt
>
>
> Hi,
> I have 2 code-paths from my app that result in a jvm OOM. 
> In the first code path, 'akka.jvm-exit-on-fatal-error' kicks in and shuts 
> down the JVM, so that the caller (py4J) get notified with proper stack trace
> In the 2nd code path (rpc.netty), no such handler kicks in and shutdown the 
> JVM, so the caller does not get notified.
> Is it possible to have an jvm exit handle for the rpc. netty path?
> First code path trace file - 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17630) jvm-exit-on-fatal-error for spark.rpc.netty like there is available for akka

2016-09-21 Thread Mario Briggs (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mario Briggs updated SPARK-17630:
-
Description: 
Hi,

I have 2 code-paths from my app that result in a jvm OOM. 

In the first code path, 'akka.jvm-exit-on-fatal-error' kicks in and shuts down 
the JVM, so that the caller (py4J) get notified with proper stack trace. 
Attached stack-trace file (firstCodepath.txt)

In the 2nd code path (rpc.netty), no such handler kicks in and shutdown the 
JVM, so the caller does not get notified. 
Attached stack-trace file (SecondCodepath.txt)

Is it possible to have an jvm exit handle for the rpc. netty path?


 

  was:
Hi,

I have 2 code-paths from my app that result in a jvm OOM. 

In the first code path, 'akka.jvm-exit-on-fatal-error' kicks in and shuts down 
the JVM, so that the caller (py4J) get notified with proper stack trace

In the 2nd code path (rpc.netty), no such handler kicks in and shutdown the 
JVM, so the caller does not get notified.

Is it possible to have an jvm exit handle for the rpc. netty path?

First code path trace file - 
 


> jvm-exit-on-fatal-error for spark.rpc.netty like there is available for akka
> 
>
> Key: SPARK-17630
> URL: https://issues.apache.org/jira/browse/SPARK-17630
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Mario Briggs
> Attachments: SecondCodePath.txt, firstCodepath.txt
>
>
> Hi,
> I have 2 code-paths from my app that result in a jvm OOM. 
> In the first code path, 'akka.jvm-exit-on-fatal-error' kicks in and shuts 
> down the JVM, so that the caller (py4J) get notified with proper stack trace. 
> Attached stack-trace file (firstCodepath.txt)
> In the 2nd code path (rpc.netty), no such handler kicks in and shutdown the 
> JVM, so the caller does not get notified. 
> Attached stack-trace file (SecondCodepath.txt)
> Is it possible to have an jvm exit handle for the rpc. netty path?
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17630) jvm-exit-on-fatal-error handler for spark.rpc.netty like there is available for akka

2016-09-21 Thread Mario Briggs (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mario Briggs updated SPARK-17630:
-
Summary: jvm-exit-on-fatal-error handler for spark.rpc.netty like there is 
available for akka  (was: jvm-exit-on-fatal-error for spark.rpc.netty like 
there is available for akka)

> jvm-exit-on-fatal-error handler for spark.rpc.netty like there is available 
> for akka
> 
>
> Key: SPARK-17630
> URL: https://issues.apache.org/jira/browse/SPARK-17630
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Mario Briggs
> Attachments: SecondCodePath.txt, firstCodepath.txt
>
>
> Hi,
> I have 2 code-paths from my app that result in a jvm OOM. 
> In the first code path, 'akka.jvm-exit-on-fatal-error' kicks in and shuts 
> down the JVM, so that the caller (py4J) get notified with proper stack trace. 
> Attached stack-trace file (firstCodepath.txt)
> In the 2nd code path (rpc.netty), no such handler kicks in and shutdown the 
> JVM, so the caller does not get notified. 
> Attached stack-trace file (SecondCodepath.txt)
> Is it possible to have an jvm exit handle for the rpc. netty path?
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17631) Structured Streaming - Add Http Stream Sink

2016-09-21 Thread zhangxinyu (JIRA)
zhangxinyu created SPARK-17631:
--

 Summary: Structured Streaming - Add Http Stream Sink
 Key: SPARK-17631
 URL: https://issues.apache.org/jira/browse/SPARK-17631
 Project: Spark
  Issue Type: New Feature
  Components: SQL, Streaming
Affects Versions: 2.0.0
Reporter: zhangxinyu
Priority: Minor
 Fix For: 2.0.0


Streaming query results can be sinked to http server through http post request
github: https://github.com/apache/spark/pull/15194



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13331) AES support for over-the-wire encryption

2016-09-21 Thread Junjie Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15512340#comment-15512340
 ] 

Junjie Chen commented on SPARK-13331:
-

Hi [~vanzin]

Could you help to review the code?

> AES support for over-the-wire encryption
> 
>
> Key: SPARK-13331
> URL: https://issues.apache.org/jira/browse/SPARK-13331
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: Dong Chen
>Priority: Minor
>
> In network/common, SASL with DIGEST­-MD5 authentication is used for 
> negotiating a secure communication channel. When SASL operation mode is 
> "auth­-conf", the data transferred on the network is encrypted. DIGEST-MD5 
> mechanism supports following encryption: 3DES, DES, and RC4. The negotiation 
> procedure will select one of them to encrypt / decrypt the data on the 
> channel.
> However, 3des and rc4 are slow relatively. We could add code in the 
> negotiation to make it support AES for more secure and performance.
> The proposed solution is:
> When "auth-conf" is enabled, at the end of original negotiation, the 
> authentication succeeds and a secure channel is built. We could add one more 
> negotiation step: Client and server negotiate whether they both support AES. 
> If yes, the Key and IV used by AES will be generated by server and sent to 
> client through the already secure channel. Then update the encryption / 
> decryption handler to AES at both client and server side. Following data 
> transfer will use AES instead of original encryption algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17625) expectedOutputAttributes should be set when converting SimpleCatalogRelation to LogicalRelation

2016-09-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-17625:

Assignee: Zhenhua Wang

> expectedOutputAttributes should be set when converting SimpleCatalogRelation 
> to LogicalRelation
> ---
>
> Key: SPARK-17625
> URL: https://issues.apache.org/jira/browse/SPARK-17625
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
>Priority: Minor
> Fix For: 2.1.0
>
>
> expectedOutputAttributes should be set when converting SimpleCatalogRelation 
> to LogicalRelation, otherwise the outputs of LogicalRelation are different 
> from outputs of SimpleCatalogRelation - they have different exprId's.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17625) expectedOutputAttributes should be set when converting SimpleCatalogRelation to LogicalRelation

2016-09-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17625.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15182
[https://github.com/apache/spark/pull/15182]

> expectedOutputAttributes should be set when converting SimpleCatalogRelation 
> to LogicalRelation
> ---
>
> Key: SPARK-17625
> URL: https://issues.apache.org/jira/browse/SPARK-17625
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Zhenhua Wang
>Priority: Minor
> Fix For: 2.1.0
>
>
> expectedOutputAttributes should be set when converting SimpleCatalogRelation 
> to LogicalRelation, otherwise the outputs of LogicalRelation are different 
> from outputs of SimpleCatalogRelation - they have different exprId's.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14709) spark.ml API for linear SVM

2016-09-21 Thread Tae Jun Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15512391#comment-15512391
 ] 

Tae Jun Kim commented on SPARK-14709:
-

Cheer up guys! I'm looking forward to DF based SVM :-)
Always thanks for implementations!

> spark.ml API for linear SVM
> ---
>
> Key: SPARK-14709
> URL: https://issues.apache.org/jira/browse/SPARK-14709
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Provide API for SVM algorithm for DataFrames.  I would recommend using 
> OWL-QN, rather than wrapping spark.mllib's SGD-based implementation.
> The API should mimic existing spark.ml.classification APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2