date:20180326


 [ 
https://issues.apache.org/jira/browse/SPARK-1359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gummelt updated SPARK-1359:
---
Remaining Estimate: (was: 168h)
 Original Estimate: (was: 168h)

> SGD implementation is not efficient
> ---
>
> Key: SPARK-1359
> URL: https://issues.apache.org/jira/browse/SPARK-1359
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 0.9.0, 1.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> The SGD implementation samples a mini-batch to compute the stochastic 
> gradient. This is not efficient because examples are provided via an iterator 
> interface. We have to scan all of them to obtain a sample.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1359) SGD implementation is not efficient


 [ 
https://issues.apache.org/jira/browse/SPARK-1359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gummelt updated SPARK-1359:
---
Remaining Estimate: 168h
 Original Estimate: 168h

> SGD implementation is not efficient
> ---
>
> Key: SPARK-1359
> URL: https://issues.apache.org/jira/browse/SPARK-1359
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 0.9.0, 1.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The SGD implementation samples a mini-batch to compute the stochastic 
> gradient. This is not efficient because examples are provided via an iterator 
> interface. We have to scan all of them to obtain a sample.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1359) SGD implementation is not efficient


 [ 
https://issues.apache.org/jira/browse/SPARK-1359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gummelt updated SPARK-1359:
---
Remaining Estimate: (was: 504h)
 Original Estimate: (was: 504h)

> SGD implementation is not efficient
> ---
>
> Key: SPARK-1359
> URL: https://issues.apache.org/jira/browse/SPARK-1359
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 0.9.0, 1.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> The SGD implementation samples a mini-batch to compute the stochastic 
> gradient. This is not efficient because examples are provided via an iterator 
> interface. We have to scan all of them to obtain a sample.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1359) SGD implementation is not efficient


 [ 
https://issues.apache.org/jira/browse/SPARK-1359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gummelt updated SPARK-1359:
---
Remaining Estimate: 504h
 Original Estimate: 504h

> SGD implementation is not efficient
> ---
>
> Key: SPARK-1359
> URL: https://issues.apache.org/jira/browse/SPARK-1359
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 0.9.0, 1.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> The SGD implementation samples a mini-batch to compute the stochastic 
> gradient. This is not efficient because examples are provided via an iterator 
> interface. We have to scan all of them to obtain a sample.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23598) WholeStageCodegen can lead to IllegalAccessError calling append for HashAggregateExec

2018-03-26 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414984#comment-16414984
 ] 

Dongjoon Hyun commented on SPARK-23598:
---

Thank you, [~hvanhovell] !

> WholeStageCodegen can lead to IllegalAccessError  calling append for 
> HashAggregateExec
> --
>
> Key: SPARK-23598
> URL: https://issues.apache.org/jira/browse/SPARK-23598
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: David Vogelbacher
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> Got the following stacktrace for a large QueryPlan using WholeStageCodeGen:
> {noformat}
> java.lang.IllegalAccessError: tried to access method 
> org.apache.spark.sql.execution.BufferedRowIterator.append(Lorg/apache/spark/sql/catalyst/InternalRow;)V
>  from class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7$agg_NestedClass
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7$agg_NestedClass.agg_doAggregateWithKeysOutput$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:109)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345){noformat}
> After disabling codegen, everything works.
> The root cause seems to be that we are trying to call the protected _append_ 
> method of 
> [BufferedRowIterator|https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/BufferedRowIterator.java#L68]
>  from an inner-class of a sub-class that is loaded by a different 
> class-loader (after codegen compilation).
> [https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-5.html#jvms-5.4.4] 
> states that a protected method _R_ can be accessed only if one of the 
> following two conditions is fulfilled:
>  # R is protected and is declared in a class C, and D is either a subclass of 
> C or C itself. Furthermore, if R is not static, then the symbolic reference 
> to R must contain a symbolic reference to a class T, such that T is either a 
> subclass of D, a superclass of D, or D itself.
>  # R is either protected or has default access (that is, neither public nor 
> protected nor private), and is declared by a class in the same run-time 
> package as D.
> 2.) doesn't apply as we have loaded the class with a different class loader 
> (and are in a different package) and 1.) doesn't apply because we are 
> apparently trying to call the method from an inner class of a subclass of 
> _BufferedRowIterator_.
> Looking at the Code path of _WholeStageCodeGen_, the following happens:
>  # In 
> [WholeStageCodeGen|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala#L527],
>  we create the subclass of _BufferedRowIterator_, along with a _processNext_ 
> method for processing the output of the child plan.
>  # In the child, which is a 
> [HashAggregateExec|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L517],
>  we create the method which shows up at the top of the stack trace (called 
> _doAggregateWithKeysOutput_ )
>  # We add this method to the compiled code invoking _addNewFunction_ of 
> [CodeGenerator|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L460]
> In the generated function body we call the _append_ method.|
> Now, the _addNewFunction_ method states that:
> {noformat}
> If the code for the `OuterClass` grows too large, the function will be 
> inlined into a new private, inner class
> {noformat}
> This indeed seems to happen: the _doAggregateWithKeysOutput_ method is put 
> into a new private inner class. Thus, it doesn't have access to the protected 
> _append_ method anymore but still tries to call it, which results in the 
> _IllegalAccessError._ 
> Possible fixes:
>  * Pass in the _inlineTo

[jira] [Commented] (SPARK-19552) Upgrade Netty version to 4.1.x final

2018-03-26 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414831#comment-16414831
 ] 

ASF GitHub Bot commented on SPARK-19552:


Github user robertdale commented on the issue:

https://github.com/apache/tinkerpop/pull/826
  
Looks like netty is upgraded in Spark 2.3.0 only. 
https://issues.apache.org/jira/browse/SPARK-19552


> Upgrade Netty version to 4.1.x final
> 
>
> Key: SPARK-19552
> URL: https://issues.apache.org/jira/browse/SPARK-19552
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Adam Roberts
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.3.0
>
>
> Netty 4.1.8 was recently released but isn't API compatible with previous 
> major versions (like Netty 4.0.x), see 
> http://netty.io/news/2017/01/30/4-0-44-Final-4-1-8-Final.html for details.
> This version does include a fix for a security concern but not one we'd be 
> exposed to with Spark "out of the box". Let's upgrade the version we use to 
> be on the safe side as the security fix I'm especially interested in is not 
> available in the 4.0.x release line. 
> We should move up anyway to take on a bunch of other big fixes cited in the 
> release notes (and if anyone were to use Spark with netty and tcnative, they 
> shouldn't be exposed to the security problem) - we should be good citizens 
> and make this change.
> As this 4.1 version involves API changes we'll need to implement a few 
> methods and possibly adjust the Sasl tests. This JIRA and associated pull 
> request starts the process which I'll work on - and any help would be much 
> appreciated! Currently I know:
> {code}
> @Override
> public void write(ChannelHandlerContext ctx, Object msg, ChannelPromise 
> promise)
>   throws Exception {
>   if (!foundEncryptionHandler) {
> foundEncryptionHandler =
>   ctx.channel().pipeline().get(encryptHandlerName) != null; <-- this 
> returns false and causes test failures
>   }
>   ctx.write(msg, promise);
> }
> {code}
> Here's what changes will be required (at least):
> {code}
> common/network-common/src/main/java/org/apache/spark/network/crypto/TransportCipher.java{code}
>  requires touch, retain and transferred methods
> {code}
> common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java{code}
>  requires the above methods too
> {code}common/network-common/src/test/java/org/apache/spark/network/protocol/MessageWithHeaderSuite.java{code}
> With "dummy" implementations so we can at least compile and test, we'll see 
> five new test failures to address.
> These are
> {code}
> org.apache.spark.network.sasl.SparkSaslSuite.testFileRegionEncryption
> org.apache.spark.network.sasl.SparkSaslSuite.testSaslEncryption
> org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testEncryption
> org.apache.spark.rpc.netty.NettyRpcEnvSuite.send with SASL encryption
> org.apache.spark.rpc.netty.NettyRpcEnvSuite.ask with SASL encryption
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23797) SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-26 Thread Tin Vu (JIRA)

Tin Vu created SPARK-23797:
--

 Summary: SparkSQL performance on small TPCDS tables is very low 
when compared to Drill or Presto
 Key: SPARK-23797
 URL: https://issues.apache.org/jira/browse/SPARK-23797
 Project: Spark
  Issue Type: Bug
  Components: Optimizer, Spark Submit, SQL
Affects Versions: 2.3.0
Reporter: Tin Vu


I am executing a benchmark to compare performance of SparkSQL, Apache Drill and 
Presto. My experimental setup:
 * TPCDS dataset with scale factor 100 (size 100GB).
 * Spark, Drill, Presto have a same number of workers: 12.
 * Each worked has same allocated amount of memory: 4GB.
 * Data is stored by Hive with ORC format.

I executed a very simple SQL query: "SELECT * from table_name"
 The issue is that for some small size tables (even table with few dozen of 
records), SparkSQL still required about 7-8 seconds to finish, while Drill and 
Presto only needed less than 1 second.
 For other large tables with billions records, SparkSQL performance was 
reasonable when it required 20-30 seconds to scan the whole table.
 Do you have any idea or reasonable explanation for this issue?

Thanks,



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22839) Refactor Kubernetes code for configuring driver/executor pods to use consistent and cleaner abstraction


[ 
https://issues.apache.org/jira/browse/SPARK-22839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414739#comment-16414739
 ] 

Apache Spark commented on SPARK-22839:
--

User 'mccheah' has created a pull request for this issue:
https://github.com/apache/spark/pull/20910

> Refactor Kubernetes code for configuring driver/executor pods to use 
> consistent and cleaner abstraction
> ---
>
> Key: SPARK-22839
> URL: https://issues.apache.org/jira/browse/SPARK-22839
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Yinan Li
>Priority: Major
> Fix For: 2.4.0
>
>
> As discussed in https://github.com/apache/spark/pull/19954, the current code 
> for configuring the driver pod vs the code for configuring the executor pods 
> are not using the same abstraction. Besides that, the current code leaves a 
> lot to be desired in terms of the level and cleaness of abstraction. For 
> example, the current code is passing around many pieces of information around 
> different class hierarchies, which makes code review and maintenance 
> challenging. We need some thorough refactoring of the current code to achieve 
> better, cleaner, and consistent abstraction.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22839) Refactor Kubernetes code for configuring driver/executor pods to use consistent and cleaner abstraction

2018-03-26 Thread Matt Cheah (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414738#comment-16414738
 ] 

Matt Cheah commented on SPARK-22839:


Design was proposed and agreed upon in 
[https://docs.google.com/document/d/1XPLh3E2JJ7yeJSDLZWXh_lUcjZ1P0dy9QeUEyxIlfak/edit#.]
 Will be posting a pull request with the refactor shortly.

> Refactor Kubernetes code for configuring driver/executor pods to use 
> consistent and cleaner abstraction
> ---
>
> Key: SPARK-22839
> URL: https://issues.apache.org/jira/browse/SPARK-22839
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Yinan Li
>Priority: Major
> Fix For: 2.4.0
>
>
> As discussed in https://github.com/apache/spark/pull/19954, the current code 
> for configuring the driver pod vs the code for configuring the executor pods 
> are not using the same abstraction. Besides that, the current code leaves a 
> lot to be desired in terms of the level and cleaness of abstraction. For 
> example, the current code is passing around many pieces of information around 
> different class hierarchies, which makes code review and maintenance 
> challenging. We need some thorough refactoring of the current code to achieve 
> better, cleaner, and consistent abstraction.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23776) pyspark-sql tests should display build instructions when components are missing


 [ 
https://issues.apache.org/jira/browse/SPARK-23776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23776:


Assignee: Apache Spark

> pyspark-sql tests should display build instructions when components are 
> missing
> ---
>
> Key: SPARK-23776
> URL: https://issues.apache.org/jira/browse/SPARK-23776
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Bruce Robbins
>Assignee: Apache Spark
>Priority: Minor
>
> This is a follow up to SPARK-23417.
> The pyspark-streaming tests print useful build instructions when certain 
> components are missing in the build.
> pyspark-sql's udf and readwrite tests also have specific build requirements: 
> the build must compile test scala files, and the build must also create the 
> Hive assembly. When those class or jar files are not created, the tests throw 
> only partially helpful exceptions, e.g.:
> {noformat}
> AnalysisException: u'Can not load class 
> test.org.apache.spark.sql.JavaStringLength, please make sure it is on the 
> classpath;'
> {noformat}
> or
> {noformat}
> IllegalArgumentException: u"Error while instantiating 
> 'org.apache.spark.sql.hive.HiveExternalCatalog':"
> {noformat}
> You end up in this situation when you follow Spark's build instructions and 
> then attempt to run the pyspark tests.
> It would be nice if pyspark-sql tests provide helpful build instructions in 
> these cases.
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23776) pyspark-sql tests should display build instructions when components are missing


 [ 
https://issues.apache.org/jira/browse/SPARK-23776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23776:


Assignee: (was: Apache Spark)

> pyspark-sql tests should display build instructions when components are 
> missing
> ---
>
> Key: SPARK-23776
> URL: https://issues.apache.org/jira/browse/SPARK-23776
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> This is a follow up to SPARK-23417.
> The pyspark-streaming tests print useful build instructions when certain 
> components are missing in the build.
> pyspark-sql's udf and readwrite tests also have specific build requirements: 
> the build must compile test scala files, and the build must also create the 
> Hive assembly. When those class or jar files are not created, the tests throw 
> only partially helpful exceptions, e.g.:
> {noformat}
> AnalysisException: u'Can not load class 
> test.org.apache.spark.sql.JavaStringLength, please make sure it is on the 
> classpath;'
> {noformat}
> or
> {noformat}
> IllegalArgumentException: u"Error while instantiating 
> 'org.apache.spark.sql.hive.HiveExternalCatalog':"
> {noformat}
> You end up in this situation when you follow Spark's build instructions and 
> then attempt to run the pyspark tests.
> It would be nice if pyspark-sql tests provide helpful build instructions in 
> these cases.
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23776) pyspark-sql tests should display build instructions when components are missing


[ 
https://issues.apache.org/jira/browse/SPARK-23776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414718#comment-16414718
 ] 

Apache Spark commented on SPARK-23776:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/20909

> pyspark-sql tests should display build instructions when components are 
> missing
> ---
>
> Key: SPARK-23776
> URL: https://issues.apache.org/jira/browse/SPARK-23776
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> This is a follow up to SPARK-23417.
> The pyspark-streaming tests print useful build instructions when certain 
> components are missing in the build.
> pyspark-sql's udf and readwrite tests also have specific build requirements: 
> the build must compile test scala files, and the build must also create the 
> Hive assembly. When those class or jar files are not created, the tests throw 
> only partially helpful exceptions, e.g.:
> {noformat}
> AnalysisException: u'Can not load class 
> test.org.apache.spark.sql.JavaStringLength, please make sure it is on the 
> classpath;'
> {noformat}
> or
> {noformat}
> IllegalArgumentException: u"Error while instantiating 
> 'org.apache.spark.sql.hive.HiveExternalCatalog':"
> {noformat}
> You end up in this situation when you follow Spark's build instructions and 
> then attempt to run the pyspark tests.
> It would be nice if pyspark-sql tests provide helpful build instructions in 
> these cases.
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23162) PySpark ML LinearRegressionSummary missing r2adj

2018-03-26 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-23162.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20842
[https://github.com/apache/spark/pull/20842]

> PySpark ML LinearRegressionSummary missing r2adj
> 
>
> Key: SPARK-23162
> URL: https://issues.apache.org/jira/browse/SPARK-23162
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: kevin yu
>Priority: Minor
>  Labels: starter
> Fix For: 2.4.0
>
>
> Missing the Python API for {{r2adj}} in {{LinearRegressionSummary}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23162) PySpark ML LinearRegressionSummary missing r2adj

2018-03-26 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-23162:


Assignee: kevin yu

> PySpark ML LinearRegressionSummary missing r2adj
> 
>
> Key: SPARK-23162
> URL: https://issues.apache.org/jira/browse/SPARK-23162
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: kevin yu
>Priority: Minor
>  Labels: starter
>
> Missing the Python API for {{r2adj}} in {{LinearRegressionSummary}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23572) Update security.md to cover new features


 [ 
https://issues.apache.org/jira/browse/SPARK-23572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-23572.

   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20742
[https://github.com/apache/spark/pull/20742]

> Update security.md to cover new features
> 
>
> Key: SPARK-23572
> URL: https://issues.apache.org/jira/browse/SPARK-23572
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Major
> Fix For: 2.4.0
>
>
> I just took a look at {{security.md}} and while it is correct, it covers 
> functionality that is now sort of obsolete (such as SASL-based encryption 
> instead of the newer AES encryption support).
> We should go over that document and make sure everything is up to date.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23572) Update security.md to cover new features


 [ 
https://issues.apache.org/jira/browse/SPARK-23572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-23572:
--

Assignee: Marcelo Vanzin

> Update security.md to cover new features
> 
>
> Key: SPARK-23572
> URL: https://issues.apache.org/jira/browse/SPARK-23572
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Major
>
> I just took a look at {{security.md}} and while it is correct, it covers 
> functionality that is now sort of obsolete (such as SASL-based encryption 
> instead of the newer AES encryption support).
> We should go over that document and make sure everything is up to date.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23736) Extension of the concat function to support array columns

2018-03-26 Thread Marek Novotny (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marek Novotny updated SPARK-23736:
--
Description: 
Extend the _concat_ function to also support array columns.

Example:

{{concat(array(1, 2, 3), array(10, 20, 30), array(100, 200)) => [1, 2, 3,10, 
20, 30,100, 200] }}

  was:
Extend the _concat_ function to also support array columns.

Example:

concat(array(1, 2, 3), array(10, 20, 30), array(100, 200)) => [1, 2, 3,10, 20, 
30,100, 200] 


> Extension of the concat function to support array columns
> -
>
> Key: SPARK-23736
> URL: https://issues.apache.org/jira/browse/SPARK-23736
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Marek Novotny
>Priority: Major
>
> Extend the _concat_ function to also support array columns.
> Example:
> {{concat(array(1, 2, 3), array(10, 20, 30), array(100, 200)) => [1, 2, 3,10, 
> 20, 30,100, 200] }}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23736) Extension of the concat function to support array columns

2018-03-26 Thread Marek Novotny (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marek Novotny updated SPARK-23736:
--
Description: 
Extend the _concat_ function to also support array columns.

Example:

concat(array(1, 2, 3), array(10, 20, 30), array(100, 200)) => [1, 2, 3,10, 20, 
30,100, 200] 

  was:
Implement the _concat_arrays_ function that merges two or more array columns 
into one. If any of children values is null, the function should return null.

{{def concat_arrays(columns : Column*): Column }}

Example:

[1, 2, 3], [10, 20, 30], [100, 200] => [1, 2, 3,10, 20, 30,100, 200] 

Summary: Extension of the concat function to support array columns  
(was: collection function: concat_arrays)

> Extension of the concat function to support array columns
> -
>
> Key: SPARK-23736
> URL: https://issues.apache.org/jira/browse/SPARK-23736
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Marek Novotny
>Priority: Major
>
> Extend the _concat_ function to also support array columns.
> Example:
> concat(array(1, 2, 3), array(10, 20, 30), array(100, 200)) => [1, 2, 3,10, 
> 20, 30,100, 200] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23672) Document Support returning lists in Arrow UDFs


 [ 
https://issues.apache.org/jira/browse/SPARK-23672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23672:


Assignee: (was: Apache Spark)

> Document Support returning lists in Arrow UDFs
> --
>
> Key: SPARK-23672
> URL: https://issues.apache.org/jira/browse/SPARK-23672
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Major
>
> Documenting the support for returning lists for individual inputs on 
> non-grouped data inside of PySpark UDFs to better support the wordcount 
> example (and other things but wordcount is the simplest I can think of).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23672) Document Support returning lists in Arrow UDFs


 [ 
https://issues.apache.org/jira/browse/SPARK-23672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23672:


Assignee: Apache Spark

> Document Support returning lists in Arrow UDFs
> --
>
> Key: SPARK-23672
> URL: https://issues.apache.org/jira/browse/SPARK-23672
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Major
>
> Documenting the support for returning lists for individual inputs on 
> non-grouped data inside of PySpark UDFs to better support the wordcount 
> example (and other things but wordcount is the simplest I can think of).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23672) Document Support returning lists in Arrow UDFs


[ 
https://issues.apache.org/jira/browse/SPARK-23672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414348#comment-16414348
 ] 

Apache Spark commented on SPARK-23672:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/20908

> Document Support returning lists in Arrow UDFs
> --
>
> Key: SPARK-23672
> URL: https://issues.apache.org/jira/browse/SPARK-23672
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Major
>
> Documenting the support for returning lists for individual inputs on 
> non-grouped data inside of PySpark UDFs to better support the wordcount 
> example (and other things but wordcount is the simplest I can think of).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11237) PMML export for ML KMeans


[ 
https://issues.apache.org/jira/browse/SPARK-11237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414344#comment-16414344
 ] 

Apache Spark commented on SPARK-11237:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/20907

> PMML export for ML KMeans
> -
>
> Key: SPARK-11237
> URL: https://issues.apache.org/jira/browse/SPARK-11237
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: holdenk
>Priority: Major
>
> Add PMML export for ML KMeans



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23739) Spark structured streaming long running problem

2018-03-26 Thread Cody Koeninger (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414326#comment-16414326
 ] 

Cody Koeninger commented on SPARK-23739:


Ok, the OutOfMemoryError is probably a separate and unrelated issue.



> Spark structured streaming long running problem
> ---
>
> Key: SPARK-23739
> URL: https://issues.apache.org/jira/browse/SPARK-23739
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Florencio
>Priority: Critical
>  Labels: spark, streaming, structured
>
> I had a problem with long running spark structured streaming in spark 2.1. 
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.kafka.common.requests.LeaveGroupResponse.
> The detailed error is the following:
> 18/03/16 16:10:57 INFO StreamExecution: Committed offsets for batch 2110. 
> Metadata OffsetSeqMetadata(0,1521216656590)
> 18/03/16 16:10:57 INFO KafkaSource: GetBatch called with start = 
> Some(\{"TopicName":{"2":5520197,"1":5521045,"3":5522054,"0":5527915}}), end = 
> \{"TopicName":{"2":5522730,"1":5523577,"3":5524586,"0":5530441}}
> 18/03/16 16:10:57 INFO KafkaSource: Partitions added: Map()
> 18/03/16 16:10:57 ERROR StreamExecution: Query [id = 
> a233b9ff-cc39-44d3-b953-a255986c04bf, runId = 
> 8520e3c0-2455-4ac1-9021-8518fb58b3f8] terminated with error
> java.util.zip.ZipException: invalid code lengths set
>  at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>  at java.io.FilterInputStream.read(FilterInputStream.java:133)
>  at java.io.FilterInputStream.read(FilterInputStream.java:107)
>  at 
> org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:354)
>  at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322)
>  at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303)
>  at org.apache.spark.util.Utils$.copyStream(Utils.scala:362)
>  at 
> org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:45)
>  at 
> org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:83)
>  at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:173)
>  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>  at org.apache.spark.SparkContext.clean(SparkContext.scala:2101)
>  at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370)
>  at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>  at org.apache.spark.rdd.RDD.map(RDD.scala:369)
>  at org.apache.spark.sql.kafka010.KafkaSource.getBatch(KafkaSource.scala:287)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:503)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:499)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>  at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>  at 
> org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
>  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
> 18/03/16 16:10:57 ERROR ClientUtils: Failed to close coordinator
> java.lang.NoClassDefFoundError: 
> org/apache/kafka/common/requests/LeaveGroupResponse
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.sendLeaveGroupRequest(AbstractCoordinator.java:575)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.maybeLeaveGroup(AbstractCoordinator.java:566)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.close(AbstractCoordinator.java:555)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.close(ConsumerCoordinator.java:377)
>  at org.apache.kafka.clients.ClientUtils.closeQuietly(ClientUtils.java:66)
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:1383)
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:1364)
>  at org.apache.sp

[jira] [Updated] (SPARK-23599) The UUID() expression is too non-deterministic

2018-03-26 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-23599:
--
Fix Version/s: 2.3.1

> The UUID() expression is too non-deterministic
> --
>
> Key: SPARK-23599
> URL: https://issues.apache.org/jira/browse/SPARK-23599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Liang-Chi Hsieh
>Priority: Critical
> Fix For: 2.3.1, 2.4.0
>
>
> The current {{Uuid()}} expression uses {{java.util.UUID.randomUUID}} for UUID 
> generation. There are a couple of major problems with this:
> - It is non-deterministic across task retries. This breaks Spark's processing 
> model, and this will to very hard to trace bugs, like non-deterministic 
> shuffles, duplicates and missing rows.
> - It uses a single secure random for UUID generation. This uses a single JVM 
> wide lock, and this can lead to lock contention and other performance 
> problems.
> We should move to something that is deterministic between retries. This can 
> be done by using seeded PRNGs for which we set the seed during planning. It 
> is important here to use a PRNG that provides enough entropy for creating a 
> proper UUID.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23790) proxy-user failed connecting to a kerberos configured metastore

2018-03-26 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414291#comment-16414291
 ] 

Stavros Kontopoulos edited comment on SPARK-23790 at 3/26/18 6:29 PM:
--

Yes that is what I am saying. The initial fix here: 
[https://github.com/apache/spark/pull/17333] does the trick but I want to have 
a similar approach with yarn:[https://github.com/apache/spark/pull/17335]  that 
adds delegation tokens in current user's ugi. When I did that I hit the issue 
with HadoopRDD which fetches its delegation tokens on its own.


was (Author: skonto):
Yes that is what I am saying. The initial fix here: 
[https://github.com/apache/spark/pull/17333] does the trick but I want to have 
a similar approach with yarn that adds delegation tokens in current user's ugi. 
When I did that I hit the issue with HadoopRDD which fetches its delegation 
tokens on its own.

> proxy-user failed connecting to a kerberos configured metastore
> ---
>
> Key: SPARK-23790
> URL: https://issues.apache.org/jira/browse/SPARK-23790
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> This appeared at a customer trying to integrate with a kerberized hdfs 
> cluster.
> This can be easily fixed with the proposed fix 
> [here|https://github.com/apache/spark/pull/17333] and the problem was 
> reported first [here|https://issues.apache.org/jira/browse/SPARK-19995] for 
> yarn.
> The other option is to add the delegation tokens to the current user's UGI as 
> in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
> problem but leads to a failure when someones uses a HadoopRDD because the 
> latter, uses FileInputFormat to get the splits which calls the local ticket 
> cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail 
> with:
> {quote}Exception in thread "main" 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
> can be issued only with kerberos or web authenticationat 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)
> {quote}
> This implies that security mode is SIMPLE and hadoop libs there are not aware 
> of kerberos.
> This is related to this issue the workaround decided was to 
> [trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
>  hadoop.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23790) proxy-user failed connecting to a kerberos configured metastore

2018-03-26 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414291#comment-16414291
 ] 

Stavros Kontopoulos edited comment on SPARK-23790 at 3/26/18 6:29 PM:
--

Yes that is what I am saying. The initial fix here: 
[https://github.com/apache/spark/pull/17333] does the trick but I wanted to 
have a similar approach with yarn:[https://github.com/apache/spark/pull/17335]  
that adds delegation tokens in current user's ugi. When I did that I hit the 
issue with HadoopRDD which fetches its delegation tokens on its own.


was (Author: skonto):
Yes that is what I am saying. The initial fix here: 
[https://github.com/apache/spark/pull/17333] does the trick but I want to have 
a similar approach with yarn:[https://github.com/apache/spark/pull/17335]  that 
adds delegation tokens in current user's ugi. When I did that I hit the issue 
with HadoopRDD which fetches its delegation tokens on its own.

> proxy-user failed connecting to a kerberos configured metastore
> ---
>
> Key: SPARK-23790
> URL: https://issues.apache.org/jira/browse/SPARK-23790
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> This appeared at a customer trying to integrate with a kerberized hdfs 
> cluster.
> This can be easily fixed with the proposed fix 
> [here|https://github.com/apache/spark/pull/17333] and the problem was 
> reported first [here|https://issues.apache.org/jira/browse/SPARK-19995] for 
> yarn.
> The other option is to add the delegation tokens to the current user's UGI as 
> in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
> problem but leads to a failure when someones uses a HadoopRDD because the 
> latter, uses FileInputFormat to get the splits which calls the local ticket 
> cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail 
> with:
> {quote}Exception in thread "main" 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
> can be issued only with kerberos or web authenticationat 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)
> {quote}
> This implies that security mode is SIMPLE and hadoop libs there are not aware 
> of kerberos.
> This is related to this issue the workaround decided was to 
> [trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
>  hadoop.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23790) proxy-user failed connecting to a kerberos configured metastore

2018-03-26 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414291#comment-16414291
 ] 

Stavros Kontopoulos commented on SPARK-23790:
-

Yes that is what I am saying. The initial fix here: 
[https://github.com/apache/spark/pull/17333] does the trick but I want to have 
a similar approach with yarn that adds delegation tokens in current user's ugi. 
When I did that I hit the issue with HadoopRDD which fetches its delegation 
tokens on its own.

> proxy-user failed connecting to a kerberos configured metastore
> ---
>
> Key: SPARK-23790
> URL: https://issues.apache.org/jira/browse/SPARK-23790
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> This appeared at a customer trying to integrate with a kerberized hdfs 
> cluster.
> This can be easily fixed with the proposed fix 
> [here|https://github.com/apache/spark/pull/17333] and the problem was 
> reported first [here|https://issues.apache.org/jira/browse/SPARK-19995] for 
> yarn.
> The other option is to add the delegation tokens to the current user's UGI as 
> in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
> problem but leads to a failure when someones uses a HadoopRDD because the 
> latter, uses FileInputFormat to get the splits which calls the local ticket 
> cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail 
> with:
> {quote}Exception in thread "main" 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
> can be issued only with kerberos or web authenticationat 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)
> {quote}
> This implies that security mode is SIMPLE and hadoop libs there are not aware 
> of kerberos.
> This is related to this issue the workaround decided was to 
> [trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
>  hadoop.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23672) Document Support returning lists in Arrow UDFs

2018-03-26 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-23672:

Description: Documenting the support for returning lists for individual 
inputs on non-grouped data inside of PySpark UDFs to better support the 
wordcount example (and other things but wordcount is the simplest I can think 
of).  (was: Consider to add support for returning lists for individual inputs 
on non-grouped data inside of PySpark UDFs to better support the wordcount 
example (and other things but wordcount is the simplest I can think of).)

> Document Support returning lists in Arrow UDFs
> --
>
> Key: SPARK-23672
> URL: https://issues.apache.org/jira/browse/SPARK-23672
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Major
>
> Documenting the support for returning lists for individual inputs on 
> non-grouped data inside of PySpark UDFs to better support the wordcount 
> example (and other things but wordcount is the simplest I can think of).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23672) Document Support returning lists in Arrow UDFs

2018-03-26 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-23672:

Summary: Document Support returning lists in Arrow UDFs  (was: Support 
returning lists in Arrow UDFs)

> Document Support returning lists in Arrow UDFs
> --
>
> Key: SPARK-23672
> URL: https://issues.apache.org/jira/browse/SPARK-23672
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Major
>
> Consider to add support for returning lists for individual inputs on 
> non-grouped data inside of PySpark UDFs to better support the wordcount 
> example (and other things but wordcount is the simplest I can think of).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23561) make StreamWriter not a DataSourceWriter subclass


 [ 
https://issues.apache.org/jira/browse/SPARK-23561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23561:


Assignee: (was: Apache Spark)

> make StreamWriter not a DataSourceWriter subclass
> -
>
> Key: SPARK-23561
> URL: https://issues.apache.org/jira/browse/SPARK-23561
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> The inheritance makes little sense now; they've almost entirely diverged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23561) make StreamWriter not a DataSourceWriter subclass


 [ 
https://issues.apache.org/jira/browse/SPARK-23561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23561:


Assignee: Apache Spark

> make StreamWriter not a DataSourceWriter subclass
> -
>
> Key: SPARK-23561
> URL: https://issues.apache.org/jira/browse/SPARK-23561
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Apache Spark
>Priority: Major
>
> The inheritance makes little sense now; they've almost entirely diverged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23561) make StreamWriter not a DataSourceWriter subclass


[ 
https://issues.apache.org/jira/browse/SPARK-23561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414209#comment-16414209
 ] 

Apache Spark commented on SPARK-23561:
--

User 'jose-torres' has created a pull request for this issue:
https://github.com/apache/spark/pull/20906

> make StreamWriter not a DataSourceWriter subclass
> -
>
> Key: SPARK-23561
> URL: https://issues.apache.org/jira/browse/SPARK-23561
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> The inheritance makes little sense now; they've almost entirely diverged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23790) proxy-user failed connecting to a kerberos configured metastore


[ 
https://issues.apache.org/jira/browse/SPARK-23790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414208#comment-16414208
 ] 

Marcelo Vanzin commented on SPARK-23790:


BTW if what you're saying is that Yuming's fix also works for the issue you're 
seeing, we should probably dupe this to the other bug.

> proxy-user failed connecting to a kerberos configured metastore
> ---
>
> Key: SPARK-23790
> URL: https://issues.apache.org/jira/browse/SPARK-23790
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> This appeared at a customer trying to integrate with a kerberized hdfs 
> cluster.
> This can be easily fixed with the proposed fix 
> [here|https://github.com/apache/spark/pull/17333] and the problem was 
> reported first [here|https://issues.apache.org/jira/browse/SPARK-19995] for 
> yarn.
> The other option is to add the delegation tokens to the current user's UGI as 
> in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
> problem but leads to a failure when someones uses a HadoopRDD because the 
> latter, uses FileInputFormat to get the splits which calls the local ticket 
> cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail 
> with:
> {quote}Exception in thread "main" 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
> can be issued only with kerberos or web authenticationat 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)
> {quote}
> This implies that security mode is SIMPLE and hadoop libs there are not aware 
> of kerberos.
> This is related to this issue the workaround decided was to 
> [trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
>  hadoop.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22513) Provide build profile for hadoop 2.8

2018-03-26 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414190#comment-16414190
 ] 

Nicholas Chammas commented on SPARK-22513:
--

Thanks for the breakdown. This will be handy for reference. So I guess at the 
summary level Sean was correct. :D

> Provide build profile for hadoop 2.8
> 
>
> Key: SPARK-22513
> URL: https://issues.apache.org/jira/browse/SPARK-22513
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Christine Koppelt
>Priority: Major
>
> hadoop 2.8 comes with a patch which is necessary to make it run on NixOS [1]. 
> Therefore it would be cool to have a Spark version pre-built for Hadoop 2.8.
> [1] 
> https://github.com/apache/hadoop/commit/5231c527aaf19fb3f4bd59dcd2ab19bfb906d377#diff-19821342174c77119be4a99dc3f3618d



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23672) Support returning lists in Arrow UDFs

2018-03-26 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-23672:

Description: Consider to add support for returning lists for individual 
inputs on non-grouped data inside of PySpark UDFs to better support the 
wordcount example (and other things but wordcount is the simplest I can think 
of).  (was: Consider to add support for returning lists inside of PySpark UDFs 
to better support the wordcount example.)

> Support returning lists in Arrow UDFs
> -
>
> Key: SPARK-23672
> URL: https://issues.apache.org/jira/browse/SPARK-23672
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Major
>
> Consider to add support for returning lists for individual inputs on 
> non-grouped data inside of PySpark UDFs to better support the wordcount 
> example (and other things but wordcount is the simplest I can think of).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23790) proxy-user failed connecting to a kerberos configured metastore


[ 
https://issues.apache.org/jira/browse/SPARK-23790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414176#comment-16414176
 ] 

Marcelo Vanzin commented on SPARK-23790:


I haven't had the time to see exactly what spark-cli is doing. This looks the 
same as SPARK-23639, and I don't like the place where the fix is being made. 
But I don't know enough about spark-cli yet to suggest something different.

> proxy-user failed connecting to a kerberos configured metastore
> ---
>
> Key: SPARK-23790
> URL: https://issues.apache.org/jira/browse/SPARK-23790
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> This appeared at a customer trying to integrate with a kerberized hdfs 
> cluster.
> This can be easily fixed with the proposed fix 
> [here|https://github.com/apache/spark/pull/17333] and the problem was 
> reported first [here|https://issues.apache.org/jira/browse/SPARK-19995] for 
> yarn.
> The other option is to add the delegation tokens to the current user's UGI as 
> in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
> problem but leads to a failure when someones uses a HadoopRDD because the 
> latter, uses FileInputFormat to get the splits which calls the local ticket 
> cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail 
> with:
> {quote}Exception in thread "main" 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
> can be issued only with kerberos or web authenticationat 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)
> {quote}
> This implies that security mode is SIMPLE and hadoop libs there are not aware 
> of kerberos.
> This is related to this issue the workaround decided was to 
> [trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
>  hadoop.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23796) There's no API to change state RDD's name

2018-03-26 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-23796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

István Gansperger updated SPARK-23796:
--
Description: 
I use a few {{mapWithState}} stream oparations in my application and at some 
point it became a minor inconvenience that I could not figure out how to set 
the state RDDs name or serialization level. Searching around didn't really help 
and I have not come across any issues regarding this (pardon my inability to 
find it if there's one). It could be useful to see how much memory each state 
uses if the user has multiple such transformations.

I have used some ugly reflection based code to be able to set the name of the 
state RDD and also the serialization level. I understand that the latter may be 
intentionally limited, but I haven't come across any issues caused by this 
apart from slightly degraded performance in exchange for a bit less memory 
usage. Are these limitations in place intentionally or is it just an oversight? 
Having some extra methods for these on {{StateSpec}} could be useful in my 
opinion.

  was:
I use a few {{mapWithState}} stream oparations in my application and at some 
point it became a minor inconvenience that I could not figure out how to set 
the state RDDs name or serialization level. Searching around didn't really help 
and I have not come across any issues regarding this (pardon my inability to 
find it if there's one). It could be useful to see how much memory each state 
uses if the user has multiple such transformations.

I have used some ugly reflection based code to be able to set the name of the 
state RDD and also the serialization level. I understand that the latter may be 
intentionally limited, but I haven't come across any issues caused by this 
apart from sightly degraded performance in exchange for a bit less memory 
usage. Are these limitations in place intentionally or is it just an oversight? 
Having some extra methods for these on {{StateSpec}} could be useful in my 
opinion.


> There's no API to change state RDD's name
> -
>
> Key: SPARK-23796
> URL: https://issues.apache.org/jira/browse/SPARK-23796
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: István Gansperger
>Priority: Minor
>
> I use a few {{mapWithState}} stream oparations in my application and at some 
> point it became a minor inconvenience that I could not figure out how to set 
> the state RDDs name or serialization level. Searching around didn't really 
> help and I have not come across any issues regarding this (pardon my 
> inability to find it if there's one). It could be useful to see how much 
> memory each state uses if the user has multiple such transformations.
> I have used some ugly reflection based code to be able to set the name of the 
> state RDD and also the serialization level. I understand that the latter may 
> be intentionally limited, but I haven't come across any issues caused by this 
> apart from slightly degraded performance in exchange for a bit less memory 
> usage. Are these limitations in place intentionally or is it just an 
> oversight? Having some extra methods for these on {{StateSpec}} could be 
> useful in my opinion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23796) There's no API to change state RDD's name

2018-03-26 Thread JIRA

István Gansperger created SPARK-23796:
-

 Summary: There's no API to change state RDD's name
 Key: SPARK-23796
 URL: https://issues.apache.org/jira/browse/SPARK-23796
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: István Gansperger


I use a few {{mapWithState}} stream oparations in my application and at some 
point it became a minor inconvenience that I could not figure out how to set 
the state RDDs name or serialization level. Searching around didn't really help 
and I have not come across any issues regarding this (pardon my inability to 
find it if there's one). It could be useful to see how much memory each state 
uses if the user has multiple such transformations.

I have used some ugly reflection based code to be able to set the name of the 
state RDD and also the serialization level. I understand that the latter may be 
intentionally limited, but I haven't come across any issues caused by this 
apart from sightly degraded performance in exchange for a bit less memory 
usage. Are these limitations in place intentionally or is it just an oversight? 
Having some extra methods for these on {{StateSpec}} could be useful in my 
opinion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23795) AbstractLauncher is not extendable


 [ 
https://issues.apache.org/jira/browse/SPARK-23795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-23795.

Resolution: Not A Problem

That class is not meant to be extended by outside libraries.

> AbstractLauncher is not extendable
> --
>
> Key: SPARK-23795
> URL: https://issues.apache.org/jira/browse/SPARK-23795
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Dan Sanduleac
>Priority: Minor
>
> The class is {{public abstract}} but because {{self()}} is package-private, 
> it cannot actually be implemented, which seems like an oversight.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23737) Scala API documentation leads to nonexistent pages for sources

2018-03-26 Thread Alexander Bessonov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414111#comment-16414111
 ] 

Alexander Bessonov commented on SPARK-23737:


[~sameerag], Making a wild guess the username in the URL is yours.

> Scala API documentation leads to nonexistent pages for sources
> --
>
> Key: SPARK-23737
> URL: https://issues.apache.org/jira/browse/SPARK-23737
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: Alexander Bessonov
>Priority: Minor
>
> h3. Steps to reproduce:
>  # Go to [Scala API 
> homepage|[http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package]].
>  # Click "Source: package.scala"
> h3. Result:
> The link leads to nonexistent page: 
> [https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/core/src/main/scala/org/apache/spark/package.scala]
> h3. Expected result:
> The link leads to proper page:
> [https://github.com/apache/spark/tree/v2.3.0/core/src/main/scala/org/apache/spark/package.scala]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-23737) Scala API documentation leads to nonexistent pages for sources

2018-03-26 Thread Alexander Bessonov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Bessonov reopened SPARK-23737:


Okay. The bug isn't fixed and it affects everyone who wants to jump to the 
source code from ScalaDocs.

> Scala API documentation leads to nonexistent pages for sources
> --
>
> Key: SPARK-23737
> URL: https://issues.apache.org/jira/browse/SPARK-23737
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: Alexander Bessonov
>Priority: Minor
>
> h3. Steps to reproduce:
>  # Go to [Scala API 
> homepage|[http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package]].
>  # Click "Source: package.scala"
> h3. Result:
> The link leads to nonexistent page: 
> [https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/core/src/main/scala/org/apache/spark/package.scala]
> h3. Expected result:
> The link leads to proper page:
> [https://github.com/apache/spark/tree/v2.3.0/core/src/main/scala/org/apache/spark/package.scala]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23795) AbstractLauncher is not extendable


 [ 
https://issues.apache.org/jira/browse/SPARK-23795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23795:


Assignee: (was: Apache Spark)

> AbstractLauncher is not extendable
> --
>
> Key: SPARK-23795
> URL: https://issues.apache.org/jira/browse/SPARK-23795
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Dan Sanduleac
>Priority: Minor
>
> The class is {{public abstract}} but because {{self()}} is package-private, 
> it cannot actually be implemented, which seems like an oversight.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23795) AbstractLauncher is not extendable


 [ 
https://issues.apache.org/jira/browse/SPARK-23795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23795:


Assignee: Apache Spark

> AbstractLauncher is not extendable
> --
>
> Key: SPARK-23795
> URL: https://issues.apache.org/jira/browse/SPARK-23795
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Dan Sanduleac
>Assignee: Apache Spark
>Priority: Minor
>
> The class is {{public abstract}} but because {{self()}} is package-private, 
> it cannot actually be implemented, which seems like an oversight.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23795) AbstractLauncher is not extendable


[ 
https://issues.apache.org/jira/browse/SPARK-23795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414092#comment-16414092
 ] 

Apache Spark commented on SPARK-23795:
--

User 'dansanduleac' has created a pull request for this issue:
https://github.com/apache/spark/pull/20905

> AbstractLauncher is not extendable
> --
>
> Key: SPARK-23795
> URL: https://issues.apache.org/jira/browse/SPARK-23795
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Dan Sanduleac
>Priority: Minor
>
> The class is {{public abstract}} but because {{self()}} is package-private, 
> it cannot actually be implemented, which seems like an oversight.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23795) AbstractLauncher is not extendable

2018-03-26 Thread Dan Sanduleac (JIRA)

Dan Sanduleac created SPARK-23795:
-

 Summary: AbstractLauncher is not extendable
 Key: SPARK-23795
 URL: https://issues.apache.org/jira/browse/SPARK-23795
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.0, 2.4.0
Reporter: Dan Sanduleac


The class is {{public abstract}} but because {{self()}} is package-private, it 
cannot actually be implemented, which seems like an oversight.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23751) Kolmogorov-Smirnoff test Python API in pyspark.ml


[ 
https://issues.apache.org/jira/browse/SPARK-23751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16413704#comment-16413704
 ] 

Apache Spark commented on SPARK-23751:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/20904

> Kolmogorov-Smirnoff test Python API in pyspark.ml
> -
>
> Key: SPARK-23751
> URL: https://issues.apache.org/jira/browse/SPARK-23751
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Python wrapper for new DataFrame-based API for KS test



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22513) Provide build profile for hadoop 2.8

[
https://issues.apache.org/jira/browse/SPARK-22513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16413685#comment-16413685
]

Steve Loughran edited comment on SPARK-22513 at 3/26/18 11:13 AM:
--

API wise, everything compiled against 2.6 should compile against all subsequent
2.x versions.
* The profiles are similar except in the cloud profiles where later releases
add more stuff (hadoop-azure in 2.7). No changes there between 2.7 and 2.8

There's the hadoop-allyun module in recent 2.x+ (forthcoming 2.9?) Hadoop 3
adds a new "hadoop-cloud-storage" profile which is intended to allow
descendants to get a "cruft removed and up to date" set of FS dependencies; if
new things go in, this will be updated. And I have the task of shading that
soon.
* There's also the inevitable changes in versions of things, specifically and
inevitably jackson being the most visible. For Hadoop 3 (and 2.9+?) we've move
to the shaded amazon-sdk-bundle so there's no need to worry about the version
of jackson it builds against. As to Guava, that's the same everywhere but you
can bump it up to at least guava 19.0 without problems (AFAIK, usual
disclaimers, etc).

* We all strive to keep the semantics of things the same, but one person's
"small improvement" is always someone else's "fundamental regression in the way
things behave" —the eternal losing battle of software engineering. Best
strategy there: build and test against alpha and beta releases, complain when
things don't work & make sure it's fixed in the final one.

FWIW, removing the 2.6 and setting 2.7 as the bare minimum would be a good
move. Hadoop 2.6 is Java 6 only; the rest of branch-2 is Java 7.
Hadoop 2.7 is the foundation of CDH HDP and microsoft HD/Insighs, albeit with
a fair amount of backporting. Using 2.7 as the foundation means you don't have
to worry about what was backported, except to complain when someone broke
compatibility. As to ASF 2.8.x, I'd recommend it if you want to use the ASF
artifacts (bug fixes, way better S3 performance), and, if you work with Azure
outside HDP or HD/Insights, 2.9. I don't know about CDH there, Sean will need
to git log --grep for HADOOP-14660 and HADOOP-14535 as the big columnar storage
speedups. AWS EMR and google dataproc are both 2.8.x —no idea about changes
made.

You can build spark against any Hadoop version you like on the 2.x line without
problems.
{code:java}
mvn package -Phadoop-2.7,hadoop-cloud,yarn Dhadoop.version=2.9.0
{code}
Against 3.x things compile but Hive is unhappy unless you have one of: a spark
hive module with a patch to hive's version check case statement or apache
hadoop trunk pretending to be a branch-2 line `-Ddeclared.hadoop.version=2.11`.
That works OK for spark build & test but MUST NOT be deployed as HDFS version
checking will be unhappy.

Clear :)?

ps: don't mention Java 9 (HADOOP-11123) 10 (HADOOP-11423) or 11
(HADOOP-15338)]. thanks.

was (Author: ste...@apache.org):
API wise, everything compiled against 2.6 should compile against all subsequent
2.x versions.
* The profiles are similar except in the cloud profiles where later releases
add more stuff (hadoop-azure in 2.7). No changes there between 2.7 and 2.8

* We all strive to keep the semantics of things the same, but one person's
"small improvement" is always someone else's "fundamental regression in the way
things behave". —the eternal losing battle of software engineeing. Best
strategy there: build and test against alpha and beta releases, complain when
things don't work & make sure it's fixed in the final one.

FWIW, removing the 2.6 and setting 2.7 as the bare minimum would be a good
move. Hadoop 2.6 is Java 6 only; the rest of branch-2 is Java 7.
Hadoop 2.7 is the foundation of CDH HDP and microsoft HD/I, albeit with a fair
amount of backporting. Using 2.7 as the foundation means you don't have to
worry about what was backported, except to complain when someone broke
compatibility. As to ASF 2.8.x, I'd recommend it if you want to use the ASF
artifacts (bug fixes, way better S3 performance), and, if you work with Azure
outside HDP or HD/Insights, 2.9. I don't know about CDH

[jira] [Comment Edited] (SPARK-22513) Provide build profile for hadoop 2.8

[
https://issues.apache.org/jira/browse/SPARK-22513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16413685#comment-16413685
]

Steve Loughran edited comment on SPARK-22513 at 3/26/18 11:10 AM:
--

There's the hadoop-allyun module in recent 2.x+ (forthcoming 2.9?) 3.x, which
the authors would like backported to the 2.x line. Hadoop 3 adds a new
"hadoop-cloud-storage" profile which is intended to allow descendants to get a
"cruft removed and up to date" set of FS dependencies; if new things go in,
this will be updated. And I have the task of shading that soon.
* There's also the inevitable changes in versions of things, specifically and
inevitably jackson being the most visible. For Hadoop 3 (and 2.9+?) we've move
to the shaded amazon-sdk-bundle so there's no need to worry about the version
of jackson it builds against. As to Guava, that's the same everywhere but you
can bump it up to at least guava 19.0 without problems (AFAIK, usual
disclaimers, etc).

FWIW, removing the 2.6 and setting 2.7 as the bare minimum would be a good
move. Hadoop 2.6 is Java 6 only; the rest of branch-2 is Java 7.
Hadoop 2.7 is the foundation of CDH HDP and microsoft HD/I, albeit with a fair
amount of backporting. Using 2.7 as the foundation means you don't have to
worry about what was backported, except to complain when someone broke
compatibility. As to ASF 2.8.x, I'd recommend it if you want to use the ASF
artifacts (bug fixes, way better S3 performance), and, if you work with Azure
outside HDP or HD/Insights, 2.9. I don't know about CDH there, Sean will need
to git log --grep for HADOOP-14660 and HADOOP-14535 as the big columnar storage
speedups.

Otherwise, you can build against any version you like on the 2.x line without
problems.
{code:java}
mvn install -Phadoop-2.7,hadoop-cloud,yarn Dhadoop.version=2.9.0
{code}
Against 3.x things compile but Hive is unhappy unless you have one of: a spark
hive module with a patch to hive's version check case statement or apache 3.x
branch pretending to be a branch-2 line `-Ddeclared.hadoop.version=2.11`, which
works OK for spark build & test but MUST NOT be deployed as HDFS version
checking will be unhappy.

Clear :)?

ps: don't mention Java 9 (HADOOP-11123) 10 (HADOOP-11423) or 11
(HADOOP-15338)]. thanks.

[jira] [Assigned] (SPARK-23751) Kolmogorov-Smirnoff test Python API in pyspark.ml


 [ 
https://issues.apache.org/jira/browse/SPARK-23751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23751:


Assignee: (was: Apache Spark)

> Kolmogorov-Smirnoff test Python API in pyspark.ml
> -
>
> Key: SPARK-23751
> URL: https://issues.apache.org/jira/browse/SPARK-23751
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Python wrapper for new DataFrame-based API for KS test



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23751) Kolmogorov-Smirnoff test Python API in pyspark.ml


 [ 
https://issues.apache.org/jira/browse/SPARK-23751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23751:


Assignee: Apache Spark

> Kolmogorov-Smirnoff test Python API in pyspark.ml
> -
>
> Key: SPARK-23751
> URL: https://issues.apache.org/jira/browse/SPARK-23751
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Major
>
> Python wrapper for new DataFrame-based API for KS test



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22513) Provide build profile for hadoop 2.8

[
https://issues.apache.org/jira/browse/SPARK-22513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16413685#comment-16413685
]

Steve Loughran edited comment on SPARK-22513 at 3/26/18 11:10 AM:
--

FWIW, removing the 2.6 and setting 2.7 as the bare minimum would be a good
move. Hadoop 2.6 is Java 6 only; the rest of branch-2 is Java 7.
Hadoop 2.7 is the foundation of CDH HDP and microsoft HD/I, albeit with a fair
amount of backporting. Using 2.7 as the foundation means you don't have to
worry about what was backported, except to complain when someone broke
compatibility. As to ASF 2.8.x, I'd recommend it if you want to use the ASF
artifacts (bug fixes, way better S3 performance), and, if you work with Azure
outside HDP or HD/Insights, 2.9. I don't know about CDH there, Sean will need
to git log --grep for HADOOP-14660 and HADOOP-14535 as the big columnar storage
speedups.

Clear :)?

ps: don't mention Java 9 (HADOOP-11123) 10 (HADOOP-11423) or 11
(HADOOP-15338)]. thanks.

FWIW, removing the 2.6 and setting 2.7 as the bare minimum would be a good
move. Hadoop 2.6 is Java 6 only; the rest of branch-2 is Java 7.
Hadoop 2.7 is the foundation of CDH HDP and microsoft HD/I, albeit with a fair
amount of backporting. Using 2.7 as the foundation means you don't have to
worry about what was backported, except to complain when someone broke
compatibility. As to ASF 2.8.x, I'd recommend it if you want to use the ASF
artifacts (bug fixes, way better S3 performance), and, if you work with Azure
outside HDP or HD/Insights, 2.9. I don't know about CDH there, Sean will ne

[jira] [Commented] (SPARK-22513) Provide build profile for hadoop 2.8


[ 
https://issues.apache.org/jira/browse/SPARK-22513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16413685#comment-16413685
 ] 

Steve Loughran commented on SPARK-22513:


API wise, everything compiled against 2.6 should compile against all subsequent 
2.x versions.
 * The profiles are similar except in the cloud profiles where later releases 
add more stuff (hadoop-azure in 2.7). No changes there between 2.7 and 2.8

There's the hadoop-allyun module in recent 2.x+ (forthcoming 2.9?) 3.x, which 
the authors would like backported to the 2.x line. Hadoop 3 adds a new 
"hadoop-cloud-storage" profile which is intended to allow descendants to get a 
"cruft removed and up to date" set of FS dependencies; if new things go in, 
this will be updated. And I have the task of shading that soon.
 * There's also the inevitable changes in versions of things, specifically and 
inevitably jackson being the most visible. For Hadoop 3 (and 2.9+?) we've move 
to the shaded amazon-sdk-bundle so there's no need to worry about the version 
of jackson it builds against. As to Guava, that's the same everywhere but you 
can bump it up to at least guava 19.0 without problems (AFAIK, usual 
disclaimers, etc).

 * We all strive to keep the semantics of things the same, but one person's 
"small improvement" is always someone else's "fundamental regression in the way 
things behave". —the eternal losing battle of software engineeing. Best 
strategy there: build and test against alpha and beta releases, complain when 
things don't work & make sure it's fixed in the final one.

FWIW, removing the 2.6 and setting 2.7 as the bare minimum would be a good 
move. Hadoop 2.6 is Java 6 only; the rest of branch-2 is Java 7. 
 Hadoop 2.7 is the foundation of CDH HDP and microsoft HD/I, albeit with a fair 
amount of backporting. Using 2.7 as the foundation means you don't have to 
worry about what was backported, except to complain when someone broke 
compatibility. As to ASF 2.8.x, I'd recommend it if you want to use the ASF 
artifacts (bug fixes, way better S3 performance), and, if you work with Azure 
outside HDP or HD/Insights, 2.9. I don't know about CDH there, Sean will need 
to git log --grep for HADOOP-14660 and HADOOP-14535 as the big columnar storage 
speedups.

Otherwise, you can build against any version you like on the 2.x line without 
problems.
{code:java}
mvn install -Phadoop-2.7,hadoop-cloud,yarn Dhadoop.version=2.9.0
{code}
Against 3.x things compile but Hive is unhappy unless you have one of: a spark 
hive module with a patch to hive's version check case statement or apache 3.x 
branch pretending to be a branch-2 line `-Ddeclared.hadoop.version=2.11`, which 
works OK for spark build & test but MUST NOT be deployed as HDFS version 
checking will be unhappy.

Clear :)?

ps: don't mention Java 9  10 or 11. thanks.

> Provide build profile for hadoop 2.8
> 
>
> Key: SPARK-22513
> URL: https://issues.apache.org/jira/browse/SPARK-22513
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Christine Koppelt
>Priority: Major
>
> hadoop 2.8 comes with a patch which is necessary to make it run on NixOS [1]. 
> Therefore it would be cool to have a Spark version pre-built for Hadoop 2.8.
> [1] 
> https://github.com/apache/hadoop/commit/5231c527aaf19fb3f4bd59dcd2ab19bfb906d377#diff-19821342174c77119be4a99dc3f3618d



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22513) Provide build profile for hadoop 2.8

[
https://issues.apache.org/jira/browse/SPARK-22513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16413685#comment-16413685
]

Steve Loughran edited comment on SPARK-22513 at 3/26/18 11:11 AM:
--

FWIW, removing the 2.6 and setting 2.7 as the bare minimum would be a good
move. Hadoop 2.6 is Java 6 only; the rest of branch-2 is Java 7.
Hadoop 2.7 is the foundation of CDH HDP and microsoft HD/I, albeit with a fair
amount of backporting. Using 2.7 as the foundation means you don't have to
worry about what was backported, except to complain when someone broke
compatibility. As to ASF 2.8.x, I'd recommend it if you want to use the ASF
artifacts (bug fixes, way better S3 performance), and, if you work with Azure
outside HDP or HD/Insights, 2.9. I don't know about CDH there, Sean will need
to git log --grep for HADOOP-14660 and HADOOP-14535 as the big columnar storage
speedups.

Otherwise, you can build spark against any version you like on the 2.x line
without problems.
{code:java}
mvn install -Phadoop-2.7,hadoop-cloud,yarn Dhadoop.version=2.9.0
{code}
Against 3.x things compile but Hive is unhappy unless you have one of: a spark
hive module with a patch to hive's version check case statement or apache 3.x
branch pretending to be a branch-2 line `-Ddeclared.hadoop.version=2.11`, which
works OK for spark build & test but MUST NOT be deployed as HDFS version
checking will be unhappy.

Clear :)?

ps: don't mention Java 9 (HADOOP-11123) 10 (HADOOP-11423) or 11
(HADOOP-15338)]. thanks.

FWIW, removing the 2.6 and setting 2.7 as the bare minimum would be a good
move. Hadoop 2.6 is Java 6 only; the rest of branch-2 is Java 7.
Hadoop 2.7 is the foundation of CDH HDP and microsoft HD/I, albeit with a fair
amount of backporting. Using 2.7 as the foundation means you don't have to
worry about what was backported, except to complain when someone broke
compatibility. As to ASF 2.8.x, I'd recommend it if you want to use the ASF
artifacts (bug fixes, way better S3 performance), and, if you work with Azure
outside HDP or HD/Insights, 2.9. I don't know about CDH there, Sean will need
to git log --grep for HADOOP-14660 and HADOOP-14535

[jira] [Created] (SPARK-23794) UUID() should be stateful

2018-03-26 Thread Herman van Hovell (JIRA)

Herman van Hovell created SPARK-23794:
-

 Summary: UUID() should be stateful
 Key: SPARK-23794
 URL: https://issues.apache.org/jira/browse/SPARK-23794
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Herman van Hovell


The UUID() expression is stateful and should implement the Stateful trait 
instead of the Nondeterministic trait.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23599) The UUID() expression is too non-deterministic