[jira] [Resolved] (SPARK-11326) Support for authentication and encryption in standalone mode

2016-10-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11326.
---
Resolution: Won't Fix

> Support for authentication and encryption in standalone mode
> 
>
> Key: SPARK-11326
> URL: https://issues.apache.org/jira/browse/SPARK-11326
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Jacek Lewandowski
>
> h3.The idea
> Currently, in standalone mode, all components, for all network connections 
> need to use the same secure token if they want to have any security ensured. 
> This ticket is intended to split the communication in standalone mode to make 
> it more like in Yarn mode - application internal communication and scheduler 
> communication.
> Such refactoring will allow for the scheduler (master, workers) to use a 
> distinct secret, which will remain unknown for the users. Similarly, it will 
> allow for better security in applications, because each application will be 
> able to use a distinct secret as well. 
> By providing SASL authentication/encryption for connections between a client 
> (Client or AppClient) and Spark Master, it becomes possible introducing 
> pluggable authentication for standalone deployment mode.
> h3.Improvements introduced by this patch
> This patch introduces the following changes:
> * Spark driver or submission client do not have to use the same secret as 
> workers use to communicate with Master
> * Master is able to authenticate individual clients with the following rules:
> ** When connecting to the master, the client needs to specify 
> {{spark.authenticate.secret}} which is an authentication token for the user 
> specified by {{spark.authenticate.user}} ({{sparkSaslUser}} by default)
> ** Master configuration may include additional 
> {{spark.authenticate.secrets.}} entries for specifying 
> authentication token for particular users or 
> {{spark.authenticate.authenticatorClass}} which specify an implementation of 
> external credentials provider (which is able to retrieve the authentication 
> token for a given user).
> ** Workers authenticate with Master as default user {{sparkSaslUser}}. 
> * The authorization rules are as follows:
> ** A regular user is able to manage only his own application (the application 
> which he submitted)
> ** A regular user is not able to register or manager workers
> ** Spark default user {{sparkSaslUser}} can manage all the applications
> h3.User facing changes when running application
> h4.General principles:
> - conf: {{spark.authenticate.secret}} is *never sent* over the wire
> - env: {{SPARK_AUTH_SECRET}} is *never sent* over the wire
> - In all situations env variable will overwrite conf variable if present. 
> - In all situations when a user has to pass a secret, it is better (safer) to 
> do this through env variable
> - In work modes with multiple secrets we assume encrypted communication 
> between client and master, between driver and master, between master and 
> workers
> 
> h4.Work modes and descriptions
> h5.Client mode, single secret
> h6.Configuration
> - env: {{SPARK_AUTH_SECRET=secret}} or conf: 
> {{spark.authenticate.secret=secret}}
> h6.Description
> - The driver is running locally
> - The driver will neither send env: {{SPARK_AUTH_SECRET}} nor conf: 
> {{spark.authenticate.secret}}
> - The driver will use either env: {{SPARK_AUTH_SECRET}} or conf: 
> {{spark.authenticate.secret}} for connection to the master
> - _ExecutorRunner_ will not find any secret in _ApplicationDescription_ so it 
> will look for it in the worker configuration and it will find it there (its 
> presence is implied). 
> 
> h5.Client mode, multiple secrets
> h6.Configuration
> - env: {{SPARK_APP_AUTH_SECRET=app_secret}} or conf: 
> {{spark.app.authenticate.secret=secret}}
> - env: {{SPARK_SUBMISSION_AUTH_SECRET=scheduler_secret}} or conf: 
> {{spark.submission.authenticate.secret=scheduler_secret}}
> h6.Description
> - The driver is running locally
> - The driver will use either env: {{SPARK_SUBMISSION_AUTH_SECRET}} or conf: 
> {{spark.submission.authenticate.secret}} to connect to the master
> - The driver will neither send env: {{SPARK_SUBMISSION_AUTH_SECRET}} nor 
> conf: {{spark.submission.authenticate.secret}}
> - The driver will use either {{SPARK_APP_AUTH_SECRET}} or conf: 
> {{spark.app.authenticate.secret}} for communication with the executors
> - The driver will send {{spark.executorEnv.SPARK_AUTH_SECRET=app_secret}} so 
> that the executors can use it to communicate with the driver
> - _ExecutorRunner_ will find that secret in _ApplicationDescription_ and it 
> will set it in env: {{SPARK_AUTH_SECRET}} which will be read by 
> _ExecutorBackend_ afterwards and used for all the connections (with driver, 
> other executors and external 

[jira] [Resolved] (SPARK-17287) PySpark sc.AddFile method does not support the recursive keyword argument

2016-10-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17287.
---
Resolution: Duplicate

> PySpark sc.AddFile method does not support the recursive keyword argument
> -
>
> Key: SPARK-17287
> URL: https://issues.apache.org/jira/browse/SPARK-17287
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Jason Piper
>Priority: Minor
>
> The Scala Spark API implements a "recursive" keyword argument when using 
> sc.addFile that allows for an entire directory to be added, however, the 
> corresponding interface hasn't been added to PySpark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values

2016-10-09 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15559597#comment-15559597
 ] 

Vincent commented on SPARK-17219:
-

No problem. I will try to submit another PR based on above discussions.

> QuantileDiscretizer does strange things with NaN values
> ---
>
> Key: SPARK-17219
> URL: https://issues.apache.org/jira/browse/SPARK-17219
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.2
>Reporter: Barry Becker
>Assignee: Vincent
> Fix For: 2.1.0
>
>
> How is the QuantileDiscretizer supposed to handle null values?
> Actual nulls are not allowed, so I replace them with Double.NaN.
> However, when you try to run the QuantileDiscretizer on a column that 
> contains NaNs, it will create (possibly more than one) NaN split(s) before 
> the final PositiveInfinity value.
> I am using the attache titanic csv data and trying to bin the "age" column 
> using the QuantileDiscretizer with 10 bins specified. The age column as a lot 
> of null values.
> These are the splits that I get:
> {code}
> -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity
> {code}
> Is that expected. It seems to imply that NaN is larger than any positive 
> number and less than infinity.
> I'm not sure of the best way to handle nulls, but I think they need a bucket 
> all their own. My suggestions would be to include an initial NaN split value 
> that is always there, just like the sentinel Infinities are. If that were the 
> case, then the splits for the example above might look like this:
> {code}
> NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity
> {code}
> This does not seem great either because a bucket that is [NaN, -Inf] doesn't 
> make much sense. Not sure if the NaN bucket counts toward numBins or not. I 
> do think it should always be there though in case future data has null even 
> though the fit data did not. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17820) Spark sqlContext.sql() performs only first insert for HiveQL "FROM target INSERT INTO dest" command to insert into multiple target tables from same source

2016-10-09 Thread Jiang Xingbo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15559565#comment-15559565
 ] 

Jiang Xingbo commented on SPARK-17820:
--

[~kmbeyond]I tried on Spark 1.6.0 and was unable to repro the problem:
{code:sql}
spark-sql> create database sqoop_import;
spark-sql> use sqoop_import;
spark-sql> create table names_count1 (department_name String, count Int);
spark-sql> create table names_count2 (department_name String, count Int);
spark-sql> create table departments(department_name String, department_id Int);
spark-sql> insert into departments select * from (select "dept2", 2) t;
spark-sql> insert into departments select * from (select "dept4", 4) t;
spark-sql> from sqoop_import.departments insert into sqoop_import.names_count1 
select department_name, count(1) where department_id=2 group by department_name 
insert into sqoop_import.names_count2 select department_name, count(1) where 
department_id=4 group by department_name;
spark-sql> select * from sqoop_import.names_count1;
dept2 1
spark-sql> select * from sqoop_import.names_count2;
dept4 1
{code}

May I ask if your database contains rows which satisfies `department_id=4`?

> Spark sqlContext.sql() performs only first insert for HiveQL "FROM target 
> INSERT INTO dest" command to insert into multiple target tables from same 
> source
> --
>
> Key: SPARK-17820
> URL: https://issues.apache.org/jira/browse/SPARK-17820
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Cloudera Quickstart VM 5.7
>Reporter: Kiran Miryala
>
> I am executing a HiveQL in spark-shell, I intend to insert a record into 2 
> destination tables from the same source table using same statement. But it 
> inserts in only first destination table. My statement:
> {noformat}
> scala>val departmentsData = sqlContext.sql("from sqoop_import.departments 
> insert into sqoop_import.names_count1 select department_name, count(1) where 
> department_id=2 group by department_name insert into 
> sqoop_import.names_count2 select department_name, count(1) where 
> department_id=4 group by department_name")
> {noformat}
> Same query inserts into both destination tables on hive shell:
> {noformat}
> from sqoop_import.departments 
> insert into sqoop_import.names_count1 
> select department_name, count(1) 
> where department_id=2 group by department_name 
> insert into sqoop_import.names_count2 
> select department_name, count(1) 
> where department_id=4 group by department_name;
> {noformat}
> Both target table definitions are:
> {noformat}
> hive>use sqoop_import;
> hive> create table names_count1 (department_name String, count Int);
> hive> create table names_count2 (department_name String, count Int);
> {noformat}
> Not sure why it is skipping next one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17815) Report committed offsets

2016-10-09 Thread Ofir Manor (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15559549#comment-15559549
 ] 

Ofir Manor commented on SPARK-17815:


Good, we are on the same page, no argument that SPARK-16963 blocks this issue. 
My only points were about the current ticket - that reporting committed offsets 
should be done by default, not based on a non-default parameter, and that 
setting group.id or a prefix of it is a great suggestion, but currently blocked.

> Report committed offsets
> 
>
> Key: SPARK-17815
> URL: https://issues.apache.org/jira/browse/SPARK-17815
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> Since we manage our own offsets, we have turned off auto-commit.  However, 
> this means that external tools are not able to report on how far behind a 
> given streaming job is.  When the user manually gives us a group.id, we 
> should report back to it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14136) Spark 2.0 can't start with yarn mode with ClassNotFoundException: org.apache.spark.deploy.yarn.history.YarnHistoryService

2016-10-09 Thread Mike Fang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15559540#comment-15559540
 ] 

Mike Fang commented on SPARK-14136:
---

To be more specific, it should be SPARK_CONF_DIR instead of SPARK_CONF.

> Spark 2.0 can't start with yarn mode with ClassNotFoundException: 
> org.apache.spark.deploy.yarn.history.YarnHistoryService
> -
>
> Key: SPARK-14136
> URL: https://issues.apache.org/jira/browse/SPARK-14136
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Shell, YARN
>Affects Versions: 2.0.0
> Environment: HortonworksHadoop2.7.1 HDP2.3.2 Java1.8.40
>Reporter: Qi Dai
>
> For the recent Spark nightly master builds (I tried current build and many of 
> last couple weeks builds), the spark-shell/pyspark can't start in yarn mode 
> with ClassNotFoundException: 
> org.apache.spark.deploy.yarn.history.YarnHistoryService
> The full stack is:
> java.lang.ClassNotFoundException: 
> org.apache.spark.deploy.yarn.history.YarnHistoryService
>   at 
> scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at org.apache.spark.util.Utils$.classForName(Utils.scala:177)
>   at 
> org.apache.spark.scheduler.cluster.SchedulerExtensionServices$$anonfun$start$5.apply(SchedulerExtensionService.scala:109)
>   at 
> org.apache.spark.scheduler.cluster.SchedulerExtensionServices$$anonfun$start$5.apply(SchedulerExtensionService.scala:108)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.scheduler.cluster.SchedulerExtensionServices.start(SchedulerExtensionService.scala:108)
>   at 
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend.start(YarnSchedulerBackend.scala:81)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
>   at org.apache.spark.SparkContext.(SparkContext.scala:501)
>   at org.apache.spark.repl.Main$.createSparkContext(Main.scala:89)
>   ... 48 elided
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.SQLContext$.createListenerAndUI(SQLContext.scala:1020)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:91)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at org.apache.spark.repl.Main$.createSQLContext(Main.scala:99)
>   ... 48 elided
> :13: error: not found: value sqlContext
>import sqlContext.implicits._
>   ^
> :13: error: not found: value sqlContext
>import sqlContext.sql
>   ^



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17815) Report committed offsets

2016-10-09 Thread Ofir Manor (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15559541#comment-15559541
 ] 

Ofir Manor commented on SPARK-17815:


As far as I understand, there is a clear single-point-of-truth: the structured 
streaming "commit log" - the checkpoint. It holds both the source state 
(offsets) and the Spark state (aggregations) of successfully finished batches 
atomically, and is the one that is used during recovery to identify the correct 
beginning offset in the source during recovery.
The structured WAL is a technical, internal implementation detail, that stores 
an intention to process a range of offsets, before they are actually read. 
Spark used it during recovery to repeat the same source end boundary to a 
failed batch.
The data in the downstream store is about Spark output - which [version,spark 
partition] have landed - not about source state. Of course, it is being used 
during Spark recovery / retry, but not as a basis to choose a offsets in the 
source (it is used to skip specific output version-partitions there were 
already written).
As this ticket states, updating the Kafka consumer group offsets in Kafka is 
only for easier progress monitoring using Kafka-specific tools. So, it should 
be considered informational, after-the-fact updating just for being nice, as it 
won't be used for Spark recovery. If a user want to manually recover, it should 
rely on the Spark checkpoint offset.
In other words, updating Kafka offsets after a batch successfully commited 
means that the offsets in Kafka represent which messages have been successfully 
processed and landed in the sink, not which messages have been read. 
[~marmbrus] Is my understanding correct?

> Report committed offsets
> 
>
> Key: SPARK-17815
> URL: https://issues.apache.org/jira/browse/SPARK-17815
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> Since we manage our own offsets, we have turned off auto-commit.  However, 
> this means that external tools are not able to report on how far behind a 
> given streaming job is.  When the user manually gives us a group.id, we 
> should report back to it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17838) Strict type checking for arguments with a better messages across APIs.

2016-10-09 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-17838:


 Summary: Strict type checking for arguments with a better messages 
across APIs.
 Key: SPARK-17838
 URL: https://issues.apache.org/jira/browse/SPARK-17838
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Hyukjin Kwon


It seems there should be more strict type checking for arguments in SparkR 
APIs. This was discussed in several PRs. 

https://github.com/apache/spark/pull/15239#discussion_r82445435

Roughly it seems there are three cases as below:

The first case below was described in 
https://github.com/apache/spark/pull/15239#discussion_r82445435

- Check for {{zero-length variable name}}

Some of other cases below were handled in 
https://github.com/apache/spark/pull/15231#discussion_r80417904

- Catch the exception from JVM and format it as pretty

- Check strictly types before calling JVM in SparkR




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10501) support UUID as an atomic type

2016-10-09 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15559405#comment-15559405
 ] 

Hyukjin Kwon commented on SPARK-10501:
--

Ah, it was type not the function. I just rushed the JIRA. Thanks for correcting.

> support UUID as an atomic type
> --
>
> Key: SPARK-10501
> URL: https://issues.apache.org/jira/browse/SPARK-10501
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jon Haddad
>Priority: Minor
>
> It's pretty common to use UUIDs instead of integers in order to avoid 
> distributed counters.  
> I've added this, which at least lets me load dataframes that use UUIDs that I 
> can cast to strings:
> {code}
> class UUIDType(AtomicType):
> pass
> _type_mappings[UUID] = UUIDType
> _atomic_types.append(UUIDType)
> {code}
> But if I try to do anything else with the UUIDs, like this:
> {code}
> ratings.select("userid").distinct().collect()
> {code}
> I get this pile of fun: 
> {code}
> scala.MatchError: UUIDType (of class 
> org.apache.spark.sql.cassandra.types.UUIDType$)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2