[jira] [Commented] (SPARK-9690) Adding the possibility to set the seed of the rand in the CrossValidator fold

2015-08-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660270#comment-14660270
 ] 

Apache Spark commented on SPARK-9690:
-

User 'mmenestret' has created a pull request for this issue:
https://github.com/apache/spark/pull/7997

 Adding the possibility to set the seed of the rand in the CrossValidator fold
 -

 Key: SPARK-9690
 URL: https://issues.apache.org/jira/browse/SPARK-9690
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.1
Reporter: Martin Menestret
Priority: Minor
 Fix For: 1.5.0

   Original Estimate: 1h
  Remaining Estimate: 1h

 The fold in the ML CrossValidator depends on a rand whose seed is set to 0 
 and it leads the sql.functions rand to call sc._jvm.functions.rand() with no 
 seed.
 In order to be able to unit test a Cross Validation it would be a good idea 
 to be able to set this seed so the output of the cross validation (with a 
 featureSubsetStrategy set to all) would always be the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9690) Adding the possibility to set the seed of the rand in the CrossValidator fold

2015-08-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9690:
---

Assignee: Apache Spark

 Adding the possibility to set the seed of the rand in the CrossValidator fold
 -

 Key: SPARK-9690
 URL: https://issues.apache.org/jira/browse/SPARK-9690
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.1
Reporter: Martin Menestret
Assignee: Apache Spark
Priority: Minor
 Fix For: 1.5.0

   Original Estimate: 1h
  Remaining Estimate: 1h

 The fold in the ML CrossValidator depends on a rand whose seed is set to 0 
 and it leads the sql.functions rand to call sc._jvm.functions.rand() with no 
 seed.
 In order to be able to unit test a Cross Validation it would be a good idea 
 to be able to set this seed so the output of the cross validation (with a 
 featureSubsetStrategy set to all) would always be the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9690) Adding the possibility to set the seed of the rand in the CrossValidator fold

2015-08-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9690:
---

Assignee: (was: Apache Spark)

 Adding the possibility to set the seed of the rand in the CrossValidator fold
 -

 Key: SPARK-9690
 URL: https://issues.apache.org/jira/browse/SPARK-9690
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.1
Reporter: Martin Menestret
Priority: Minor
 Fix For: 1.5.0

   Original Estimate: 1h
  Remaining Estimate: 1h

 The fold in the ML CrossValidator depends on a rand whose seed is set to 0 
 and it leads the sql.functions rand to call sc._jvm.functions.rand() with no 
 seed.
 In order to be able to unit test a Cross Validation it would be a good idea 
 to be able to set this seed so the output of the cross validation (with a 
 featureSubsetStrategy set to all) would always be the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9182) filter and groupBy on DataFrames are not passed through to jdbc source

2015-08-06 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660278#comment-14660278
 ] 

Cheng Lian commented on SPARK-9182:
---

Hey [~grahn], sorry for the late reply, I somehow missed your last two comments.

Thanks for the detailed information. I'm able to reproduce this issue locally 
now. Confirmed that it's related to NUMERIC. Trying to deliver a fix for this.

 filter and groupBy on DataFrames are not passed through to jdbc source
 --

 Key: SPARK-9182
 URL: https://issues.apache.org/jira/browse/SPARK-9182
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
Reporter: Greg Rahn

 When running all of these API calls, the only one that passes the filter 
 through to the backend jdbc source is equality.  All filters in these 
 commands should be able to be passed through to the jdbc database source.
 {code}
 val url=jdbc:postgresql:grahn
 val prop = new java.util.Properties
 val emp = sqlContext.read.jdbc(url, emp, prop)
 emp.filter(emp(sal) === 5000).show()
 emp.filter(emp(sal)  5000).show()
 emp.filter(sal = 3000).show()
 emp.filter(sal  2500).show()
 emp.filter(sal = 2500).show()
 emp.filter(sal  2500).show()
 emp.filter(sal = 2500).show()
 emp.filter(sal != 3000).show()
 emp.filter(sal between 3000 and 5000).show()
 emp.filter(ename in ('SCOTT','BLAKE')).show()
 {code}
 We see from the PostgreSQL query log the following is run, and see that only 
 equality predicates are passed through.
 {code}
 LOG:  execute unnamed: SET extra_float_digits = 3
 LOG:  execute unnamed: SELECT 
 empno,ename,job,mgr,hiredate,sal,comm,deptno FROM emp WHERE 
 sal = 5000
 LOG:  execute unnamed: SET extra_float_digits = 3
 LOG:  execute unnamed: SELECT 
 empno,ename,job,mgr,hiredate,sal,comm,deptno FROM emp
 LOG:  execute unnamed: SET extra_float_digits = 3
 LOG:  execute unnamed: SELECT 
 empno,ename,job,mgr,hiredate,sal,comm,deptno FROM emp WHERE 
 sal = 3000
 LOG:  execute unnamed: SET extra_float_digits = 3
 LOG:  execute unnamed: SELECT 
 empno,ename,job,mgr,hiredate,sal,comm,deptno FROM emp
 LOG:  execute unnamed: SET extra_float_digits = 3
 LOG:  execute unnamed: SELECT 
 empno,ename,job,mgr,hiredate,sal,comm,deptno FROM emp
 LOG:  execute unnamed: SET extra_float_digits = 3
 LOG:  execute unnamed: SELECT 
 empno,ename,job,mgr,hiredate,sal,comm,deptno FROM emp
 LOG:  execute unnamed: SET extra_float_digits = 3
 LOG:  execute unnamed: SELECT 
 empno,ename,job,mgr,hiredate,sal,comm,deptno FROM emp
 LOG:  execute unnamed: SET extra_float_digits = 3
 LOG:  execute unnamed: SELECT 
 empno,ename,job,mgr,hiredate,sal,comm,deptno FROM emp
 LOG:  execute unnamed: SET extra_float_digits = 3
 LOG:  execute unnamed: SELECT 
 empno,ename,job,mgr,hiredate,sal,comm,deptno FROM emp
 LOG:  execute unnamed: SET extra_float_digits = 3
 LOG:  execute unnamed: SELECT 
 empno,ename,job,mgr,hiredate,sal,comm,deptno FROM emp
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9690) Adding the possibility to set the seed of the rand in the CrossValidator fold

2015-08-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9690:
-
Fix Version/s: (was: 1.5.0)

[~Mmenestret] Don't set fix version 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

 Adding the possibility to set the seed of the rand in the CrossValidator fold
 -

 Key: SPARK-9690
 URL: https://issues.apache.org/jira/browse/SPARK-9690
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.1
Reporter: Martin Menestret
Priority: Minor
   Original Estimate: 1h
  Remaining Estimate: 1h

 The fold in the ML CrossValidator depends on a rand whose seed is set to 0 
 and it leads the sql.functions rand to call sc._jvm.functions.rand() with no 
 seed.
 In order to be able to unit test a Cross Validation it would be a good idea 
 to be able to set this seed so the output of the cross validation (with a 
 featureSubsetStrategy set to all) would always be the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9515) Creating JavaSparkContext with yarn-cluster mode throws NPE

2015-08-06 Thread nirav patel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660286#comment-14660286
 ] 

nirav patel commented on SPARK-9515:


[~srowen] I just gave a reason why I can't use spark-submit script as you asked 
in first comment. I agree this should be more forum question but I though NPE 
is something you may wanna handle better.

 Creating JavaSparkContext with yarn-cluster mode throws NPE
 ---

 Key: SPARK-9515
 URL: https://issues.apache.org/jira/browse/SPARK-9515
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.3.1
Reporter: nirav patel

 I have spark application that runs agains YARN cluster. I run spark 
 application as part of my web application. I can't use spark-submit script. 
 Way I run it is `java -cp myApp.jar com.myapp.Application` which in turn 
 initiate JavaSparkContext. It used to work with spark 1.0.2 and standalone 
 cluster but now with 1.3.1 and yarn its failing.
 Caused by: java.lang.NullPointerException
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster$.sparkContextInitialized(ApplicationMaster.scala:580)
   at 
 org.apache.spark.scheduler.cluster.YarnClusterScheduler.postStartHook(YarnClusterScheduler.scala:32)
   at org.apache.spark.SparkContext.init(SparkContext.scala:541)
   at 
 org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:61)
 EDIT:
 I got it working with yarn-client mode however I want to test it out with 
 yarn-cluster mode as well.
 Application design is, we create singleton SparkContext object and preload 
 few RDDs in memory when our spring-boot application(tomcat container) starts. 
 That allows us to submit subsequent spark jobs without overhead of creating 
 new sparkContext and RDDs. It performs excellent for our SLA. We are serving 
 real-time GLM in ms with that. I hope this is a reason enough why we can't 
 use spark-submit script to submit a job.
 Code is pretty simple. This is how we create sparkContext
 SparkConf conf = new 
 SparkConf().setAppName(appName.toString()).setMaster(yarn-client);
 conf.set(spark.eventLog.enabled, true);
 conf.set(spark.executor.extraClassPath, 
 /opt/mapr/hbase/hbase-0.98.12/lib/*);
 conf.set(spark.cores.max, sparkCoreMax);
 conf.set(spark.executor.memory, sparkExecMem);
 conf.set(spark.executor.extraJavaOptions, executorJavaOPts);
 conf.set(spark.akka.threads, sparkDriverThreads);
 JavaSparkContext sparkContext = new JavaSparkContext(conf);
 This is how we actually run sprig-boot app.
 java 
 -Dloader.path=myspringbootapp.jar,/spark/spark-1.3.1/lib,/opt/mapr/hadoop/hadoop-2.5.1/etc/hadoop,/opt/mapr/hadoop/hadoop-2.5.1/share/hadoop/yarn
  -XX:PermSize=512m -XX:MaxPermSize=512m -Xms1024m -jar myspringbootapp.jar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8978) Implement the DirectKafkaRateController

2015-08-06 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-8978.
--
Resolution: Fixed
  Assignee: Iulian Dragos

 Implement the DirectKafkaRateController
 ---

 Key: SPARK-8978
 URL: https://issues.apache.org/jira/browse/SPARK-8978
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Iulian Dragos
Assignee: Iulian Dragos
 Fix For: 1.5.0


 Based on this [design 
 doc|https://docs.google.com/document/d/1ls_g5fFmfbbSTIfQQpUxH56d0f3OksF567zwA00zK9E/edit?usp=sharing].
 The DirectKafkaInputDStream should use the rate estimate to control how many 
 records/partition to put in the next batch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7550) Support setting the right schema serde when writing to Hive metastore

2015-08-06 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660726#comment-14660726
 ] 

Reynold Xin commented on SPARK-7550:


[~lian cheng] shouldn't this also be merged into branch-1.5?


 Support setting the right schema  serde when writing to Hive metastore
 ---

 Key: SPARK-7550
 URL: https://issues.apache.org/jira/browse/SPARK-7550
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.0
Reporter: Reynold Xin
Assignee: Cheng Hao
Priority: Blocker

 As of 1.4, Spark SQL does not properly set the table schema and serde when 
 writing a table to Hive's metastore. Would be great to do that properly so 
 users can use non-Spark SQL systems to read those tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9702) Repartition operator should use Exchange to perform its shuffle

2015-08-06 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-9702:
-

 Summary: Repartition operator should use Exchange to perform its 
shuffle
 Key: SPARK-9702
 URL: https://issues.apache.org/jira/browse/SPARK-9702
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Josh Rosen


Spark SQL's {{Repartition}} operator is implemented in terms of Spark Core's 
repartition operator, which means that it has to perform lots of unnecessary 
row copying and inefficient row serialization. Instead, it would be better if 
this was implemented using some of Exchange's internals so that it can avoid 
row format conversions and generic getters / hashcodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7549) Support aggregating over nested fields

2015-08-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7549:
---
Parent Issue: SPARK-9576  (was: SPARK-6116)

 Support aggregating over nested fields
 --

 Key: SPARK-7549
 URL: https://issues.apache.org/jira/browse/SPARK-7549
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 Would be nice to be able to run sum, avg, min, max (and other numeric 
 aggregate expressions) on arrays.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7160) Support converting DataFrames to typed RDDs.

2015-08-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7160:
---
Parent Issue: SPARK-9576  (was: SPARK-6116)

 Support converting DataFrames to typed RDDs.
 

 Key: SPARK-7160
 URL: https://issues.apache.org/jira/browse/SPARK-7160
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.3.1
Reporter: Ray Ortigas
Assignee: Ray Ortigas
Priority: Critical

 As a Spark user still working with RDDs, I'd like the ability to convert a 
 DataFrame to a typed RDD.
 For example, if I've converted RDDs to DataFrames so that I could save them 
 as Parquet or CSV files, I would like to rebuild the RDD from those files 
 automatically rather than writing the row-to-type conversion myself.
 {code}
 val rdd0 = sc.parallelize(Seq(Food(apple, 1), Food(banana, 2), 
 Food(cherry, 3)))
 val df0 = rdd0.toDF()
 df0.save(foods.parquet)
 val df1 = sqlContext.load(foods.parquet)
 val rdd1 = df1.toTypedRDD[Food]()
 // rdd0 and rdd1 should have the same elements
 {code}
 I originally submitted a smaller PR for spark-csv 
 https://github.com/databricks/spark-csv/pull/52, but Reynold Xin suggested 
 that converting a DataFrame to a typed RDD wasn't something specific to 
 spark-csv.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9493) Chain logistic regression with isotonic regression under the pipeline API

2015-08-06 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-9493.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7952
[https://github.com/apache/spark/pull/7952]

 Chain logistic regression with isotonic regression under the pipeline API
 -

 Key: SPARK-9493
 URL: https://issues.apache.org/jira/browse/SPARK-9493
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.5.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.5.0


 One use case of isotonic regression is to calibrate the probabilities output 
 by logistic regression. We should make this easier in the pipeline API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6116) DataFrame API improvement umbrella ticket (Spark 1.5)

2015-08-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6116:
---
Description: 
An umbrella ticket for DataFrame API improvements for Spark 1.5.

SPARK-9576 is the ticket for Spark 1.6.

  was:An umbrella ticket to track improvements and changes needed to make 
DataFrame API non-experimental.


 DataFrame API improvement umbrella ticket (Spark 1.5)
 -

 Key: SPARK-6116
 URL: https://issues.apache.org/jira/browse/SPARK-6116
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical
  Labels: DataFrame

 An umbrella ticket for DataFrame API improvements for Spark 1.5.
 SPARK-9576 is the ticket for Spark 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9363) SortMergeJoin operator should support UnsafeRow

2015-08-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9363:
---
Sprint: Spark 1.5 release

 SortMergeJoin operator should support UnsafeRow
 ---

 Key: SPARK-9363
 URL: https://issues.apache.org/jira/browse/SPARK-9363
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Josh Rosen
Assignee: Josh Rosen

 The SortMergeJoin operator should implement the suppotsUnsafeRow and 
 outputsUnsafeRow settings when appropriate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds

2015-08-06 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660796#comment-14660796
 ] 

Patrick Wendell commented on SPARK-1517:


Hey Ryan,

IIRC - the Apache snapshot repository won't let us publish binaries that do not 
have SNAPSHOT in the version number. The reason is it expects to see 
timestamped snapshots so its garbage collection mechanism can work. We could 
look at adding sha1 hashes, before SNAPSHOT, but I think there is some chance 
this would break their cleanup.

In terms of posting more binaries - I can look at whether Databricks or 
Berkeley might be able to donate S3 resources for this, but it would have to be 
clearly maintained by those organizations and not branded as official Apache 
releases or anything like that.

 Publish nightly snapshots of documentation, maven artifacts, and binary builds
 --

 Key: SPARK-1517
 URL: https://issues.apache.org/jira/browse/SPARK-1517
 Project: Spark
  Issue Type: Improvement
  Components: Build, Project Infra
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Critical

 Should be pretty easy to do with Jenkins. The only thing I can think of that 
 would be tricky is to set up credentials so that jenkins can publish this 
 stuff somewhere on apache infra.
 Ideally we don't want to have to put a private key on every jenkins box 
 (since they are otherwise pretty stateless). One idea is to encrypt these 
 credentials with a passphrase and post them somewhere publicly visible. Then 
 the jenkins build can download the credentials provided we set a passphrase 
 in an environment variable in jenkins. There may be simpler solutions as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5180) Data source API improvement

2015-08-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5180:
---
Sprint: Spark 1.5 release

 Data source API improvement
 ---

 Key: SPARK-5180
 URL: https://issues.apache.org/jira/browse/SPARK-5180
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Reporter: Yin Huai
Assignee: Cheng Lian
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9691) PySpark SQL rand function treats seed 0 as no seed

2015-08-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9691:
---
Sprint: Spark 1.5 doc/QA sprint

 PySpark SQL rand function treats seed 0 as no seed
 --

 Key: SPARK-9691
 URL: https://issues.apache.org/jira/browse/SPARK-9691
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.4.0, 1.4.1, 1.4.2, 1.5.0
Reporter: Joseph K. Bradley
Assignee: Yin Huai

 In PySpark SQL's rand() function, it tests for a seed in a way such that seed 
 0 is treated as no seed, leading to non-deterministic results when a user 
 would expect deterministic results.
 See: 
 [https://github.com/apache/spark/blob/98e69467d4fda2c26a951409b5b7c6f1e9345ce4/python/pyspark/sql/functions.py#L271]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9683) deep copy UTF8String when convert unsafe row to safe row

2015-08-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9683:
---
Target Version/s: 1.5.0
Priority: Critical  (was: Major)

 deep copy UTF8String when convert unsafe row to safe row
 

 Key: SPARK-9683
 URL: https://issues.apache.org/jira/browse/SPARK-9683
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-9675) GenerateUnsafeProjection seems to corrupt MapType data

2015-08-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-9675.
--
   Resolution: Duplicate
Fix Version/s: 1.5.0

 GenerateUnsafeProjection seems to corrupt MapType data
 --

 Key: SPARK-9675
 URL: https://issues.apache.org/jira/browse/SPARK-9675
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Wenchen Fan
Priority: Blocker
 Fix For: 1.5.0


 See https://github.com/apache/spark/pull/7981#issuecomment-128208233



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9596) Avoid reloading Hadoop classes like UserGroupInformation

2015-08-06 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-9596:

Shepherd: Michael Armbrust

 Avoid reloading Hadoop classes like UserGroupInformation
 

 Key: SPARK-9596
 URL: https://issues.apache.org/jira/browse/SPARK-9596
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Tao Wang
Assignee: Tao Wang

 Some hadoop classes contains global information such as authentication in 
 UserGroupInformation. If we load them again in `IsolatedClientLoader`, the 
 message they carry will be dropped.
 So we should treat hadoop classes as shared too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9665) ML 1.5 QA: API: Experimental, DeveloperApi, final audit

2015-08-06 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9665:
-
Description: 
We should make a pass through the items marked as Experimental or DeveloperApi 
and see if any are stable enough to be unmarked.  This will probably not 
include the Pipeline APIs yet since some parts (e.g., feature attributes) are 
still under flux.

We should also check for items marked final or sealed to see if they are stable 
enough to be opened up as APIs.

  was:We should make a pass through the items marked as Experimental or 
DeveloperApi and see if any are stable enough to be unmarked.  This will 
probably not include the Pipeline APIs yet since some parts (e.g., feature 
attributes) are still under flux.


 ML 1.5 QA: API: Experimental, DeveloperApi, final audit
 ---

 Key: SPARK-9665
 URL: https://issues.apache.org/jira/browse/SPARK-9665
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib
Reporter: Joseph K. Bradley
Priority: Minor

 We should make a pass through the items marked as Experimental or 
 DeveloperApi and see if any are stable enough to be unmarked.  This will 
 probably not include the Pipeline APIs yet since some parts (e.g., feature 
 attributes) are still under flux.
 We should also check for items marked final or sealed to see if they are 
 stable enough to be opened up as APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9707) Test sort-based fallback mode for dynamic partition insert

2015-08-06 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-9707:
--

 Summary: Test sort-based fallback mode for dynamic partition insert
 Key: SPARK-9707
 URL: https://issues.apache.org/jira/browse/SPARK-9707
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9665) ML 1.5 QA: API: Experimental, DeveloperApi, final, sealed audit

2015-08-06 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9665:
-
Summary: ML 1.5 QA: API: Experimental, DeveloperApi, final, sealed audit  
(was: ML 1.5 QA: API: Experimental, DeveloperApi, final audit)

 ML 1.5 QA: API: Experimental, DeveloperApi, final, sealed audit
 ---

 Key: SPARK-9665
 URL: https://issues.apache.org/jira/browse/SPARK-9665
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib
Reporter: Joseph K. Bradley
Priority: Minor

 We should make a pass through the items marked as Experimental or 
 DeveloperApi and see if any are stable enough to be unmarked.  This will 
 probably not include the Pipeline APIs yet since some parts (e.g., feature 
 attributes) are still under flux.
 We should also check for items marked final or sealed to see if they are 
 stable enough to be opened up as APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9211) HiveComparisonTest generates incorrect file name for golden answer files on Windows

2015-08-06 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-9211.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7563
[https://github.com/apache/spark/pull/7563]

 HiveComparisonTest generates incorrect file name for golden answer files on 
 Windows
 ---

 Key: SPARK-9211
 URL: https://issues.apache.org/jira/browse/SPARK-9211
 Project: Spark
  Issue Type: Test
  Components: SQL, Windows
Affects Versions: 1.4.1
 Environment: Windows
Reporter: Christian Kadner
Priority: Minor
  Labels: hive, sql, test, windows
 Fix For: 1.5.0


 The names of the golden answer files for the Hive test cases (test suites 
 based on {{HiveComparisonTest}}) are generated using an MD5 hash of the query 
 text. When the query text contains line breaks then the generated MD5 hash 
 differs between Windows and Linux/OSX ({{\r\n}} vs {{\n}}).
 This results in erroneously created golden answer files from just running a 
 Hive comparison test and makes it impossible to modify or add new test cases 
 with correctly named golden answer files on Windows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9453) Support records larger than default page size in UnsafeShuffleExternalSorter

2015-08-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660830#comment-14660830
 ] 

Apache Spark commented on SPARK-9453:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/8005

 Support records larger than default page size in UnsafeShuffleExternalSorter
 

 Key: SPARK-9453
 URL: https://issues.apache.org/jira/browse/SPARK-9453
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Josh Rosen
Assignee: Davies Liu





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9624) Make RateControllerSuite faster and more robust

2015-08-06 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-9624.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

 Make RateControllerSuite faster and more robust
 ---

 Key: SPARK-9624
 URL: https://issues.apache.org/jira/browse/SPARK-9624
 Project: Spark
  Issue Type: Test
  Components: Streaming, Tests
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Minor
 Fix For: 1.5.0


 Tests in RateControllerSuite runs with 1 second batch, takes almost 10 
 seconds for the whole test suite. If we reduce the batch interval to 100 ms, 
 then the test  multiple publish rates reach receivers becomes flaky as 
 multiple rates updates may get applied before the rate is polled. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9619) Rename Receiver.executor to Receiver.supervisor

2015-08-06 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-9619.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

 Rename Receiver.executor to Receiver.supervisor 
 

 Key: SPARK-9619
 URL: https://issues.apache.org/jira/browse/SPARK-9619
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Minor
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9556) Make all BlockGenerators subscribe to rate limit updates

2015-08-06 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-9556.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

 Make all BlockGenerators subscribe to rate limit updates
 

 Key: SPARK-9556
 URL: https://issues.apache.org/jira/browse/SPARK-9556
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9710) RPackageUtilsSuite fails if R is not installer

2015-08-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9710:
---

Assignee: Apache Spark

 RPackageUtilsSuite fails if R is not installer
 --

 Key: SPARK-9710
 URL: https://issues.apache.org/jira/browse/SPARK-9710
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.5.0
Reporter: Marcelo Vanzin
Assignee: Apache Spark

 That's because there's a bug in RUtils.scala. PR soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8167) Tasks that fail due to YARN preemption can cause job failure

2015-08-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660871#comment-14660871
 ] 

Apache Spark commented on SPARK-8167:
-

User 'mccheah' has created a pull request for this issue:
https://github.com/apache/spark/pull/8007

 Tasks that fail due to YARN preemption can cause job failure
 

 Key: SPARK-8167
 URL: https://issues.apache.org/jira/browse/SPARK-8167
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, YARN
Affects Versions: 1.3.1
Reporter: Patrick Woody
Assignee: Matt Cheah
Priority: Blocker

 Tasks that are running on preempted executors will count as FAILED with an 
 ExecutorLostFailure. Unfortunately, this can quickly spiral out of control if 
 a large resource shift is occurring, and the tasks get scheduled to executors 
 that immediately get preempted as well.
 The current workaround is to increase spark.task.maxFailures very high, but 
 that can cause delays in true failures. We should ideally differentiate these 
 task statuses so that they don't count towards the failure limit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9710) RPackageUtilsSuite fails if R is not installed

2015-08-06 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-9710:
--
Summary: RPackageUtilsSuite fails if R is not installed  (was: 
RPackageUtilsSuite fails if R is not installer)

 RPackageUtilsSuite fails if R is not installed
 --

 Key: SPARK-9710
 URL: https://issues.apache.org/jira/browse/SPARK-9710
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.5.0
Reporter: Marcelo Vanzin

 That's because there's a bug in RUtils.scala. PR soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9630) Cleanup Hybrid Aggregate Operator.

2015-08-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-9630.

   Resolution: Fixed
Fix Version/s: 1.5.0

 Cleanup Hybrid Aggregate Operator.
 --

 Key: SPARK-9630
 URL: https://issues.apache.org/jira/browse/SPARK-9630
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Blocker
 Fix For: 1.5.0


 This is the follow-up of SPARK-9240 to address review comments and clean up 
 code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8167) Tasks that fail due to YARN preemption can cause job failure

2015-08-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8167:
---

Assignee: Apache Spark  (was: Matt Cheah)

 Tasks that fail due to YARN preemption can cause job failure
 

 Key: SPARK-8167
 URL: https://issues.apache.org/jira/browse/SPARK-8167
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, YARN
Affects Versions: 1.3.1
Reporter: Patrick Woody
Assignee: Apache Spark
Priority: Blocker

 Tasks that are running on preempted executors will count as FAILED with an 
 ExecutorLostFailure. Unfortunately, this can quickly spiral out of control if 
 a large resource shift is occurring, and the tasks get scheduled to executors 
 that immediately get preempted as well.
 The current workaround is to increase spark.task.maxFailures very high, but 
 that can cause delays in true failures. We should ideally differentiate these 
 task statuses so that they don't count towards the failure limit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9710) RPackageUtilsSuite fails if R is not installer

2015-08-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9710:
---

Assignee: (was: Apache Spark)

 RPackageUtilsSuite fails if R is not installer
 --

 Key: SPARK-9710
 URL: https://issues.apache.org/jira/browse/SPARK-9710
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.5.0
Reporter: Marcelo Vanzin

 That's because there's a bug in RUtils.scala. PR soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9710) RPackageUtilsSuite fails if R is not installer

2015-08-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660870#comment-14660870
 ] 

Apache Spark commented on SPARK-9710:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/8008

 RPackageUtilsSuite fails if R is not installer
 --

 Key: SPARK-9710
 URL: https://issues.apache.org/jira/browse/SPARK-9710
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.5.0
Reporter: Marcelo Vanzin

 That's because there's a bug in RUtils.scala. PR soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8167) Tasks that fail due to YARN preemption can cause job failure

2015-08-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8167:
---

Assignee: Matt Cheah  (was: Apache Spark)

 Tasks that fail due to YARN preemption can cause job failure
 

 Key: SPARK-8167
 URL: https://issues.apache.org/jira/browse/SPARK-8167
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, YARN
Affects Versions: 1.3.1
Reporter: Patrick Woody
Assignee: Matt Cheah
Priority: Blocker

 Tasks that are running on preempted executors will count as FAILED with an 
 ExecutorLostFailure. Unfortunately, this can quickly spiral out of control if 
 a large resource shift is occurring, and the tasks get scheduled to executors 
 that immediately get preempted as well.
 The current workaround is to increase spark.task.maxFailures very high, but 
 that can cause delays in true failures. We should ideally differentiate these 
 task statuses so that they don't count towards the failure limit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8890) Reduce memory consumption for dynamic partition insert

2015-08-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660893#comment-14660893
 ] 

Apache Spark commented on SPARK-8890:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/8010

 Reduce memory consumption for dynamic partition insert
 --

 Key: SPARK-8890
 URL: https://issues.apache.org/jira/browse/SPARK-8890
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Michael Armbrust
Priority: Critical

 Currently, InsertIntoHadoopFsRelation can run out of memory if the number of 
 table partitions is large. The problem is that we open one output writer for 
 each partition, and when data are randomized and when the number of 
 partitions is large, we open a large number of output writers, leading to OOM.
 The solution here is to inject a sorting operation once the number of active 
 partitions is beyond a certain point (e.g. 50?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5180) Data source API improvement

2015-08-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5180:
---
Issue Type: Story  (was: Umbrella)

 Data source API improvement
 ---

 Key: SPARK-5180
 URL: https://issues.apache.org/jira/browse/SPARK-5180
 Project: Spark
  Issue Type: Story
  Components: SQL
Reporter: Yin Huai
Assignee: Cheng Lian
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9706) List Public API Compatibility Issues with japi-compliance checker

2015-08-06 Thread Feynman Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feynman Liang updated SPARK-9706:
-
Attachment: compat_reports.zip

 List Public API Compatibility Issues with japi-compliance checker
 -

 Key: SPARK-9706
 URL: https://issues.apache.org/jira/browse/SPARK-9706
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Reporter: Feynman Liang
 Attachments: compat_reports.zip


 Using command:
 {{japi-compliance-checker spark-mllib_2.10-1.4.2-SNAPSHOT.jar 
 spark-mllib_2.10-1.5.0-SNAPSHOT.jar}}
 Report result attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9706) Check Public API Compatibility with japi-compliance checker

2015-08-06 Thread Feynman Liang (JIRA)
Feynman Liang created SPARK-9706:


 Summary: Check Public API Compatibility with japi-compliance 
checker
 Key: SPARK-9706
 URL: https://issues.apache.org/jira/browse/SPARK-9706
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Reporter: Feynman Liang


Using command:

{{japi-compliance-checker spark-mllib_2.10-1.4.2-SNAPSHOT.jar 
spark-mllib_2.10-1.5.0-SNAPSHOT.jar}}

Report result attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9706) List Public API Compatibility Issues with japi-compliance checker

2015-08-06 Thread Feynman Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feynman Liang updated SPARK-9706:
-
Description: 
To identify potential API issues, list public API changes which affect binary 
and source incompatibility by using command:

{code}
japi-compliance-checker spark-mllib_2.10-1.4.2-SNAPSHOT.jar 
spark-mllib_2.10-1.5.0-SNAPSHOT.jar
{code}

Report result attached.

  was:
Using command:

{code}
japi-compliance-checker spark-mllib_2.10-1.4.2-SNAPSHOT.jar 
spark-mllib_2.10-1.5.0-SNAPSHOT.jar
{code}

Report result attached.


 List Public API Compatibility Issues with japi-compliance checker
 -

 Key: SPARK-9706
 URL: https://issues.apache.org/jira/browse/SPARK-9706
 Project: Spark
  Issue Type: Task
  Components: ML, MLlib
Reporter: Feynman Liang
 Attachments: compat_reports.zip


 To identify potential API issues, list public API changes which affect binary 
 and source incompatibility by using command:
 {code}
 japi-compliance-checker spark-mllib_2.10-1.4.2-SNAPSHOT.jar 
 spark-mllib_2.10-1.5.0-SNAPSHOT.jar
 {code}
 Report result attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9705) outdated Python 3 and IPython information

2015-08-06 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-9705:
--
Target Version/s: 1.4.1, 1.4.0, 1.5.0  (was: 1.4.0, 1.4.1)

 outdated Python 3 and IPython information
 -

 Key: SPARK-9705
 URL: https://issues.apache.org/jira/browse/SPARK-9705
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, PySpark
Affects Versions: 1.4.0, 1.4.1, 1.5.0
Reporter: Piotr Migdał
  Labels: documentation
   Original Estimate: 0.25h
  Remaining Estimate: 0.25h

 https://issues.apache.org/jira/browse/SPARK-4897 adds Python 3.4 support to 
 1.4.0 and above, but the official docs (1.4.1, but the same is for 1.4.0) 
 says explicitly:
 Spark 1.4.1 works with Python 2.6 or higher (but not Python 3).
 Affected:
 https://spark.apache.org/docs/1.4.0/programming-guide.html
 https://spark.apache.org/docs/1.4.1/programming-guide.html
 There are some other Python-related things, which are outdated, e.g. this 
 line:
 For example, to launch the IPython Notebook with PyLab plot support:
 (At least since IPython 3.0 PyLab/Matplotlib support happens inside a 
 notebook; and the line --pylab inline is already removed.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9639) JobHandler may throw NPE if JobScheduler has been stopped

2015-08-06 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-9639.
--
   Resolution: Fixed
 Assignee: Shixiong Zhu
Fix Version/s: 1.5.0

 JobHandler may throw NPE if JobScheduler has been stopped
 -

 Key: SPARK-9639
 URL: https://issues.apache.org/jira/browse/SPARK-9639
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
 Fix For: 1.5.0


 Because `JobScheduler.stop(false)` may set `eventLoop` to null when 
 `JobHandler` is running, then it's possible that when `post` is called, 
 `eventLoop` happens to null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9710) RPackageUtilsSuite fails if R is not installer

2015-08-06 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-9710:
-

 Summary: RPackageUtilsSuite fails if R is not installer
 Key: SPARK-9710
 URL: https://issues.apache.org/jira/browse/SPARK-9710
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.5.0
Reporter: Marcelo Vanzin


That's because there's a bug in RUtils.scala. PR soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9709) Avoid starving an unsafe operator in a sort

2015-08-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9709:
---

Assignee: Andrew Or  (was: Apache Spark)

 Avoid starving an unsafe operator in a sort
 ---

 Key: SPARK-9709
 URL: https://issues.apache.org/jira/browse/SPARK-9709
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.5.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical

 This concerns mainly TungstenSort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7398) Add back-pressure to Spark Streaming (umbrella JIRA)

2015-08-06 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-7398:
-
Target Version/s:   (was: 1.5.0)

 Add back-pressure to Spark Streaming (umbrella JIRA)
 

 Key: SPARK-7398
 URL: https://issues.apache.org/jira/browse/SPARK-7398
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.3.1
Reporter: François Garillot
Priority: Critical
  Labels: streams

 Spark Streaming has trouble dealing with situations where 
  batch processing time  batch interval
 Meaning a high throughput of input data w.r.t. Spark's ability to remove data 
 from the queue.
 If this throughput is sustained for long enough, it leads to an unstable 
 situation where the memory of the Receiver's Executor is overflowed.
 This aims at transmitting a back-pressure signal back to data ingestion to 
 help with dealing with that high throughput, in a backwards-compatible way.
 The original design doc can be found here:
 https://docs.google.com/document/d/1ZhiP_yBHcbjifz8nJEyPJpHqxB1FT6s8-Zk7sAfayQw/edit?usp=sharing
 The second design doc, focusing [on the first 
 sub-task|https://issues.apache.org/jira/browse/SPARK-8834] (without all the 
 background info, and more centered on the implementation) can be found here:
 https://docs.google.com/document/d/1ls_g5fFmfbbSTIfQQpUxH56d0f3OksF567zwA00zK9E/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9703) EnsureRequirements should not add unnecessary shuffles when only ordering requirements are unsatisfied

2015-08-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9703:
---

Assignee: Josh Rosen  (was: Apache Spark)

 EnsureRequirements should not add unnecessary shuffles when only ordering 
 requirements are unsatisfied
 --

 Key: SPARK-9703
 URL: https://issues.apache.org/jira/browse/SPARK-9703
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0, 1.4.0, 1.5.0
Reporter: Josh Rosen
Assignee: Josh Rosen

 Consider SortMergeJoin, which requires a sorted, clustered distribution of 
 its input rows. Say that both of SMJ's children produce unsorted output but 
 are both single partition. In this case, we will need to inject sort 
 operators but should not need to inject exchanges. Unfortunately, it looks 
 like the Exchange unnecessarily repartitions using a hash partitioning.
 We should update Exchange so that it does not unnecessarily repartition 
 children when only the ordering requirements are unsatisfied.
 I'd like to fix this for Spark 1.5 since it makes certain types of unit tests 
 easier to write.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9228) Combine unsafe and codegen into a single option

2015-08-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9228:
---
Sprint: Spark 1.5 release  (was: Spark 1.5 doc/QA sprint)

 Combine unsafe and codegen into a single option
 ---

 Key: SPARK-9228
 URL: https://issues.apache.org/jira/browse/SPARK-9228
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Davies Liu
Priority: Blocker

 Before QA, lets flip on features and consolidate unsafe and codegen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9228) Combine unsafe and codegen into a single option

2015-08-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9228:
---
Sprint: Spark 1.5 doc/QA sprint

 Combine unsafe and codegen into a single option
 ---

 Key: SPARK-9228
 URL: https://issues.apache.org/jira/browse/SPARK-9228
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Davies Liu
Priority: Blocker

 Before QA, lets flip on features and consolidate unsafe and codegen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9703) EnsureRequirements should not add unnecessary shuffles when only ordering requirements are unsatisfied

2015-08-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660743#comment-14660743
 ] 

Apache Spark commented on SPARK-9703:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/7988

 EnsureRequirements should not add unnecessary shuffles when only ordering 
 requirements are unsatisfied
 --

 Key: SPARK-9703
 URL: https://issues.apache.org/jira/browse/SPARK-9703
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0, 1.4.0, 1.5.0
Reporter: Josh Rosen
Assignee: Josh Rosen

 Consider SortMergeJoin, which requires a sorted, clustered distribution of 
 its input rows. Say that both of SMJ's children produce unsorted output but 
 are both single partition. In this case, we will need to inject sort 
 operators but should not need to inject exchanges. Unfortunately, it looks 
 like the Exchange unnecessarily repartitions using a hash partitioning.
 We should update Exchange so that it does not unnecessarily repartition 
 children when only the ordering requirements are unsatisfied.
 I'd like to fix this for Spark 1.5 since it makes certain types of unit tests 
 easier to write.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9703) EnsureRequirements should not add unnecessary shuffles when only ordering requirements are unsatisfied

2015-08-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9703:
---

Assignee: Apache Spark  (was: Josh Rosen)

 EnsureRequirements should not add unnecessary shuffles when only ordering 
 requirements are unsatisfied
 --

 Key: SPARK-9703
 URL: https://issues.apache.org/jira/browse/SPARK-9703
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0, 1.4.0, 1.5.0
Reporter: Josh Rosen
Assignee: Apache Spark

 Consider SortMergeJoin, which requires a sorted, clustered distribution of 
 its input rows. Say that both of SMJ's children produce unsorted output but 
 are both single partition. In this case, we will need to inject sort 
 operators but should not need to inject exchanges. Unfortunately, it looks 
 like the Exchange unnecessarily repartitions using a hash partitioning.
 We should update Exchange so that it does not unnecessarily repartition 
 children when only the ordering requirements are unsatisfied.
 I'd like to fix this for Spark 1.5 since it makes certain types of unit tests 
 easier to write.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9683) deep copy UTF8String when convert unsafe row to safe row

2015-08-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9683:
---
Sprint: Spark 1.5 release

 deep copy UTF8String when convert unsafe row to safe row
 

 Key: SPARK-9683
 URL: https://issues.apache.org/jira/browse/SPARK-9683
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9706) List Public API Compatibility Issues with japi-compliance checker

2015-08-06 Thread Feynman Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feynman Liang updated SPARK-9706:
-
Description: 
Using command:

{code}
japi-compliance-checker spark-mllib_2.10-1.4.2-SNAPSHOT.jar 
spark-mllib_2.10-1.5.0-SNAPSHOT.jar
{code}

Report result attached.

  was:
Using command:

{code}
japi-compliance-checker spark-mllib_2.10-1.4.2-SNAPSHOT.jar 
spark-mllib_2.10-1.5.0-SNAPSHOT.jar
{/code}

Report result attached.


 List Public API Compatibility Issues with japi-compliance checker
 -

 Key: SPARK-9706
 URL: https://issues.apache.org/jira/browse/SPARK-9706
 Project: Spark
  Issue Type: Task
  Components: ML, MLlib
Reporter: Feynman Liang
 Attachments: compat_reports.zip


 Using command:
 {code}
 japi-compliance-checker spark-mllib_2.10-1.4.2-SNAPSHOT.jar 
 spark-mllib_2.10-1.5.0-SNAPSHOT.jar
 {code}
 Report result attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6902) Row() object can be mutated even though it should be immutable

2015-08-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6902:
---

Assignee: Davies Liu  (was: Apache Spark)

 Row() object can be mutated even though it should be immutable
 --

 Key: SPARK-6902
 URL: https://issues.apache.org/jira/browse/SPARK-6902
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.2.0
Reporter: Jonathan Arfa
Assignee: Davies Liu

 See the below code snippet, IMHO it shouldn't let you assign {{x.c = 5}} and 
 should just give you an error.
 {quote}
 Welcome to
     __
  / __/__  ___ _/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/__ / .__/\_,_/_/ /_/\_\   version 1.2.0-SNAPSHOT
   /_/
 Using Python version 2.6.6 (r266:84292, Jan 22 2014 09:42:36)
 SparkContext available as sc.
  from pyspark.sql import *
  x = Row(a=1, b=2, c=3)
  x
 Row(a=1, b=2, c=3)
  x.__dict__
 \{'__FIELDS__': ['a', 'b', 'c']\}
  x.c
 3
  x.c = 5
  x
 Row(a=1, b=2, c=3)
  x.__dict__
 \{'__FIELDS__': ['a', 'b', 'c'], 'c': 5\}
  x.c
 5
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6902) Row() object can be mutated even though it should be immutable

2015-08-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6902:
---

Assignee: Apache Spark  (was: Davies Liu)

 Row() object can be mutated even though it should be immutable
 --

 Key: SPARK-6902
 URL: https://issues.apache.org/jira/browse/SPARK-6902
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.2.0
Reporter: Jonathan Arfa
Assignee: Apache Spark

 See the below code snippet, IMHO it shouldn't let you assign {{x.c = 5}} and 
 should just give you an error.
 {quote}
 Welcome to
     __
  / __/__  ___ _/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/__ / .__/\_,_/_/ /_/\_\   version 1.2.0-SNAPSHOT
   /_/
 Using Python version 2.6.6 (r266:84292, Jan 22 2014 09:42:36)
 SparkContext available as sc.
  from pyspark.sql import *
  x = Row(a=1, b=2, c=3)
  x
 Row(a=1, b=2, c=3)
  x.__dict__
 \{'__FIELDS__': ['a', 'b', 'c']\}
  x.c
 3
  x.c = 5
  x
 Row(a=1, b=2, c=3)
  x.__dict__
 \{'__FIELDS__': ['a', 'b', 'c'], 'c': 5\}
  x.c
 5
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6902) Row() object can be mutated even though it should be immutable

2015-08-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660873#comment-14660873
 ] 

Apache Spark commented on SPARK-6902:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/8009

 Row() object can be mutated even though it should be immutable
 --

 Key: SPARK-6902
 URL: https://issues.apache.org/jira/browse/SPARK-6902
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.2.0
Reporter: Jonathan Arfa
Assignee: Davies Liu

 See the below code snippet, IMHO it shouldn't let you assign {{x.c = 5}} and 
 should just give you an error.
 {quote}
 Welcome to
     __
  / __/__  ___ _/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/__ / .__/\_,_/_/ /_/\_\   version 1.2.0-SNAPSHOT
   /_/
 Using Python version 2.6.6 (r266:84292, Jan 22 2014 09:42:36)
 SparkContext available as sc.
  from pyspark.sql import *
  x = Row(a=1, b=2, c=3)
  x
 Row(a=1, b=2, c=3)
  x.__dict__
 \{'__FIELDS__': ['a', 'b', 'c']\}
  x.c
 3
  x.c = 5
  x
 Row(a=1, b=2, c=3)
  x.__dict__
 \{'__FIELDS__': ['a', 'b', 'c'], 'c': 5\}
  x.c
 5
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9211) HiveComparisonTest generates incorrect file name for golden answer files on Windows

2015-08-06 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-9211:

Assignee: Christian Kadner

 HiveComparisonTest generates incorrect file name for golden answer files on 
 Windows
 ---

 Key: SPARK-9211
 URL: https://issues.apache.org/jira/browse/SPARK-9211
 Project: Spark
  Issue Type: Test
  Components: SQL, Windows
Affects Versions: 1.4.1
 Environment: Windows
Reporter: Christian Kadner
Assignee: Christian Kadner
Priority: Minor
  Labels: hive, sql, test, windows
 Fix For: 1.5.0


 The names of the golden answer files for the Hive test cases (test suites 
 based on {{HiveComparisonTest}}) are generated using an MD5 hash of the query 
 text. When the query text contains line breaks then the generated MD5 hash 
 differs between Windows and Linux/OSX ({{\r\n}} vs {{\n}}).
 This results in erroneously created golden answer files from just running a 
 Hive comparison test and makes it impossible to modify or add new test cases 
 with correctly named golden answer files on Windows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9709) Avoid starving an unsafe operator in a sort

2015-08-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9709:
---

Assignee: Apache Spark  (was: Andrew Or)

 Avoid starving an unsafe operator in a sort
 ---

 Key: SPARK-9709
 URL: https://issues.apache.org/jira/browse/SPARK-9709
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.5.0
Reporter: Andrew Or
Assignee: Apache Spark
Priority: Critical

 This concerns mainly TungstenSort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9709) Avoid starving an unsafe operator in a sort

2015-08-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660901#comment-14660901
 ] 

Apache Spark commented on SPARK-9709:
-

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/8011

 Avoid starving an unsafe operator in a sort
 ---

 Key: SPARK-9709
 URL: https://issues.apache.org/jira/browse/SPARK-9709
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.5.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical

 This concerns mainly TungstenSort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9618) SQLContext.read.schema().parquet() ignores the supplied schema

2015-08-06 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660734#comment-14660734
 ] 

Reynold Xin commented on SPARK-9618:


[~lian cheng] this was not merged into branch-1.5. I just cherry-picked it.


 SQLContext.read.schema().parquet() ignores the supplied schema
 --

 Key: SPARK-9618
 URL: https://issues.apache.org/jira/browse/SPARK-9618
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.1
Reporter: Nathan Howell
Assignee: Nathan Howell
Priority: Minor
 Fix For: 1.5.0


 If a user supplies a schema when loading a Parquet file it is ignored and the 
 schema is read off disk instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5180) Data source API improvement

2015-08-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5180:
---
Issue Type: Umbrella  (was: Improvement)

 Data source API improvement
 ---

 Key: SPARK-5180
 URL: https://issues.apache.org/jira/browse/SPARK-5180
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Reporter: Yin Huai
Assignee: Cheng Lian
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5180) Data source API improvement

2015-08-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5180:
---
Target Version/s: 1.5.0  (was: 1.6.0)

 Data source API improvement
 ---

 Key: SPARK-5180
 URL: https://issues.apache.org/jira/browse/SPARK-5180
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Assignee: Cheng Lian
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9705) outdated Python 3 and IPython information

2015-08-06 Thread JIRA
Piotr Migdał created SPARK-9705:
---

 Summary: outdated Python 3 and IPython information
 Key: SPARK-9705
 URL: https://issues.apache.org/jira/browse/SPARK-9705
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, PySpark
Affects Versions: 1.4.1, 1.4.0
Reporter: Piotr Migdał


https://issues.apache.org/jira/browse/SPARK-4897 adds Python 3.4 support to 
1.4.0 and above, but the official docs (1.4.1, but the same is for 1.4.0) says 
explicitly:
Spark 1.4.1 works with Python 2.6 or higher (but not Python 3).

Affected:
https://spark.apache.org/docs/1.4.0/programming-guide.html
https://spark.apache.org/docs/1.4.1/programming-guide.html

There are some other Python-related things, which are outdated, e.g. this line:
For example, to launch the IPython Notebook with PyLab plot support:
(At least since IPython 3.0 PyLab/Matplotlib support happens inside a notebook; 
and the line --pylab inline is already removed.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9704) Make some ML APIs public: VectorUDT, Identifiable, ProbabilisticClassifier

2015-08-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9704:
---

Assignee: Joseph K. Bradley  (was: Apache Spark)

 Make some ML APIs public: VectorUDT, Identifiable, ProbabilisticClassifier
 --

 Key: SPARK-9704
 URL: https://issues.apache.org/jira/browse/SPARK-9704
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 This JIRA is for making several ML APIs public to make it easier for users to 
 write their own Pipeline stages.
 Issue brought up by [~eronwright].  Descriptions below copied from 
 [http://apache-spark-developers-list.1001551.n3.nabble.com/Make-ML-Developer-APIs-public-post-1-4-td13583.html].
 We plan to make these APIs public in Spark 1.5.  However, they will be marked 
 DeveloperApi and are *very likely* to be broken in the future.
 * VectorUDT: To define a relation with a vector field, VectorUDT must be 
 instantiated.
 * Identifiable trait: The trait generates a unique identifier for the 
 associated pipeline component.  Nice to have a consistent format by reusing 
 the trait.
 * ProbabilisticClassifier.  Third-party components should leverage the 
 complex logic around computing only selected columns.
 We will not yet make these public:
 * SchemaUtils: Third-party pipeline components have a need for checking 
 column types and appending columns.
 ** This will probably be moved into Spark SQL.  Users can copy the methods 
 into their own code as needed.
 * Shared Params (HasLabel, HasFeatures): This is covered in [SPARK-7146] but 
 reiterating it here.
 ** We need to discuss whether these should be standardized public APIs.  
 Users can copy the traits into their own code as needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9704) Make some ML APIs public: VectorUDT, Identifiable, ProbabilisticClassifier

2015-08-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660769#comment-14660769
 ] 

Apache Spark commented on SPARK-9704:
-

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/8004

 Make some ML APIs public: VectorUDT, Identifiable, ProbabilisticClassifier
 --

 Key: SPARK-9704
 URL: https://issues.apache.org/jira/browse/SPARK-9704
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 This JIRA is for making several ML APIs public to make it easier for users to 
 write their own Pipeline stages.
 Issue brought up by [~eronwright].  Descriptions below copied from 
 [http://apache-spark-developers-list.1001551.n3.nabble.com/Make-ML-Developer-APIs-public-post-1-4-td13583.html].
 We plan to make these APIs public in Spark 1.5.  However, they will be marked 
 DeveloperApi and are *very likely* to be broken in the future.
 * VectorUDT: To define a relation with a vector field, VectorUDT must be 
 instantiated.
 * Identifiable trait: The trait generates a unique identifier for the 
 associated pipeline component.  Nice to have a consistent format by reusing 
 the trait.
 * ProbabilisticClassifier.  Third-party components should leverage the 
 complex logic around computing only selected columns.
 We will not yet make these public:
 * SchemaUtils: Third-party pipeline components have a need for checking 
 column types and appending columns.
 ** This will probably be moved into Spark SQL.  Users can copy the methods 
 into their own code as needed.
 * Shared Params (HasLabel, HasFeatures): This is covered in [SPARK-7146] but 
 reiterating it here.
 ** We need to discuss whether these should be standardized public APIs.  
 Users can copy the traits into their own code as needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9565) Spark SQL 1.5.0 QA/testing umbrella

2015-08-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9565:
---
Issue Type: Story  (was: Test)

 Spark SQL 1.5.0 QA/testing umbrella
 ---

 Key: SPARK-9565
 URL: https://issues.apache.org/jira/browse/SPARK-9565
 Project: Spark
  Issue Type: Story
  Components: SQL
Reporter: Reynold Xin
Assignee: Michael Armbrust
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9704) Make some ML APIs public: VectorUDT, Identifiable, ProbabilisticClassifier

2015-08-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9704:
---

Assignee: Apache Spark  (was: Joseph K. Bradley)

 Make some ML APIs public: VectorUDT, Identifiable, ProbabilisticClassifier
 --

 Key: SPARK-9704
 URL: https://issues.apache.org/jira/browse/SPARK-9704
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Apache Spark

 This JIRA is for making several ML APIs public to make it easier for users to 
 write their own Pipeline stages.
 Issue brought up by [~eronwright].  Descriptions below copied from 
 [http://apache-spark-developers-list.1001551.n3.nabble.com/Make-ML-Developer-APIs-public-post-1-4-td13583.html].
 We plan to make these APIs public in Spark 1.5.  However, they will be marked 
 DeveloperApi and are *very likely* to be broken in the future.
 * VectorUDT: To define a relation with a vector field, VectorUDT must be 
 instantiated.
 * Identifiable trait: The trait generates a unique identifier for the 
 associated pipeline component.  Nice to have a consistent format by reusing 
 the trait.
 * ProbabilisticClassifier.  Third-party components should leverage the 
 complex logic around computing only selected columns.
 We will not yet make these public:
 * SchemaUtils: Third-party pipeline components have a need for checking 
 column types and appending columns.
 ** This will probably be moved into Spark SQL.  Users can copy the methods 
 into their own code as needed.
 * Shared Params (HasLabel, HasFeatures): This is covered in [SPARK-7146] but 
 reiterating it here.
 ** We need to discuss whether these should be standardized public APIs.  
 Users can copy the traits into their own code as needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9704) Make some ML APIs public: VectorUDT, Identifiable, ProbabilisticClassifier

2015-08-06 Thread Eron Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660791#comment-14660791
 ] 

Eron Wright  commented on SPARK-9704:
-

Thanks for accepting the suggestions, and I agree with the workarounds 
suggested for SchemaUtils and shared params.

 Make some ML APIs public: VectorUDT, Identifiable, ProbabilisticClassifier
 --

 Key: SPARK-9704
 URL: https://issues.apache.org/jira/browse/SPARK-9704
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 This JIRA is for making several ML APIs public to make it easier for users to 
 write their own Pipeline stages.
 Issue brought up by [~eronwright].  Descriptions below copied from 
 [http://apache-spark-developers-list.1001551.n3.nabble.com/Make-ML-Developer-APIs-public-post-1-4-td13583.html].
 We plan to make these APIs public in Spark 1.5.  However, they will be marked 
 DeveloperApi and are *very likely* to be broken in the future.
 * VectorUDT: To define a relation with a vector field, VectorUDT must be 
 instantiated.
 * Identifiable trait: The trait generates a unique identifier for the 
 associated pipeline component.  Nice to have a consistent format by reusing 
 the trait.
 * ProbabilisticClassifier.  Third-party components should leverage the 
 complex logic around computing only selected columns.
 We will not yet make these public:
 * SchemaUtils: Third-party pipeline components have a need for checking 
 column types and appending columns.
 ** This will probably be moved into Spark SQL.  Users can copy the methods 
 into their own code as needed.
 * Shared Params (HasLabel, HasFeatures): This is covered in [SPARK-7146] but 
 reiterating it here.
 ** We need to discuss whether these should be standardized public APIs.  
 Users can copy the traits into their own code as needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9709) Avoid starving an unsafe operator in a sort

2015-08-06 Thread Andrew Or (JIRA)
Andrew Or created SPARK-9709:


 Summary: Avoid starving an unsafe operator in a sort
 Key: SPARK-9709
 URL: https://issues.apache.org/jira/browse/SPARK-9709
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.5.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical


This concerns mainly TungstenSort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9548) BytesToBytesMap could have a destructive iterator

2015-08-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-9548.

   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 1.5.0

 BytesToBytesMap could have a destructive iterator
 -

 Key: SPARK-9548
 URL: https://issues.apache.org/jira/browse/SPARK-9548
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Josh Rosen
Assignee: Liang-Chi Hsieh
Priority: Blocker
 Fix For: 1.5.0


 BytesToBytesMap.iterator() could be destructive, freeing each page as it 
 moves onto the next one.  There are some circumstances where we don't want a 
 destructive iterator (such as when we're building a KV sorter from a map), so 
 there should be a flag to control this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4561) PySparkSQL's Row.asDict() should convert nested rows to dictionaries

2015-08-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660860#comment-14660860
 ] 

Apache Spark commented on SPARK-4561:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/8006

 PySparkSQL's Row.asDict() should convert nested rows to dictionaries
 

 Key: SPARK-4561
 URL: https://issues.apache.org/jira/browse/SPARK-4561
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 1.2.0
Reporter: Josh Rosen
Assignee: Davies Liu

 In PySpark, you can call {{.asDict
 ()}} on a SparkSQL {{Row}} to convert it to a dictionary.  Unfortunately, 
 though, this does not convert nested rows to dictionaries.  For example:
 {code}
  sqlContext.sql(select results from results).first()
 Row(results=[Row(time=3.762), Row(time=3.47), Row(time=3.559), 
 Row(time=3.458), Row(time=3.229), Row(time=3.21), Row(time=3.166), 
 Row(time=3.276), Row(time=3.239), Row(time=3.149)])
  sqlContext.sql(select results from results).first().asDict()
 {u'results': [(3.762,),
   (3.47,),
   (3.559,),
   (3.458,),
   (3.229,),
   (3.21,),
   (3.166,),
   (3.276,),
   (3.239,),
   (3.149,)]}
 {code}
 Actually, it looks like the nested fields are just left as Rows (IPython's 
 fancy display logic obscured this in my first example):
 {code}
  Row(results=[Row(time=1), Row(time=2)]).asDict()
 {'results': [Row(time=1), Row(time=2)]}
 {code}
 Here's the output I'd expect:
 {code}
  Row(results=[Row(time=1), Row(time=2)])
 {'results' : [{'time': 1}, {'time': 2}]}
 {code}
 I ran into this issue when trying to use Pandas dataframes to display nested 
 data that I queried from Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9645) External shuffle service does not work with kerberos on

2015-08-06 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-9645.
---
   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 1.5.0

 External shuffle service does not work with kerberos on
 ---

 Key: SPARK-9645
 URL: https://issues.apache.org/jira/browse/SPARK-9645
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.5.0
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
Priority: Blocker
 Fix For: 1.5.0


 Lots of errors like this when running apps with the external shuffle service 
 and kerberos enabled:
 {noformat}
 15/08/05 06:26:18 WARN TaskSetManager: Lost task 2.0 in stage 2.0 (TID 12, 
 spark-nightly-2.vpc.cloudera.com): FetchFailed(BlockManagerId(2, 
 spark-nightly-2.vpc.cloudera.com, 7337), shuffleId=0, mapId=0, reduceId=2, 
 message=
 org.apache.spark.shuffle.FetchFailedException: java.lang.RuntimeException: 
 Failed to open file: 
 /yarn/nm/usercache/systest/appcache/application_1438780049118_0008/blockmgr-7178b106-6902-4082-8792-1c3e34b80d15/38/shuffle_0_0_0.index
   at 
 org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getSortBasedShuffleBlockData(ExternalShuffleBlockResolver.java:203)
   at 
 org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:113)
   at 
 org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.handleMessage(ExternalShuffleBlockHandler.java:80)
   at 
 org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:68)
   at 
 org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:114)
 {noformat}
 This is caused by commit c4830598 (SPARK-6287), which modified the 
 permissions of the directory storing the shuffle files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9363) SortMergeJoin operator should support UnsafeRow

2015-08-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9363:
---

Assignee: Apache Spark  (was: Josh Rosen)

 SortMergeJoin operator should support UnsafeRow
 ---

 Key: SPARK-9363
 URL: https://issues.apache.org/jira/browse/SPARK-9363
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Josh Rosen
Assignee: Apache Spark

 The SortMergeJoin operator should implement the suppotsUnsafeRow and 
 outputsUnsafeRow settings when appropriate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9363) SortMergeJoin operator should support UnsafeRow

2015-08-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9363:
---

Assignee: Josh Rosen  (was: Apache Spark)

 SortMergeJoin operator should support UnsafeRow
 ---

 Key: SPARK-9363
 URL: https://issues.apache.org/jira/browse/SPARK-9363
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Josh Rosen
Assignee: Josh Rosen

 The SortMergeJoin operator should implement the suppotsUnsafeRow and 
 outputsUnsafeRow settings when appropriate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9363) SortMergeJoin operator should support UnsafeRow

2015-08-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660720#comment-14660720
 ] 

Apache Spark commented on SPARK-9363:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/7904

 SortMergeJoin operator should support UnsafeRow
 ---

 Key: SPARK-9363
 URL: https://issues.apache.org/jira/browse/SPARK-9363
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Josh Rosen
Assignee: Josh Rosen

 The SortMergeJoin operator should implement the suppotsUnsafeRow and 
 outputsUnsafeRow settings when appropriate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9691) PySpark SQL rand function treats seed 0 as no seed

2015-08-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9691:
---
Sprint: Spark 1.5 release  (was: Spark 1.5 doc/QA sprint)

 PySpark SQL rand function treats seed 0 as no seed
 --

 Key: SPARK-9691
 URL: https://issues.apache.org/jira/browse/SPARK-9691
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.4.0, 1.4.1, 1.4.2, 1.5.0
Reporter: Joseph K. Bradley
Assignee: Yin Huai

 In PySpark SQL's rand() function, it tests for a seed in a way such that seed 
 0 is treated as no seed, leading to non-deterministic results when a user 
 would expect deterministic results.
 See: 
 [https://github.com/apache/spark/blob/98e69467d4fda2c26a951409b5b7c6f1e9345ce4/python/pyspark/sql/functions.py#L271]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9704) Make some ML APIs public: VectorUDT, Identifiable, ProbabilisticClassifier

2015-08-06 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-9704:


 Summary: Make some ML APIs public: VectorUDT, Identifiable, 
ProbabilisticClassifier
 Key: SPARK-9704
 URL: https://issues.apache.org/jira/browse/SPARK-9704
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley


This JIRA is for making several ML APIs public to make it easier for users to 
write their own Pipeline stages.

Issue brought up by [~eronwright].  Descriptions below copied from 
[http://apache-spark-developers-list.1001551.n3.nabble.com/Make-ML-Developer-APIs-public-post-1-4-td13583.html].

We plan to make these APIs public in Spark 1.5.  However, they will be marked 
DeveloperApi and are *very likely* to be broken in the future.
* VectorUDT: To define a relation with a vector field, VectorUDT must be 
instantiated.
* Identifiable trait: The trait generates a unique identifier for the 
associated pipeline component.  Nice to have a consistent format by reusing the 
trait.
* ProbabilisticClassifier.  Third-party components should leverage the complex 
logic around computing only selected columns.

We will not yet make these public:
* SchemaUtils: Third-party pipeline components have a need for checking column 
types and appending columns.
** This will probably be moved into Spark SQL.  Users can copy the methods into 
their own code as needed.
* Shared Params (HasLabel, HasFeatures): This is covered in [SPARK-7146] but 
reiterating it here.
** We need to discuss whether these should be standardized public APIs.  Users 
can copy the traits into their own code as needed.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9706) List Public API Compatibility Issues with japi-compliance checker

2015-08-06 Thread Feynman Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feynman Liang updated SPARK-9706:
-
Issue Type: Task  (was: Bug)

 List Public API Compatibility Issues with japi-compliance checker
 -

 Key: SPARK-9706
 URL: https://issues.apache.org/jira/browse/SPARK-9706
 Project: Spark
  Issue Type: Task
  Components: ML, MLlib
Reporter: Feynman Liang
 Attachments: compat_reports.zip


 Using command:
 {{japi-compliance-checker spark-mllib_2.10-1.4.2-SNAPSHOT.jar 
 spark-mllib_2.10-1.5.0-SNAPSHOT.jar}}
 Report result attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9706) List Public API Compatibility Issues with japi-compliance checker

2015-08-06 Thread Feynman Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feynman Liang updated SPARK-9706:
-
Description: 
Using command:

{code}
japi-compliance-checker spark-mllib_2.10-1.4.2-SNAPSHOT.jar 
spark-mllib_2.10-1.5.0-SNAPSHOT.jar
{/code}

Report result attached.

  was:
Using command:

{{code}}
japi-compliance-checker spark-mllib_2.10-1.4.2-SNAPSHOT.jar 
spark-mllib_2.10-1.5.0-SNAPSHOT.jar
{{/code}}

Report result attached.


 List Public API Compatibility Issues with japi-compliance checker
 -

 Key: SPARK-9706
 URL: https://issues.apache.org/jira/browse/SPARK-9706
 Project: Spark
  Issue Type: Task
  Components: ML, MLlib
Reporter: Feynman Liang
 Attachments: compat_reports.zip


 Using command:
 {code}
 japi-compliance-checker spark-mllib_2.10-1.4.2-SNAPSHOT.jar 
 spark-mllib_2.10-1.5.0-SNAPSHOT.jar
 {/code}
 Report result attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9706) List Public API Compatibility Issues with japi-compliance checker

2015-08-06 Thread Feynman Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feynman Liang updated SPARK-9706:
-
Description: 
Using command:

{{code}}
japi-compliance-checker spark-mllib_2.10-1.4.2-SNAPSHOT.jar 
spark-mllib_2.10-1.5.0-SNAPSHOT.jar
{{/code}}

Report result attached.

  was:
Using command:

{{japi-compliance-checker spark-mllib_2.10-1.4.2-SNAPSHOT.jar 
spark-mllib_2.10-1.5.0-SNAPSHOT.jar}}

Report result attached.


 List Public API Compatibility Issues with japi-compliance checker
 -

 Key: SPARK-9706
 URL: https://issues.apache.org/jira/browse/SPARK-9706
 Project: Spark
  Issue Type: Task
  Components: ML, MLlib
Reporter: Feynman Liang
 Attachments: compat_reports.zip


 Using command:
 {{code}}
 japi-compliance-checker spark-mllib_2.10-1.4.2-SNAPSHOT.jar 
 spark-mllib_2.10-1.5.0-SNAPSHOT.jar
 {{/code}}
 Report result attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8848) Write Parquet LISTs and MAPs conforming to Parquet format spec

2015-08-06 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-8848:

Target Version/s: 1.5.0  (was: 1.6.0)

 Write Parquet LISTs and MAPs conforming to Parquet format spec
 --

 Key: SPARK-8848
 URL: https://issues.apache.org/jira/browse/SPARK-8848
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.0.2, 1.1.1, 1.2.2, 1.3.1, 1.4.0
Reporter: Cheng Lian
Assignee: Cheng Lian

 [Parquet format PR #17|https://github.com/apache/parquet-format/pull/17] 
 standardized structures of Parquet complex types (LIST  MAP). Spark SQL 
 should follow this spec and write Parquet data conforming to the standard.
 Note that although currently Parquet files written by Spark SQL is 
 non-standard (because Parquet format spec wasn't clear about this part when 
 Spark SQL Parquet support was authored), it's still compatible with the most 
 recent Parquet format spec, because the format we use is covered by the 
 backwards-compatibility rules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9381) Migrate JSON data source to the new partitioning data source

2015-08-06 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660735#comment-14660735
 ] 

Reynold Xin commented on SPARK-9381:


This was also not merged into branch-1.5. I cherry-picked it.


 Migrate JSON data source to the new partitioning data source
 

 Key: SPARK-9381
 URL: https://issues.apache.org/jira/browse/SPARK-9381
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Cheng Hao
Assignee: Cheng Hao
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7550) Support setting the right schema serde when writing to Hive metastore

2015-08-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7550:
---
Fix Version/s: 1.5.0

 Support setting the right schema  serde when writing to Hive metastore
 ---

 Key: SPARK-7550
 URL: https://issues.apache.org/jira/browse/SPARK-7550
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.0
Reporter: Reynold Xin
Assignee: Cheng Hao
Priority: Blocker
 Fix For: 1.5.0


 As of 1.4, Spark SQL does not properly set the table schema and serde when 
 writing a table to Hive's metastore. Would be great to do that properly so 
 users can use non-Spark SQL systems to read those tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6923) Spark SQL CLI does not read Data Source schema correctly

2015-08-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6923:
---
Fix Version/s: 1.5.0

 Spark SQL CLI does not read Data Source schema correctly
 

 Key: SPARK-6923
 URL: https://issues.apache.org/jira/browse/SPARK-6923
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: pin_zhang
Assignee: Cheng Hao
Priority: Critical
 Fix For: 1.5.0


 {code:java}
 HiveContext hctx = new HiveContext(sc);
 ListString sample = new ArrayListString();
 sample.add( {\id\: \id_1\, \age\:1} );
 RDDString sampleRDD = new JavaSparkContext(sc).parallelize(sample).rdd();   
 DataFrame df = hctx.jsonRDD(sampleRDD);
 String table=test;
 df.saveAsTable(table, json,SaveMode.Overwrite);
 Table t = hctx.catalog().client().getTable(table);
 System.out.println( t.getCols());
 {code}
 --
 With the code above to save DataFrame to hive table,
 Get table cols returns one column named 'col'
 [FieldSchema(name:col, type:arraystring, comment:from deserializer)]
 Expected return fields schema id, age.
 This results in the jdbc API cannot retrieves the table columns via ResultSet 
 DatabaseMetaData.getColumns(String catalog, String schemaPattern,String 
 tableNamePattern, String columnNamePattern)
 But resultset metadata for query  select * from test   contains fields id, 
 age.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7550) Support setting the right schema serde when writing to Hive metastore

2015-08-06 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660739#comment-14660739
 ] 

Reynold Xin commented on SPARK-7550:


I cherry-picked it into branch-1.5.


 Support setting the right schema  serde when writing to Hive metastore
 ---

 Key: SPARK-7550
 URL: https://issues.apache.org/jira/browse/SPARK-7550
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.0
Reporter: Reynold Xin
Assignee: Cheng Hao
Priority: Blocker
 Fix For: 1.5.0


 As of 1.4, Spark SQL does not properly set the table schema and serde when 
 writing a table to Hive's metastore. Would be great to do that properly so 
 users can use non-Spark SQL systems to read those tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9706) List Public API Compatibility Issues with japi-compliance checker

2015-08-06 Thread Feynman Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feynman Liang resolved SPARK-9706.
--
Resolution: Fixed

 List Public API Compatibility Issues with japi-compliance checker
 -

 Key: SPARK-9706
 URL: https://issues.apache.org/jira/browse/SPARK-9706
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Reporter: Feynman Liang
 Attachments: compat_reports.zip


 Using command:
 {{japi-compliance-checker spark-mllib_2.10-1.4.2-SNAPSHOT.jar 
 spark-mllib_2.10-1.5.0-SNAPSHOT.jar}}
 Report result attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9706) List Public API Compatibility Issues with japi-compliance checker

2015-08-06 Thread Feynman Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feynman Liang updated SPARK-9706:
-
Summary: List Public API Compatibility Issues with japi-compliance checker  
(was: Check Public API Compatibility with japi-compliance checker)

 List Public API Compatibility Issues with japi-compliance checker
 -

 Key: SPARK-9706
 URL: https://issues.apache.org/jira/browse/SPARK-9706
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Reporter: Feynman Liang
 Attachments: compat_reports.zip


 Using command:
 {{japi-compliance-checker spark-mllib_2.10-1.4.2-SNAPSHOT.jar 
 spark-mllib_2.10-1.5.0-SNAPSHOT.jar}}
 Report result attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds

2015-08-06 Thread Ryan Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660816#comment-14660816
 ] 

Ryan Williams commented on SPARK-1517:
--

That all makes sense, thanks.

I guess I was imagining we could publish the SHA'd snapshots to a Maven 
repository other than the Apache snapshot repository, especially if the latter 
has rules that make this inconvenient.

Understood that the binaries (and Maven artifacts) would have to be clearly 
branded as not official Apache releases.

If I came up with a URL that binaries could be uploaded to, what would have to 
change to make it happen? Likewise if I found a Maven repository that could 
host these artifacts?

 Publish nightly snapshots of documentation, maven artifacts, and binary builds
 --

 Key: SPARK-1517
 URL: https://issues.apache.org/jira/browse/SPARK-1517
 Project: Spark
  Issue Type: Improvement
  Components: Build, Project Infra
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Critical

 Should be pretty easy to do with Jenkins. The only thing I can think of that 
 would be tricky is to set up credentials so that jenkins can publish this 
 stuff somewhere on apache infra.
 Ideally we don't want to have to put a private key on every jenkins box 
 (since they are otherwise pretty stateless). One idea is to encrypt these 
 credentials with a passphrase and post them somewhere publicly visible. Then 
 the jenkins build can download the credentials provided we set a passphrase 
 in an environment variable in jenkins. There may be simpler solutions as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9711) spark-submit fails after restarting cluster with spark-ec2

2015-08-06 Thread Guangyang Li (JIRA)
Guangyang Li created SPARK-9711:
---

 Summary: spark-submit fails after restarting cluster with spark-ec2
 Key: SPARK-9711
 URL: https://issues.apache.org/jira/browse/SPARK-9711
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.4.1
Reporter: Guangyang Li


With Spark 1.4.1 and YARN client mode. My job works at the first time the 
cluster is built. If I stop and start the cluster, the same /bin/spark-submit 
command fails and keeps trying to connect to master node:

INFO Client: Retrying connect to server: 
ec2-54-174-232-129.compute-1.amazonaws.com/172.31.36.29:8032. Already tried 0 
time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1000 MILLISECONDS)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9683) deep copy UTF8String when convert unsafe row to safe row

2015-08-06 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-9683:
--
Assignee: Wenchen Fan

 deep copy UTF8String when convert unsafe row to safe row
 

 Key: SPARK-9683
 URL: https://issues.apache.org/jira/browse/SPARK-9683
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan
Assignee: Wenchen Fan
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9676) Execute GraphX library from R

2015-08-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-9676.
--
Resolution: Invalid

Please use u...@spark.apache.org to ask questions.

 Execute GraphX library from R
 -

 Key: SPARK-9676
 URL: https://issues.apache.org/jira/browse/SPARK-9676
 Project: Spark
  Issue Type: Brainstorming
  Components: R
Affects Versions: 1.6.0
Reporter: Sudhindra

 Wanted to use the GRaphX from SparkR , is there a way to do it ?.I think as 
 of now it is not possible.I was thinking if one can write a wrapper in R that 
 can call Scala Graphx libraries .
 Any thought on this please.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9685) Unsupported dataType: char(X) in Hive

2015-08-06 Thread JIRA
Ángel Álvarez created SPARK-9685:


 Summary: Unsupported dataType: char(X) in Hive
 Key: SPARK-9685
 URL: https://issues.apache.org/jira/browse/SPARK-9685
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Ángel Álvarez


I'm getting the following error when I try to read a Hive table with char(X) 
fields:

{code}
15/08/06 11:38:51 INFO parse.ParseDriver: Parse Completed
org.apache.spark.sql.types.DataTypeException: Unsupported dataType: char(8). If 
you have a struct and a field name of it has any special characters, please use 
backticks (`) to quote that field name, e.g. `x+y`. Please note that backtick 
itself is not supported in a field name.
at 
org.apache.spark.sql.types.DataTypeParser$class.toDataType(DataTypeParser.scala:95)
at 
org.apache.spark.sql.types.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:107)
at 
org.apache.spark.sql.types.DataTypeParser$.parse(DataTypeParser.scala:111)
at 
org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:769)
at 
org.apache.spark.sql.hive.MetastoreRelation$SchemaAttribute.toAttribute(HiveMetastoreCatalog.scala:742)
at 
org.apache.spark.sql.hive.MetastoreRelation$$anonfun$44.apply(HiveMetastoreCatalog.scala:752)
at 
org.apache.spark.sql.hive.MetastoreRelation$$anonfun$44.apply(HiveMetastoreCatalog.scala:752)
{code}

It seems there is no char DataType defined in the DataTypeParser class
{code}
  protected lazy val primitiveType: Parser[DataType] =
(?i)string.r ^^^ StringType |
(?i)float.r ^^^ FloatType |
(?i)(?:int|integer).r ^^^ IntegerType |
(?i)tinyint.r ^^^ ByteType |
(?i)smallint.r ^^^ ShortType |
(?i)double.r ^^^ DoubleType |
(?i)(?:bigint|long).r ^^^ LongType |
(?i)binary.r ^^^ BinaryType |
(?i)boolean.r ^^^ BooleanType |
fixedDecimalType |
(?i)decimal.r ^^^ DecimalType.USER_DEFAULT |
(?i)date.r ^^^ DateType |
(?i)timestamp.r ^^^ TimestampType |
varchar
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9641) spark.shuffle.service.port is not documented

2015-08-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9641:
---

Assignee: (was: Apache Spark)

 spark.shuffle.service.port is not documented
 

 Key: SPARK-9641
 URL: https://issues.apache.org/jira/browse/SPARK-9641
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Shuffle
Reporter: Thomas Graves
Priority: Minor

 Looking at the code I see spark.shuffle.service.port being used but I can't 
 find any documentation on it.   I don't see a reason for this to be an 
 internal config so we should document it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9684) sbt/sbt assembly Error: Invalid or corrupt jarfile sbt/sbt-launch-0.13.6.jar

2015-08-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-9684.
--
Resolution: Duplicate

[~prabeeshk] please search JIRA first. Also, you're reporting vs an ancient 
version of Spark; you probably need to evaluate whether it's fixed in master 
first.

https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

 sbt/sbt assembly Error: Invalid or corrupt jarfile sbt/sbt-launch-0.13.6.jar
 

 Key: SPARK-9684
 URL: https://issues.apache.org/jira/browse/SPARK-9684
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.0, 1.1.1
Reporter: Prabeesh K

 While running sbt/sbt assembly . Got following error.
 Attempting to fetch sbt
 Launching sbt from sbt/sbt-launch-0.13.6.jar
 Error: Invalid or corrupt jarfile sbt/sbt-launch-0.13.6.jar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9641) spark.shuffle.service.port is not documented

2015-08-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14659778#comment-14659778
 ] 

Apache Spark commented on SPARK-9641:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/7991

 spark.shuffle.service.port is not documented
 

 Key: SPARK-9641
 URL: https://issues.apache.org/jira/browse/SPARK-9641
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Shuffle
Reporter: Thomas Graves
Priority: Minor

 Looking at the code I see spark.shuffle.service.port being used but I can't 
 find any documentation on it.   I don't see a reason for this to be an 
 internal config so we should document it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9684) sbt/sbt assembly Error: Invalid or corrupt jarfile sbt/sbt-launch-0.13.6.jar

2015-08-06 Thread Prabeesh K (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14659779#comment-14659779
 ] 

Prabeesh K commented on SPARK-9684:
---

[~pwendell] Please have a look

 sbt/sbt assembly Error: Invalid or corrupt jarfile sbt/sbt-launch-0.13.6.jar
 

 Key: SPARK-9684
 URL: https://issues.apache.org/jira/browse/SPARK-9684
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.0, 1.1.1
Reporter: Prabeesh K

 While running sbt/sbt assembly . Got following error.
 Attempting to fetch sbt
 Launching sbt from sbt/sbt-launch-0.13.6.jar
 Error: Invalid or corrupt jarfile sbt/sbt-launch-0.13.6.jar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9641) spark.shuffle.service.port is not documented

2015-08-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9641:
---

Assignee: Apache Spark

 spark.shuffle.service.port is not documented
 

 Key: SPARK-9641
 URL: https://issues.apache.org/jira/browse/SPARK-9641
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Shuffle
Reporter: Thomas Graves
Assignee: Apache Spark
Priority: Minor

 Looking at the code I see spark.shuffle.service.port being used but I can't 
 find any documentation on it.   I don't see a reason for this to be an 
 internal config so we should document it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9684) sbt/sbt assembly Error: Invalid or corrupt jarfile sbt/sbt-launch-0.13.6.jar

2015-08-06 Thread Prabeesh K (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14659785#comment-14659785
 ] 

Prabeesh K commented on SPARK-9684:
---

 It is fixed in the master. But I can't find a spark-1.X version with fix

 sbt/sbt assembly Error: Invalid or corrupt jarfile sbt/sbt-launch-0.13.6.jar
 

 Key: SPARK-9684
 URL: https://issues.apache.org/jira/browse/SPARK-9684
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.0, 1.1.1
Reporter: Prabeesh K

 While running sbt/sbt assembly . Got following error.
 Attempting to fetch sbt
 Launching sbt from sbt/sbt-launch-0.13.6.jar
 Error: Invalid or corrupt jarfile sbt/sbt-launch-0.13.6.jar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9681) Support R feature interactions in RFormula

2015-08-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9681:
---

Assignee: Apache Spark

 Support R feature interactions in RFormula
 --

 Key: SPARK-9681
 URL: https://issues.apache.org/jira/browse/SPARK-9681
 Project: Spark
  Issue Type: Improvement
  Components: ML, SparkR
Reporter: Eric Liang
Assignee: Apache Spark

 Support the interaction (:) operator RFormula feature transformer, so that 
 it is available for use in SparkR's glm.
 Umbrella design doc for RFormula integration: 
 https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit?pli=1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9683) deep copy UTF8String when convert unsafe row to safe row

2015-08-06 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-9683:
--

 Summary: deep copy UTF8String when convert unsafe row to safe row
 Key: SPARK-9683
 URL: https://issues.apache.org/jira/browse/SPARK-9683
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   >