[jira] [Commented] (SPARK-9690) Adding the possibility to set the seed of the rand in the CrossValidator fold
[ https://issues.apache.org/jira/browse/SPARK-9690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660270#comment-14660270 ] Apache Spark commented on SPARK-9690: - User 'mmenestret' has created a pull request for this issue: https://github.com/apache/spark/pull/7997 Adding the possibility to set the seed of the rand in the CrossValidator fold - Key: SPARK-9690 URL: https://issues.apache.org/jira/browse/SPARK-9690 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.1 Reporter: Martin Menestret Priority: Minor Fix For: 1.5.0 Original Estimate: 1h Remaining Estimate: 1h The fold in the ML CrossValidator depends on a rand whose seed is set to 0 and it leads the sql.functions rand to call sc._jvm.functions.rand() with no seed. In order to be able to unit test a Cross Validation it would be a good idea to be able to set this seed so the output of the cross validation (with a featureSubsetStrategy set to all) would always be the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9690) Adding the possibility to set the seed of the rand in the CrossValidator fold
[ https://issues.apache.org/jira/browse/SPARK-9690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9690: --- Assignee: Apache Spark Adding the possibility to set the seed of the rand in the CrossValidator fold - Key: SPARK-9690 URL: https://issues.apache.org/jira/browse/SPARK-9690 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.1 Reporter: Martin Menestret Assignee: Apache Spark Priority: Minor Fix For: 1.5.0 Original Estimate: 1h Remaining Estimate: 1h The fold in the ML CrossValidator depends on a rand whose seed is set to 0 and it leads the sql.functions rand to call sc._jvm.functions.rand() with no seed. In order to be able to unit test a Cross Validation it would be a good idea to be able to set this seed so the output of the cross validation (with a featureSubsetStrategy set to all) would always be the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9690) Adding the possibility to set the seed of the rand in the CrossValidator fold
[ https://issues.apache.org/jira/browse/SPARK-9690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9690: --- Assignee: (was: Apache Spark) Adding the possibility to set the seed of the rand in the CrossValidator fold - Key: SPARK-9690 URL: https://issues.apache.org/jira/browse/SPARK-9690 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.1 Reporter: Martin Menestret Priority: Minor Fix For: 1.5.0 Original Estimate: 1h Remaining Estimate: 1h The fold in the ML CrossValidator depends on a rand whose seed is set to 0 and it leads the sql.functions rand to call sc._jvm.functions.rand() with no seed. In order to be able to unit test a Cross Validation it would be a good idea to be able to set this seed so the output of the cross validation (with a featureSubsetStrategy set to all) would always be the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9182) filter and groupBy on DataFrames are not passed through to jdbc source
[ https://issues.apache.org/jira/browse/SPARK-9182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660278#comment-14660278 ] Cheng Lian commented on SPARK-9182: --- Hey [~grahn], sorry for the late reply, I somehow missed your last two comments. Thanks for the detailed information. I'm able to reproduce this issue locally now. Confirmed that it's related to NUMERIC. Trying to deliver a fix for this. filter and groupBy on DataFrames are not passed through to jdbc source -- Key: SPARK-9182 URL: https://issues.apache.org/jira/browse/SPARK-9182 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1 Reporter: Greg Rahn When running all of these API calls, the only one that passes the filter through to the backend jdbc source is equality. All filters in these commands should be able to be passed through to the jdbc database source. {code} val url=jdbc:postgresql:grahn val prop = new java.util.Properties val emp = sqlContext.read.jdbc(url, emp, prop) emp.filter(emp(sal) === 5000).show() emp.filter(emp(sal) 5000).show() emp.filter(sal = 3000).show() emp.filter(sal 2500).show() emp.filter(sal = 2500).show() emp.filter(sal 2500).show() emp.filter(sal = 2500).show() emp.filter(sal != 3000).show() emp.filter(sal between 3000 and 5000).show() emp.filter(ename in ('SCOTT','BLAKE')).show() {code} We see from the PostgreSQL query log the following is run, and see that only equality predicates are passed through. {code} LOG: execute unnamed: SET extra_float_digits = 3 LOG: execute unnamed: SELECT empno,ename,job,mgr,hiredate,sal,comm,deptno FROM emp WHERE sal = 5000 LOG: execute unnamed: SET extra_float_digits = 3 LOG: execute unnamed: SELECT empno,ename,job,mgr,hiredate,sal,comm,deptno FROM emp LOG: execute unnamed: SET extra_float_digits = 3 LOG: execute unnamed: SELECT empno,ename,job,mgr,hiredate,sal,comm,deptno FROM emp WHERE sal = 3000 LOG: execute unnamed: SET extra_float_digits = 3 LOG: execute unnamed: SELECT empno,ename,job,mgr,hiredate,sal,comm,deptno FROM emp LOG: execute unnamed: SET extra_float_digits = 3 LOG: execute unnamed: SELECT empno,ename,job,mgr,hiredate,sal,comm,deptno FROM emp LOG: execute unnamed: SET extra_float_digits = 3 LOG: execute unnamed: SELECT empno,ename,job,mgr,hiredate,sal,comm,deptno FROM emp LOG: execute unnamed: SET extra_float_digits = 3 LOG: execute unnamed: SELECT empno,ename,job,mgr,hiredate,sal,comm,deptno FROM emp LOG: execute unnamed: SET extra_float_digits = 3 LOG: execute unnamed: SELECT empno,ename,job,mgr,hiredate,sal,comm,deptno FROM emp LOG: execute unnamed: SET extra_float_digits = 3 LOG: execute unnamed: SELECT empno,ename,job,mgr,hiredate,sal,comm,deptno FROM emp LOG: execute unnamed: SET extra_float_digits = 3 LOG: execute unnamed: SELECT empno,ename,job,mgr,hiredate,sal,comm,deptno FROM emp {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9690) Adding the possibility to set the seed of the rand in the CrossValidator fold
[ https://issues.apache.org/jira/browse/SPARK-9690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9690: - Fix Version/s: (was: 1.5.0) [~Mmenestret] Don't set fix version https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Adding the possibility to set the seed of the rand in the CrossValidator fold - Key: SPARK-9690 URL: https://issues.apache.org/jira/browse/SPARK-9690 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.1 Reporter: Martin Menestret Priority: Minor Original Estimate: 1h Remaining Estimate: 1h The fold in the ML CrossValidator depends on a rand whose seed is set to 0 and it leads the sql.functions rand to call sc._jvm.functions.rand() with no seed. In order to be able to unit test a Cross Validation it would be a good idea to be able to set this seed so the output of the cross validation (with a featureSubsetStrategy set to all) would always be the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9515) Creating JavaSparkContext with yarn-cluster mode throws NPE
[ https://issues.apache.org/jira/browse/SPARK-9515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660286#comment-14660286 ] nirav patel commented on SPARK-9515: [~srowen] I just gave a reason why I can't use spark-submit script as you asked in first comment. I agree this should be more forum question but I though NPE is something you may wanna handle better. Creating JavaSparkContext with yarn-cluster mode throws NPE --- Key: SPARK-9515 URL: https://issues.apache.org/jira/browse/SPARK-9515 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.3.1 Reporter: nirav patel I have spark application that runs agains YARN cluster. I run spark application as part of my web application. I can't use spark-submit script. Way I run it is `java -cp myApp.jar com.myapp.Application` which in turn initiate JavaSparkContext. It used to work with spark 1.0.2 and standalone cluster but now with 1.3.1 and yarn its failing. Caused by: java.lang.NullPointerException at org.apache.spark.deploy.yarn.ApplicationMaster$.sparkContextInitialized(ApplicationMaster.scala:580) at org.apache.spark.scheduler.cluster.YarnClusterScheduler.postStartHook(YarnClusterScheduler.scala:32) at org.apache.spark.SparkContext.init(SparkContext.scala:541) at org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:61) EDIT: I got it working with yarn-client mode however I want to test it out with yarn-cluster mode as well. Application design is, we create singleton SparkContext object and preload few RDDs in memory when our spring-boot application(tomcat container) starts. That allows us to submit subsequent spark jobs without overhead of creating new sparkContext and RDDs. It performs excellent for our SLA. We are serving real-time GLM in ms with that. I hope this is a reason enough why we can't use spark-submit script to submit a job. Code is pretty simple. This is how we create sparkContext SparkConf conf = new SparkConf().setAppName(appName.toString()).setMaster(yarn-client); conf.set(spark.eventLog.enabled, true); conf.set(spark.executor.extraClassPath, /opt/mapr/hbase/hbase-0.98.12/lib/*); conf.set(spark.cores.max, sparkCoreMax); conf.set(spark.executor.memory, sparkExecMem); conf.set(spark.executor.extraJavaOptions, executorJavaOPts); conf.set(spark.akka.threads, sparkDriverThreads); JavaSparkContext sparkContext = new JavaSparkContext(conf); This is how we actually run sprig-boot app. java -Dloader.path=myspringbootapp.jar,/spark/spark-1.3.1/lib,/opt/mapr/hadoop/hadoop-2.5.1/etc/hadoop,/opt/mapr/hadoop/hadoop-2.5.1/share/hadoop/yarn -XX:PermSize=512m -XX:MaxPermSize=512m -Xms1024m -jar myspringbootapp.jar -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8978) Implement the DirectKafkaRateController
[ https://issues.apache.org/jira/browse/SPARK-8978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-8978. -- Resolution: Fixed Assignee: Iulian Dragos Implement the DirectKafkaRateController --- Key: SPARK-8978 URL: https://issues.apache.org/jira/browse/SPARK-8978 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Iulian Dragos Assignee: Iulian Dragos Fix For: 1.5.0 Based on this [design doc|https://docs.google.com/document/d/1ls_g5fFmfbbSTIfQQpUxH56d0f3OksF567zwA00zK9E/edit?usp=sharing]. The DirectKafkaInputDStream should use the rate estimate to control how many records/partition to put in the next batch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7550) Support setting the right schema serde when writing to Hive metastore
[ https://issues.apache.org/jira/browse/SPARK-7550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660726#comment-14660726 ] Reynold Xin commented on SPARK-7550: [~lian cheng] shouldn't this also be merged into branch-1.5? Support setting the right schema serde when writing to Hive metastore --- Key: SPARK-7550 URL: https://issues.apache.org/jira/browse/SPARK-7550 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.0 Reporter: Reynold Xin Assignee: Cheng Hao Priority: Blocker As of 1.4, Spark SQL does not properly set the table schema and serde when writing a table to Hive's metastore. Would be great to do that properly so users can use non-Spark SQL systems to read those tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9702) Repartition operator should use Exchange to perform its shuffle
Josh Rosen created SPARK-9702: - Summary: Repartition operator should use Exchange to perform its shuffle Key: SPARK-9702 URL: https://issues.apache.org/jira/browse/SPARK-9702 Project: Spark Issue Type: Improvement Components: SQL Reporter: Josh Rosen Spark SQL's {{Repartition}} operator is implemented in terms of Spark Core's repartition operator, which means that it has to perform lots of unnecessary row copying and inefficient row serialization. Instead, it would be better if this was implemented using some of Exchange's internals so that it can avoid row format conversions and generic getters / hashcodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7549) Support aggregating over nested fields
[ https://issues.apache.org/jira/browse/SPARK-7549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7549: --- Parent Issue: SPARK-9576 (was: SPARK-6116) Support aggregating over nested fields -- Key: SPARK-7549 URL: https://issues.apache.org/jira/browse/SPARK-7549 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Would be nice to be able to run sum, avg, min, max (and other numeric aggregate expressions) on arrays. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7160) Support converting DataFrames to typed RDDs.
[ https://issues.apache.org/jira/browse/SPARK-7160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7160: --- Parent Issue: SPARK-9576 (was: SPARK-6116) Support converting DataFrames to typed RDDs. Key: SPARK-7160 URL: https://issues.apache.org/jira/browse/SPARK-7160 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.3.1 Reporter: Ray Ortigas Assignee: Ray Ortigas Priority: Critical As a Spark user still working with RDDs, I'd like the ability to convert a DataFrame to a typed RDD. For example, if I've converted RDDs to DataFrames so that I could save them as Parquet or CSV files, I would like to rebuild the RDD from those files automatically rather than writing the row-to-type conversion myself. {code} val rdd0 = sc.parallelize(Seq(Food(apple, 1), Food(banana, 2), Food(cherry, 3))) val df0 = rdd0.toDF() df0.save(foods.parquet) val df1 = sqlContext.load(foods.parquet) val rdd1 = df1.toTypedRDD[Food]() // rdd0 and rdd1 should have the same elements {code} I originally submitted a smaller PR for spark-csv https://github.com/databricks/spark-csv/pull/52, but Reynold Xin suggested that converting a DataFrame to a typed RDD wasn't something specific to spark-csv. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9493) Chain logistic regression with isotonic regression under the pipeline API
[ https://issues.apache.org/jira/browse/SPARK-9493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-9493. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7952 [https://github.com/apache/spark/pull/7952] Chain logistic regression with isotonic regression under the pipeline API - Key: SPARK-9493 URL: https://issues.apache.org/jira/browse/SPARK-9493 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.5.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.5.0 One use case of isotonic regression is to calibrate the probabilities output by logistic regression. We should make this easier in the pipeline API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6116) DataFrame API improvement umbrella ticket (Spark 1.5)
[ https://issues.apache.org/jira/browse/SPARK-6116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6116: --- Description: An umbrella ticket for DataFrame API improvements for Spark 1.5. SPARK-9576 is the ticket for Spark 1.6. was:An umbrella ticket to track improvements and changes needed to make DataFrame API non-experimental. DataFrame API improvement umbrella ticket (Spark 1.5) - Key: SPARK-6116 URL: https://issues.apache.org/jira/browse/SPARK-6116 Project: Spark Issue Type: Umbrella Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Labels: DataFrame An umbrella ticket for DataFrame API improvements for Spark 1.5. SPARK-9576 is the ticket for Spark 1.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9363) SortMergeJoin operator should support UnsafeRow
[ https://issues.apache.org/jira/browse/SPARK-9363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9363: --- Sprint: Spark 1.5 release SortMergeJoin operator should support UnsafeRow --- Key: SPARK-9363 URL: https://issues.apache.org/jira/browse/SPARK-9363 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Josh Rosen Assignee: Josh Rosen The SortMergeJoin operator should implement the suppotsUnsafeRow and outputsUnsafeRow settings when appropriate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds
[ https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660796#comment-14660796 ] Patrick Wendell commented on SPARK-1517: Hey Ryan, IIRC - the Apache snapshot repository won't let us publish binaries that do not have SNAPSHOT in the version number. The reason is it expects to see timestamped snapshots so its garbage collection mechanism can work. We could look at adding sha1 hashes, before SNAPSHOT, but I think there is some chance this would break their cleanup. In terms of posting more binaries - I can look at whether Databricks or Berkeley might be able to donate S3 resources for this, but it would have to be clearly maintained by those organizations and not branded as official Apache releases or anything like that. Publish nightly snapshots of documentation, maven artifacts, and binary builds -- Key: SPARK-1517 URL: https://issues.apache.org/jira/browse/SPARK-1517 Project: Spark Issue Type: Improvement Components: Build, Project Infra Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Critical Should be pretty easy to do with Jenkins. The only thing I can think of that would be tricky is to set up credentials so that jenkins can publish this stuff somewhere on apache infra. Ideally we don't want to have to put a private key on every jenkins box (since they are otherwise pretty stateless). One idea is to encrypt these credentials with a passphrase and post them somewhere publicly visible. Then the jenkins build can download the credentials provided we set a passphrase in an environment variable in jenkins. There may be simpler solutions as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5180) Data source API improvement
[ https://issues.apache.org/jira/browse/SPARK-5180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5180: --- Sprint: Spark 1.5 release Data source API improvement --- Key: SPARK-5180 URL: https://issues.apache.org/jira/browse/SPARK-5180 Project: Spark Issue Type: Umbrella Components: SQL Reporter: Yin Huai Assignee: Cheng Lian Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9691) PySpark SQL rand function treats seed 0 as no seed
[ https://issues.apache.org/jira/browse/SPARK-9691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9691: --- Sprint: Spark 1.5 doc/QA sprint PySpark SQL rand function treats seed 0 as no seed -- Key: SPARK-9691 URL: https://issues.apache.org/jira/browse/SPARK-9691 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.4.0, 1.4.1, 1.4.2, 1.5.0 Reporter: Joseph K. Bradley Assignee: Yin Huai In PySpark SQL's rand() function, it tests for a seed in a way such that seed 0 is treated as no seed, leading to non-deterministic results when a user would expect deterministic results. See: [https://github.com/apache/spark/blob/98e69467d4fda2c26a951409b5b7c6f1e9345ce4/python/pyspark/sql/functions.py#L271] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9683) deep copy UTF8String when convert unsafe row to safe row
[ https://issues.apache.org/jira/browse/SPARK-9683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9683: --- Target Version/s: 1.5.0 Priority: Critical (was: Major) deep copy UTF8String when convert unsafe row to safe row Key: SPARK-9683 URL: https://issues.apache.org/jira/browse/SPARK-9683 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-9675) GenerateUnsafeProjection seems to corrupt MapType data
[ https://issues.apache.org/jira/browse/SPARK-9675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-9675. -- Resolution: Duplicate Fix Version/s: 1.5.0 GenerateUnsafeProjection seems to corrupt MapType data -- Key: SPARK-9675 URL: https://issues.apache.org/jira/browse/SPARK-9675 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Wenchen Fan Priority: Blocker Fix For: 1.5.0 See https://github.com/apache/spark/pull/7981#issuecomment-128208233 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9596) Avoid reloading Hadoop classes like UserGroupInformation
[ https://issues.apache.org/jira/browse/SPARK-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-9596: Shepherd: Michael Armbrust Avoid reloading Hadoop classes like UserGroupInformation Key: SPARK-9596 URL: https://issues.apache.org/jira/browse/SPARK-9596 Project: Spark Issue Type: Bug Components: SQL Reporter: Tao Wang Assignee: Tao Wang Some hadoop classes contains global information such as authentication in UserGroupInformation. If we load them again in `IsolatedClientLoader`, the message they carry will be dropped. So we should treat hadoop classes as shared too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9665) ML 1.5 QA: API: Experimental, DeveloperApi, final audit
[ https://issues.apache.org/jira/browse/SPARK-9665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9665: - Description: We should make a pass through the items marked as Experimental or DeveloperApi and see if any are stable enough to be unmarked. This will probably not include the Pipeline APIs yet since some parts (e.g., feature attributes) are still under flux. We should also check for items marked final or sealed to see if they are stable enough to be opened up as APIs. was:We should make a pass through the items marked as Experimental or DeveloperApi and see if any are stable enough to be unmarked. This will probably not include the Pipeline APIs yet since some parts (e.g., feature attributes) are still under flux. ML 1.5 QA: API: Experimental, DeveloperApi, final audit --- Key: SPARK-9665 URL: https://issues.apache.org/jira/browse/SPARK-9665 Project: Spark Issue Type: Sub-task Components: ML, MLlib Reporter: Joseph K. Bradley Priority: Minor We should make a pass through the items marked as Experimental or DeveloperApi and see if any are stable enough to be unmarked. This will probably not include the Pipeline APIs yet since some parts (e.g., feature attributes) are still under flux. We should also check for items marked final or sealed to see if they are stable enough to be opened up as APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9707) Test sort-based fallback mode for dynamic partition insert
Reynold Xin created SPARK-9707: -- Summary: Test sort-based fallback mode for dynamic partition insert Key: SPARK-9707 URL: https://issues.apache.org/jira/browse/SPARK-9707 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9665) ML 1.5 QA: API: Experimental, DeveloperApi, final, sealed audit
[ https://issues.apache.org/jira/browse/SPARK-9665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9665: - Summary: ML 1.5 QA: API: Experimental, DeveloperApi, final, sealed audit (was: ML 1.5 QA: API: Experimental, DeveloperApi, final audit) ML 1.5 QA: API: Experimental, DeveloperApi, final, sealed audit --- Key: SPARK-9665 URL: https://issues.apache.org/jira/browse/SPARK-9665 Project: Spark Issue Type: Sub-task Components: ML, MLlib Reporter: Joseph K. Bradley Priority: Minor We should make a pass through the items marked as Experimental or DeveloperApi and see if any are stable enough to be unmarked. This will probably not include the Pipeline APIs yet since some parts (e.g., feature attributes) are still under flux. We should also check for items marked final or sealed to see if they are stable enough to be opened up as APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9211) HiveComparisonTest generates incorrect file name for golden answer files on Windows
[ https://issues.apache.org/jira/browse/SPARK-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-9211. - Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7563 [https://github.com/apache/spark/pull/7563] HiveComparisonTest generates incorrect file name for golden answer files on Windows --- Key: SPARK-9211 URL: https://issues.apache.org/jira/browse/SPARK-9211 Project: Spark Issue Type: Test Components: SQL, Windows Affects Versions: 1.4.1 Environment: Windows Reporter: Christian Kadner Priority: Minor Labels: hive, sql, test, windows Fix For: 1.5.0 The names of the golden answer files for the Hive test cases (test suites based on {{HiveComparisonTest}}) are generated using an MD5 hash of the query text. When the query text contains line breaks then the generated MD5 hash differs between Windows and Linux/OSX ({{\r\n}} vs {{\n}}). This results in erroneously created golden answer files from just running a Hive comparison test and makes it impossible to modify or add new test cases with correctly named golden answer files on Windows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9453) Support records larger than default page size in UnsafeShuffleExternalSorter
[ https://issues.apache.org/jira/browse/SPARK-9453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660830#comment-14660830 ] Apache Spark commented on SPARK-9453: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/8005 Support records larger than default page size in UnsafeShuffleExternalSorter Key: SPARK-9453 URL: https://issues.apache.org/jira/browse/SPARK-9453 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Josh Rosen Assignee: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9624) Make RateControllerSuite faster and more robust
[ https://issues.apache.org/jira/browse/SPARK-9624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-9624. -- Resolution: Fixed Fix Version/s: 1.5.0 Make RateControllerSuite faster and more robust --- Key: SPARK-9624 URL: https://issues.apache.org/jira/browse/SPARK-9624 Project: Spark Issue Type: Test Components: Streaming, Tests Reporter: Tathagata Das Assignee: Tathagata Das Priority: Minor Fix For: 1.5.0 Tests in RateControllerSuite runs with 1 second batch, takes almost 10 seconds for the whole test suite. If we reduce the batch interval to 100 ms, then the test multiple publish rates reach receivers becomes flaky as multiple rates updates may get applied before the rate is polled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9619) Rename Receiver.executor to Receiver.supervisor
[ https://issues.apache.org/jira/browse/SPARK-9619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-9619. -- Resolution: Fixed Fix Version/s: 1.5.0 Rename Receiver.executor to Receiver.supervisor Key: SPARK-9619 URL: https://issues.apache.org/jira/browse/SPARK-9619 Project: Spark Issue Type: Bug Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Minor Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9556) Make all BlockGenerators subscribe to rate limit updates
[ https://issues.apache.org/jira/browse/SPARK-9556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-9556. -- Resolution: Fixed Fix Version/s: 1.5.0 Make all BlockGenerators subscribe to rate limit updates Key: SPARK-9556 URL: https://issues.apache.org/jira/browse/SPARK-9556 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9710) RPackageUtilsSuite fails if R is not installer
[ https://issues.apache.org/jira/browse/SPARK-9710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9710: --- Assignee: Apache Spark RPackageUtilsSuite fails if R is not installer -- Key: SPARK-9710 URL: https://issues.apache.org/jira/browse/SPARK-9710 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 1.5.0 Reporter: Marcelo Vanzin Assignee: Apache Spark That's because there's a bug in RUtils.scala. PR soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8167) Tasks that fail due to YARN preemption can cause job failure
[ https://issues.apache.org/jira/browse/SPARK-8167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660871#comment-14660871 ] Apache Spark commented on SPARK-8167: - User 'mccheah' has created a pull request for this issue: https://github.com/apache/spark/pull/8007 Tasks that fail due to YARN preemption can cause job failure Key: SPARK-8167 URL: https://issues.apache.org/jira/browse/SPARK-8167 Project: Spark Issue Type: Bug Components: Scheduler, YARN Affects Versions: 1.3.1 Reporter: Patrick Woody Assignee: Matt Cheah Priority: Blocker Tasks that are running on preempted executors will count as FAILED with an ExecutorLostFailure. Unfortunately, this can quickly spiral out of control if a large resource shift is occurring, and the tasks get scheduled to executors that immediately get preempted as well. The current workaround is to increase spark.task.maxFailures very high, but that can cause delays in true failures. We should ideally differentiate these task statuses so that they don't count towards the failure limit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9710) RPackageUtilsSuite fails if R is not installed
[ https://issues.apache.org/jira/browse/SPARK-9710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-9710: -- Summary: RPackageUtilsSuite fails if R is not installed (was: RPackageUtilsSuite fails if R is not installer) RPackageUtilsSuite fails if R is not installed -- Key: SPARK-9710 URL: https://issues.apache.org/jira/browse/SPARK-9710 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 1.5.0 Reporter: Marcelo Vanzin That's because there's a bug in RUtils.scala. PR soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9630) Cleanup Hybrid Aggregate Operator.
[ https://issues.apache.org/jira/browse/SPARK-9630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-9630. Resolution: Fixed Fix Version/s: 1.5.0 Cleanup Hybrid Aggregate Operator. -- Key: SPARK-9630 URL: https://issues.apache.org/jira/browse/SPARK-9630 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Yin Huai Priority: Blocker Fix For: 1.5.0 This is the follow-up of SPARK-9240 to address review comments and clean up code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8167) Tasks that fail due to YARN preemption can cause job failure
[ https://issues.apache.org/jira/browse/SPARK-8167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8167: --- Assignee: Apache Spark (was: Matt Cheah) Tasks that fail due to YARN preemption can cause job failure Key: SPARK-8167 URL: https://issues.apache.org/jira/browse/SPARK-8167 Project: Spark Issue Type: Bug Components: Scheduler, YARN Affects Versions: 1.3.1 Reporter: Patrick Woody Assignee: Apache Spark Priority: Blocker Tasks that are running on preempted executors will count as FAILED with an ExecutorLostFailure. Unfortunately, this can quickly spiral out of control if a large resource shift is occurring, and the tasks get scheduled to executors that immediately get preempted as well. The current workaround is to increase spark.task.maxFailures very high, but that can cause delays in true failures. We should ideally differentiate these task statuses so that they don't count towards the failure limit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9710) RPackageUtilsSuite fails if R is not installer
[ https://issues.apache.org/jira/browse/SPARK-9710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9710: --- Assignee: (was: Apache Spark) RPackageUtilsSuite fails if R is not installer -- Key: SPARK-9710 URL: https://issues.apache.org/jira/browse/SPARK-9710 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 1.5.0 Reporter: Marcelo Vanzin That's because there's a bug in RUtils.scala. PR soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9710) RPackageUtilsSuite fails if R is not installer
[ https://issues.apache.org/jira/browse/SPARK-9710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660870#comment-14660870 ] Apache Spark commented on SPARK-9710: - User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/8008 RPackageUtilsSuite fails if R is not installer -- Key: SPARK-9710 URL: https://issues.apache.org/jira/browse/SPARK-9710 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 1.5.0 Reporter: Marcelo Vanzin That's because there's a bug in RUtils.scala. PR soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8167) Tasks that fail due to YARN preemption can cause job failure
[ https://issues.apache.org/jira/browse/SPARK-8167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8167: --- Assignee: Matt Cheah (was: Apache Spark) Tasks that fail due to YARN preemption can cause job failure Key: SPARK-8167 URL: https://issues.apache.org/jira/browse/SPARK-8167 Project: Spark Issue Type: Bug Components: Scheduler, YARN Affects Versions: 1.3.1 Reporter: Patrick Woody Assignee: Matt Cheah Priority: Blocker Tasks that are running on preempted executors will count as FAILED with an ExecutorLostFailure. Unfortunately, this can quickly spiral out of control if a large resource shift is occurring, and the tasks get scheduled to executors that immediately get preempted as well. The current workaround is to increase spark.task.maxFailures very high, but that can cause delays in true failures. We should ideally differentiate these task statuses so that they don't count towards the failure limit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8890) Reduce memory consumption for dynamic partition insert
[ https://issues.apache.org/jira/browse/SPARK-8890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660893#comment-14660893 ] Apache Spark commented on SPARK-8890: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/8010 Reduce memory consumption for dynamic partition insert -- Key: SPARK-8890 URL: https://issues.apache.org/jira/browse/SPARK-8890 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Michael Armbrust Priority: Critical Currently, InsertIntoHadoopFsRelation can run out of memory if the number of table partitions is large. The problem is that we open one output writer for each partition, and when data are randomized and when the number of partitions is large, we open a large number of output writers, leading to OOM. The solution here is to inject a sorting operation once the number of active partitions is beyond a certain point (e.g. 50?) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5180) Data source API improvement
[ https://issues.apache.org/jira/browse/SPARK-5180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5180: --- Issue Type: Story (was: Umbrella) Data source API improvement --- Key: SPARK-5180 URL: https://issues.apache.org/jira/browse/SPARK-5180 Project: Spark Issue Type: Story Components: SQL Reporter: Yin Huai Assignee: Cheng Lian Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9706) List Public API Compatibility Issues with japi-compliance checker
[ https://issues.apache.org/jira/browse/SPARK-9706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feynman Liang updated SPARK-9706: - Attachment: compat_reports.zip List Public API Compatibility Issues with japi-compliance checker - Key: SPARK-9706 URL: https://issues.apache.org/jira/browse/SPARK-9706 Project: Spark Issue Type: Bug Components: ML, MLlib Reporter: Feynman Liang Attachments: compat_reports.zip Using command: {{japi-compliance-checker spark-mllib_2.10-1.4.2-SNAPSHOT.jar spark-mllib_2.10-1.5.0-SNAPSHOT.jar}} Report result attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9706) Check Public API Compatibility with japi-compliance checker
Feynman Liang created SPARK-9706: Summary: Check Public API Compatibility with japi-compliance checker Key: SPARK-9706 URL: https://issues.apache.org/jira/browse/SPARK-9706 Project: Spark Issue Type: Bug Components: ML, MLlib Reporter: Feynman Liang Using command: {{japi-compliance-checker spark-mllib_2.10-1.4.2-SNAPSHOT.jar spark-mllib_2.10-1.5.0-SNAPSHOT.jar}} Report result attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9706) List Public API Compatibility Issues with japi-compliance checker
[ https://issues.apache.org/jira/browse/SPARK-9706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feynman Liang updated SPARK-9706: - Description: To identify potential API issues, list public API changes which affect binary and source incompatibility by using command: {code} japi-compliance-checker spark-mllib_2.10-1.4.2-SNAPSHOT.jar spark-mllib_2.10-1.5.0-SNAPSHOT.jar {code} Report result attached. was: Using command: {code} japi-compliance-checker spark-mllib_2.10-1.4.2-SNAPSHOT.jar spark-mllib_2.10-1.5.0-SNAPSHOT.jar {code} Report result attached. List Public API Compatibility Issues with japi-compliance checker - Key: SPARK-9706 URL: https://issues.apache.org/jira/browse/SPARK-9706 Project: Spark Issue Type: Task Components: ML, MLlib Reporter: Feynman Liang Attachments: compat_reports.zip To identify potential API issues, list public API changes which affect binary and source incompatibility by using command: {code} japi-compliance-checker spark-mllib_2.10-1.4.2-SNAPSHOT.jar spark-mllib_2.10-1.5.0-SNAPSHOT.jar {code} Report result attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9705) outdated Python 3 and IPython information
[ https://issues.apache.org/jira/browse/SPARK-9705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-9705: -- Target Version/s: 1.4.1, 1.4.0, 1.5.0 (was: 1.4.0, 1.4.1) outdated Python 3 and IPython information - Key: SPARK-9705 URL: https://issues.apache.org/jira/browse/SPARK-9705 Project: Spark Issue Type: Documentation Components: Documentation, PySpark Affects Versions: 1.4.0, 1.4.1, 1.5.0 Reporter: Piotr Migdał Labels: documentation Original Estimate: 0.25h Remaining Estimate: 0.25h https://issues.apache.org/jira/browse/SPARK-4897 adds Python 3.4 support to 1.4.0 and above, but the official docs (1.4.1, but the same is for 1.4.0) says explicitly: Spark 1.4.1 works with Python 2.6 or higher (but not Python 3). Affected: https://spark.apache.org/docs/1.4.0/programming-guide.html https://spark.apache.org/docs/1.4.1/programming-guide.html There are some other Python-related things, which are outdated, e.g. this line: For example, to launch the IPython Notebook with PyLab plot support: (At least since IPython 3.0 PyLab/Matplotlib support happens inside a notebook; and the line --pylab inline is already removed.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9639) JobHandler may throw NPE if JobScheduler has been stopped
[ https://issues.apache.org/jira/browse/SPARK-9639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-9639. -- Resolution: Fixed Assignee: Shixiong Zhu Fix Version/s: 1.5.0 JobHandler may throw NPE if JobScheduler has been stopped - Key: SPARK-9639 URL: https://issues.apache.org/jira/browse/SPARK-9639 Project: Spark Issue Type: Bug Components: Streaming Reporter: Shixiong Zhu Assignee: Shixiong Zhu Fix For: 1.5.0 Because `JobScheduler.stop(false)` may set `eventLoop` to null when `JobHandler` is running, then it's possible that when `post` is called, `eventLoop` happens to null. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9710) RPackageUtilsSuite fails if R is not installer
Marcelo Vanzin created SPARK-9710: - Summary: RPackageUtilsSuite fails if R is not installer Key: SPARK-9710 URL: https://issues.apache.org/jira/browse/SPARK-9710 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 1.5.0 Reporter: Marcelo Vanzin That's because there's a bug in RUtils.scala. PR soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9709) Avoid starving an unsafe operator in a sort
[ https://issues.apache.org/jira/browse/SPARK-9709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9709: --- Assignee: Andrew Or (was: Apache Spark) Avoid starving an unsafe operator in a sort --- Key: SPARK-9709 URL: https://issues.apache.org/jira/browse/SPARK-9709 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.5.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical This concerns mainly TungstenSort. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7398) Add back-pressure to Spark Streaming (umbrella JIRA)
[ https://issues.apache.org/jira/browse/SPARK-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-7398: - Target Version/s: (was: 1.5.0) Add back-pressure to Spark Streaming (umbrella JIRA) Key: SPARK-7398 URL: https://issues.apache.org/jira/browse/SPARK-7398 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.3.1 Reporter: François Garillot Priority: Critical Labels: streams Spark Streaming has trouble dealing with situations where batch processing time batch interval Meaning a high throughput of input data w.r.t. Spark's ability to remove data from the queue. If this throughput is sustained for long enough, it leads to an unstable situation where the memory of the Receiver's Executor is overflowed. This aims at transmitting a back-pressure signal back to data ingestion to help with dealing with that high throughput, in a backwards-compatible way. The original design doc can be found here: https://docs.google.com/document/d/1ZhiP_yBHcbjifz8nJEyPJpHqxB1FT6s8-Zk7sAfayQw/edit?usp=sharing The second design doc, focusing [on the first sub-task|https://issues.apache.org/jira/browse/SPARK-8834] (without all the background info, and more centered on the implementation) can be found here: https://docs.google.com/document/d/1ls_g5fFmfbbSTIfQQpUxH56d0f3OksF567zwA00zK9E/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9703) EnsureRequirements should not add unnecessary shuffles when only ordering requirements are unsatisfied
[ https://issues.apache.org/jira/browse/SPARK-9703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9703: --- Assignee: Josh Rosen (was: Apache Spark) EnsureRequirements should not add unnecessary shuffles when only ordering requirements are unsatisfied -- Key: SPARK-9703 URL: https://issues.apache.org/jira/browse/SPARK-9703 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0, 1.4.0, 1.5.0 Reporter: Josh Rosen Assignee: Josh Rosen Consider SortMergeJoin, which requires a sorted, clustered distribution of its input rows. Say that both of SMJ's children produce unsorted output but are both single partition. In this case, we will need to inject sort operators but should not need to inject exchanges. Unfortunately, it looks like the Exchange unnecessarily repartitions using a hash partitioning. We should update Exchange so that it does not unnecessarily repartition children when only the ordering requirements are unsatisfied. I'd like to fix this for Spark 1.5 since it makes certain types of unit tests easier to write. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9228) Combine unsafe and codegen into a single option
[ https://issues.apache.org/jira/browse/SPARK-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9228: --- Sprint: Spark 1.5 release (was: Spark 1.5 doc/QA sprint) Combine unsafe and codegen into a single option --- Key: SPARK-9228 URL: https://issues.apache.org/jira/browse/SPARK-9228 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Davies Liu Priority: Blocker Before QA, lets flip on features and consolidate unsafe and codegen. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9228) Combine unsafe and codegen into a single option
[ https://issues.apache.org/jira/browse/SPARK-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9228: --- Sprint: Spark 1.5 doc/QA sprint Combine unsafe and codegen into a single option --- Key: SPARK-9228 URL: https://issues.apache.org/jira/browse/SPARK-9228 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Davies Liu Priority: Blocker Before QA, lets flip on features and consolidate unsafe and codegen. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9703) EnsureRequirements should not add unnecessary shuffles when only ordering requirements are unsatisfied
[ https://issues.apache.org/jira/browse/SPARK-9703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660743#comment-14660743 ] Apache Spark commented on SPARK-9703: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/7988 EnsureRequirements should not add unnecessary shuffles when only ordering requirements are unsatisfied -- Key: SPARK-9703 URL: https://issues.apache.org/jira/browse/SPARK-9703 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0, 1.4.0, 1.5.0 Reporter: Josh Rosen Assignee: Josh Rosen Consider SortMergeJoin, which requires a sorted, clustered distribution of its input rows. Say that both of SMJ's children produce unsorted output but are both single partition. In this case, we will need to inject sort operators but should not need to inject exchanges. Unfortunately, it looks like the Exchange unnecessarily repartitions using a hash partitioning. We should update Exchange so that it does not unnecessarily repartition children when only the ordering requirements are unsatisfied. I'd like to fix this for Spark 1.5 since it makes certain types of unit tests easier to write. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9703) EnsureRequirements should not add unnecessary shuffles when only ordering requirements are unsatisfied
[ https://issues.apache.org/jira/browse/SPARK-9703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9703: --- Assignee: Apache Spark (was: Josh Rosen) EnsureRequirements should not add unnecessary shuffles when only ordering requirements are unsatisfied -- Key: SPARK-9703 URL: https://issues.apache.org/jira/browse/SPARK-9703 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0, 1.4.0, 1.5.0 Reporter: Josh Rosen Assignee: Apache Spark Consider SortMergeJoin, which requires a sorted, clustered distribution of its input rows. Say that both of SMJ's children produce unsorted output but are both single partition. In this case, we will need to inject sort operators but should not need to inject exchanges. Unfortunately, it looks like the Exchange unnecessarily repartitions using a hash partitioning. We should update Exchange so that it does not unnecessarily repartition children when only the ordering requirements are unsatisfied. I'd like to fix this for Spark 1.5 since it makes certain types of unit tests easier to write. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9683) deep copy UTF8String when convert unsafe row to safe row
[ https://issues.apache.org/jira/browse/SPARK-9683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9683: --- Sprint: Spark 1.5 release deep copy UTF8String when convert unsafe row to safe row Key: SPARK-9683 URL: https://issues.apache.org/jira/browse/SPARK-9683 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9706) List Public API Compatibility Issues with japi-compliance checker
[ https://issues.apache.org/jira/browse/SPARK-9706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feynman Liang updated SPARK-9706: - Description: Using command: {code} japi-compliance-checker spark-mllib_2.10-1.4.2-SNAPSHOT.jar spark-mllib_2.10-1.5.0-SNAPSHOT.jar {code} Report result attached. was: Using command: {code} japi-compliance-checker spark-mllib_2.10-1.4.2-SNAPSHOT.jar spark-mllib_2.10-1.5.0-SNAPSHOT.jar {/code} Report result attached. List Public API Compatibility Issues with japi-compliance checker - Key: SPARK-9706 URL: https://issues.apache.org/jira/browse/SPARK-9706 Project: Spark Issue Type: Task Components: ML, MLlib Reporter: Feynman Liang Attachments: compat_reports.zip Using command: {code} japi-compliance-checker spark-mllib_2.10-1.4.2-SNAPSHOT.jar spark-mllib_2.10-1.5.0-SNAPSHOT.jar {code} Report result attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6902) Row() object can be mutated even though it should be immutable
[ https://issues.apache.org/jira/browse/SPARK-6902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6902: --- Assignee: Davies Liu (was: Apache Spark) Row() object can be mutated even though it should be immutable -- Key: SPARK-6902 URL: https://issues.apache.org/jira/browse/SPARK-6902 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.2.0 Reporter: Jonathan Arfa Assignee: Davies Liu See the below code snippet, IMHO it shouldn't let you assign {{x.c = 5}} and should just give you an error. {quote} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.2.0-SNAPSHOT /_/ Using Python version 2.6.6 (r266:84292, Jan 22 2014 09:42:36) SparkContext available as sc. from pyspark.sql import * x = Row(a=1, b=2, c=3) x Row(a=1, b=2, c=3) x.__dict__ \{'__FIELDS__': ['a', 'b', 'c']\} x.c 3 x.c = 5 x Row(a=1, b=2, c=3) x.__dict__ \{'__FIELDS__': ['a', 'b', 'c'], 'c': 5\} x.c 5 {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6902) Row() object can be mutated even though it should be immutable
[ https://issues.apache.org/jira/browse/SPARK-6902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6902: --- Assignee: Apache Spark (was: Davies Liu) Row() object can be mutated even though it should be immutable -- Key: SPARK-6902 URL: https://issues.apache.org/jira/browse/SPARK-6902 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.2.0 Reporter: Jonathan Arfa Assignee: Apache Spark See the below code snippet, IMHO it shouldn't let you assign {{x.c = 5}} and should just give you an error. {quote} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.2.0-SNAPSHOT /_/ Using Python version 2.6.6 (r266:84292, Jan 22 2014 09:42:36) SparkContext available as sc. from pyspark.sql import * x = Row(a=1, b=2, c=3) x Row(a=1, b=2, c=3) x.__dict__ \{'__FIELDS__': ['a', 'b', 'c']\} x.c 3 x.c = 5 x Row(a=1, b=2, c=3) x.__dict__ \{'__FIELDS__': ['a', 'b', 'c'], 'c': 5\} x.c 5 {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6902) Row() object can be mutated even though it should be immutable
[ https://issues.apache.org/jira/browse/SPARK-6902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660873#comment-14660873 ] Apache Spark commented on SPARK-6902: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/8009 Row() object can be mutated even though it should be immutable -- Key: SPARK-6902 URL: https://issues.apache.org/jira/browse/SPARK-6902 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.2.0 Reporter: Jonathan Arfa Assignee: Davies Liu See the below code snippet, IMHO it shouldn't let you assign {{x.c = 5}} and should just give you an error. {quote} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.2.0-SNAPSHOT /_/ Using Python version 2.6.6 (r266:84292, Jan 22 2014 09:42:36) SparkContext available as sc. from pyspark.sql import * x = Row(a=1, b=2, c=3) x Row(a=1, b=2, c=3) x.__dict__ \{'__FIELDS__': ['a', 'b', 'c']\} x.c 3 x.c = 5 x Row(a=1, b=2, c=3) x.__dict__ \{'__FIELDS__': ['a', 'b', 'c'], 'c': 5\} x.c 5 {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9211) HiveComparisonTest generates incorrect file name for golden answer files on Windows
[ https://issues.apache.org/jira/browse/SPARK-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-9211: Assignee: Christian Kadner HiveComparisonTest generates incorrect file name for golden answer files on Windows --- Key: SPARK-9211 URL: https://issues.apache.org/jira/browse/SPARK-9211 Project: Spark Issue Type: Test Components: SQL, Windows Affects Versions: 1.4.1 Environment: Windows Reporter: Christian Kadner Assignee: Christian Kadner Priority: Minor Labels: hive, sql, test, windows Fix For: 1.5.0 The names of the golden answer files for the Hive test cases (test suites based on {{HiveComparisonTest}}) are generated using an MD5 hash of the query text. When the query text contains line breaks then the generated MD5 hash differs between Windows and Linux/OSX ({{\r\n}} vs {{\n}}). This results in erroneously created golden answer files from just running a Hive comparison test and makes it impossible to modify or add new test cases with correctly named golden answer files on Windows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9709) Avoid starving an unsafe operator in a sort
[ https://issues.apache.org/jira/browse/SPARK-9709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9709: --- Assignee: Apache Spark (was: Andrew Or) Avoid starving an unsafe operator in a sort --- Key: SPARK-9709 URL: https://issues.apache.org/jira/browse/SPARK-9709 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.5.0 Reporter: Andrew Or Assignee: Apache Spark Priority: Critical This concerns mainly TungstenSort. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9709) Avoid starving an unsafe operator in a sort
[ https://issues.apache.org/jira/browse/SPARK-9709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660901#comment-14660901 ] Apache Spark commented on SPARK-9709: - User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/8011 Avoid starving an unsafe operator in a sort --- Key: SPARK-9709 URL: https://issues.apache.org/jira/browse/SPARK-9709 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.5.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical This concerns mainly TungstenSort. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9618) SQLContext.read.schema().parquet() ignores the supplied schema
[ https://issues.apache.org/jira/browse/SPARK-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660734#comment-14660734 ] Reynold Xin commented on SPARK-9618: [~lian cheng] this was not merged into branch-1.5. I just cherry-picked it. SQLContext.read.schema().parquet() ignores the supplied schema -- Key: SPARK-9618 URL: https://issues.apache.org/jira/browse/SPARK-9618 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.1 Reporter: Nathan Howell Assignee: Nathan Howell Priority: Minor Fix For: 1.5.0 If a user supplies a schema when loading a Parquet file it is ignored and the schema is read off disk instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5180) Data source API improvement
[ https://issues.apache.org/jira/browse/SPARK-5180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5180: --- Issue Type: Umbrella (was: Improvement) Data source API improvement --- Key: SPARK-5180 URL: https://issues.apache.org/jira/browse/SPARK-5180 Project: Spark Issue Type: Umbrella Components: SQL Reporter: Yin Huai Assignee: Cheng Lian Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5180) Data source API improvement
[ https://issues.apache.org/jira/browse/SPARK-5180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5180: --- Target Version/s: 1.5.0 (was: 1.6.0) Data source API improvement --- Key: SPARK-5180 URL: https://issues.apache.org/jira/browse/SPARK-5180 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai Assignee: Cheng Lian Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9705) outdated Python 3 and IPython information
Piotr Migdał created SPARK-9705: --- Summary: outdated Python 3 and IPython information Key: SPARK-9705 URL: https://issues.apache.org/jira/browse/SPARK-9705 Project: Spark Issue Type: Documentation Components: Documentation, PySpark Affects Versions: 1.4.1, 1.4.0 Reporter: Piotr Migdał https://issues.apache.org/jira/browse/SPARK-4897 adds Python 3.4 support to 1.4.0 and above, but the official docs (1.4.1, but the same is for 1.4.0) says explicitly: Spark 1.4.1 works with Python 2.6 or higher (but not Python 3). Affected: https://spark.apache.org/docs/1.4.0/programming-guide.html https://spark.apache.org/docs/1.4.1/programming-guide.html There are some other Python-related things, which are outdated, e.g. this line: For example, to launch the IPython Notebook with PyLab plot support: (At least since IPython 3.0 PyLab/Matplotlib support happens inside a notebook; and the line --pylab inline is already removed.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9704) Make some ML APIs public: VectorUDT, Identifiable, ProbabilisticClassifier
[ https://issues.apache.org/jira/browse/SPARK-9704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9704: --- Assignee: Joseph K. Bradley (was: Apache Spark) Make some ML APIs public: VectorUDT, Identifiable, ProbabilisticClassifier -- Key: SPARK-9704 URL: https://issues.apache.org/jira/browse/SPARK-9704 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley This JIRA is for making several ML APIs public to make it easier for users to write their own Pipeline stages. Issue brought up by [~eronwright]. Descriptions below copied from [http://apache-spark-developers-list.1001551.n3.nabble.com/Make-ML-Developer-APIs-public-post-1-4-td13583.html]. We plan to make these APIs public in Spark 1.5. However, they will be marked DeveloperApi and are *very likely* to be broken in the future. * VectorUDT: To define a relation with a vector field, VectorUDT must be instantiated. * Identifiable trait: The trait generates a unique identifier for the associated pipeline component. Nice to have a consistent format by reusing the trait. * ProbabilisticClassifier. Third-party components should leverage the complex logic around computing only selected columns. We will not yet make these public: * SchemaUtils: Third-party pipeline components have a need for checking column types and appending columns. ** This will probably be moved into Spark SQL. Users can copy the methods into their own code as needed. * Shared Params (HasLabel, HasFeatures): This is covered in [SPARK-7146] but reiterating it here. ** We need to discuss whether these should be standardized public APIs. Users can copy the traits into their own code as needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9704) Make some ML APIs public: VectorUDT, Identifiable, ProbabilisticClassifier
[ https://issues.apache.org/jira/browse/SPARK-9704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660769#comment-14660769 ] Apache Spark commented on SPARK-9704: - User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/8004 Make some ML APIs public: VectorUDT, Identifiable, ProbabilisticClassifier -- Key: SPARK-9704 URL: https://issues.apache.org/jira/browse/SPARK-9704 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley This JIRA is for making several ML APIs public to make it easier for users to write their own Pipeline stages. Issue brought up by [~eronwright]. Descriptions below copied from [http://apache-spark-developers-list.1001551.n3.nabble.com/Make-ML-Developer-APIs-public-post-1-4-td13583.html]. We plan to make these APIs public in Spark 1.5. However, they will be marked DeveloperApi and are *very likely* to be broken in the future. * VectorUDT: To define a relation with a vector field, VectorUDT must be instantiated. * Identifiable trait: The trait generates a unique identifier for the associated pipeline component. Nice to have a consistent format by reusing the trait. * ProbabilisticClassifier. Third-party components should leverage the complex logic around computing only selected columns. We will not yet make these public: * SchemaUtils: Third-party pipeline components have a need for checking column types and appending columns. ** This will probably be moved into Spark SQL. Users can copy the methods into their own code as needed. * Shared Params (HasLabel, HasFeatures): This is covered in [SPARK-7146] but reiterating it here. ** We need to discuss whether these should be standardized public APIs. Users can copy the traits into their own code as needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9565) Spark SQL 1.5.0 QA/testing umbrella
[ https://issues.apache.org/jira/browse/SPARK-9565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9565: --- Issue Type: Story (was: Test) Spark SQL 1.5.0 QA/testing umbrella --- Key: SPARK-9565 URL: https://issues.apache.org/jira/browse/SPARK-9565 Project: Spark Issue Type: Story Components: SQL Reporter: Reynold Xin Assignee: Michael Armbrust Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9704) Make some ML APIs public: VectorUDT, Identifiable, ProbabilisticClassifier
[ https://issues.apache.org/jira/browse/SPARK-9704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9704: --- Assignee: Apache Spark (was: Joseph K. Bradley) Make some ML APIs public: VectorUDT, Identifiable, ProbabilisticClassifier -- Key: SPARK-9704 URL: https://issues.apache.org/jira/browse/SPARK-9704 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Assignee: Apache Spark This JIRA is for making several ML APIs public to make it easier for users to write their own Pipeline stages. Issue brought up by [~eronwright]. Descriptions below copied from [http://apache-spark-developers-list.1001551.n3.nabble.com/Make-ML-Developer-APIs-public-post-1-4-td13583.html]. We plan to make these APIs public in Spark 1.5. However, they will be marked DeveloperApi and are *very likely* to be broken in the future. * VectorUDT: To define a relation with a vector field, VectorUDT must be instantiated. * Identifiable trait: The trait generates a unique identifier for the associated pipeline component. Nice to have a consistent format by reusing the trait. * ProbabilisticClassifier. Third-party components should leverage the complex logic around computing only selected columns. We will not yet make these public: * SchemaUtils: Third-party pipeline components have a need for checking column types and appending columns. ** This will probably be moved into Spark SQL. Users can copy the methods into their own code as needed. * Shared Params (HasLabel, HasFeatures): This is covered in [SPARK-7146] but reiterating it here. ** We need to discuss whether these should be standardized public APIs. Users can copy the traits into their own code as needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9704) Make some ML APIs public: VectorUDT, Identifiable, ProbabilisticClassifier
[ https://issues.apache.org/jira/browse/SPARK-9704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660791#comment-14660791 ] Eron Wright commented on SPARK-9704: - Thanks for accepting the suggestions, and I agree with the workarounds suggested for SchemaUtils and shared params. Make some ML APIs public: VectorUDT, Identifiable, ProbabilisticClassifier -- Key: SPARK-9704 URL: https://issues.apache.org/jira/browse/SPARK-9704 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley This JIRA is for making several ML APIs public to make it easier for users to write their own Pipeline stages. Issue brought up by [~eronwright]. Descriptions below copied from [http://apache-spark-developers-list.1001551.n3.nabble.com/Make-ML-Developer-APIs-public-post-1-4-td13583.html]. We plan to make these APIs public in Spark 1.5. However, they will be marked DeveloperApi and are *very likely* to be broken in the future. * VectorUDT: To define a relation with a vector field, VectorUDT must be instantiated. * Identifiable trait: The trait generates a unique identifier for the associated pipeline component. Nice to have a consistent format by reusing the trait. * ProbabilisticClassifier. Third-party components should leverage the complex logic around computing only selected columns. We will not yet make these public: * SchemaUtils: Third-party pipeline components have a need for checking column types and appending columns. ** This will probably be moved into Spark SQL. Users can copy the methods into their own code as needed. * Shared Params (HasLabel, HasFeatures): This is covered in [SPARK-7146] but reiterating it here. ** We need to discuss whether these should be standardized public APIs. Users can copy the traits into their own code as needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9709) Avoid starving an unsafe operator in a sort
Andrew Or created SPARK-9709: Summary: Avoid starving an unsafe operator in a sort Key: SPARK-9709 URL: https://issues.apache.org/jira/browse/SPARK-9709 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.5.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical This concerns mainly TungstenSort. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9548) BytesToBytesMap could have a destructive iterator
[ https://issues.apache.org/jira/browse/SPARK-9548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-9548. Resolution: Fixed Assignee: Liang-Chi Hsieh Fix Version/s: 1.5.0 BytesToBytesMap could have a destructive iterator - Key: SPARK-9548 URL: https://issues.apache.org/jira/browse/SPARK-9548 Project: Spark Issue Type: Improvement Components: SQL Reporter: Josh Rosen Assignee: Liang-Chi Hsieh Priority: Blocker Fix For: 1.5.0 BytesToBytesMap.iterator() could be destructive, freeing each page as it moves onto the next one. There are some circumstances where we don't want a destructive iterator (such as when we're building a KV sorter from a map), so there should be a flag to control this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4561) PySparkSQL's Row.asDict() should convert nested rows to dictionaries
[ https://issues.apache.org/jira/browse/SPARK-4561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660860#comment-14660860 ] Apache Spark commented on SPARK-4561: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/8006 PySparkSQL's Row.asDict() should convert nested rows to dictionaries Key: SPARK-4561 URL: https://issues.apache.org/jira/browse/SPARK-4561 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 1.2.0 Reporter: Josh Rosen Assignee: Davies Liu In PySpark, you can call {{.asDict ()}} on a SparkSQL {{Row}} to convert it to a dictionary. Unfortunately, though, this does not convert nested rows to dictionaries. For example: {code} sqlContext.sql(select results from results).first() Row(results=[Row(time=3.762), Row(time=3.47), Row(time=3.559), Row(time=3.458), Row(time=3.229), Row(time=3.21), Row(time=3.166), Row(time=3.276), Row(time=3.239), Row(time=3.149)]) sqlContext.sql(select results from results).first().asDict() {u'results': [(3.762,), (3.47,), (3.559,), (3.458,), (3.229,), (3.21,), (3.166,), (3.276,), (3.239,), (3.149,)]} {code} Actually, it looks like the nested fields are just left as Rows (IPython's fancy display logic obscured this in my first example): {code} Row(results=[Row(time=1), Row(time=2)]).asDict() {'results': [Row(time=1), Row(time=2)]} {code} Here's the output I'd expect: {code} Row(results=[Row(time=1), Row(time=2)]) {'results' : [{'time': 1}, {'time': 2}]} {code} I ran into this issue when trying to use Pandas dataframes to display nested data that I queried from Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9645) External shuffle service does not work with kerberos on
[ https://issues.apache.org/jira/browse/SPARK-9645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-9645. --- Resolution: Fixed Assignee: Marcelo Vanzin Fix Version/s: 1.5.0 External shuffle service does not work with kerberos on --- Key: SPARK-9645 URL: https://issues.apache.org/jira/browse/SPARK-9645 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.5.0 Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Priority: Blocker Fix For: 1.5.0 Lots of errors like this when running apps with the external shuffle service and kerberos enabled: {noformat} 15/08/05 06:26:18 WARN TaskSetManager: Lost task 2.0 in stage 2.0 (TID 12, spark-nightly-2.vpc.cloudera.com): FetchFailed(BlockManagerId(2, spark-nightly-2.vpc.cloudera.com, 7337), shuffleId=0, mapId=0, reduceId=2, message= org.apache.spark.shuffle.FetchFailedException: java.lang.RuntimeException: Failed to open file: /yarn/nm/usercache/systest/appcache/application_1438780049118_0008/blockmgr-7178b106-6902-4082-8792-1c3e34b80d15/38/shuffle_0_0_0.index at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getSortBasedShuffleBlockData(ExternalShuffleBlockResolver.java:203) at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:113) at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.handleMessage(ExternalShuffleBlockHandler.java:80) at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:68) at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:114) {noformat} This is caused by commit c4830598 (SPARK-6287), which modified the permissions of the directory storing the shuffle files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9363) SortMergeJoin operator should support UnsafeRow
[ https://issues.apache.org/jira/browse/SPARK-9363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9363: --- Assignee: Apache Spark (was: Josh Rosen) SortMergeJoin operator should support UnsafeRow --- Key: SPARK-9363 URL: https://issues.apache.org/jira/browse/SPARK-9363 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Josh Rosen Assignee: Apache Spark The SortMergeJoin operator should implement the suppotsUnsafeRow and outputsUnsafeRow settings when appropriate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9363) SortMergeJoin operator should support UnsafeRow
[ https://issues.apache.org/jira/browse/SPARK-9363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9363: --- Assignee: Josh Rosen (was: Apache Spark) SortMergeJoin operator should support UnsafeRow --- Key: SPARK-9363 URL: https://issues.apache.org/jira/browse/SPARK-9363 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Josh Rosen Assignee: Josh Rosen The SortMergeJoin operator should implement the suppotsUnsafeRow and outputsUnsafeRow settings when appropriate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9363) SortMergeJoin operator should support UnsafeRow
[ https://issues.apache.org/jira/browse/SPARK-9363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660720#comment-14660720 ] Apache Spark commented on SPARK-9363: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/7904 SortMergeJoin operator should support UnsafeRow --- Key: SPARK-9363 URL: https://issues.apache.org/jira/browse/SPARK-9363 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Josh Rosen Assignee: Josh Rosen The SortMergeJoin operator should implement the suppotsUnsafeRow and outputsUnsafeRow settings when appropriate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9691) PySpark SQL rand function treats seed 0 as no seed
[ https://issues.apache.org/jira/browse/SPARK-9691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9691: --- Sprint: Spark 1.5 release (was: Spark 1.5 doc/QA sprint) PySpark SQL rand function treats seed 0 as no seed -- Key: SPARK-9691 URL: https://issues.apache.org/jira/browse/SPARK-9691 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.4.0, 1.4.1, 1.4.2, 1.5.0 Reporter: Joseph K. Bradley Assignee: Yin Huai In PySpark SQL's rand() function, it tests for a seed in a way such that seed 0 is treated as no seed, leading to non-deterministic results when a user would expect deterministic results. See: [https://github.com/apache/spark/blob/98e69467d4fda2c26a951409b5b7c6f1e9345ce4/python/pyspark/sql/functions.py#L271] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9704) Make some ML APIs public: VectorUDT, Identifiable, ProbabilisticClassifier
Joseph K. Bradley created SPARK-9704: Summary: Make some ML APIs public: VectorUDT, Identifiable, ProbabilisticClassifier Key: SPARK-9704 URL: https://issues.apache.org/jira/browse/SPARK-9704 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley This JIRA is for making several ML APIs public to make it easier for users to write their own Pipeline stages. Issue brought up by [~eronwright]. Descriptions below copied from [http://apache-spark-developers-list.1001551.n3.nabble.com/Make-ML-Developer-APIs-public-post-1-4-td13583.html]. We plan to make these APIs public in Spark 1.5. However, they will be marked DeveloperApi and are *very likely* to be broken in the future. * VectorUDT: To define a relation with a vector field, VectorUDT must be instantiated. * Identifiable trait: The trait generates a unique identifier for the associated pipeline component. Nice to have a consistent format by reusing the trait. * ProbabilisticClassifier. Third-party components should leverage the complex logic around computing only selected columns. We will not yet make these public: * SchemaUtils: Third-party pipeline components have a need for checking column types and appending columns. ** This will probably be moved into Spark SQL. Users can copy the methods into their own code as needed. * Shared Params (HasLabel, HasFeatures): This is covered in [SPARK-7146] but reiterating it here. ** We need to discuss whether these should be standardized public APIs. Users can copy the traits into their own code as needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9706) List Public API Compatibility Issues with japi-compliance checker
[ https://issues.apache.org/jira/browse/SPARK-9706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feynman Liang updated SPARK-9706: - Issue Type: Task (was: Bug) List Public API Compatibility Issues with japi-compliance checker - Key: SPARK-9706 URL: https://issues.apache.org/jira/browse/SPARK-9706 Project: Spark Issue Type: Task Components: ML, MLlib Reporter: Feynman Liang Attachments: compat_reports.zip Using command: {{japi-compliance-checker spark-mllib_2.10-1.4.2-SNAPSHOT.jar spark-mllib_2.10-1.5.0-SNAPSHOT.jar}} Report result attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9706) List Public API Compatibility Issues with japi-compliance checker
[ https://issues.apache.org/jira/browse/SPARK-9706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feynman Liang updated SPARK-9706: - Description: Using command: {code} japi-compliance-checker spark-mllib_2.10-1.4.2-SNAPSHOT.jar spark-mllib_2.10-1.5.0-SNAPSHOT.jar {/code} Report result attached. was: Using command: {{code}} japi-compliance-checker spark-mllib_2.10-1.4.2-SNAPSHOT.jar spark-mllib_2.10-1.5.0-SNAPSHOT.jar {{/code}} Report result attached. List Public API Compatibility Issues with japi-compliance checker - Key: SPARK-9706 URL: https://issues.apache.org/jira/browse/SPARK-9706 Project: Spark Issue Type: Task Components: ML, MLlib Reporter: Feynman Liang Attachments: compat_reports.zip Using command: {code} japi-compliance-checker spark-mllib_2.10-1.4.2-SNAPSHOT.jar spark-mllib_2.10-1.5.0-SNAPSHOT.jar {/code} Report result attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9706) List Public API Compatibility Issues with japi-compliance checker
[ https://issues.apache.org/jira/browse/SPARK-9706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feynman Liang updated SPARK-9706: - Description: Using command: {{code}} japi-compliance-checker spark-mllib_2.10-1.4.2-SNAPSHOT.jar spark-mllib_2.10-1.5.0-SNAPSHOT.jar {{/code}} Report result attached. was: Using command: {{japi-compliance-checker spark-mllib_2.10-1.4.2-SNAPSHOT.jar spark-mllib_2.10-1.5.0-SNAPSHOT.jar}} Report result attached. List Public API Compatibility Issues with japi-compliance checker - Key: SPARK-9706 URL: https://issues.apache.org/jira/browse/SPARK-9706 Project: Spark Issue Type: Task Components: ML, MLlib Reporter: Feynman Liang Attachments: compat_reports.zip Using command: {{code}} japi-compliance-checker spark-mllib_2.10-1.4.2-SNAPSHOT.jar spark-mllib_2.10-1.5.0-SNAPSHOT.jar {{/code}} Report result attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8848) Write Parquet LISTs and MAPs conforming to Parquet format spec
[ https://issues.apache.org/jira/browse/SPARK-8848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-8848: Target Version/s: 1.5.0 (was: 1.6.0) Write Parquet LISTs and MAPs conforming to Parquet format spec -- Key: SPARK-8848 URL: https://issues.apache.org/jira/browse/SPARK-8848 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.0.2, 1.1.1, 1.2.2, 1.3.1, 1.4.0 Reporter: Cheng Lian Assignee: Cheng Lian [Parquet format PR #17|https://github.com/apache/parquet-format/pull/17] standardized structures of Parquet complex types (LIST MAP). Spark SQL should follow this spec and write Parquet data conforming to the standard. Note that although currently Parquet files written by Spark SQL is non-standard (because Parquet format spec wasn't clear about this part when Spark SQL Parquet support was authored), it's still compatible with the most recent Parquet format spec, because the format we use is covered by the backwards-compatibility rules. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9381) Migrate JSON data source to the new partitioning data source
[ https://issues.apache.org/jira/browse/SPARK-9381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660735#comment-14660735 ] Reynold Xin commented on SPARK-9381: This was also not merged into branch-1.5. I cherry-picked it. Migrate JSON data source to the new partitioning data source Key: SPARK-9381 URL: https://issues.apache.org/jira/browse/SPARK-9381 Project: Spark Issue Type: New Feature Components: SQL Reporter: Cheng Hao Assignee: Cheng Hao Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7550) Support setting the right schema serde when writing to Hive metastore
[ https://issues.apache.org/jira/browse/SPARK-7550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7550: --- Fix Version/s: 1.5.0 Support setting the right schema serde when writing to Hive metastore --- Key: SPARK-7550 URL: https://issues.apache.org/jira/browse/SPARK-7550 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.0 Reporter: Reynold Xin Assignee: Cheng Hao Priority: Blocker Fix For: 1.5.0 As of 1.4, Spark SQL does not properly set the table schema and serde when writing a table to Hive's metastore. Would be great to do that properly so users can use non-Spark SQL systems to read those tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6923) Spark SQL CLI does not read Data Source schema correctly
[ https://issues.apache.org/jira/browse/SPARK-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6923: --- Fix Version/s: 1.5.0 Spark SQL CLI does not read Data Source schema correctly Key: SPARK-6923 URL: https://issues.apache.org/jira/browse/SPARK-6923 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: pin_zhang Assignee: Cheng Hao Priority: Critical Fix For: 1.5.0 {code:java} HiveContext hctx = new HiveContext(sc); ListString sample = new ArrayListString(); sample.add( {\id\: \id_1\, \age\:1} ); RDDString sampleRDD = new JavaSparkContext(sc).parallelize(sample).rdd(); DataFrame df = hctx.jsonRDD(sampleRDD); String table=test; df.saveAsTable(table, json,SaveMode.Overwrite); Table t = hctx.catalog().client().getTable(table); System.out.println( t.getCols()); {code} -- With the code above to save DataFrame to hive table, Get table cols returns one column named 'col' [FieldSchema(name:col, type:arraystring, comment:from deserializer)] Expected return fields schema id, age. This results in the jdbc API cannot retrieves the table columns via ResultSet DatabaseMetaData.getColumns(String catalog, String schemaPattern,String tableNamePattern, String columnNamePattern) But resultset metadata for query select * from test contains fields id, age. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7550) Support setting the right schema serde when writing to Hive metastore
[ https://issues.apache.org/jira/browse/SPARK-7550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660739#comment-14660739 ] Reynold Xin commented on SPARK-7550: I cherry-picked it into branch-1.5. Support setting the right schema serde when writing to Hive metastore --- Key: SPARK-7550 URL: https://issues.apache.org/jira/browse/SPARK-7550 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.0 Reporter: Reynold Xin Assignee: Cheng Hao Priority: Blocker Fix For: 1.5.0 As of 1.4, Spark SQL does not properly set the table schema and serde when writing a table to Hive's metastore. Would be great to do that properly so users can use non-Spark SQL systems to read those tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9706) List Public API Compatibility Issues with japi-compliance checker
[ https://issues.apache.org/jira/browse/SPARK-9706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feynman Liang resolved SPARK-9706. -- Resolution: Fixed List Public API Compatibility Issues with japi-compliance checker - Key: SPARK-9706 URL: https://issues.apache.org/jira/browse/SPARK-9706 Project: Spark Issue Type: Bug Components: ML, MLlib Reporter: Feynman Liang Attachments: compat_reports.zip Using command: {{japi-compliance-checker spark-mllib_2.10-1.4.2-SNAPSHOT.jar spark-mllib_2.10-1.5.0-SNAPSHOT.jar}} Report result attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9706) List Public API Compatibility Issues with japi-compliance checker
[ https://issues.apache.org/jira/browse/SPARK-9706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feynman Liang updated SPARK-9706: - Summary: List Public API Compatibility Issues with japi-compliance checker (was: Check Public API Compatibility with japi-compliance checker) List Public API Compatibility Issues with japi-compliance checker - Key: SPARK-9706 URL: https://issues.apache.org/jira/browse/SPARK-9706 Project: Spark Issue Type: Bug Components: ML, MLlib Reporter: Feynman Liang Attachments: compat_reports.zip Using command: {{japi-compliance-checker spark-mllib_2.10-1.4.2-SNAPSHOT.jar spark-mllib_2.10-1.5.0-SNAPSHOT.jar}} Report result attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds
[ https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660816#comment-14660816 ] Ryan Williams commented on SPARK-1517: -- That all makes sense, thanks. I guess I was imagining we could publish the SHA'd snapshots to a Maven repository other than the Apache snapshot repository, especially if the latter has rules that make this inconvenient. Understood that the binaries (and Maven artifacts) would have to be clearly branded as not official Apache releases. If I came up with a URL that binaries could be uploaded to, what would have to change to make it happen? Likewise if I found a Maven repository that could host these artifacts? Publish nightly snapshots of documentation, maven artifacts, and binary builds -- Key: SPARK-1517 URL: https://issues.apache.org/jira/browse/SPARK-1517 Project: Spark Issue Type: Improvement Components: Build, Project Infra Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Critical Should be pretty easy to do with Jenkins. The only thing I can think of that would be tricky is to set up credentials so that jenkins can publish this stuff somewhere on apache infra. Ideally we don't want to have to put a private key on every jenkins box (since they are otherwise pretty stateless). One idea is to encrypt these credentials with a passphrase and post them somewhere publicly visible. Then the jenkins build can download the credentials provided we set a passphrase in an environment variable in jenkins. There may be simpler solutions as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9711) spark-submit fails after restarting cluster with spark-ec2
Guangyang Li created SPARK-9711: --- Summary: spark-submit fails after restarting cluster with spark-ec2 Key: SPARK-9711 URL: https://issues.apache.org/jira/browse/SPARK-9711 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.4.1 Reporter: Guangyang Li With Spark 1.4.1 and YARN client mode. My job works at the first time the cluster is built. If I stop and start the cluster, the same /bin/spark-submit command fails and keeps trying to connect to master node: INFO Client: Retrying connect to server: ec2-54-174-232-129.compute-1.amazonaws.com/172.31.36.29:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9683) deep copy UTF8String when convert unsafe row to safe row
[ https://issues.apache.org/jira/browse/SPARK-9683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-9683: -- Assignee: Wenchen Fan deep copy UTF8String when convert unsafe row to safe row Key: SPARK-9683 URL: https://issues.apache.org/jira/browse/SPARK-9683 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan Assignee: Wenchen Fan Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9676) Execute GraphX library from R
[ https://issues.apache.org/jira/browse/SPARK-9676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-9676. -- Resolution: Invalid Please use u...@spark.apache.org to ask questions. Execute GraphX library from R - Key: SPARK-9676 URL: https://issues.apache.org/jira/browse/SPARK-9676 Project: Spark Issue Type: Brainstorming Components: R Affects Versions: 1.6.0 Reporter: Sudhindra Wanted to use the GRaphX from SparkR , is there a way to do it ?.I think as of now it is not possible.I was thinking if one can write a wrapper in R that can call Scala Graphx libraries . Any thought on this please. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9685) Unsupported dataType: char(X) in Hive
Ángel Álvarez created SPARK-9685: Summary: Unsupported dataType: char(X) in Hive Key: SPARK-9685 URL: https://issues.apache.org/jira/browse/SPARK-9685 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Ángel Álvarez I'm getting the following error when I try to read a Hive table with char(X) fields: {code} 15/08/06 11:38:51 INFO parse.ParseDriver: Parse Completed org.apache.spark.sql.types.DataTypeException: Unsupported dataType: char(8). If you have a struct and a field name of it has any special characters, please use backticks (`) to quote that field name, e.g. `x+y`. Please note that backtick itself is not supported in a field name. at org.apache.spark.sql.types.DataTypeParser$class.toDataType(DataTypeParser.scala:95) at org.apache.spark.sql.types.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:107) at org.apache.spark.sql.types.DataTypeParser$.parse(DataTypeParser.scala:111) at org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:769) at org.apache.spark.sql.hive.MetastoreRelation$SchemaAttribute.toAttribute(HiveMetastoreCatalog.scala:742) at org.apache.spark.sql.hive.MetastoreRelation$$anonfun$44.apply(HiveMetastoreCatalog.scala:752) at org.apache.spark.sql.hive.MetastoreRelation$$anonfun$44.apply(HiveMetastoreCatalog.scala:752) {code} It seems there is no char DataType defined in the DataTypeParser class {code} protected lazy val primitiveType: Parser[DataType] = (?i)string.r ^^^ StringType | (?i)float.r ^^^ FloatType | (?i)(?:int|integer).r ^^^ IntegerType | (?i)tinyint.r ^^^ ByteType | (?i)smallint.r ^^^ ShortType | (?i)double.r ^^^ DoubleType | (?i)(?:bigint|long).r ^^^ LongType | (?i)binary.r ^^^ BinaryType | (?i)boolean.r ^^^ BooleanType | fixedDecimalType | (?i)decimal.r ^^^ DecimalType.USER_DEFAULT | (?i)date.r ^^^ DateType | (?i)timestamp.r ^^^ TimestampType | varchar {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9641) spark.shuffle.service.port is not documented
[ https://issues.apache.org/jira/browse/SPARK-9641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9641: --- Assignee: (was: Apache Spark) spark.shuffle.service.port is not documented Key: SPARK-9641 URL: https://issues.apache.org/jira/browse/SPARK-9641 Project: Spark Issue Type: Improvement Components: Documentation, Shuffle Reporter: Thomas Graves Priority: Minor Looking at the code I see spark.shuffle.service.port being used but I can't find any documentation on it. I don't see a reason for this to be an internal config so we should document it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9684) sbt/sbt assembly Error: Invalid or corrupt jarfile sbt/sbt-launch-0.13.6.jar
[ https://issues.apache.org/jira/browse/SPARK-9684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-9684. -- Resolution: Duplicate [~prabeeshk] please search JIRA first. Also, you're reporting vs an ancient version of Spark; you probably need to evaluate whether it's fixed in master first. https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark sbt/sbt assembly Error: Invalid or corrupt jarfile sbt/sbt-launch-0.13.6.jar Key: SPARK-9684 URL: https://issues.apache.org/jira/browse/SPARK-9684 Project: Spark Issue Type: Bug Affects Versions: 1.1.0, 1.1.1 Reporter: Prabeesh K While running sbt/sbt assembly . Got following error. Attempting to fetch sbt Launching sbt from sbt/sbt-launch-0.13.6.jar Error: Invalid or corrupt jarfile sbt/sbt-launch-0.13.6.jar -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9641) spark.shuffle.service.port is not documented
[ https://issues.apache.org/jira/browse/SPARK-9641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14659778#comment-14659778 ] Apache Spark commented on SPARK-9641: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/7991 spark.shuffle.service.port is not documented Key: SPARK-9641 URL: https://issues.apache.org/jira/browse/SPARK-9641 Project: Spark Issue Type: Improvement Components: Documentation, Shuffle Reporter: Thomas Graves Priority: Minor Looking at the code I see spark.shuffle.service.port being used but I can't find any documentation on it. I don't see a reason for this to be an internal config so we should document it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9684) sbt/sbt assembly Error: Invalid or corrupt jarfile sbt/sbt-launch-0.13.6.jar
[ https://issues.apache.org/jira/browse/SPARK-9684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14659779#comment-14659779 ] Prabeesh K commented on SPARK-9684: --- [~pwendell] Please have a look sbt/sbt assembly Error: Invalid or corrupt jarfile sbt/sbt-launch-0.13.6.jar Key: SPARK-9684 URL: https://issues.apache.org/jira/browse/SPARK-9684 Project: Spark Issue Type: Bug Affects Versions: 1.1.0, 1.1.1 Reporter: Prabeesh K While running sbt/sbt assembly . Got following error. Attempting to fetch sbt Launching sbt from sbt/sbt-launch-0.13.6.jar Error: Invalid or corrupt jarfile sbt/sbt-launch-0.13.6.jar -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9641) spark.shuffle.service.port is not documented
[ https://issues.apache.org/jira/browse/SPARK-9641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9641: --- Assignee: Apache Spark spark.shuffle.service.port is not documented Key: SPARK-9641 URL: https://issues.apache.org/jira/browse/SPARK-9641 Project: Spark Issue Type: Improvement Components: Documentation, Shuffle Reporter: Thomas Graves Assignee: Apache Spark Priority: Minor Looking at the code I see spark.shuffle.service.port being used but I can't find any documentation on it. I don't see a reason for this to be an internal config so we should document it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9684) sbt/sbt assembly Error: Invalid or corrupt jarfile sbt/sbt-launch-0.13.6.jar
[ https://issues.apache.org/jira/browse/SPARK-9684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14659785#comment-14659785 ] Prabeesh K commented on SPARK-9684: --- It is fixed in the master. But I can't find a spark-1.X version with fix sbt/sbt assembly Error: Invalid or corrupt jarfile sbt/sbt-launch-0.13.6.jar Key: SPARK-9684 URL: https://issues.apache.org/jira/browse/SPARK-9684 Project: Spark Issue Type: Bug Affects Versions: 1.1.0, 1.1.1 Reporter: Prabeesh K While running sbt/sbt assembly . Got following error. Attempting to fetch sbt Launching sbt from sbt/sbt-launch-0.13.6.jar Error: Invalid or corrupt jarfile sbt/sbt-launch-0.13.6.jar -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9681) Support R feature interactions in RFormula
[ https://issues.apache.org/jira/browse/SPARK-9681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9681: --- Assignee: Apache Spark Support R feature interactions in RFormula -- Key: SPARK-9681 URL: https://issues.apache.org/jira/browse/SPARK-9681 Project: Spark Issue Type: Improvement Components: ML, SparkR Reporter: Eric Liang Assignee: Apache Spark Support the interaction (:) operator RFormula feature transformer, so that it is available for use in SparkR's glm. Umbrella design doc for RFormula integration: https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit?pli=1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9683) deep copy UTF8String when convert unsafe row to safe row
Wenchen Fan created SPARK-9683: -- Summary: deep copy UTF8String when convert unsafe row to safe row Key: SPARK-9683 URL: https://issues.apache.org/jira/browse/SPARK-9683 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org