[jira] [Commented] (SPARK-10437) Support aggregation expressions in Order By
[ https://issues.apache.org/jira/browse/SPARK-10437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730542#comment-14730542 ] Apache Spark commented on SPARK-10437: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/8599 > Support aggregation expressions in Order By > --- > > Key: SPARK-10437 > URL: https://issues.apache.org/jira/browse/SPARK-10437 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Harish Butani > > Followup on SPARK-6583 > The following still fails. > {code} > val df = sqlContext.read.json("examples/src/main/resources/people.json") > df.registerTempTable("t") > val df2 = sqlContext.sql("select age, count(*) from t group by age order by > count(*)") > df2.show() > {code} > {code:title=StackTrace} > Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: No > function to evaluate expression. type: Count, tree: COUNT(1) > at > org.apache.spark.sql.catalyst.expressions.AggregateExpression.eval(aggregates.scala:41) > at > org.apache.spark.sql.catalyst.expressions.RowOrdering.compare(rows.scala:219) > {code} > In 1.4 the issue seemed to be BindReferences.bindReference didn't handle this > case. > Haven't looked at 1.5 code, but don't see a change to bindReference in this > patch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10445) Extend maven version range (enforcer)
[ https://issues.apache.org/jira/browse/SPARK-10445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10445: -- Priority: Minor (was: Major) > Extend maven version range (enforcer) > - > > Key: SPARK-10445 > URL: https://issues.apache.org/jira/browse/SPARK-10445 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Jean-Baptiste Onofré >Priority: Minor > > Currently, the pom.xml "forces" (via enforcer rule) the usage of Maven 3.3.x. > Actually, the build works fine with Maven 3.2.x as well. > I propose to extend the Maven version range. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10446) Support to specify join type when calling join with usingColumns
[ https://issues.apache.org/jira/browse/SPARK-10446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10446: Assignee: Apache Spark > Support to specify join type when calling join with usingColumns > > > Key: SPARK-10446 > URL: https://issues.apache.org/jira/browse/SPARK-10446 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark > > Currently the method join(right: DataFrame, usingColumns: Seq[String]) only > supports inner join. It is more convenient to have it support other join > types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10445) Extend maven version range (enforcer)
[ https://issues.apache.org/jira/browse/SPARK-10445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10445. --- Resolution: Won't Fix See https://issues.apache.org/jira/browse/SPARK-9521 -- we need 3.3+ but the build system downloads it for you. > Extend maven version range (enforcer) > - > > Key: SPARK-10445 > URL: https://issues.apache.org/jira/browse/SPARK-10445 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Jean-Baptiste Onofré > > Currently, the pom.xml "forces" (via enforcer rule) the usage of Maven 3.3.x. > Actually, the build works fine with Maven 3.2.x as well. > I propose to extend the Maven version range. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10446) Support to specify join type when calling join with usingColumns
Liang-Chi Hsieh created SPARK-10446: --- Summary: Support to specify join type when calling join with usingColumns Key: SPARK-10446 URL: https://issues.apache.org/jira/browse/SPARK-10446 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Currently the method join(right: DataFrame, usingColumns: Seq[String]) only supports inner join. It is more convenient to have it support other join types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10442) select cast('false' as boolean) returns true
[ https://issues.apache.org/jira/browse/SPARK-10442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730657#comment-14730657 ] Cheng Lian commented on SPARK-10442: The reason is that all non-empty strings are converted to {{true}} when casting to boolean. This behavior isn't intuitive, but it is consistent with Hive. I'm not sure whether we want to change this. PostgreSQL only allows string literals {{'true'}} and {{'false'}} to be casted to boolean (case insensitive), casting any other strings to boolean results in error. > select cast('false' as boolean) returns true > > > Key: SPARK-10442 > URL: https://issues.apache.org/jira/browse/SPARK-10442 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10310) [Spark SQL] All result records will be popluated into ONE line during the script transform due to missing the correct line/filed delimeter
[ https://issues.apache.org/jira/browse/SPARK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-10310: --- Description: There is real case using python stream script in Spark SQL query. We found that all result records were wroten in ONE line as input from "select" pipeline for python script and so it caused script will not identify each record.Other, filed separator in spark sql will be '^A' or '\001' which is inconsistent/incompatible the '\t' in Hive implementation. Key query: {code:sql} CREATE VIEW temp1 AS SELECT * FROM ( FROM ( SELECT c.wcs_user_sk, w.wp_type, (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec FROM web_clickstreams c, web_page w WHERE c.wcs_web_page_sk = w.wp_web_page_sk AND c.wcs_web_page_sk IS NOT NULL AND c.wcs_user_sk IS NOT NULL AND c.wcs_sales_skIS NULL --abandoned implies: no sale DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec ) clicksAnWebPageType REDUCE wcs_user_sk, tstamp_inSec, wp_type USING 'python sessionize.py 3600' AS ( wp_type STRING, tstamp BIGINT, sessionid STRING) ) sessionized {code} Key Python script: {noformat} for line in sys.stdin: user_sk, tstamp_str, value = line.strip().split("\t") {noformat} Sample SELECT result: {noformat} ^V31^A3237764860^Afeedback^U31^A3237769106^Adynamic^T31^A3237779027^Areview {noformat} Expected result: {noformat} 31 3237764860 feedback 31 3237769106 dynamic 31 3237779027 review {noformat} was: There is real case using python stream script in Spark SQL query. We found that all result records were wroten in ONE line as input from "select" pipeline for python script and so it caused script will not identify each record.Other, filed separator in spark sql will be '^A' or '\001' which is inconsistent/incompatible the '\t' in Hive implementation. #Key Query: CREATE VIEW temp1 AS SELECT * FROM ( FROM ( SELECT c.wcs_user_sk, w.wp_type, (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec FROM web_clickstreams c, web_page w WHERE c.wcs_web_page_sk = w.wp_web_page_sk AND c.wcs_web_page_sk IS NOT NULL AND c.wcs_user_sk IS NOT NULL AND c.wcs_sales_skIS NULL --abandoned implies: no sale DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec ) clicksAnWebPageType REDUCE wcs_user_sk, tstamp_inSec, wp_type USING 'python sessionize.py 3600' AS ( wp_type STRING, tstamp BIGINT, sessionid STRING) ) sessionized #Key Python Script# for line in sys.stdin: user_sk, tstamp_str, value = line.strip().split("\t") Result Records example from 'select' ## ^V31^A3237764860^Afeedback^U31^A3237769106^Adynamic^T31^A3237779027^Areview Result Records example in format## 31 3237764860 feedback 31 3237769106 dynamic 31 3237779027 review > [Spark SQL] All result records will be popluated into ONE line during the > script transform due to missing the correct line/filed delimeter > -- > > Key: SPARK-10310 > URL: https://issues.apache.org/jira/browse/SPARK-10310 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yi Zhou >Priority: Critical > > There is real case using python stream script in Spark SQL query. We found > that all result records were wroten in ONE line as input from "select" > pipeline for python script and so it caused script will not identify each > record.Other, filed separator in spark sql will be '^A' or '\001' which is > inconsistent/incompatible the '\t' in Hive implementation. > Key query: > {code:sql} > CREATE VIEW temp1 AS > SELECT * > FROM > ( > FROM > ( > SELECT > c.wcs_user_sk, > w.wp_type, > (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec > FROM web_clickstreams c, web_page w > WHERE c.wcs_web_page_sk = w.wp_web_page_sk > AND c.wcs_web_page_sk IS NOT NULL > AND c.wcs_user_sk IS NOT NULL > AND c.wcs_sales_skIS NULL --abandoned implies: no sale > DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec > ) clicksAnWebPageType > REDUCE > wcs_user_sk, > tstamp_inSec, > wp_type > USING 'python sessionize.py 3600' > AS ( > wp_type STRING, > tstamp BIGINT, > sessionid STRING) > ) sessionized > {code} > Key Python script: > {noformat} > for line in sys.stdin: > user_sk, tstamp_str, value = line.strip().split("\t") > {noformat} > Sample SELECT result: > {noformat} >
[jira] [Commented] (SPARK-10446) Support to specify join type when calling join with usingColumns
[ https://issues.apache.org/jira/browse/SPARK-10446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730600#comment-14730600 ] Apache Spark commented on SPARK-10446: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/8600 > Support to specify join type when calling join with usingColumns > > > Key: SPARK-10446 > URL: https://issues.apache.org/jira/browse/SPARK-10446 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh > > Currently the method join(right: DataFrame, usingColumns: Seq[String]) only > supports inner join. It is more convenient to have it support other join > types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10446) Support to specify join type when calling join with usingColumns
[ https://issues.apache.org/jira/browse/SPARK-10446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10446: Assignee: (was: Apache Spark) > Support to specify join type when calling join with usingColumns > > > Key: SPARK-10446 > URL: https://issues.apache.org/jira/browse/SPARK-10446 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh > > Currently the method join(right: DataFrame, usingColumns: Seq[String]) only > supports inner join. It is more convenient to have it support other join > types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9235) PYSPARK_DRIVER_PYTHON env variable is not set on the YARN Node manager acting as driver in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-9235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730680#comment-14730680 ] Aaron Glahe commented on SPARK-9235: You set it in the spark-env.sh, e.g, since we use condo as our "python env": SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/srv/software/anaconda/bin/python" > PYSPARK_DRIVER_PYTHON env variable is not set on the YARN Node manager acting > as driver in yarn-cluster mode > > > Key: SPARK-9235 > URL: https://issues.apache.org/jira/browse/SPARK-9235 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.4.1, 1.5.0 > Environment: CentOS 6.6, python 2.7, Spark 1.4.1 tagged version, YARN > Cluster Manager, CDH 5.4.1 (Hadoop 2.6.0++), Java 1.7 >Reporter: Aaron Glahe >Priority: Minor > > Relates to SPARK-9229 > Env: Spark on YARN, Java 1.7, Centos 6.6, CDH 5.4.1 (Hadoop 2.6.0++), > Anaconda Python 2.7.10 "installed" in /srv/software directory > On a client/submitting machine, we set the PYSPARK_DRIVER_PYTHON env var in > spark-env.sh that pointed the anaconda python executable, which was on every > YARN node: > export PYSPARK_DRIVER_PYTHON='/srv/software/anaconda/bin/python' > side note, export PYSPARK_PYTHON='/srv/software/anaconda/bin/python' was set > as well in the spark-env.sh. > run the command: > spark-submit test.py --master yarn --deploy-mode cluster > It appears as though the Node Manager with the DRIVER does not use the > PYSPARK_DRIVER_PYTHON env python, but instead uses the CentOS system default > (which in this case is python 2.6). > Workaround appears to setting the python path in the SPARK_YARN_USER_ENV -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10437) Support aggregation expressions in Order By
[ https://issues.apache.org/jira/browse/SPARK-10437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10437: Assignee: (was: Apache Spark) > Support aggregation expressions in Order By > --- > > Key: SPARK-10437 > URL: https://issues.apache.org/jira/browse/SPARK-10437 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Harish Butani > > Followup on SPARK-6583 > The following still fails. > {code} > val df = sqlContext.read.json("examples/src/main/resources/people.json") > df.registerTempTable("t") > val df2 = sqlContext.sql("select age, count(*) from t group by age order by > count(*)") > df2.show() > {code} > {code:title=StackTrace} > Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: No > function to evaluate expression. type: Count, tree: COUNT(1) > at > org.apache.spark.sql.catalyst.expressions.AggregateExpression.eval(aggregates.scala:41) > at > org.apache.spark.sql.catalyst.expressions.RowOrdering.compare(rows.scala:219) > {code} > In 1.4 the issue seemed to be BindReferences.bindReference didn't handle this > case. > Haven't looked at 1.5 code, but don't see a change to bindReference in this > patch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10437) Support aggregation expressions in Order By
[ https://issues.apache.org/jira/browse/SPARK-10437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10437: Assignee: Apache Spark > Support aggregation expressions in Order By > --- > > Key: SPARK-10437 > URL: https://issues.apache.org/jira/browse/SPARK-10437 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Harish Butani >Assignee: Apache Spark > > Followup on SPARK-6583 > The following still fails. > {code} > val df = sqlContext.read.json("examples/src/main/resources/people.json") > df.registerTempTable("t") > val df2 = sqlContext.sql("select age, count(*) from t group by age order by > count(*)") > df2.show() > {code} > {code:title=StackTrace} > Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: No > function to evaluate expression. type: Count, tree: COUNT(1) > at > org.apache.spark.sql.catalyst.expressions.AggregateExpression.eval(aggregates.scala:41) > at > org.apache.spark.sql.catalyst.expressions.RowOrdering.compare(rows.scala:219) > {code} > In 1.4 the issue seemed to be BindReferences.bindReference didn't handle this > case. > Haven't looked at 1.5 code, but don't see a change to bindReference in this > patch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10298) PySpark can't JSON serialize a DataFrame with DecimalType columns.
[ https://issues.apache.org/jira/browse/SPARK-10298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10298: -- Assignee: Michael Armbrust > PySpark can't JSON serialize a DataFrame with DecimalType columns. > -- > > Key: SPARK-10298 > URL: https://issues.apache.org/jira/browse/SPARK-10298 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Kevin Cox >Assignee: Michael Armbrust > Fix For: 1.5.0 > > > {code} > In [8]: sc.sql.createDataFrame([[Decimal(123)]], > types.StructType([types.StructField("a", types.DecimalType())])) > Out[8]: DataFrame[a: decimal(10,0)] > In [9]: _.write.json("foo") > 15/08/26 14:26:21 ERROR DefaultWriterContainer: Aborting task. > scala.MatchError: (DecimalType(10,0),123) (of class scala.Tuple2) > at > org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89) > at > org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89) > at > org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:126) > at > org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89) > at > org.apache.spark.sql.execution.datasources.json.JacksonGenerator$.apply(JacksonGenerator.scala:133) > at > org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.writeInternal(JSONRelation.scala:191) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:224) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 15/08/26 14:26:21 ERROR DefaultWriterContainer: Task attempt > attempt_201508261426__m_00_0 aborted. > 15/08/26 14:26:21 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > org.apache.spark.SparkException: Task failed while writing rows. > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:232) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: scala.MatchError: (DecimalType(10,0),123) (of class scala.Tuple2) > at > org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89) > at > org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89) > at > org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:126) > at > org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89) > at >
[jira] [Updated] (SPARK-10159) Hive 1.3.x GenericUDFDate NPE issue
[ https://issues.apache.org/jira/browse/SPARK-10159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10159: -- Assignee: Michael Armbrust > Hive 1.3.x GenericUDFDate NPE issue > --- > > Key: SPARK-10159 > URL: https://issues.apache.org/jira/browse/SPARK-10159 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 >Reporter: Alex Liu >Assignee: Michael Armbrust > Fix For: 1.5.0 > > > When run sql query with HiveContext, Hive 1.3.x GenericUDFDate NPE issue. > The following is the query and log > {code} > SELECT a.stationid AS stationid, > a.month AS month, > a.year AS year, > AVG(a.mean) AS mean, > MIN(a.min) AS min, > MAX(a.max) AS max > FROM > (SELECT *, > YEAR(date) AS year, > MONTH(date) AS month, > FROM_UNIXTIME(UNIX_TIMESTAMP(TO_DATE(date), '-MM-dd'), 'E') AS > weekday >FROM weathercql.daily) a > WHERE ((a.weekday = 'Mon')) > AND (a.metric = 'temperature') > GROUP BY a.stationid, a.month, a.year > ORDER BY stationid, year, month > LIMIT 100 > {code} > log {code} > Filter > ((HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFFromUnixTime(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFDate(date#81),-MM-dd),E) > = Mon) && (metric#80 = temperature)) > ERROR 2015-08-20 15:39:06 > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation: Error > executing query: > org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in > stage 2.0 failed 4 times, most recent failure: Lost task 1.3 in stage 2.0 > (TID 208, 127.0.0.1): java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.udf.generic.GenericUDFDate.evaluate(GenericUDFDate.java:119) > at org.apache.spark.sql.hive.HiveGenericUdf.eval(hiveUdfs.scala:188) > at > org.apache.spark.sql.hive.HiveGenericUdf$$anonfun$eval$2.apply(hiveUdfs.scala:184) > at > org.apache.spark.sql.hive.DeferredObjectAdapter.get(hiveUdfs.scala:138) > at > org.apache.hadoop.hive.ql.udf.generic.GenericUDFToUnixTimeStamp.evaluate(GenericUDFToUnixTimeStamp.java:121) > at > org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp.evaluate(GenericUDFUnixTimeStamp.java:52) > at org.apache.spark.sql.hive.HiveGenericUdf.eval(hiveUdfs.scala:188) > at > org.apache.spark.sql.hive.HiveSimpleUdf$$anonfun$eval$1.apply(hiveUdfs.scala:121) > at > org.apache.spark.sql.hive.HiveSimpleUdf$$anonfun$eval$1.apply(hiveUdfs.scala:121) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.sql.hive.HiveSimpleUdf.eval(hiveUdfs.scala:121) > at > org.apache.spark.sql.catalyst.expressions.EqualTo.eval(predicates.scala:191) > at > org.apache.spark.sql.catalyst.expressions.And.eval(predicates.scala:130) > at > org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$1.apply(predicates.scala:30) > at > org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$1.apply(predicates.scala:30) > at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:154) > at > org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:149) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:70) > at
[jira] [Created] (SPARK-10447) Upgrade pyspark to use py4j 0.9
Justin Uang created SPARK-10447: --- Summary: Upgrade pyspark to use py4j 0.9 Key: SPARK-10447 URL: https://issues.apache.org/jira/browse/SPARK-10447 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.4.1 Reporter: Justin Uang This was recently released, and it has many improvements, especially the following: {quote} Python side: IDEs and interactive interpreters such as IPython can now get help text/autocompletion for Java classes, objects, and members. This makes Py4J an ideal tool to explore complex Java APIs (e.g., the Eclipse API). Thanks to @jonahkichwacoders {quote} Normally we wrap all the APIs in spark, but for the ones that aren't, this would make it easier to offroad by using the java proxy objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4940) Support more evenly distributing cores for Mesos mode
[ https://issues.apache.org/jira/browse/SPARK-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730911#comment-14730911 ] Martin Tapp commented on SPARK-4940: My principal use case is to cram as much as possible on the same cluster. Some of our apps would benefit from these different strategies. For instance, we are using a library which starts lots of threads, so the round-robin strategy is really a good match for this type to prevent too many tasks on the same executor. Another example is an a pure spark pipeline where it's ok to fill the slave first because not much `outside` resources are being used. This would allow maximizing our cluster resource utilization. > Support more evenly distributing cores for Mesos mode > - > > Key: SPARK-4940 > URL: https://issues.apache.org/jira/browse/SPARK-4940 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Timothy Chen > Attachments: mesos-config-difference-3nodes-vs-2nodes.png > > > Currently in Coarse grain mode the spark scheduler simply takes all the > resources it can on each node, but can cause uneven distribution based on > resources available on each slave. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10447) Upgrade pyspark to use py4j 0.9
[ https://issues.apache.org/jira/browse/SPARK-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730990#comment-14730990 ] Sean Owen commented on SPARK-10447: --- I bet there are some upsides to updating, but the question is: do we know if anything breaks, changes? Worth at least running the tests with this change, but also skimming the release notes to understand any breaking changes. > Upgrade pyspark to use py4j 0.9 > --- > > Key: SPARK-10447 > URL: https://issues.apache.org/jira/browse/SPARK-10447 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.4.1 >Reporter: Justin Uang > > This was recently released, and it has many improvements, especially the > following: > {quote} > Python side: IDEs and interactive interpreters such as IPython can now get > help text/autocompletion for Java classes, objects, and members. This makes > Py4J an ideal tool to explore complex Java APIs (e.g., the Eclipse API). > Thanks to @jonahkichwacoders > {quote} > Normally we wrap all the APIs in spark, but for the ones that aren't, this > would make it easier to offroad by using the java proxy objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10448) Parquet schema merging should NOT merge UDT
Cheng Lian created SPARK-10448: -- Summary: Parquet schema merging should NOT merge UDT Key: SPARK-10448 URL: https://issues.apache.org/jira/browse/SPARK-10448 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1, 1.3.1, 1.5.0 Reporter: Cheng Lian For example, we may have a UDT {{U}} that maps to a Catalyst {{StructType}} with two fields {{a}} and {{b}}. Later on, we updated {{U}} to {{U'}} by removing {{a}} and adding {{c}}. In this case, Parquet schema merging will give a {{StructType}} with all three fields. But such a {{StructType}} can be mapped to neither {{U}} nor {{U'}}. We probably shouldn't allow schema merging over UDT types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10450) Minor SQL style, format, typo, readability fixes
[ https://issues.apache.org/jira/browse/SPARK-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10450: Assignee: Andrew Or (was: Apache Spark) > Minor SQL style, format, typo, readability fixes > > > Key: SPARK-10450 > URL: https://issues.apache.org/jira/browse/SPARK-10450 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Minor > > This JIRA isn't exactly tied to one particular patch. Like SPARK-10003 it's > more of a continuous process. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10450) Minor SQL style, format, typo, readability fixes
[ https://issues.apache.org/jira/browse/SPARK-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731211#comment-14731211 ] Apache Spark commented on SPARK-10450: -- User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/8603 > Minor SQL style, format, typo, readability fixes > > > Key: SPARK-10450 > URL: https://issues.apache.org/jira/browse/SPARK-10450 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Minor > > This JIRA isn't exactly tied to one particular patch. Like SPARK-10003 it's > more of a continuous process. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10451) Prevent unnecessary serializations in InMemoryColumnarTableScan
[ https://issues.apache.org/jira/browse/SPARK-10451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10451: Assignee: Apache Spark > Prevent unnecessary serializations in InMemoryColumnarTableScan > --- > > Key: SPARK-10451 > URL: https://issues.apache.org/jira/browse/SPARK-10451 > Project: Spark > Issue Type: Improvement >Reporter: Yash Datta >Assignee: Apache Spark > > In InMemorycolumnarTableScan, seriliazation of certain fields like > buildFilter, InMemoryRelation etc can be avoided during task execution by > carefully managing the clsoure of mapPartitions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10451) Prevent unnecessary serializations in InMemoryColumnarTableScan
[ https://issues.apache.org/jira/browse/SPARK-10451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731283#comment-14731283 ] Apache Spark commented on SPARK-10451: -- User 'saucam' has created a pull request for this issue: https://github.com/apache/spark/pull/8604 > Prevent unnecessary serializations in InMemoryColumnarTableScan > --- > > Key: SPARK-10451 > URL: https://issues.apache.org/jira/browse/SPARK-10451 > Project: Spark > Issue Type: Improvement >Reporter: Yash Datta > > In InMemorycolumnarTableScan, seriliazation of certain fields like > buildFilter, InMemoryRelation etc can be avoided during task execution by > carefully managing the clsoure of mapPartitions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10451) Prevent unnecessary serializations in InMemoryColumnarTableScan
[ https://issues.apache.org/jira/browse/SPARK-10451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10451: Assignee: (was: Apache Spark) > Prevent unnecessary serializations in InMemoryColumnarTableScan > --- > > Key: SPARK-10451 > URL: https://issues.apache.org/jira/browse/SPARK-10451 > Project: Spark > Issue Type: Improvement >Reporter: Yash Datta > > In InMemorycolumnarTableScan, seriliazation of certain fields like > buildFilter, InMemoryRelation etc can be avoided during task execution by > carefully managing the clsoure of mapPartitions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8951) support CJK characters in collect()
[ https://issues.apache.org/jira/browse/SPARK-8951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731151#comment-14731151 ] Shivaram Venkataraman commented on SPARK-8951: -- Ah I should have retested this before merging - I'll send a PR to fix this now > support CJK characters in collect() > --- > > Key: SPARK-8951 > URL: https://issues.apache.org/jira/browse/SPARK-8951 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Jaehong Choi >Assignee: Jaehong Choi >Priority: Minor > Fix For: 1.6.0 > > Attachments: SerDe.scala.diff > > > Spark gives an error message and does not show the output when a field of the > result DataFrame contains characters in CJK. > I found out that SerDe in R API only supports ASCII format for strings right > now as commented in source code. > So, I fixed SerDe.scala a little to support CJK as the file attached. > I did not care efficiency, but just wanted to see if it works. > {noformat} > people.json > {"name":"가나"} > {"name":"테스트123", "age":30} > {"name":"Justin", "age":19} > df <- read.df(sqlContext, "./people.json", "json") > head(df) > Error in rawtochar(string) : embedded nul in string : '\0 \x98' > {noformat} > {code:title=core/src/main/scala/org/apache/spark/api/r/SerDe.scala} > // NOTE: Only works for ASCII right now > def writeString(out: DataOutputStream, value: String): Unit = { > val len = value.length > out.writeInt(len + 1) // For the \0 > out.writeBytes(value) > out.writeByte(0) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10449) StructType.merge shouldn't merge DecimalTypes with different precisions and/or scales
Cheng Lian created SPARK-10449: -- Summary: StructType.merge shouldn't merge DecimalTypes with different precisions and/or scales Key: SPARK-10449 URL: https://issues.apache.org/jira/browse/SPARK-10449 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1, 1.3.1, 1.5.0 Reporter: Cheng Lian Schema merging should only handle struct fields. But currently we also reconcile decimal precision and scale information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9666) ML 1.5 QA: model save/load audit
[ https://issues.apache.org/jira/browse/SPARK-9666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731299#comment-14731299 ] Joseph K. Bradley commented on SPARK-9666: -- Thanks for checking. Shall I mark this complete? > ML 1.5 QA: model save/load audit > > > Key: SPARK-9666 > URL: https://issues.apache.org/jira/browse/SPARK-9666 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: yuhao yang > > We should check to make sure no changes broke model import/export in > spark.mllib. > * If a model's name, data members, or constructors have changed _at all_, > then we likely need to support a new save/load format version. Different > versions must be tested in unit tests to ensure backwards compatibility > (i.e., verify we can load old model formats). > * Examples in the programming guide should include save/load when available. > It's important to try running each example in the guide whenever it is > modified (since there are no automated tests). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8951) support CJK characters in collect()
[ https://issues.apache.org/jira/browse/SPARK-8951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731155#comment-14731155 ] Shivaram Venkataraman commented on SPARK-8951: -- Sent https://github.com/apache/spark/pull/8601 to fix this > support CJK characters in collect() > --- > > Key: SPARK-8951 > URL: https://issues.apache.org/jira/browse/SPARK-8951 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Jaehong Choi >Assignee: Jaehong Choi >Priority: Minor > Fix For: 1.6.0 > > Attachments: SerDe.scala.diff > > > Spark gives an error message and does not show the output when a field of the > result DataFrame contains characters in CJK. > I found out that SerDe in R API only supports ASCII format for strings right > now as commented in source code. > So, I fixed SerDe.scala a little to support CJK as the file attached. > I did not care efficiency, but just wanted to see if it works. > {noformat} > people.json > {"name":"가나"} > {"name":"테스트123", "age":30} > {"name":"Justin", "age":19} > df <- read.df(sqlContext, "./people.json", "json") > head(df) > Error in rawtochar(string) : embedded nul in string : '\0 \x98' > {noformat} > {code:title=core/src/main/scala/org/apache/spark/api/r/SerDe.scala} > // NOTE: Only works for ASCII right now > def writeString(out: DataOutputStream, value: String): Unit = { > val len = value.length > out.writeInt(len + 1) // For the \0 > out.writeBytes(value) > out.writeByte(0) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10450) Minor SQL style, format, typo, readability fixes
Andrew Or created SPARK-10450: - Summary: Minor SQL style, format, typo, readability fixes Key: SPARK-10450 URL: https://issues.apache.org/jira/browse/SPARK-10450 Project: Spark Issue Type: Improvement Components: SQL Reporter: Andrew Or Assignee: Andrew Or Priority: Minor This JIRA isn't exactly tied to one particular patch. Like SPARK-10003 it's more of a continuous process. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10451) Prevent unnecessary serializations in InMemoryColumnarTableScan
Yash Datta created SPARK-10451: -- Summary: Prevent unnecessary serializations in InMemoryColumnarTableScan Key: SPARK-10451 URL: https://issues.apache.org/jira/browse/SPARK-10451 Project: Spark Issue Type: Improvement Reporter: Yash Datta In InMemorycolumnarTableScan, seriliazation of certain fields like buildFilter, InMemoryRelation etc can be avoided during task execution by carefully managing the clsoure of mapPartitions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10447) Upgrade pyspark to use py4j 0.9
[ https://issues.apache.org/jira/browse/SPARK-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731164#comment-14731164 ] Justin Uang commented on SPARK-10447: - Agreed, I'm pretty sure that this will break some APIs and we'll have to fix those as we do the upgrade =). > Upgrade pyspark to use py4j 0.9 > --- > > Key: SPARK-10447 > URL: https://issues.apache.org/jira/browse/SPARK-10447 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.4.1 >Reporter: Justin Uang > > This was recently released, and it has many improvements, especially the > following: > {quote} > Python side: IDEs and interactive interpreters such as IPython can now get > help text/autocompletion for Java classes, objects, and members. This makes > Py4J an ideal tool to explore complex Java APIs (e.g., the Eclipse API). > Thanks to @jonahkichwacoders > {quote} > Normally we wrap all the APIs in spark, but for the ones that aren't, this > would make it easier to offroad by using the java proxy objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10450) Minor SQL style, format, typo, readability fixes
[ https://issues.apache.org/jira/browse/SPARK-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10450: Assignee: Apache Spark (was: Andrew Or) > Minor SQL style, format, typo, readability fixes > > > Key: SPARK-10450 > URL: https://issues.apache.org/jira/browse/SPARK-10450 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Andrew Or >Assignee: Apache Spark >Priority: Minor > > This JIRA isn't exactly tied to one particular patch. Like SPARK-10003 it's > more of a continuous process. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10452) Pyspark worker security issue
Michael Procopio created SPARK-10452: Summary: Pyspark worker security issue Key: SPARK-10452 URL: https://issues.apache.org/jira/browse/SPARK-10452 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.0 Environment: Spark 1.4.0 running on hadoop 2.5.2. Reporter: Michael Procopio Priority: Critical The python worker launched by the executor is given the credentials used to launch yarn. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10453) There's now way to use spark.dynmicAllocation.enabled with pyspark
Michael Procopio created SPARK-10453: Summary: There's now way to use spark.dynmicAllocation.enabled with pyspark Key: SPARK-10453 URL: https://issues.apache.org/jira/browse/SPARK-10453 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.0 Environment: When using spark.dynamicAllocation.enabled, the assumption is that memory/core resources will be mediated by the yarn resource manager. Unfortunately, whatever value is used for spark.executor.memory is consumed as JVM heap space by the executor. There's no way to account for the memory requirements of the pyspark worker. Executor JVM heap space should be decoupled from spark.executor.memory. Reporter: Michael Procopio -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10453) There's now way to use spark.dynmicAllocation.enabled with pyspark
[ https://issues.apache.org/jira/browse/SPARK-10453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-10453. Resolution: Not A Problem >From http://spark.apache.org/docs/latest/running-on-yarn.html: {noformat} spark.yarn.executor.memoryOverhead executorMemory * 0.10, with minimum of 384 The amount of off heap memory (in megabytes) to be allocated per executor. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the executor size (typically 6-10%). {noformat} That also counts encompasses the python workers. > There's now way to use spark.dynmicAllocation.enabled with pyspark > -- > > Key: SPARK-10453 > URL: https://issues.apache.org/jira/browse/SPARK-10453 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.4.0 > Environment: When using spark.dynamicAllocation.enabled, the > assumption is that memory/core resources will be mediated by the yarn > resource manager. Unfortunately, whatever value is used for > spark.executor.memory is consumed as JVM heap space by the executor. There's > no way to account for the memory requirements of the pyspark worker. > Executor JVM heap space should be decoupled from spark.executor.memory. >Reporter: Michael Procopio > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10454) Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause multiple concurrent attempts for the same map stage
[ https://issues.apache.org/jira/browse/SPARK-10454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731380#comment-14731380 ] Apache Spark commented on SPARK-10454: -- User 'robbinspg' has created a pull request for this issue: https://github.com/apache/spark/pull/8605 > Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause > multiple concurrent attempts for the same map stage > - > > Key: SPARK-10454 > URL: https://issues.apache.org/jira/browse/SPARK-10454 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 1.5.1 >Reporter: Pete Robbins >Priority: Minor > > test case fails intermittently in Jenkins. > For eg, see the following builds- > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41991/ > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41999/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10454) Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause multiple concurrent attempts for the same map stage
[ https://issues.apache.org/jira/browse/SPARK-10454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10454: Assignee: (was: Apache Spark) > Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause > multiple concurrent attempts for the same map stage > - > > Key: SPARK-10454 > URL: https://issues.apache.org/jira/browse/SPARK-10454 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 1.5.1 >Reporter: Pete Robbins >Priority: Minor > > test case fails intermittently in Jenkins. > For eg, see the following builds- > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41991/ > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41999/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10454) Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause multiple concurrent attempts for the same map stage
[ https://issues.apache.org/jira/browse/SPARK-10454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10454: Assignee: Apache Spark > Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause > multiple concurrent attempts for the same map stage > - > > Key: SPARK-10454 > URL: https://issues.apache.org/jira/browse/SPARK-10454 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 1.5.1 >Reporter: Pete Robbins >Assignee: Apache Spark >Priority: Minor > > test case fails intermittently in Jenkins. > For eg, see the following builds- > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41991/ > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41999/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10456) upgrade java 7 on amplab jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-10456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shane knapp updated SPARK-10456: Description: our java 7 installation is really old (from last september). update this to the latest java 7 jdk. please assign this to me. was:our java 7 installation is really old (from last september). update this to the latest java 7 jdk > upgrade java 7 on amplab jenkins workers > > > Key: SPARK-10456 > URL: https://issues.apache.org/jira/browse/SPARK-10456 > Project: Spark > Issue Type: Task > Components: Build >Reporter: shane knapp > Labels: build > > our java 7 installation is really old (from last september). update this to > the latest java 7 jdk. > please assign this to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10452) Pyspark worker security issue
[ https://issues.apache.org/jira/browse/SPARK-10452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-10452. Resolution: Not A Problem If you need your workers to run as you user, you need to configure YARN to use Kerberos. > Pyspark worker security issue > - > > Key: SPARK-10452 > URL: https://issues.apache.org/jira/browse/SPARK-10452 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.4.0 > Environment: Spark 1.4.0 running on hadoop 2.5.2. >Reporter: Michael Procopio >Priority: Critical > > The python worker launched by the executor is given the credentials used to > launch yarn. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10454) Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause multiple concurrent attempts for the same map stage
[ https://issues.apache.org/jira/browse/SPARK-10454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731377#comment-14731377 ] Pete Robbins commented on SPARK-10454: -- This is another case of not waiting for events to drain form the listenerBus > Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause > multiple concurrent attempts for the same map stage > - > > Key: SPARK-10454 > URL: https://issues.apache.org/jira/browse/SPARK-10454 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 1.5.1 >Reporter: Pete Robbins >Priority: Minor > > test case fails intermittently in Jenkins. > For eg, see the following builds- > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41991/ > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41999/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10455) install java 8 on amplab jenkins workers
shane knapp created SPARK-10455: --- Summary: install java 8 on amplab jenkins workers Key: SPARK-10455 URL: https://issues.apache.org/jira/browse/SPARK-10455 Project: Spark Issue Type: Task Components: Build Reporter: shane knapp install java 8 on all jenkins workers. and just for clarification: we want the 64-bit version, yes? please assign this to me, thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10456) upgrade java 7 on amplab jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-10456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731436#comment-14731436 ] shane knapp commented on SPARK-10456: - looks like we'll be installing 7u79 (we're at 7u51 currently). > upgrade java 7 on amplab jenkins workers > > > Key: SPARK-10456 > URL: https://issues.apache.org/jira/browse/SPARK-10456 > Project: Spark > Issue Type: Task > Components: Build >Reporter: shane knapp > Labels: build > > our java 7 installation is really old (from last september). update this to > the latest java 7 jdk. > please assign this to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9963) ML RandomForest cleanup: replace predictNodeIndex with predictImpl
[ https://issues.apache.org/jira/browse/SPARK-9963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731462#comment-14731462 ] Joseph K. Bradley commented on SPARK-9963: -- Sorry for the slow response! (I've been traveling.) Option 2 sounds best. It can resemble the current predictImpl, but can use the version of shouldGoLeft taking binned feature values. > ML RandomForest cleanup: replace predictNodeIndex with predictImpl > -- > > Key: SPARK-9963 > URL: https://issues.apache.org/jira/browse/SPARK-9963 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Trivial > Labels: starter > > Replace ml.tree.impl.RandomForest.predictNodeIndex with Node.predictImpl. > This should be straightforward, but please ping me if anything is unclear. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731415#comment-14731415 ] Imran Rashid commented on SPARK-4105: - [~mvherweg] Do you know if the error occurred after there was already a stage retry? If so, then this might just be a symptom of SPARK-8029. You would know if earlier in the logs, you see a FetchFailedException which is *not* related to snappy exceptions. I think that is the first report of this bug since SPARK-7660, which we were really hoping fixed this issue, so it would be great to capture more information about it. [~mmitsuto] Can you do the same check, and also tell us which version of Spark you are using? > FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based > shuffle > - > > Key: SPARK-4105 > URL: https://issues.apache.org/jira/browse/SPARK-4105 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.4.1 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > Attachments: JavaObjectToSerialize.java, > SparkFailedToUncompressGenerator.scala > > > We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during > shuffle read. Here's a sample stacktrace from an executor: > {code} > 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID > 33053) > java.io.IOException: FAILED_TO_UNCOMPRESS(5) > at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) > at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) > at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391) > at org.xerial.snappy.Snappy.uncompress(Snappy.java:427) > at > org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127) > at > org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) > at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58) > at > org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128) > at > org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at >
[jira] [Created] (SPARK-10456) upgrade java 7 on amplab jenkins workers
shane knapp created SPARK-10456: --- Summary: upgrade java 7 on amplab jenkins workers Key: SPARK-10456 URL: https://issues.apache.org/jira/browse/SPARK-10456 Project: Spark Issue Type: Task Components: Build Reporter: shane knapp our java 7 installation is really old (from last september). update this to the latest java 7 jdk -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10456) upgrade java 7 on amplab jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-10456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-10456: --- Assignee: shane knapp > upgrade java 7 on amplab jenkins workers > > > Key: SPARK-10456 > URL: https://issues.apache.org/jira/browse/SPARK-10456 > Project: Spark > Issue Type: Task > Components: Build >Reporter: shane knapp >Assignee: shane knapp > Labels: build > > our java 7 installation is really old (from last september). update this to > the latest java 7 jdk. > please assign this to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10455) install java 8 on amplab jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-10455: --- Assignee: shane knapp > install java 8 on amplab jenkins workers > > > Key: SPARK-10455 > URL: https://issues.apache.org/jira/browse/SPARK-10455 > Project: Spark > Issue Type: Task > Components: Build >Reporter: shane knapp >Assignee: shane knapp > > install java 8 on all jenkins workers. > and just for clarification: we want the 64-bit version, yes? > please assign this to me, thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10455) install java 8 on amplab jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731440#comment-14731440 ] Josh Rosen commented on SPARK-10455: Yep, I think we want the 64-bit version. > install java 8 on amplab jenkins workers > > > Key: SPARK-10455 > URL: https://issues.apache.org/jira/browse/SPARK-10455 > Project: Spark > Issue Type: Task > Components: Build >Reporter: shane knapp >Assignee: shane knapp > > install java 8 on all jenkins workers. > and just for clarification: we want the 64-bit version, yes? > please assign this to me, thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10454) Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause multiple concurrent attempts for the same map stage
Pete Robbins created SPARK-10454: Summary: Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause multiple concurrent attempts for the same map stage Key: SPARK-10454 URL: https://issues.apache.org/jira/browse/SPARK-10454 Project: Spark Issue Type: Bug Components: Scheduler, Spark Core Affects Versions: 1.5.1 Reporter: Pete Robbins Priority: Minor test case fails intermittently in Jenkins. For eg, see the following builds- https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41991/ https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41999/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10439) Catalyst should check for overflow / underflow of date and timestamp values
[ https://issues.apache.org/jira/browse/SPARK-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731392#comment-14731392 ] Apache Spark commented on SPARK-10439: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/8606 > Catalyst should check for overflow / underflow of date and timestamp values > --- > > Key: SPARK-10439 > URL: https://issues.apache.org/jira/browse/SPARK-10439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Marcelo Vanzin >Priority: Minor > > While testing some code, I noticed that a few methods in {{DateTimeUtils}} > are prone to overflow and underflow. > For example, {{millisToDays}} can overflow the return type ({{Int}}) if a > large enough input value is provided. > Similarly, {{fromJavaTimestamp}} converts milliseconds to microseconds, which > can overflow if the input is {{> Long.MAX_VALUE / 1000}} (or underflow in the > negative case). > There might be others but these were the ones that caught my eye. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10439) Catalyst should check for overflow / underflow of date and timestamp values
[ https://issues.apache.org/jira/browse/SPARK-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10439: Assignee: (was: Apache Spark) > Catalyst should check for overflow / underflow of date and timestamp values > --- > > Key: SPARK-10439 > URL: https://issues.apache.org/jira/browse/SPARK-10439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Marcelo Vanzin >Priority: Minor > > While testing some code, I noticed that a few methods in {{DateTimeUtils}} > are prone to overflow and underflow. > For example, {{millisToDays}} can overflow the return type ({{Int}}) if a > large enough input value is provided. > Similarly, {{fromJavaTimestamp}} converts milliseconds to microseconds, which > can overflow if the input is {{> Long.MAX_VALUE / 1000}} (or underflow in the > negative case). > There might be others but these were the ones that caught my eye. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10439) Catalyst should check for overflow / underflow of date and timestamp values
[ https://issues.apache.org/jira/browse/SPARK-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10439: Assignee: Apache Spark > Catalyst should check for overflow / underflow of date and timestamp values > --- > > Key: SPARK-10439 > URL: https://issues.apache.org/jira/browse/SPARK-10439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Minor > > While testing some code, I noticed that a few methods in {{DateTimeUtils}} > are prone to overflow and underflow. > For example, {{millisToDays}} can overflow the return type ({{Int}}) if a > large enough input value is provided. > Similarly, {{fromJavaTimestamp}} converts milliseconds to microseconds, which > can overflow if the input is {{> Long.MAX_VALUE / 1000}} (or underflow in the > negative case). > There might be others but these were the ones that caught my eye. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10433) Gradient boosted trees
[ https://issues.apache.org/jira/browse/SPARK-10433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731398#comment-14731398 ] Joseph K. Bradley commented on SPARK-10433: --- Has this been reported on 1.5? I've seen reports for 1.4, but was told by [~dbtsai] that 1.5 seems to have fixed this issue. I believe that the caching (and optional checkpointing) added in 1.5 fix this issue, but it would be great to get confirmation. > Gradient boosted trees > -- > > Key: SPARK-10433 > URL: https://issues.apache.org/jira/browse/SPARK-10433 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.4.1, 1.5.0 >Reporter: Sean Owen > > (Sorry to say I don't have any leads on a fix, but this was reported by three > different people and I confirmed it at fairly close range, so think it's > legitimate:) > This is probably best explained in the words from the mailing list thread at > http://mail-archives.apache.org/mod_mbox/spark-user/201509.mbox/%3C55E84380.2000408%40gmail.com%3E > . Matt Forbes says: > {quote} > I am training a boosted trees model on a couple million input samples (with > around 300 features) and am noticing that the input size of each stage is > increasing each iteration. For each new tree, the first step seems to be > building the decision tree metadata, which does a .count() on the input data, > so this is the step I've been using to track the input size changing. Here is > what I'm seeing: > {quote} > {code} > count at DecisionTreeMetadata.scala:111 > 1. Input Size / Records: 726.1 MB / 1295620 > 2. Input Size / Records: 106.9 GB / 64780816 > 3. Input Size / Records: 160.3 GB / 97171224 > 4. Input Size / Records: 214.8 GB / 129680959 > 5. Input Size / Records: 268.5 GB / 162533424 > > Input Size / Records: 1912.6 GB / 1382017686 > > {code} > {quote} > This step goes from taking less than 10s up to 5 minutes by the 15th or so > iteration. I'm not quite sure what could be causing this. I am passing a > memory-only cached RDD[LabeledPoint] to GradientBoostedTrees.train > {quote} > Johannes Bauer showed me a very similar problem. > Peter Rudenko offers this sketch of a reproduction: > {code} > val boostingStrategy = BoostingStrategy.defaultParams("Classification") > boostingStrategy.setNumIterations(30) > boostingStrategy.setLearningRate(1.0) > boostingStrategy.treeStrategy.setMaxDepth(3) > boostingStrategy.treeStrategy.setMaxBins(128) > boostingStrategy.treeStrategy.setSubsamplingRate(1.0) > boostingStrategy.treeStrategy.setMinInstancesPerNode(1) > boostingStrategy.treeStrategy.setUseNodeIdCache(true) > boostingStrategy.treeStrategy.setCategoricalFeaturesInfo( > > mapAsJavaMap(categoricalFeatures).asInstanceOf[java.util.Map[java.lang.Integer, > java.lang.Integer]]) > val model = GradientBoostedTrees.train(instances, boostingStrategy) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10455) install java 8 on amplab jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731428#comment-14731428 ] shane knapp commented on SPARK-10455: - looks like i'll be installing java 8u60. > install java 8 on amplab jenkins workers > > > Key: SPARK-10455 > URL: https://issues.apache.org/jira/browse/SPARK-10455 > Project: Spark > Issue Type: Task > Components: Build >Reporter: shane knapp > > install java 8 on all jenkins workers. > and just for clarification: we want the 64-bit version, yes? > please assign this to me, thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9963) ML RandomForest cleanup: replace predictNodeIndex with predictImpl
[ https://issues.apache.org/jira/browse/SPARK-9963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731468#comment-14731468 ] Joseph K. Bradley commented on SPARK-9963: -- Yep, that first case in the if-else is for the right-most bin with range [maxSplitValue, +inf] > ML RandomForest cleanup: replace predictNodeIndex with predictImpl > -- > > Key: SPARK-9963 > URL: https://issues.apache.org/jira/browse/SPARK-9963 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Trivial > Labels: starter > > Replace ml.tree.impl.RandomForest.predictNodeIndex with Node.predictImpl. > This should be straightforward, but please ping me if anything is unclear. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10414) DenseMatrix gives different hashcode even though equals returns true
[ https://issues.apache.org/jira/browse/SPARK-10414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731757#comment-14731757 ] Vinod KC commented on SPARK-10414: -- Thanks Got the JIRA id https://issues.apache.org/jira/browse/SPARK-9919 > DenseMatrix gives different hashcode even though equals returns true > > > Key: SPARK-10414 > URL: https://issues.apache.org/jira/browse/SPARK-10414 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Vinod KC >Priority: Minor > > hashcode implementation in DenseMatrix gives different result for same input > val dm = Matrices.dense(2, 2, Array(0.0, 1.0, 2.0, 3.0)) > val dm1 = Matrices.dense(2, 2, Array(0.0, 1.0, 2.0, 3.0)) > assert(dm1 === dm) // passed > assert(dm1.hashCode === dm.hashCode) // Failed > This violates the hashCode/equals contract. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9961) ML prediction abstractions should have defaultEvaluator fields
[ https://issues.apache.org/jira/browse/SPARK-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731810#comment-14731810 ] George Dittmar commented on SPARK-9961: --- Can you expand on what you mean by Evaluator? Just looking for something to eval how good predictions are? > ML prediction abstractions should have defaultEvaluator fields > -- > > Key: SPARK-9961 > URL: https://issues.apache.org/jira/browse/SPARK-9961 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley > > Predictor and PredictionModel should have abstract defaultEvaluator methods > which return Evaluators. Subclasses like Regressor, Classifier, etc. should > all provide natural evaluators, set to use the correct input columns and > metrics. Concrete classes may later be modified to > The initial implementation should be marked as DeveloperApi since we may need > to change the defaults later on. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9961) ML prediction abstractions should have defaultEvaluator fields
[ https://issues.apache.org/jira/browse/SPARK-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731810#comment-14731810 ] George Dittmar edited comment on SPARK-9961 at 9/5/15 5:23 AM: --- Can you expand on what you mean by Evaluator? Just looking for something to eval how good predictions are? was (Author: georgedittmar): Can you expand on what you mean by Evaluator? Just looking for something to eval how good predictions are? > ML prediction abstractions should have defaultEvaluator fields > -- > > Key: SPARK-9961 > URL: https://issues.apache.org/jira/browse/SPARK-9961 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley > > Predictor and PredictionModel should have abstract defaultEvaluator methods > which return Evaluators. Subclasses like Regressor, Classifier, etc. should > all provide natural evaluators, set to use the correct input columns and > metrics. Concrete classes may later be modified to > The initial implementation should be marked as DeveloperApi since we may need > to change the defaults later on. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10459) PythonUDF could process UnsafeRow
Davies Liu created SPARK-10459: -- Summary: PythonUDF could process UnsafeRow Key: SPARK-10459 URL: https://issues.apache.org/jira/browse/SPARK-10459 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Currently, There will be ConvertToSafe for PythonUDF, that's not needed actually. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8632) Poor Python UDF performance because of RDD caching
[ https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-8632: - Assignee: Davies Liu > Poor Python UDF performance because of RDD caching > -- > > Key: SPARK-8632 > URL: https://issues.apache.org/jira/browse/SPARK-8632 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4.0 >Reporter: Justin Uang >Assignee: Davies Liu > > {quote} > We have been running into performance problems using Python UDFs with > DataFrames at large scale. > From the implementation of BatchPythonEvaluation, it looks like the goal was > to reuse the PythonRDD code. It caches the entire child RDD so that it can do > two passes over the data. One to give to the PythonRDD, then one to join the > python lambda results with the original row (which may have java objects that > should be passed through). > In addition, it caches all the columns, even the ones that don't need to be > processed by the Python UDF. In the cases I was working with, I had a 500 > column table, and i wanted to use a python UDF for one column, and it ended > up caching all 500 columns. > {quote} > http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8951) support CJK characters in collect()
[ https://issues.apache.org/jira/browse/SPARK-8951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731140#comment-14731140 ] Jihong MA commented on SPARK-8951: -- This commit cause R style check failure. Running R style checks Loading required package: methods Attaching package: 'SparkR' The following objects are masked from 'package:stats': filter, na.omit The following objects are masked from 'package:base': intersect, rbind, sample, subset, summary, table, transform Attaching package: 'testthat' The following object is masked from 'package:SparkR': describe R/deserialize.R:63:9: style: Trailing whitespace is superfluous. string ^ lintr checks failed. [error] running /home/jenkins/workspace/SparkPullRequestBuilder/dev/lint-r ; received return code 1 Archiving unit tests logs... > No log files found. Attempting to post to Github... > Post successful. Build step 'Execute shell' marked build as failure Archiving artifacts Recording test results ERROR: Publisher 'Publish JUnit test result report' failed: No test report files were found. Configuration error? Test FAILed. Refer to this link for build results (access rights to CI server needed): > support CJK characters in collect() > --- > > Key: SPARK-8951 > URL: https://issues.apache.org/jira/browse/SPARK-8951 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Jaehong Choi >Assignee: Jaehong Choi >Priority: Minor > Fix For: 1.6.0 > > Attachments: SerDe.scala.diff > > > Spark gives an error message and does not show the output when a field of the > result DataFrame contains characters in CJK. > I found out that SerDe in R API only supports ASCII format for strings right > now as commented in source code. > So, I fixed SerDe.scala a little to support CJK as the file attached. > I did not care efficiency, but just wanted to see if it works. > {noformat} > people.json > {"name":"가나"} > {"name":"테스트123", "age":30} > {"name":"Justin", "age":19} > df <- read.df(sqlContext, "./people.json", "json") > head(df) > Error in rawtochar(string) : embedded nul in string : '\0 \x98' > {noformat} > {code:title=core/src/main/scala/org/apache/spark/api/r/SerDe.scala} > // NOTE: Only works for ASCII right now > def writeString(out: DataOutputStream, value: String): Unit = { > val len = value.length > out.writeInt(len + 1) // For the \0 > out.writeBytes(value) > out.writeByte(0) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9925) Set SQLConf.SHUFFLE_PARTITIONS.key correctly for tests
[ https://issues.apache.org/jira/browse/SPARK-9925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731170#comment-14731170 ] Apache Spark commented on SPARK-9925: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/8602 > Set SQLConf.SHUFFLE_PARTITIONS.key correctly for tests > -- > > Key: SPARK-9925 > URL: https://issues.apache.org/jira/browse/SPARK-9925 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > > Right now, in our TestSQLContext/TestHiveContext, we use {{override def > numShufflePartitions: Int = this.getConf(SQLConf.SHUFFLE_PARTITIONS, 5)}} to > set {{SHUFFLE_PARTITIONS}}. However, we never put it to SQLConf. So, after we > use {{withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> "number")}}, the number > of shuffle partitions will be set back to 200. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10452) Pyspark worker security issue
[ https://issues.apache.org/jira/browse/SPARK-10452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731332#comment-14731332 ] Marcelo Vanzin edited comment on SPARK-10452 at 9/4/15 9:43 PM: If you need your workers to run as your user, you need to configure YARN to use Kerberos. was (Author: vanzin): If you need your workers to run as you user, you need to configure YARN to use Kerberos. > Pyspark worker security issue > - > > Key: SPARK-10452 > URL: https://issues.apache.org/jira/browse/SPARK-10452 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.4.0 > Environment: Spark 1.4.0 running on hadoop 2.5.2. >Reporter: Michael Procopio >Priority: Critical > > The python worker launched by the executor is given the credentials used to > launch yarn. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10455) install java 8 on amplab jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shane knapp resolved SPARK-10455. - Resolution: Done > install java 8 on amplab jenkins workers > > > Key: SPARK-10455 > URL: https://issues.apache.org/jira/browse/SPARK-10455 > Project: Spark > Issue Type: Task > Components: Build >Reporter: shane knapp >Assignee: shane knapp > > install java 8 on all jenkins workers. > and just for clarification: we want the 64-bit version, yes? > please assign this to me, thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-10455) install java 8 on amplab jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shane knapp closed SPARK-10455. --- FIN! > install java 8 on amplab jenkins workers > > > Key: SPARK-10455 > URL: https://issues.apache.org/jira/browse/SPARK-10455 > Project: Spark > Issue Type: Task > Components: Build >Reporter: shane knapp >Assignee: shane knapp > > install java 8 on all jenkins workers. > and just for clarification: we want the 64-bit version, yes? > please assign this to me, thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10457) Unable to connect to MySQL with the DataFrame API
Mariano Simone created SPARK-10457: -- Summary: Unable to connect to MySQL with the DataFrame API Key: SPARK-10457 URL: https://issues.apache.org/jira/browse/SPARK-10457 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1 Environment: Linux singularity 3.13.0-63-generic #103-Ubuntu SMP Fri Aug 14 21:42:59 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60) "org.apache.spark" %% "spark-core"% "1.4.1" % "provided", "org.apache.spark" % "spark-sql_2.10"% "1.4.1" % "provided", "org.apache.spark" % "spark-streaming_2.10" % "1.4.1" % "provided", "org.apache.spark" %% "spark-streaming-kafka" % "1.4.1", "mysql"% "mysql-connector-java" % "5.1.36" Reporter: Mariano Simone I'm getting this error everytime I try to create a dataframe using jdbc: java.sql.SQLException: No suitable driver found for jdbc:mysql://localhost:3306/test What I have so far: standart sbt project. Added the dep. on mysql-connector to build.sbt like this: "mysql"% "mysql-connector-java" % "5.1.36" The code that creates the df: val url = "jdbc:mysql://localhost:3306/test" val table = "test_table" val properties = new Properties properties.put("user", "123") properties.put("password", "123") properties.put("driver", "com.mysql.jdbc.Driver") val tiers = sqlContext.read.jdbc(url, table, properties) I also loaded the jar like this: streamingContext.sparkContext.addJar("mysql-connector-java-5.1.36.jar") This is the back trace of the exception being thrown: 15/09/04 18:37:40 ERROR JobScheduler: Error running job streaming job 144140266 ms.0 java.sql.SQLException: No suitable driver found for jdbc:mysql://localhost:3306/test at java.sql.DriverManager.getConnection(DriverManager.java:689) at java.sql.DriverManager.getConnection(DriverManager.java:208) at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:118) at org.apache.spark.sql.jdbc.JDBCRelation.(JDBCRelation.scala:128) at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:200) at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:130) at com.playtika.etl.Application$.processRDD(Application.scala:69) at com.playtika.etl.Application$$anonfun$processStream$1.apply(Application.scala:52) at com.playtika.etl.Application$$anonfun$processStream$1.apply(Application.scala:51) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:34) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:193) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:192) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10457) Unable to connect to MySQL with the DataFrame API
[ https://issues.apache.org/jira/browse/SPARK-10457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mariano Simone updated SPARK-10457: --- Description: I'm getting this error everytime I try to create a dataframe using jdbc: java.sql.SQLException: No suitable driver found for jdbc:mysql://localhost:3306/test What I have so far: standart sbt project. Added the dep. on mysql-connector to build.sbt like this: "mysql"% "mysql-connector-java" % "5.1.36" The code that creates the df: val url = "jdbc:mysql://localhost:3306/test" val table = "test_table" val properties = new Properties properties.put("user", "123") properties.put("password", "123") properties.put("driver", "com.mysql.jdbc.Driver") val tiers = sqlContext.read.jdbc(url, table, properties) I also loaded the jar like this: streamingContext.sparkContext.addJar("mysql-connector-java-5.1.36.jar") This is the back trace of the exception being thrown: 15/09/04 18:37:40 ERROR JobScheduler: Error running job streaming job 144140266 ms.0 java.sql.SQLException: No suitable driver found for jdbc:mysql://localhost:3306/test at java.sql.DriverManager.getConnection(DriverManager.java:689) at java.sql.DriverManager.getConnection(DriverManager.java:208) at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:118) at org.apache.spark.sql.jdbc.JDBCRelation.(JDBCRelation.scala:128) at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:200) at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:130) at com.playtika.etl.Application$.processRDD(Application.scala:69) at com.playtika.etl.Application$$anonfun$processStream$1.apply(Application.scala:52) at com.playtika.etl.Application$$anonfun$processStream$1.apply(Application.scala:51) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:34) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:193) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:192) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Let me know if more data is needed. was: I'm getting this error everytime I try to create a dataframe using jdbc: java.sql.SQLException: No suitable driver found for jdbc:mysql://localhost:3306/test What I have so far: standart sbt project. Added the dep. on mysql-connector to build.sbt like this: "mysql"% "mysql-connector-java" % "5.1.36" The code that creates the df: val url = "jdbc:mysql://localhost:3306/test" val table = "test_table" val properties = new Properties properties.put("user", "123") properties.put("password", "123") properties.put("driver", "com.mysql.jdbc.Driver") val tiers = sqlContext.read.jdbc(url, table, properties) I also loaded the jar like this: streamingContext.sparkContext.addJar("mysql-connector-java-5.1.36.jar") This is the back trace of the exception being thrown: 15/09/04 18:37:40 ERROR JobScheduler: Error running job streaming job 144140266 ms.0 java.sql.SQLException: No suitable driver found for jdbc:mysql://localhost:3306/test at java.sql.DriverManager.getConnection(DriverManager.java:689) at java.sql.DriverManager.getConnection(DriverManager.java:208) at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:118) at org.apache.spark.sql.jdbc.JDBCRelation.(JDBCRelation.scala:128) at
[jira] [Closed] (SPARK-10457) Unable to connect to MySQL with the DataFrame API
[ https://issues.apache.org/jira/browse/SPARK-10457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mariano Simone closed SPARK-10457. -- Resolution: Fixed Found the solution. spark.executor.extraClassPath needed configuration. > Unable to connect to MySQL with the DataFrame API > - > > Key: SPARK-10457 > URL: https://issues.apache.org/jira/browse/SPARK-10457 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 > Environment: Linux singularity 3.13.0-63-generic #103-Ubuntu SMP Fri > Aug 14 21:42:59 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux > Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60) > "org.apache.spark" %% "spark-core"% "1.4.1" % "provided", > "org.apache.spark" % "spark-sql_2.10"% "1.4.1" % "provided", > "org.apache.spark" % "spark-streaming_2.10" % "1.4.1" % "provided", > "org.apache.spark" %% "spark-streaming-kafka" % "1.4.1", > "mysql"% "mysql-connector-java" % "5.1.36" >Reporter: Mariano Simone > > I'm getting this error everytime I try to create a dataframe using jdbc: > java.sql.SQLException: No suitable driver found for > jdbc:mysql://localhost:3306/test > What I have so far: > standart sbt project. > Added the dep. on mysql-connector to build.sbt like this: > "mysql"% "mysql-connector-java" % "5.1.36" > The code that creates the df: > val url = "jdbc:mysql://localhost:3306/test" > val table = "test_table" > val properties = new Properties > properties.put("user", "123") > properties.put("password", "123") > properties.put("driver", "com.mysql.jdbc.Driver") > val tiers = sqlContext.read.jdbc(url, table, properties) > I also loaded the jar like this: > streamingContext.sparkContext.addJar("mysql-connector-java-5.1.36.jar") > This is the back trace of the exception being thrown: > 15/09/04 18:37:40 ERROR JobScheduler: Error running job streaming job > 144140266 ms.0 > java.sql.SQLException: No suitable driver found for > jdbc:mysql://localhost:3306/test > at java.sql.DriverManager.getConnection(DriverManager.java:689) > at java.sql.DriverManager.getConnection(DriverManager.java:208) > at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:118) > at org.apache.spark.sql.jdbc.JDBCRelation.(JDBCRelation.scala:128) > at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:200) > at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:130) > at com.playtika.etl.Application$.processRDD(Application.scala:69) > at > com.playtika.etl.Application$$anonfun$processStream$1.apply(Application.scala:52) > at > com.playtika.etl.Application$$anonfun$processStream$1.apply(Application.scala:51) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:40) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) > at scala.util.Try$.apply(Try.scala:161) > at org.apache.spark.streaming.scheduler.Job.run(Job.scala:34) > at > org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:193) > at > org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193) > at > org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > at > org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:192) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Let me know if more data is needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10311) In cluster mode, AppId and AttemptId should be update when ApplicationMaster is new
[ https://issues.apache.org/jira/browse/SPARK-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-10311: -- Affects Version/s: 1.5.0 1.4.1 > In cluster mode, AppId and AttemptId should be update when ApplicationMaster > is new > --- > > Key: SPARK-10311 > URL: https://issues.apache.org/jira/browse/SPARK-10311 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.4.1, 1.5.0 >Reporter: meiyoula > > When I start a streaming app with checkpoint data in yarn-cluster mode, the > appId and attempId are old(which app first create the checkpoint data), and > the event log writes into the old file name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10311) In cluster mode, AppId and AttemptId should be update when ApplicationMaster is new
[ https://issues.apache.org/jira/browse/SPARK-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-10311: -- Target Version/s: 1.6.0, 1.5.1 > In cluster mode, AppId and AttemptId should be update when ApplicationMaster > is new > --- > > Key: SPARK-10311 > URL: https://issues.apache.org/jira/browse/SPARK-10311 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.4.1, 1.5.0 >Reporter: meiyoula > > When I start a streaming app with checkpoint data in yarn-cluster mode, the > appId and attempId are old(which app first create the checkpoint data), and > the event log writes into the old file name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10433) Gradient boosted trees
[ https://issues.apache.org/jira/browse/SPARK-10433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731533#comment-14731533 ] DB Tsai commented on SPARK-10433: - [~sowen] I can confirm that this should be fixed in 1.5 > Gradient boosted trees > -- > > Key: SPARK-10433 > URL: https://issues.apache.org/jira/browse/SPARK-10433 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.4.1, 1.5.0 >Reporter: Sean Owen > > (Sorry to say I don't have any leads on a fix, but this was reported by three > different people and I confirmed it at fairly close range, so think it's > legitimate:) > This is probably best explained in the words from the mailing list thread at > http://mail-archives.apache.org/mod_mbox/spark-user/201509.mbox/%3C55E84380.2000408%40gmail.com%3E > . Matt Forbes says: > {quote} > I am training a boosted trees model on a couple million input samples (with > around 300 features) and am noticing that the input size of each stage is > increasing each iteration. For each new tree, the first step seems to be > building the decision tree metadata, which does a .count() on the input data, > so this is the step I've been using to track the input size changing. Here is > what I'm seeing: > {quote} > {code} > count at DecisionTreeMetadata.scala:111 > 1. Input Size / Records: 726.1 MB / 1295620 > 2. Input Size / Records: 106.9 GB / 64780816 > 3. Input Size / Records: 160.3 GB / 97171224 > 4. Input Size / Records: 214.8 GB / 129680959 > 5. Input Size / Records: 268.5 GB / 162533424 > > Input Size / Records: 1912.6 GB / 1382017686 > > {code} > {quote} > This step goes from taking less than 10s up to 5 minutes by the 15th or so > iteration. I'm not quite sure what could be causing this. I am passing a > memory-only cached RDD[LabeledPoint] to GradientBoostedTrees.train > {quote} > Johannes Bauer showed me a very similar problem. > Peter Rudenko offers this sketch of a reproduction: > {code} > val boostingStrategy = BoostingStrategy.defaultParams("Classification") > boostingStrategy.setNumIterations(30) > boostingStrategy.setLearningRate(1.0) > boostingStrategy.treeStrategy.setMaxDepth(3) > boostingStrategy.treeStrategy.setMaxBins(128) > boostingStrategy.treeStrategy.setSubsamplingRate(1.0) > boostingStrategy.treeStrategy.setMinInstancesPerNode(1) > boostingStrategy.treeStrategy.setUseNodeIdCache(true) > boostingStrategy.treeStrategy.setCategoricalFeaturesInfo( > > mapAsJavaMap(categoricalFeatures).asInstanceOf[java.util.Map[java.lang.Integer, > java.lang.Integer]]) > val model = GradientBoostedTrees.train(instances, boostingStrategy) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10420) Implementing Reactive Streams based Spark Streaming Receiver
[ https://issues.apache.org/jira/browse/SPARK-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-10420: -- Target Version/s: 1.6.0 (was: ) > Implementing Reactive Streams based Spark Streaming Receiver > > > Key: SPARK-10420 > URL: https://issues.apache.org/jira/browse/SPARK-10420 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Nilanjan Raychaudhuri >Priority: Minor > > Hello TD, > This is probably the last bit of the back-pressure story, implementing > ReactiveStreams based Spark streaming receivers. After discussing about this > with my Typesafe team we came up with the following design document > https://docs.google.com/document/d/1lGQKXfNznd5SPuQigvCdLsudl-gcvWKuHWr0Bpn3y30/edit?usp=sharing > Could you please take a look at this when you get a chance? > Thanks > Nilanjan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10447) Upgrade pyspark to use py4j 0.9
[ https://issues.apache.org/jira/browse/SPARK-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731592#comment-14731592 ] Justin Uang commented on SPARK-10447: - Sure, I wouldn't mind doing the code review. Can you add me? > Upgrade pyspark to use py4j 0.9 > --- > > Key: SPARK-10447 > URL: https://issues.apache.org/jira/browse/SPARK-10447 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.4.1 >Reporter: Justin Uang > > This was recently released, and it has many improvements, especially the > following: > {quote} > Python side: IDEs and interactive interpreters such as IPython can now get > help text/autocompletion for Java classes, objects, and members. This makes > Py4J an ideal tool to explore complex Java APIs (e.g., the Eclipse API). > Thanks to @jonahkichwacoders > {quote} > Normally we wrap all the APIs in spark, but for the ones that aren't, this > would make it easier to offroad by using the java proxy objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10447) Upgrade pyspark to use py4j 0.9
[ https://issues.apache.org/jira/browse/SPARK-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731591#comment-14731591 ] holdenk commented on SPARK-10447: - I can give this a shot if no one else is interested in doing this (I've been wrangling some py4j bits with Sparkling Pandas). > Upgrade pyspark to use py4j 0.9 > --- > > Key: SPARK-10447 > URL: https://issues.apache.org/jira/browse/SPARK-10447 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.4.1 >Reporter: Justin Uang > > This was recently released, and it has many improvements, especially the > following: > {quote} > Python side: IDEs and interactive interpreters such as IPython can now get > help text/autocompletion for Java classes, objects, and members. This makes > Py4J an ideal tool to explore complex Java APIs (e.g., the Eclipse API). > Thanks to @jonahkichwacoders > {quote} > Normally we wrap all the APIs in spark, but for the ones that aren't, this > would make it easier to offroad by using the java proxy objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10447) Upgrade pyspark to use py4j 0.9
[ https://issues.apache.org/jira/browse/SPARK-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731597#comment-14731597 ] holdenk commented on SPARK-10447: - Sure, I'll ping you when I've got the PR ready (probably sometime this long weekend) if that's good for you? > Upgrade pyspark to use py4j 0.9 > --- > > Key: SPARK-10447 > URL: https://issues.apache.org/jira/browse/SPARK-10447 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.4.1 >Reporter: Justin Uang > > This was recently released, and it has many improvements, especially the > following: > {quote} > Python side: IDEs and interactive interpreters such as IPython can now get > help text/autocompletion for Java classes, objects, and members. This makes > Py4J an ideal tool to explore complex Java APIs (e.g., the Eclipse API). > Thanks to @jonahkichwacoders > {quote} > Normally we wrap all the APIs in spark, but for the ones that aren't, this > would make it easier to offroad by using the java proxy objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10447) Upgrade pyspark to use py4j 0.9
[ https://issues.apache.org/jira/browse/SPARK-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731598#comment-14731598 ] Justin Uang commented on SPARK-10447: - Sound good > Upgrade pyspark to use py4j 0.9 > --- > > Key: SPARK-10447 > URL: https://issues.apache.org/jira/browse/SPARK-10447 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.4.1 >Reporter: Justin Uang > > This was recently released, and it has many improvements, especially the > following: > {quote} > Python side: IDEs and interactive interpreters such as IPython can now get > help text/autocompletion for Java classes, objects, and members. This makes > Py4J an ideal tool to explore complex Java APIs (e.g., the Eclipse API). > Thanks to @jonahkichwacoders > {quote} > Normally we wrap all the APIs in spark, but for the ones that aren't, this > would make it easier to offroad by using the java proxy objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10397) Make Python's SparkContext self-descriptive on "print sc"
[ https://issues.apache.org/jira/browse/SPARK-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731618#comment-14731618 ] Apache Spark commented on SPARK-10397: -- User 'alexrovner' has created a pull request for this issue: https://github.com/apache/spark/pull/8608 > Make Python's SparkContext self-descriptive on "print sc" > - > > Key: SPARK-10397 > URL: https://issues.apache.org/jira/browse/SPARK-10397 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.4.0 >Reporter: Sergey Tryuber >Priority: Trivial > > When I execute in Python shell: > {code} > print sc > {code} > I receive something like: > {noformat} > > {noformat} > But this is very inconvenient, especially if a user wants to create a > good-looking and self-descriptive IPython Notebook. He would like to see some > information about his Spark cluster. > In contrast, H2O context does have this feature and it is very helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10397) Make Python's SparkContext self-descriptive on "print sc"
[ https://issues.apache.org/jira/browse/SPARK-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10397: Assignee: (was: Apache Spark) > Make Python's SparkContext self-descriptive on "print sc" > - > > Key: SPARK-10397 > URL: https://issues.apache.org/jira/browse/SPARK-10397 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.4.0 >Reporter: Sergey Tryuber >Priority: Trivial > > When I execute in Python shell: > {code} > print sc > {code} > I receive something like: > {noformat} > > {noformat} > But this is very inconvenient, especially if a user wants to create a > good-looking and self-descriptive IPython Notebook. He would like to see some > information about his Spark cluster. > In contrast, H2O context does have this feature and it is very helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10397) Make Python's SparkContext self-descriptive on "print sc"
[ https://issues.apache.org/jira/browse/SPARK-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10397: Assignee: Apache Spark > Make Python's SparkContext self-descriptive on "print sc" > - > > Key: SPARK-10397 > URL: https://issues.apache.org/jira/browse/SPARK-10397 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.4.0 >Reporter: Sergey Tryuber >Assignee: Apache Spark >Priority: Trivial > > When I execute in Python shell: > {code} > print sc > {code} > I receive something like: > {noformat} > > {noformat} > But this is very inconvenient, especially if a user wants to create a > good-looking and self-descriptive IPython Notebook. He would like to see some > information about his Spark cluster. > In contrast, H2O context does have this feature and it is very helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10397) Make Python's SparkContext self-descriptive on "print sc"
[ https://issues.apache.org/jira/browse/SPARK-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731619#comment-14731619 ] Alex Rovner commented on SPARK-10397: - Pull: https://github.com/apache/spark/pull/8608 {noformat} >>> sc {'_accumulatorServer': , '_batchSize': 0, '_callsite': CallSite(function='', file='/Users/alex.rovner/git/spark/python/pyspark/shell.py', linenum=43), '_conf': {'_jconf': JavaObject id=o0}, '_javaAccumulator': JavaObject id=o11, '_jsc': JavaObject id=o8, '_pickled_broadcast_vars': set([]), '_python_includes': [], '_temp_dir': u'/private/var/folders/hj/v4zb0_f159q8mt4w3j8m2_mrgp/T/spark-a9cc47a9-db90-49a3-a82e-263f0b56268c/pyspark-773c7490-2b2d-4418-a030-256a5b9c1fe1', '_unbatched_serializer': PickleSerializer(), 'appName': u'PySparkShell', 'environment': {}, 'master': u'local[*]', 'profiler_collector': None, 'pythonExec': 'python2.7', 'pythonVer': '2.7', 'serializer': AutoBatchedSerializer(PickleSerializer()), 'sparkHome': None} >>> print sc {'_accumulatorServer': , '_batchSize': 0, '_callsite': CallSite(function='', file='/Users/alex.rovner/git/spark/python/pyspark/shell.py', linenum=43), '_conf': {'_jconf': JavaObject id=o0}, '_javaAccumulator': JavaObject id=o11, '_jsc': JavaObject id=o8, '_pickled_broadcast_vars': set([]), '_python_includes': [], '_temp_dir': u'/private/var/folders/hj/v4zb0_f159q8mt4w3j8m2_mrgp/T/spark-a9cc47a9-db90-49a3-a82e-263f0b56268c/pyspark-773c7490-2b2d-4418-a030-256a5b9c1fe1', '_unbatched_serializer': PickleSerializer(), 'appName': u'PySparkShell', 'environment': {}, 'master': u'local[*]', 'profiler_collector': None, 'pythonExec': 'python2.7', 'pythonVer': '2.7', 'serializer': AutoBatchedSerializer(PickleSerializer()), 'sparkHome': None} >>> {noformat} > Make Python's SparkContext self-descriptive on "print sc" > - > > Key: SPARK-10397 > URL: https://issues.apache.org/jira/browse/SPARK-10397 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.4.0 >Reporter: Sergey Tryuber >Priority: Trivial > > When I execute in Python shell: > {code} > print sc > {code} > I receive something like: > {noformat} > > {noformat} > But this is very inconvenient, especially if a user wants to create a > good-looking and self-descriptive IPython Notebook. He would like to see some > information about his Spark cluster. > In contrast, H2O context does have this feature and it is very helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10458) Would like to know if a given Spark Context is stopped or currently stopping
Matt Cheah created SPARK-10458: -- Summary: Would like to know if a given Spark Context is stopped or currently stopping Key: SPARK-10458 URL: https://issues.apache.org/jira/browse/SPARK-10458 Project: Spark Issue Type: Improvement Reporter: Matt Cheah Priority: Minor I ran into a case where a thread stopped a Spark Context, specifically when I hit the "kill" link from the Spark standalone UI. There was no real way for another thread to know that the context had stopped and thus should have handled that accordingly. Checking that the SparkEnv is null is one way, but that doesn't handle the case where the context is in the midst of stopping, and stopping the context may actually not be instantaneous - in my case for some reason the DAGScheduler was taking a non-trivial amount of time to stop. Implementation wise I'm more or less requesting the boolean value returned from SparkContext.stopped.get() to be visible in some way. As long as we return the value and not the Atomic Boolean itself (we wouldn't want anyone to be setting this, after all!) it would help client applications check the context's liveliness. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10402) Add scaladoc for default values of params in ML
[ https://issues.apache.org/jira/browse/SPARK-10402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-10402. --- Resolution: Fixed Fix Version/s: 1.5.1 1.6.0 Issue resolved by pull request 8591 [https://github.com/apache/spark/pull/8591] > Add scaladoc for default values of params in ML > --- > > Key: SPARK-10402 > URL: https://issues.apache.org/jira/browse/SPARK-10402 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: holdenk >Assignee: holdenk >Priority: Minor > Fix For: 1.6.0, 1.5.1 > > > We should make sure the scaladoc for params includes their default values > through the models in ml/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9925) Set SQLConf.SHUFFLE_PARTITIONS.key correctly for tests
[ https://issues.apache.org/jira/browse/SPARK-9925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-9925. -- Resolution: Fixed Fix Version/s: 1.6.0 > Set SQLConf.SHUFFLE_PARTITIONS.key correctly for tests > -- > > Key: SPARK-9925 > URL: https://issues.apache.org/jira/browse/SPARK-9925 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 1.6.0 > > > Right now, in our TestSQLContext/TestHiveContext, we use {{override def > numShufflePartitions: Int = this.getConf(SQLConf.SHUFFLE_PARTITIONS, 5)}} to > set {{SHUFFLE_PARTITIONS}}. However, we never put it to SQLConf. So, after we > use {{withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> "number")}}, the number > of shuffle partitions will be set back to 200. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10414) DenseMatrix gives different hashcode even though equals returns true
[ https://issues.apache.org/jira/browse/SPARK-10414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731745#comment-14731745 ] Vinod KC commented on SPARK-10414: -- [~josephkb] Could you please share me that existing JIRA id to review the PR Thanks > DenseMatrix gives different hashcode even though equals returns true > > > Key: SPARK-10414 > URL: https://issues.apache.org/jira/browse/SPARK-10414 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Vinod KC >Priority: Minor > > hashcode implementation in DenseMatrix gives different result for same input > val dm = Matrices.dense(2, 2, Array(0.0, 1.0, 2.0, 3.0)) > val dm1 = Matrices.dense(2, 2, Array(0.0, 1.0, 2.0, 3.0)) > assert(dm1 === dm) // passed > assert(dm1.hashCode === dm.hashCode) // Failed > This violates the hashCode/equals contract. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7257) Find nearest neighbor satisfying predicate
[ https://issues.apache.org/jira/browse/SPARK-7257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731744#comment-14731744 ] Luvsandondov Lkhamsuren commented on SPARK-7257: This sounds very interesting! If I understood correctly, having multiple vertices satisfying the predicate (let's call the set P, which is a subset of V), and we want to find set of vertices from the P that is closest. Is it guaranteed that |P| << |V|? What is the use case you'd in mind [~josephkb]? > Find nearest neighbor satisfying predicate > -- > > Key: SPARK-7257 > URL: https://issues.apache.org/jira/browse/SPARK-7257 > Project: Spark > Issue Type: New Feature > Components: GraphX >Reporter: Joseph K. Bradley >Priority: Minor > > It would be useful to be able to find nearest neighbors satisfying > predicates. E.g.: > * Given one or more starting vertices, plus a predicate. > * Find the closest vertex or vertices satisfying the predicate. > This is different from ShortestPaths in that ShortestPaths searches for a > fixed (small) set of vertices, rather than all vertices satisfying a > predicate (which could be a large set). > It could be implemented using BFS from the initial vertex/vertices, though > faster implementations might also search from vertices satisfying the > predicate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10199) Avoid using reflections for parquet model save
[ https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731753#comment-14731753 ] Vinod KC commented on SPARK-10199: -- [~mengxr] Thanks for the suggestion. Shall I close the PR? > Avoid using reflections for parquet model save > -- > > Key: SPARK-10199 > URL: https://issues.apache.org/jira/browse/SPARK-10199 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Feynman Liang >Priority: Minor > > These items are not high priority since the overhead writing to Parquest is > much greater than for runtime reflections. > Multiple model save/load in MLlib use case classes to infer a schema for the > data frame saved to Parquet. However, inferring a schema from case classes or > tuples uses [runtime > reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361] > which is unnecessary since the types are already known at the time `save` is > called. > It would be better to just specify the schema for the data frame directly > using {{sqlContext.createDataFrame(dataRDD, schema)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10402) Add scaladoc for default values of params in ML
[ https://issues.apache.org/jira/browse/SPARK-10402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-10402: -- Shepherd: Joseph K. Bradley Assignee: holdenk Target Version/s: 1.6.0, 1.5.1 > Add scaladoc for default values of params in ML > --- > > Key: SPARK-10402 > URL: https://issues.apache.org/jira/browse/SPARK-10402 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: holdenk >Assignee: holdenk >Priority: Minor > > We should make sure the scaladoc for params includes their default values > through the models in ml/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10456) upgrade java 7 on amplab jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-10456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731480#comment-14731480 ] shane knapp commented on SPARK-10456: - ok, 79 is installed but i will wait until downtime to switch the symlinks over. here's the command i will be running when that time comes: pssh -h jenkins_workers.txt "cd /usr/java; rm -f latest; rm -f default; ln -s jdk1.7.0_79 latest; ln -s latest default > upgrade java 7 on amplab jenkins workers > > > Key: SPARK-10456 > URL: https://issues.apache.org/jira/browse/SPARK-10456 > Project: Spark > Issue Type: Task > Components: Build >Reporter: shane knapp >Assignee: shane knapp > Labels: build > > our java 7 installation is really old (from last september). update this to > the latest java 7 jdk. > please assign this to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10455) install java 8 on amplab jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731478#comment-14731478 ] shane knapp commented on SPARK-10455: - it's installed in: /usr/java/jdk1.8.0_60 i'll email the dev@ list and let everyone know. > install java 8 on amplab jenkins workers > > > Key: SPARK-10455 > URL: https://issues.apache.org/jira/browse/SPARK-10455 > Project: Spark > Issue Type: Task > Components: Build >Reporter: shane knapp >Assignee: shane knapp > > install java 8 on all jenkins workers. > and just for clarification: we want the 64-bit version, yes? > please assign this to me, thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10450) Minor SQL style, format, typo, readability fixes
[ https://issues.apache.org/jira/browse/SPARK-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-10450. --- Resolution: Fixed Fix Version/s: 1.6.0 > Minor SQL style, format, typo, readability fixes > > > Key: SPARK-10450 > URL: https://issues.apache.org/jira/browse/SPARK-10450 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Minor > Fix For: 1.6.0 > > > This JIRA isn't exactly tied to one particular patch. Like SPARK-10003 it's > more of a continuous process. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10304) Partition discovery does not throw an exception if the dir structure is invalid
[ https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-10304: -- Target Version/s: 1.6.0, 1.5.1 > Partition discovery does not throw an exception if the dir structure is > invalid > --- > > Key: SPARK-10304 > URL: https://issues.apache.org/jira/browse/SPARK-10304 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Zhan Zhang >Priority: Critical > > I have a dir structure like {{/path/table1/partition_column=1/}}. When I try > to use {{load("/path/")}}, it works and I get a DF. When I query this DF, if > it is stored as ORC, there will be the following NPE. But, if it is Parquet, > we even can return rows. We should complain to users about the dir struct > because {{table1}} does not meet our format. > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in > stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 > (TID 3504, 10.0.195.227): java.lang.NullPointerException > at > org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466) > at > org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316) > at > org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10304) Partition discovery does not throw an exception if the dir structure is invalid
[ https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-10304: -- Target Version/s: 1.6.0, 1.5.1 (was: 1.5.1,1.6.0) > Partition discovery does not throw an exception if the dir structure is > invalid > --- > > Key: SPARK-10304 > URL: https://issues.apache.org/jira/browse/SPARK-10304 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Zhan Zhang >Priority: Critical > > I have a dir structure like {{/path/table1/partition_column=1/}}. When I try > to use {{load("/path/")}}, it works and I get a DF. When I query this DF, if > it is stored as ORC, there will be the following NPE. But, if it is Parquet, > we even can return rows. We should complain to users about the dir struct > because {{table1}} does not meet our format. > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in > stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 > (TID 3504, 10.0.195.227): java.lang.NullPointerException > at > org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466) > at > org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316) > at > org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10013) Remove Java assert from Java unit tests
[ https://issues.apache.org/jira/browse/SPARK-10013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731585#comment-14731585 ] Apache Spark commented on SPARK-10013: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/8607 > Remove Java assert from Java unit tests > --- > > Key: SPARK-10013 > URL: https://issues.apache.org/jira/browse/SPARK-10013 > Project: Spark > Issue Type: Test > Components: ML, MLlib >Reporter: Joseph K. Bradley > > We should use assertTrue, etc. instead to make sure the asserts are not > ignored in tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10456) upgrade java 7 on amplab jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-10456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731436#comment-14731436 ] shane knapp edited comment on SPARK-10456 at 9/4/15 9:46 PM: - looks like we'll be installing 7u79 (we're at 7u71 currently). was (Author: shaneknapp): looks like we'll be installing 7u79 (we're at 7u51 currently). > upgrade java 7 on amplab jenkins workers > > > Key: SPARK-10456 > URL: https://issues.apache.org/jira/browse/SPARK-10456 > Project: Spark > Issue Type: Task > Components: Build >Reporter: shane knapp >Assignee: shane knapp > Labels: build > > our java 7 installation is really old (from last september). update this to > the latest java 7 jdk. > please assign this to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10176) Show partially analyzed plan when checkAnswer df fails to resolve
[ https://issues.apache.org/jira/browse/SPARK-10176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-10176: -- Target Version/s: 1.6.0 (was: 1.5.0) > Show partially analyzed plan when checkAnswer df fails to resolve > - > > Key: SPARK-10176 > URL: https://issues.apache.org/jira/browse/SPARK-10176 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Reporter: Michael Armbrust >Assignee: Michael Armbrust > Fix For: 1.6.0 > > > It would be much easier to debug test failures if we could see the failed > plan instead of just the user friendly error message. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10176) Show partially analyzed plan when checkAnswer df fails to resolve
[ https://issues.apache.org/jira/browse/SPARK-10176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-10176: -- Fix Version/s: (was: 1.5.0) 1.6.0 > Show partially analyzed plan when checkAnswer df fails to resolve > - > > Key: SPARK-10176 > URL: https://issues.apache.org/jira/browse/SPARK-10176 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Reporter: Michael Armbrust >Assignee: Michael Armbrust > Fix For: 1.6.0 > > > It would be much easier to debug test failures if we could see the failed > plan instead of just the user friendly error message. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10176) Show partially analyzed plan when checkAnswer df fails to resolve
[ https://issues.apache.org/jira/browse/SPARK-10176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-10176. --- Resolution: Fixed Fix Version/s: 1.5.0 > Show partially analyzed plan when checkAnswer df fails to resolve > - > > Key: SPARK-10176 > URL: https://issues.apache.org/jira/browse/SPARK-10176 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Reporter: Michael Armbrust >Assignee: Michael Armbrust > Fix For: 1.5.0 > > > It would be much easier to debug test failures if we could see the failed > plan instead of just the user friendly error message. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10199) Avoid using reflections for parquet model save
[ https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731529#comment-14731529 ] Xiangrui Meng commented on SPARK-10199: --- The improvement numbers also depends on the model size. In unit tests, the model sizes are usually very small. Then the overhead of reflection becomes significant. With real models, it could be either the model itself is too small or the model is large and then the overhead of reflection becomes insignificant. Keeping the code simple and easy to understand is also quite important. +[~josephkb] > Avoid using reflections for parquet model save > -- > > Key: SPARK-10199 > URL: https://issues.apache.org/jira/browse/SPARK-10199 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Feynman Liang >Priority: Minor > > These items are not high priority since the overhead writing to Parquest is > much greater than for runtime reflections. > Multiple model save/load in MLlib use case classes to infer a schema for the > data frame saved to Parquet. However, inferring a schema from case classes or > tuples uses [runtime > reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361] > which is unnecessary since the types are already known at the time `save` is > called. > It would be better to just specify the schema for the data frame directly > using {{sqlContext.createDataFrame(dataRDD, schema)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9669) Support PySpark with Mesos Cluster mode
[ https://issues.apache.org/jira/browse/SPARK-9669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-9669. -- Resolution: Fixed Fix Version/s: 1.6.0 > Support PySpark with Mesos Cluster mode > --- > > Key: SPARK-9669 > URL: https://issues.apache.org/jira/browse/SPARK-9669 > Project: Spark > Issue Type: New Feature > Components: Mesos, PySpark >Affects Versions: 1.5.0 >Reporter: Timothy Chen >Assignee: Timothy Chen > Fix For: 1.6.0 > > > PySpark with cluster mode with Mesos is not yet supported. > We need to enable it and make sure it's able to launch Pyspark jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10454) Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause multiple concurrent attempts for the same map stage
[ https://issues.apache.org/jira/browse/SPARK-10454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-10454. --- Resolution: Fixed Fix Version/s: 1.5.1 1.6.0 Target Version/s: 1.6.0, 1.5.1 > Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause > multiple concurrent attempts for the same map stage > - > > Key: SPARK-10454 > URL: https://issues.apache.org/jira/browse/SPARK-10454 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 1.5.1 >Reporter: Pete Robbins >Assignee: Pete Robbins >Priority: Critical > Labels: flaky-test > Fix For: 1.6.0, 1.5.1 > > > test case fails intermittently in Jenkins. > For eg, see the following builds- > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41991/ > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41999/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org