[jira] [Commented] (SPARK-10437) Support aggregation expressions in Order By

2015-09-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730542#comment-14730542
 ] 

Apache Spark commented on SPARK-10437:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/8599

> Support aggregation expressions in Order By
> ---
>
> Key: SPARK-10437
> URL: https://issues.apache.org/jira/browse/SPARK-10437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Harish Butani
>
> Followup on SPARK-6583
> The following still fails. 
> {code}
> val df = sqlContext.read.json("examples/src/main/resources/people.json")
> df.registerTempTable("t")
> val df2 = sqlContext.sql("select age, count(*) from t group by age order by 
> count(*)")
> df2.show()
> {code}
> {code:title=StackTrace}
> Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: No 
> function to evaluate expression. type: Count, tree: COUNT(1)
>   at 
> org.apache.spark.sql.catalyst.expressions.AggregateExpression.eval(aggregates.scala:41)
>   at 
> org.apache.spark.sql.catalyst.expressions.RowOrdering.compare(rows.scala:219)
> {code}
> In 1.4 the issue seemed to be BindReferences.bindReference didn't handle this 
> case.
> Haven't looked at 1.5 code, but don't see a change to bindReference in this 
> patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10445) Extend maven version range (enforcer)

2015-09-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10445:
--
Priority: Minor  (was: Major)

> Extend maven version range (enforcer)
> -
>
> Key: SPARK-10445
> URL: https://issues.apache.org/jira/browse/SPARK-10445
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Jean-Baptiste Onofré
>Priority: Minor
>
> Currently, the pom.xml "forces" (via enforcer rule) the usage of Maven 3.3.x.
> Actually, the build works fine with Maven 3.2.x as well.
> I propose to extend the Maven version range.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10446) Support to specify join type when calling join with usingColumns

2015-09-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10446:


Assignee: Apache Spark

> Support to specify join type when calling join with usingColumns
> 
>
> Key: SPARK-10446
> URL: https://issues.apache.org/jira/browse/SPARK-10446
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> Currently the method join(right: DataFrame, usingColumns: Seq[String]) only 
> supports inner join. It is more convenient to have it support other join 
> types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10445) Extend maven version range (enforcer)

2015-09-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10445.
---
Resolution: Won't Fix

See https://issues.apache.org/jira/browse/SPARK-9521 -- we need 3.3+ but the 
build system downloads it for you.

> Extend maven version range (enforcer)
> -
>
> Key: SPARK-10445
> URL: https://issues.apache.org/jira/browse/SPARK-10445
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Jean-Baptiste Onofré
>
> Currently, the pom.xml "forces" (via enforcer rule) the usage of Maven 3.3.x.
> Actually, the build works fine with Maven 3.2.x as well.
> I propose to extend the Maven version range.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10446) Support to specify join type when calling join with usingColumns

2015-09-04 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-10446:
---

 Summary: Support to specify join type when calling join with 
usingColumns
 Key: SPARK-10446
 URL: https://issues.apache.org/jira/browse/SPARK-10446
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh


Currently the method join(right: DataFrame, usingColumns: Seq[String]) only 
supports inner join. It is more convenient to have it support other join types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10442) select cast('false' as boolean) returns true

2015-09-04 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730657#comment-14730657
 ] 

Cheng Lian commented on SPARK-10442:


The reason is that all non-empty strings are converted to {{true}} when casting 
to boolean. This behavior isn't intuitive, but it is consistent with Hive. I'm 
not sure whether we want to change this. 

PostgreSQL only allows string literals {{'true'}} and {{'false'}} to be casted 
to boolean (case insensitive), casting any other strings to boolean results in 
error.

> select cast('false' as boolean) returns true
> 
>
> Key: SPARK-10442
> URL: https://issues.apache.org/jira/browse/SPARK-10442
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10310) [Spark SQL] All result records will be popluated into ONE line during the script transform due to missing the correct line/filed delimeter

2015-09-04 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-10310:
---
Description: 
There is real case using python stream script in Spark SQL query. We found that 
all result records were wroten in ONE line as input from "select" pipeline for 
python script and so it caused script will not identify each record.Other, 
filed separator in spark sql will be '^A' or '\001' which is 
inconsistent/incompatible the '\t' in Hive implementation.

Key query:
{code:sql}
CREATE VIEW temp1 AS
SELECT *
FROM
(
  FROM
  (
SELECT
  c.wcs_user_sk,
  w.wp_type,
  (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
FROM web_clickstreams c, web_page w
WHERE c.wcs_web_page_sk = w.wp_web_page_sk
AND   c.wcs_web_page_sk IS NOT NULL
AND   c.wcs_user_sk IS NOT NULL
AND   c.wcs_sales_skIS NULL --abandoned implies: no sale
DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec
  ) clicksAnWebPageType
  REDUCE
wcs_user_sk,
tstamp_inSec,
wp_type
  USING 'python sessionize.py 3600'
  AS (
wp_type STRING,
tstamp BIGINT, 
sessionid STRING)
) sessionized
{code}
Key Python script:
{noformat}
for line in sys.stdin:
 user_sk,  tstamp_str, value  = line.strip().split("\t")
{noformat}
Sample SELECT result:
{noformat}
^V31^A3237764860^Afeedback^U31^A3237769106^Adynamic^T31^A3237779027^Areview
{noformat}
Expected result:
{noformat}
31   3237764860   feedback
31   3237769106   dynamic
31   3237779027   review
{noformat}


  was:
There is real case using python stream script in Spark SQL query. We found that 
all result records were wroten in ONE line as input from "select" pipeline for 
python script and so it caused script will not identify each record.Other, 
filed separator in spark sql will be '^A' or '\001' which is 
inconsistent/incompatible the '\t' in Hive implementation.

#Key  Query:
CREATE VIEW temp1 AS
SELECT *
FROM
(
  FROM
  (
SELECT
  c.wcs_user_sk,
  w.wp_type,
  (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
FROM web_clickstreams c, web_page w
WHERE c.wcs_web_page_sk = w.wp_web_page_sk
AND   c.wcs_web_page_sk IS NOT NULL
AND   c.wcs_user_sk IS NOT NULL
AND   c.wcs_sales_skIS NULL --abandoned implies: no sale
DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec
  ) clicksAnWebPageType
  REDUCE
wcs_user_sk,
tstamp_inSec,
wp_type
  USING 'python sessionize.py 3600'
  AS (
wp_type STRING,
tstamp BIGINT, 
sessionid STRING)
) sessionized

#Key Python Script#
for line in sys.stdin:
 user_sk,  tstamp_str, value  = line.strip().split("\t")

Result Records example from 'select' ##
^V31^A3237764860^Afeedback^U31^A3237769106^Adynamic^T31^A3237779027^Areview
Result Records example in format##
31   3237764860   feedback
31   3237769106   dynamic
31   3237779027   review



> [Spark SQL] All result records will be popluated into ONE line during the 
> script transform due to missing the correct line/filed delimeter
> --
>
> Key: SPARK-10310
> URL: https://issues.apache.org/jira/browse/SPARK-10310
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yi Zhou
>Priority: Critical
>
> There is real case using python stream script in Spark SQL query. We found 
> that all result records were wroten in ONE line as input from "select" 
> pipeline for python script and so it caused script will not identify each 
> record.Other, filed separator in spark sql will be '^A' or '\001' which is 
> inconsistent/incompatible the '\t' in Hive implementation.
> Key query:
> {code:sql}
> CREATE VIEW temp1 AS
> SELECT *
> FROM
> (
>   FROM
>   (
> SELECT
>   c.wcs_user_sk,
>   w.wp_type,
>   (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
> FROM web_clickstreams c, web_page w
> WHERE c.wcs_web_page_sk = w.wp_web_page_sk
> AND   c.wcs_web_page_sk IS NOT NULL
> AND   c.wcs_user_sk IS NOT NULL
> AND   c.wcs_sales_skIS NULL --abandoned implies: no sale
> DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec
>   ) clicksAnWebPageType
>   REDUCE
> wcs_user_sk,
> tstamp_inSec,
> wp_type
>   USING 'python sessionize.py 3600'
>   AS (
> wp_type STRING,
> tstamp BIGINT, 
> sessionid STRING)
> ) sessionized
> {code}
> Key Python script:
> {noformat}
> for line in sys.stdin:
>  user_sk,  tstamp_str, value  = line.strip().split("\t")
> {noformat}
> Sample SELECT result:
> {noformat}
> 

[jira] [Commented] (SPARK-10446) Support to specify join type when calling join with usingColumns

2015-09-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730600#comment-14730600
 ] 

Apache Spark commented on SPARK-10446:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/8600

> Support to specify join type when calling join with usingColumns
> 
>
> Key: SPARK-10446
> URL: https://issues.apache.org/jira/browse/SPARK-10446
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> Currently the method join(right: DataFrame, usingColumns: Seq[String]) only 
> supports inner join. It is more convenient to have it support other join 
> types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10446) Support to specify join type when calling join with usingColumns

2015-09-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10446:


Assignee: (was: Apache Spark)

> Support to specify join type when calling join with usingColumns
> 
>
> Key: SPARK-10446
> URL: https://issues.apache.org/jira/browse/SPARK-10446
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> Currently the method join(right: DataFrame, usingColumns: Seq[String]) only 
> supports inner join. It is more convenient to have it support other join 
> types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9235) PYSPARK_DRIVER_PYTHON env variable is not set on the YARN Node manager acting as driver in yarn-cluster mode

2015-09-04 Thread Aaron Glahe (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730680#comment-14730680
 ] 

Aaron Glahe commented on SPARK-9235:


You set it in the spark-env.sh, e.g, since we use condo as our "python env":

SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/srv/software/anaconda/bin/python"


> PYSPARK_DRIVER_PYTHON env variable is not set on the YARN Node manager acting 
> as driver in yarn-cluster mode
> 
>
> Key: SPARK-9235
> URL: https://issues.apache.org/jira/browse/SPARK-9235
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.1, 1.5.0
> Environment: CentOS 6.6, python 2.7, Spark 1.4.1 tagged version, YARN 
> Cluster Manager, CDH 5.4.1 (Hadoop 2.6.0++), Java 1.7
>Reporter: Aaron Glahe
>Priority: Minor
>
> Relates to SPARK-9229
> Env:  Spark on YARN, Java 1.7, Centos 6.6, CDH 5.4.1 (Hadoop 2.6.0++), 
> Anaconda Python 2.7.10 "installed" in /srv/software directory
> On a client/submitting machine, we set the PYSPARK_DRIVER_PYTHON env var in 
> spark-env.sh that pointed the anaconda python executable, which was on every 
> YARN node: 
> export PYSPARK_DRIVER_PYTHON='/srv/software/anaconda/bin/python'
> side note, export PYSPARK_PYTHON='/srv/software/anaconda/bin/python' was set 
> as well in the spark-env.sh.
> run the command:
> spark-submit test.py --master yarn --deploy-mode cluster
> It appears as though the Node Manager with the DRIVER does not use the 
> PYSPARK_DRIVER_PYTHON env python, but instead uses the CentOS system default 
> (which in this case is python 2.6).
> Workaround appears to setting the python path in the SPARK_YARN_USER_ENV



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10437) Support aggregation expressions in Order By

2015-09-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10437:


Assignee: (was: Apache Spark)

> Support aggregation expressions in Order By
> ---
>
> Key: SPARK-10437
> URL: https://issues.apache.org/jira/browse/SPARK-10437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Harish Butani
>
> Followup on SPARK-6583
> The following still fails. 
> {code}
> val df = sqlContext.read.json("examples/src/main/resources/people.json")
> df.registerTempTable("t")
> val df2 = sqlContext.sql("select age, count(*) from t group by age order by 
> count(*)")
> df2.show()
> {code}
> {code:title=StackTrace}
> Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: No 
> function to evaluate expression. type: Count, tree: COUNT(1)
>   at 
> org.apache.spark.sql.catalyst.expressions.AggregateExpression.eval(aggregates.scala:41)
>   at 
> org.apache.spark.sql.catalyst.expressions.RowOrdering.compare(rows.scala:219)
> {code}
> In 1.4 the issue seemed to be BindReferences.bindReference didn't handle this 
> case.
> Haven't looked at 1.5 code, but don't see a change to bindReference in this 
> patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10437) Support aggregation expressions in Order By

2015-09-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10437:


Assignee: Apache Spark

> Support aggregation expressions in Order By
> ---
>
> Key: SPARK-10437
> URL: https://issues.apache.org/jira/browse/SPARK-10437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Harish Butani
>Assignee: Apache Spark
>
> Followup on SPARK-6583
> The following still fails. 
> {code}
> val df = sqlContext.read.json("examples/src/main/resources/people.json")
> df.registerTempTable("t")
> val df2 = sqlContext.sql("select age, count(*) from t group by age order by 
> count(*)")
> df2.show()
> {code}
> {code:title=StackTrace}
> Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: No 
> function to evaluate expression. type: Count, tree: COUNT(1)
>   at 
> org.apache.spark.sql.catalyst.expressions.AggregateExpression.eval(aggregates.scala:41)
>   at 
> org.apache.spark.sql.catalyst.expressions.RowOrdering.compare(rows.scala:219)
> {code}
> In 1.4 the issue seemed to be BindReferences.bindReference didn't handle this 
> case.
> Haven't looked at 1.5 code, but don't see a change to bindReference in this 
> patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10298) PySpark can't JSON serialize a DataFrame with DecimalType columns.

2015-09-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10298:
--
Assignee: Michael Armbrust

> PySpark can't JSON serialize a DataFrame with DecimalType columns.
> --
>
> Key: SPARK-10298
> URL: https://issues.apache.org/jira/browse/SPARK-10298
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Kevin Cox
>Assignee: Michael Armbrust
> Fix For: 1.5.0
>
>
> {code}
> In [8]: sc.sql.createDataFrame([[Decimal(123)]], 
> types.StructType([types.StructField("a", types.DecimalType())]))
> Out[8]: DataFrame[a: decimal(10,0)]
> In [9]: _.write.json("foo")
> 15/08/26 14:26:21 ERROR DefaultWriterContainer: Aborting task.
> scala.MatchError: (DecimalType(10,0),123) (of class scala.Tuple2)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:126)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonGenerator$.apply(JacksonGenerator.scala:133)
>   at 
> org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.writeInternal(JSONRelation.scala:191)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:224)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 15/08/26 14:26:21 ERROR DefaultWriterContainer: Task attempt 
> attempt_201508261426__m_00_0 aborted.
> 15/08/26 14:26:21 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> org.apache.spark.SparkException: Task failed while writing rows.
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:232)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: scala.MatchError: (DecimalType(10,0),123) (of class scala.Tuple2)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:126)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
>   at 
> 

[jira] [Updated] (SPARK-10159) Hive 1.3.x GenericUDFDate NPE issue

2015-09-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10159:
--
Assignee: Michael Armbrust

> Hive 1.3.x GenericUDFDate NPE issue
> ---
>
> Key: SPARK-10159
> URL: https://issues.apache.org/jira/browse/SPARK-10159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Alex Liu
>Assignee: Michael Armbrust
> Fix For: 1.5.0
>
>
> When run sql query with HiveContext, Hive 1.3.x GenericUDFDate NPE issue.
> The following is the query and log
> {code}
> SELECT a.stationid AS stationid,
> a.month AS month,
> a.year AS year,
> AVG(a.mean) AS mean,
> MIN(a.min) AS min,
> MAX(a.max) AS max
> FROM 
>   (SELECT *,
>  YEAR(date) AS year,
>  MONTH(date) AS month,
>  FROM_UNIXTIME(UNIX_TIMESTAMP(TO_DATE(date), '-MM-dd'), 'E') AS 
> weekday
>FROM weathercql.daily) a
> WHERE ((a.weekday = 'Mon'))
>   AND (a.metric = 'temperature')
> GROUP BY a.stationid, a.month, a.year
> ORDER BY stationid, year, month
> LIMIT 100
> {code}
> log {code}
> Filter 
> ((HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFFromUnixTime(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFDate(date#81),-MM-dd),E)
>  = Mon) && (metric#80 = temperature))
> ERROR 2015-08-20 15:39:06 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation: Error 
> executing query:
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 2.0 failed 4 times, most recent failure: Lost task 1.3 in stage 2.0 
> (TID 208, 127.0.0.1): java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFDate.evaluate(GenericUDFDate.java:119)
>   at org.apache.spark.sql.hive.HiveGenericUdf.eval(hiveUdfs.scala:188)
>   at 
> org.apache.spark.sql.hive.HiveGenericUdf$$anonfun$eval$2.apply(hiveUdfs.scala:184)
>   at 
> org.apache.spark.sql.hive.DeferredObjectAdapter.get(hiveUdfs.scala:138)
>   at 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFToUnixTimeStamp.evaluate(GenericUDFToUnixTimeStamp.java:121)
>   at 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp.evaluate(GenericUDFUnixTimeStamp.java:52)
>   at org.apache.spark.sql.hive.HiveGenericUdf.eval(hiveUdfs.scala:188)
>   at 
> org.apache.spark.sql.hive.HiveSimpleUdf$$anonfun$eval$1.apply(hiveUdfs.scala:121)
>   at 
> org.apache.spark.sql.hive.HiveSimpleUdf$$anonfun$eval$1.apply(hiveUdfs.scala:121)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.sql.hive.HiveSimpleUdf.eval(hiveUdfs.scala:121)
>   at 
> org.apache.spark.sql.catalyst.expressions.EqualTo.eval(predicates.scala:191)
>   at 
> org.apache.spark.sql.catalyst.expressions.And.eval(predicates.scala:130)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$1.apply(predicates.scala:30)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$1.apply(predicates.scala:30)
>   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:154)
>   at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:149)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:70)
>   at 

[jira] [Created] (SPARK-10447) Upgrade pyspark to use py4j 0.9

2015-09-04 Thread Justin Uang (JIRA)
Justin Uang created SPARK-10447:
---

 Summary: Upgrade pyspark to use py4j 0.9
 Key: SPARK-10447
 URL: https://issues.apache.org/jira/browse/SPARK-10447
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.4.1
Reporter: Justin Uang


This was recently released, and it has many improvements, especially the 
following:

{quote}
Python side: IDEs and interactive interpreters such as IPython can now get help 
text/autocompletion for Java classes, objects, and members. This makes Py4J an 
ideal tool to explore complex Java APIs (e.g., the Eclipse API). Thanks to 
@jonahkichwacoders
{quote}

Normally we wrap all the APIs in spark, but for the ones that aren't, this 
would make it easier to offroad by using the java proxy objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4940) Support more evenly distributing cores for Mesos mode

2015-09-04 Thread Martin Tapp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730911#comment-14730911
 ] 

Martin Tapp commented on SPARK-4940:


My principal use case is to cram as much as possible on the same cluster. Some 
of our apps would benefit from these different strategies. For instance, we are 
using a library which starts lots of threads, so the round-robin strategy is 
really a good match for this type to prevent too many tasks on the same 
executor. Another example is an a pure spark pipeline where it's ok to fill the 
slave first because not much `outside` resources are being used. This would 
allow maximizing our cluster resource utilization.

> Support more evenly distributing cores for Mesos mode
> -
>
> Key: SPARK-4940
> URL: https://issues.apache.org/jira/browse/SPARK-4940
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
> Attachments: mesos-config-difference-3nodes-vs-2nodes.png
>
>
> Currently in Coarse grain mode the spark scheduler simply takes all the 
> resources it can on each node, but can cause uneven distribution based on 
> resources available on each slave.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10447) Upgrade pyspark to use py4j 0.9

2015-09-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730990#comment-14730990
 ] 

Sean Owen commented on SPARK-10447:
---

I bet there are some upsides to updating, but the question is: do we know if 
anything breaks, changes? Worth at least running the tests with this change, 
but also skimming the release notes to understand any breaking changes. 

> Upgrade pyspark to use py4j 0.9
> ---
>
> Key: SPARK-10447
> URL: https://issues.apache.org/jira/browse/SPARK-10447
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.4.1
>Reporter: Justin Uang
>
> This was recently released, and it has many improvements, especially the 
> following:
> {quote}
> Python side: IDEs and interactive interpreters such as IPython can now get 
> help text/autocompletion for Java classes, objects, and members. This makes 
> Py4J an ideal tool to explore complex Java APIs (e.g., the Eclipse API). 
> Thanks to @jonahkichwacoders
> {quote}
> Normally we wrap all the APIs in spark, but for the ones that aren't, this 
> would make it easier to offroad by using the java proxy objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10448) Parquet schema merging should NOT merge UDT

2015-09-04 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-10448:
--

 Summary: Parquet schema merging should NOT merge UDT
 Key: SPARK-10448
 URL: https://issues.apache.org/jira/browse/SPARK-10448
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1, 1.3.1, 1.5.0
Reporter: Cheng Lian


For example, we may have a UDT {{U}} that maps to a Catalyst {{StructType}} 
with two fields {{a}} and {{b}}. Later on, we updated {{U}} to {{U'}} by 
removing {{a}} and adding {{c}}. In this case, Parquet schema merging will give 
a {{StructType}} with all three fields. But such a {{StructType}} can be mapped 
to neither {{U}} nor {{U'}}.

We probably shouldn't allow schema merging over UDT types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10450) Minor SQL style, format, typo, readability fixes

2015-09-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10450:


Assignee: Andrew Or  (was: Apache Spark)

> Minor SQL style, format, typo, readability fixes
> 
>
> Key: SPARK-10450
> URL: https://issues.apache.org/jira/browse/SPARK-10450
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Minor
>
> This JIRA isn't exactly tied to one particular patch. Like SPARK-10003 it's 
> more of a continuous process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10450) Minor SQL style, format, typo, readability fixes

2015-09-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731211#comment-14731211
 ] 

Apache Spark commented on SPARK-10450:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/8603

> Minor SQL style, format, typo, readability fixes
> 
>
> Key: SPARK-10450
> URL: https://issues.apache.org/jira/browse/SPARK-10450
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Minor
>
> This JIRA isn't exactly tied to one particular patch. Like SPARK-10003 it's 
> more of a continuous process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10451) Prevent unnecessary serializations in InMemoryColumnarTableScan

2015-09-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10451:


Assignee: Apache Spark

> Prevent unnecessary serializations in InMemoryColumnarTableScan
> ---
>
> Key: SPARK-10451
> URL: https://issues.apache.org/jira/browse/SPARK-10451
> Project: Spark
>  Issue Type: Improvement
>Reporter: Yash Datta
>Assignee: Apache Spark
>
> In InMemorycolumnarTableScan, seriliazation of certain fields like 
> buildFilter, InMemoryRelation etc can be avoided during task execution by 
> carefully managing the clsoure of mapPartitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10451) Prevent unnecessary serializations in InMemoryColumnarTableScan

2015-09-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731283#comment-14731283
 ] 

Apache Spark commented on SPARK-10451:
--

User 'saucam' has created a pull request for this issue:
https://github.com/apache/spark/pull/8604

> Prevent unnecessary serializations in InMemoryColumnarTableScan
> ---
>
> Key: SPARK-10451
> URL: https://issues.apache.org/jira/browse/SPARK-10451
> Project: Spark
>  Issue Type: Improvement
>Reporter: Yash Datta
>
> In InMemorycolumnarTableScan, seriliazation of certain fields like 
> buildFilter, InMemoryRelation etc can be avoided during task execution by 
> carefully managing the clsoure of mapPartitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10451) Prevent unnecessary serializations in InMemoryColumnarTableScan

2015-09-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10451:


Assignee: (was: Apache Spark)

> Prevent unnecessary serializations in InMemoryColumnarTableScan
> ---
>
> Key: SPARK-10451
> URL: https://issues.apache.org/jira/browse/SPARK-10451
> Project: Spark
>  Issue Type: Improvement
>Reporter: Yash Datta
>
> In InMemorycolumnarTableScan, seriliazation of certain fields like 
> buildFilter, InMemoryRelation etc can be avoided during task execution by 
> carefully managing the clsoure of mapPartitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8951) support CJK characters in collect()

2015-09-04 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731151#comment-14731151
 ] 

Shivaram Venkataraman commented on SPARK-8951:
--

Ah I should have retested this before merging - I'll send a PR to fix this now

> support CJK characters in collect()
> ---
>
> Key: SPARK-8951
> URL: https://issues.apache.org/jira/browse/SPARK-8951
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Jaehong Choi
>Assignee: Jaehong Choi
>Priority: Minor
> Fix For: 1.6.0
>
> Attachments: SerDe.scala.diff
>
>
> Spark gives an error message and does not show the output when a field of the 
> result DataFrame contains characters in CJK.
> I found out that SerDe in R API only supports ASCII format for strings right 
> now as commented in source code.  
> So, I fixed SerDe.scala a little to support CJK as the file attached. 
> I did not care efficiency, but just wanted to see if it works.
> {noformat}
> people.json
> {"name":"가나"}
> {"name":"테스트123", "age":30}
> {"name":"Justin", "age":19}
> df <- read.df(sqlContext, "./people.json", "json")
> head(df)
> Error in rawtochar(string) : embedded nul in string : '\0 \x98'
> {noformat}
> {code:title=core/src/main/scala/org/apache/spark/api/r/SerDe.scala}
>   // NOTE: Only works for ASCII right now
>   def writeString(out: DataOutputStream, value: String): Unit = {
> val len = value.length
> out.writeInt(len + 1) // For the \0
> out.writeBytes(value)
> out.writeByte(0)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10449) StructType.merge shouldn't merge DecimalTypes with different precisions and/or scales

2015-09-04 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-10449:
--

 Summary: StructType.merge shouldn't merge DecimalTypes with 
different precisions and/or scales
 Key: SPARK-10449
 URL: https://issues.apache.org/jira/browse/SPARK-10449
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1, 1.3.1, 1.5.0
Reporter: Cheng Lian


Schema merging should only handle struct fields. But currently we also 
reconcile decimal precision and scale information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9666) ML 1.5 QA: model save/load audit

2015-09-04 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731299#comment-14731299
 ] 

Joseph K. Bradley commented on SPARK-9666:
--

Thanks for checking.  Shall I mark this complete?

> ML 1.5 QA: model save/load audit
> 
>
> Key: SPARK-9666
> URL: https://issues.apache.org/jira/browse/SPARK-9666
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>
> We should check to make sure no changes broke model import/export in 
> spark.mllib.
> * If a model's name, data members, or constructors have changed _at all_, 
> then we likely need to support a new save/load format version.  Different 
> versions must be tested in unit tests to ensure backwards compatibility 
> (i.e., verify we can load old model formats).
> * Examples in the programming guide should include save/load when available.  
> It's important to try running each example in the guide whenever it is 
> modified (since there are no automated tests).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8951) support CJK characters in collect()

2015-09-04 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731155#comment-14731155
 ] 

Shivaram Venkataraman commented on SPARK-8951:
--

Sent https://github.com/apache/spark/pull/8601 to fix this

> support CJK characters in collect()
> ---
>
> Key: SPARK-8951
> URL: https://issues.apache.org/jira/browse/SPARK-8951
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Jaehong Choi
>Assignee: Jaehong Choi
>Priority: Minor
> Fix For: 1.6.0
>
> Attachments: SerDe.scala.diff
>
>
> Spark gives an error message and does not show the output when a field of the 
> result DataFrame contains characters in CJK.
> I found out that SerDe in R API only supports ASCII format for strings right 
> now as commented in source code.  
> So, I fixed SerDe.scala a little to support CJK as the file attached. 
> I did not care efficiency, but just wanted to see if it works.
> {noformat}
> people.json
> {"name":"가나"}
> {"name":"테스트123", "age":30}
> {"name":"Justin", "age":19}
> df <- read.df(sqlContext, "./people.json", "json")
> head(df)
> Error in rawtochar(string) : embedded nul in string : '\0 \x98'
> {noformat}
> {code:title=core/src/main/scala/org/apache/spark/api/r/SerDe.scala}
>   // NOTE: Only works for ASCII right now
>   def writeString(out: DataOutputStream, value: String): Unit = {
> val len = value.length
> out.writeInt(len + 1) // For the \0
> out.writeBytes(value)
> out.writeByte(0)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10450) Minor SQL style, format, typo, readability fixes

2015-09-04 Thread Andrew Or (JIRA)
Andrew Or created SPARK-10450:
-

 Summary: Minor SQL style, format, typo, readability fixes
 Key: SPARK-10450
 URL: https://issues.apache.org/jira/browse/SPARK-10450
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Minor


This JIRA isn't exactly tied to one particular patch. Like SPARK-10003 it's 
more of a continuous process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10451) Prevent unnecessary serializations in InMemoryColumnarTableScan

2015-09-04 Thread Yash Datta (JIRA)
Yash Datta created SPARK-10451:
--

 Summary: Prevent unnecessary serializations in 
InMemoryColumnarTableScan
 Key: SPARK-10451
 URL: https://issues.apache.org/jira/browse/SPARK-10451
 Project: Spark
  Issue Type: Improvement
Reporter: Yash Datta


In InMemorycolumnarTableScan, seriliazation of certain fields like buildFilter, 
InMemoryRelation etc can be avoided during task execution by carefully managing 
the clsoure of mapPartitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10447) Upgrade pyspark to use py4j 0.9

2015-09-04 Thread Justin Uang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731164#comment-14731164
 ] 

Justin Uang commented on SPARK-10447:
-

Agreed, I'm pretty sure that this will break some APIs and we'll have to fix 
those as we do the upgrade =).

> Upgrade pyspark to use py4j 0.9
> ---
>
> Key: SPARK-10447
> URL: https://issues.apache.org/jira/browse/SPARK-10447
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.4.1
>Reporter: Justin Uang
>
> This was recently released, and it has many improvements, especially the 
> following:
> {quote}
> Python side: IDEs and interactive interpreters such as IPython can now get 
> help text/autocompletion for Java classes, objects, and members. This makes 
> Py4J an ideal tool to explore complex Java APIs (e.g., the Eclipse API). 
> Thanks to @jonahkichwacoders
> {quote}
> Normally we wrap all the APIs in spark, but for the ones that aren't, this 
> would make it easier to offroad by using the java proxy objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10450) Minor SQL style, format, typo, readability fixes

2015-09-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10450:


Assignee: Apache Spark  (was: Andrew Or)

> Minor SQL style, format, typo, readability fixes
> 
>
> Key: SPARK-10450
> URL: https://issues.apache.org/jira/browse/SPARK-10450
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Apache Spark
>Priority: Minor
>
> This JIRA isn't exactly tied to one particular patch. Like SPARK-10003 it's 
> more of a continuous process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10452) Pyspark worker security issue

2015-09-04 Thread Michael Procopio (JIRA)
Michael Procopio created SPARK-10452:


 Summary: Pyspark worker security issue
 Key: SPARK-10452
 URL: https://issues.apache.org/jira/browse/SPARK-10452
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.0
 Environment: Spark 1.4.0 running on hadoop 2.5.2.
Reporter: Michael Procopio
Priority: Critical


The python worker launched by the executor is given the credentials used to 
launch yarn. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10453) There's now way to use spark.dynmicAllocation.enabled with pyspark

2015-09-04 Thread Michael Procopio (JIRA)
Michael Procopio created SPARK-10453:


 Summary: There's now way to use spark.dynmicAllocation.enabled 
with pyspark
 Key: SPARK-10453
 URL: https://issues.apache.org/jira/browse/SPARK-10453
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.0
 Environment: When using spark.dynamicAllocation.enabled, the 
assumption is that memory/core resources will be mediated by the yarn resource 
manager.  Unfortunately, whatever value is used for spark.executor.memory is 
consumed as JVM heap space by the executor.  There's no way to account for the 
memory requirements of the pyspark worker.  Executor JVM heap space should be 
decoupled from spark.executor.memory.
Reporter: Michael Procopio






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10453) There's now way to use spark.dynmicAllocation.enabled with pyspark

2015-09-04 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-10453.

Resolution: Not A Problem

>From http://spark.apache.org/docs/latest/running-on-yarn.html:

{noformat}
spark.yarn.executor.memoryOverhead

executorMemory * 0.10, with minimum of 384

The amount of off heap memory (in megabytes) to be allocated per executor. This 
is memory that accounts for things like VM overheads, interned strings, other 
native overheads, etc. This tends to grow with the executor size (typically 
6-10%).
{noformat}

That also counts encompasses the python workers.


> There's now way to use spark.dynmicAllocation.enabled with pyspark
> --
>
> Key: SPARK-10453
> URL: https://issues.apache.org/jira/browse/SPARK-10453
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.0
> Environment: When using spark.dynamicAllocation.enabled, the 
> assumption is that memory/core resources will be mediated by the yarn 
> resource manager.  Unfortunately, whatever value is used for 
> spark.executor.memory is consumed as JVM heap space by the executor.  There's 
> no way to account for the memory requirements of the pyspark worker.  
> Executor JVM heap space should be decoupled from spark.executor.memory.
>Reporter: Michael Procopio
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10454) Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause multiple concurrent attempts for the same map stage

2015-09-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731380#comment-14731380
 ] 

Apache Spark commented on SPARK-10454:
--

User 'robbinspg' has created a pull request for this issue:
https://github.com/apache/spark/pull/8605

> Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause 
> multiple concurrent attempts for the same map stage
> -
>
> Key: SPARK-10454
> URL: https://issues.apache.org/jira/browse/SPARK-10454
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.5.1
>Reporter: Pete Robbins
>Priority: Minor
>
> test case fails intermittently in Jenkins.
> For eg, see the following builds-
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41991/
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41999/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10454) Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause multiple concurrent attempts for the same map stage

2015-09-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10454:


Assignee: (was: Apache Spark)

> Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause 
> multiple concurrent attempts for the same map stage
> -
>
> Key: SPARK-10454
> URL: https://issues.apache.org/jira/browse/SPARK-10454
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.5.1
>Reporter: Pete Robbins
>Priority: Minor
>
> test case fails intermittently in Jenkins.
> For eg, see the following builds-
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41991/
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41999/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10454) Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause multiple concurrent attempts for the same map stage

2015-09-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10454:


Assignee: Apache Spark

> Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause 
> multiple concurrent attempts for the same map stage
> -
>
> Key: SPARK-10454
> URL: https://issues.apache.org/jira/browse/SPARK-10454
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.5.1
>Reporter: Pete Robbins
>Assignee: Apache Spark
>Priority: Minor
>
> test case fails intermittently in Jenkins.
> For eg, see the following builds-
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41991/
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41999/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10456) upgrade java 7 on amplab jenkins workers

2015-09-04 Thread shane knapp (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp updated SPARK-10456:

Description: 
our java 7 installation is really old (from last september).  update this to 
the latest java 7 jdk.

please assign this to me.

  was:our java 7 installation is really old (from last september).  update this 
to the latest java 7 jdk


> upgrade java 7 on amplab jenkins workers
> 
>
> Key: SPARK-10456
> URL: https://issues.apache.org/jira/browse/SPARK-10456
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: shane knapp
>  Labels: build
>
> our java 7 installation is really old (from last september).  update this to 
> the latest java 7 jdk.
> please assign this to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10452) Pyspark worker security issue

2015-09-04 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-10452.

Resolution: Not A Problem

If you need your workers to run as you user, you need to configure YARN to use 
Kerberos.

> Pyspark worker security issue
> -
>
> Key: SPARK-10452
> URL: https://issues.apache.org/jira/browse/SPARK-10452
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.0
> Environment: Spark 1.4.0 running on hadoop 2.5.2.
>Reporter: Michael Procopio
>Priority: Critical
>
> The python worker launched by the executor is given the credentials used to 
> launch yarn. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10454) Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause multiple concurrent attempts for the same map stage

2015-09-04 Thread Pete Robbins (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731377#comment-14731377
 ] 

Pete Robbins commented on SPARK-10454:
--

This is another case of not waiting for events to drain form the listenerBus

> Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause 
> multiple concurrent attempts for the same map stage
> -
>
> Key: SPARK-10454
> URL: https://issues.apache.org/jira/browse/SPARK-10454
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.5.1
>Reporter: Pete Robbins
>Priority: Minor
>
> test case fails intermittently in Jenkins.
> For eg, see the following builds-
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41991/
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41999/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10455) install java 8 on amplab jenkins workers

2015-09-04 Thread shane knapp (JIRA)
shane knapp created SPARK-10455:
---

 Summary: install java 8 on amplab jenkins workers
 Key: SPARK-10455
 URL: https://issues.apache.org/jira/browse/SPARK-10455
 Project: Spark
  Issue Type: Task
  Components: Build
Reporter: shane knapp


install java 8 on all jenkins workers.

and just for clarification:  we want the 64-bit version, yes?

please assign this to me, thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10456) upgrade java 7 on amplab jenkins workers

2015-09-04 Thread shane knapp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731436#comment-14731436
 ] 

shane knapp commented on SPARK-10456:
-

looks like we'll be installing 7u79 (we're at 7u51 currently).

> upgrade java 7 on amplab jenkins workers
> 
>
> Key: SPARK-10456
> URL: https://issues.apache.org/jira/browse/SPARK-10456
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: shane knapp
>  Labels: build
>
> our java 7 installation is really old (from last september).  update this to 
> the latest java 7 jdk.
> please assign this to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9963) ML RandomForest cleanup: replace predictNodeIndex with predictImpl

2015-09-04 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731462#comment-14731462
 ] 

Joseph K. Bradley commented on SPARK-9963:
--

Sorry for the slow response! (I've been traveling.)  Option 2 sounds best.  It 
can resemble the current predictImpl, but can use the version of shouldGoLeft 
taking binned feature values.

> ML RandomForest cleanup: replace predictNodeIndex with predictImpl
> --
>
> Key: SPARK-9963
> URL: https://issues.apache.org/jira/browse/SPARK-9963
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Trivial
>  Labels: starter
>
> Replace ml.tree.impl.RandomForest.predictNodeIndex with Node.predictImpl.
> This should be straightforward, but please ping me if anything is unclear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2015-09-04 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731415#comment-14731415
 ] 

Imran Rashid commented on SPARK-4105:
-

[~mvherweg] Do you know if the error occurred after there was already a stage 
retry?  If so, then this might just be a symptom of SPARK-8029.  You would know 
if earlier in the logs, you see a FetchFailedException which is *not* related 
to snappy exceptions.  I think that is the first report of this bug since 
SPARK-7660, which we were really hoping fixed this issue, so it would be great 
to capture more information about it.

[~mmitsuto] Can you do the same check, and also tell us which version of Spark 
you are using?

> FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based 
> shuffle
> -
>
> Key: SPARK-4105
> URL: https://issues.apache.org/jira/browse/SPARK-4105
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.4.1
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
> Attachments: JavaObjectToSerialize.java, 
> SparkFailedToUncompressGenerator.scala
>
>
> We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during 
> shuffle read.  Here's a sample stacktrace from an executor:
> {code}
> 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 
> 33053)
> java.io.IOException: FAILED_TO_UNCOMPRESS(5)
>   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
>   at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
>   at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391)
>   at org.xerial.snappy.Snappy.uncompress(Snappy.java:427)
>   at 
> org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
>   at 
> org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
>   at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
>   at 
> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
>   at 
> org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129)
>   at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
>   at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>   at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at 
> 

[jira] [Created] (SPARK-10456) upgrade java 7 on amplab jenkins workers

2015-09-04 Thread shane knapp (JIRA)
shane knapp created SPARK-10456:
---

 Summary: upgrade java 7 on amplab jenkins workers
 Key: SPARK-10456
 URL: https://issues.apache.org/jira/browse/SPARK-10456
 Project: Spark
  Issue Type: Task
  Components: Build
Reporter: shane knapp


our java 7 installation is really old (from last september).  update this to 
the latest java 7 jdk



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10456) upgrade java 7 on amplab jenkins workers

2015-09-04 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-10456:
---
Assignee: shane knapp

> upgrade java 7 on amplab jenkins workers
> 
>
> Key: SPARK-10456
> URL: https://issues.apache.org/jira/browse/SPARK-10456
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: shane knapp
>Assignee: shane knapp
>  Labels: build
>
> our java 7 installation is really old (from last september).  update this to 
> the latest java 7 jdk.
> please assign this to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10455) install java 8 on amplab jenkins workers

2015-09-04 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-10455:
---
Assignee: shane knapp

> install java 8 on amplab jenkins workers
> 
>
> Key: SPARK-10455
> URL: https://issues.apache.org/jira/browse/SPARK-10455
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: shane knapp
>Assignee: shane knapp
>
> install java 8 on all jenkins workers.
> and just for clarification:  we want the 64-bit version, yes?
> please assign this to me, thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10455) install java 8 on amplab jenkins workers

2015-09-04 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731440#comment-14731440
 ] 

Josh Rosen commented on SPARK-10455:


Yep, I think we want the 64-bit version.

> install java 8 on amplab jenkins workers
> 
>
> Key: SPARK-10455
> URL: https://issues.apache.org/jira/browse/SPARK-10455
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: shane knapp
>Assignee: shane knapp
>
> install java 8 on all jenkins workers.
> and just for clarification:  we want the 64-bit version, yes?
> please assign this to me, thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10454) Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause multiple concurrent attempts for the same map stage

2015-09-04 Thread Pete Robbins (JIRA)
Pete Robbins created SPARK-10454:


 Summary: Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch 
failures don't cause multiple concurrent attempts for the same map stage
 Key: SPARK-10454
 URL: https://issues.apache.org/jira/browse/SPARK-10454
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, Spark Core
Affects Versions: 1.5.1
Reporter: Pete Robbins
Priority: Minor


test case fails intermittently in Jenkins.

For eg, see the following builds-
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41991/
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41999/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10439) Catalyst should check for overflow / underflow of date and timestamp values

2015-09-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731392#comment-14731392
 ] 

Apache Spark commented on SPARK-10439:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/8606

> Catalyst should check for overflow / underflow of date and timestamp values
> ---
>
> Key: SPARK-10439
> URL: https://issues.apache.org/jira/browse/SPARK-10439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> While testing some code, I noticed that a few methods in {{DateTimeUtils}} 
> are prone to overflow and underflow.
> For example, {{millisToDays}} can overflow the return type ({{Int}}) if a 
> large enough input value is provided.
> Similarly, {{fromJavaTimestamp}} converts milliseconds to microseconds, which 
> can overflow if the input is {{> Long.MAX_VALUE / 1000}} (or underflow in the 
> negative case).
> There might be others but these were the ones that caught my eye.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10439) Catalyst should check for overflow / underflow of date and timestamp values

2015-09-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10439:


Assignee: (was: Apache Spark)

> Catalyst should check for overflow / underflow of date and timestamp values
> ---
>
> Key: SPARK-10439
> URL: https://issues.apache.org/jira/browse/SPARK-10439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> While testing some code, I noticed that a few methods in {{DateTimeUtils}} 
> are prone to overflow and underflow.
> For example, {{millisToDays}} can overflow the return type ({{Int}}) if a 
> large enough input value is provided.
> Similarly, {{fromJavaTimestamp}} converts milliseconds to microseconds, which 
> can overflow if the input is {{> Long.MAX_VALUE / 1000}} (or underflow in the 
> negative case).
> There might be others but these were the ones that caught my eye.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10439) Catalyst should check for overflow / underflow of date and timestamp values

2015-09-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10439:


Assignee: Apache Spark

> Catalyst should check for overflow / underflow of date and timestamp values
> ---
>
> Key: SPARK-10439
> URL: https://issues.apache.org/jira/browse/SPARK-10439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Minor
>
> While testing some code, I noticed that a few methods in {{DateTimeUtils}} 
> are prone to overflow and underflow.
> For example, {{millisToDays}} can overflow the return type ({{Int}}) if a 
> large enough input value is provided.
> Similarly, {{fromJavaTimestamp}} converts milliseconds to microseconds, which 
> can overflow if the input is {{> Long.MAX_VALUE / 1000}} (or underflow in the 
> negative case).
> There might be others but these were the ones that caught my eye.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10433) Gradient boosted trees

2015-09-04 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731398#comment-14731398
 ] 

Joseph K. Bradley commented on SPARK-10433:
---

Has this been reported on 1.5?  I've seen reports for 1.4, but was told by 
[~dbtsai] that 1.5 seems to have fixed this issue.  I believe that the caching 
(and optional checkpointing) added in 1.5 fix this issue, but it would be great 
to get confirmation.

> Gradient boosted trees
> --
>
> Key: SPARK-10433
> URL: https://issues.apache.org/jira/browse/SPARK-10433
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Sean Owen
>
> (Sorry to say I don't have any leads on a fix, but this was reported by three 
> different people and I confirmed it at fairly close range, so think it's 
> legitimate:)
> This is probably best explained in the words from the mailing list thread at 
> http://mail-archives.apache.org/mod_mbox/spark-user/201509.mbox/%3C55E84380.2000408%40gmail.com%3E
>  . Matt Forbes says:
> {quote}
> I am training a boosted trees model on a couple million input samples (with 
> around 300 features) and am noticing that the input size of each stage is 
> increasing each iteration. For each new tree, the first step seems to be 
> building the decision tree metadata, which does a .count() on the input data, 
> so this is the step I've been using to track the input size changing. Here is 
> what I'm seeing: 
> {quote}
> {code}
> count at DecisionTreeMetadata.scala:111 
> 1. Input Size / Records: 726.1 MB / 1295620 
> 2. Input Size / Records: 106.9 GB / 64780816 
> 3. Input Size / Records: 160.3 GB / 97171224 
> 4. Input Size / Records: 214.8 GB / 129680959 
> 5. Input Size / Records: 268.5 GB / 162533424 
>  
> Input Size / Records: 1912.6 GB / 1382017686 
>  
> {code}
> {quote}
> This step goes from taking less than 10s up to 5 minutes by the 15th or so 
> iteration. I'm not quite sure what could be causing this. I am passing a 
> memory-only cached RDD[LabeledPoint] to GradientBoostedTrees.train 
> {quote}
> Johannes Bauer showed me a very similar problem.
> Peter Rudenko offers this sketch of a reproduction:
> {code}
> val boostingStrategy = BoostingStrategy.defaultParams("Classification")
> boostingStrategy.setNumIterations(30)
> boostingStrategy.setLearningRate(1.0)
> boostingStrategy.treeStrategy.setMaxDepth(3)
> boostingStrategy.treeStrategy.setMaxBins(128)
> boostingStrategy.treeStrategy.setSubsamplingRate(1.0)
> boostingStrategy.treeStrategy.setMinInstancesPerNode(1)
> boostingStrategy.treeStrategy.setUseNodeIdCache(true)
> boostingStrategy.treeStrategy.setCategoricalFeaturesInfo(
>   
> mapAsJavaMap(categoricalFeatures).asInstanceOf[java.util.Map[java.lang.Integer,
>  java.lang.Integer]])
> val model = GradientBoostedTrees.train(instances, boostingStrategy)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10455) install java 8 on amplab jenkins workers

2015-09-04 Thread shane knapp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731428#comment-14731428
 ] 

shane knapp commented on SPARK-10455:
-

looks like i'll be installing java 8u60.

> install java 8 on amplab jenkins workers
> 
>
> Key: SPARK-10455
> URL: https://issues.apache.org/jira/browse/SPARK-10455
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: shane knapp
>
> install java 8 on all jenkins workers.
> and just for clarification:  we want the 64-bit version, yes?
> please assign this to me, thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9963) ML RandomForest cleanup: replace predictNodeIndex with predictImpl

2015-09-04 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731468#comment-14731468
 ] 

Joseph K. Bradley commented on SPARK-9963:
--

Yep, that first case in the if-else is for the right-most bin with range 
[maxSplitValue, +inf]

> ML RandomForest cleanup: replace predictNodeIndex with predictImpl
> --
>
> Key: SPARK-9963
> URL: https://issues.apache.org/jira/browse/SPARK-9963
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Trivial
>  Labels: starter
>
> Replace ml.tree.impl.RandomForest.predictNodeIndex with Node.predictImpl.
> This should be straightforward, but please ping me if anything is unclear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10414) DenseMatrix gives different hashcode even though equals returns true

2015-09-04 Thread Vinod KC (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731757#comment-14731757
 ] 

Vinod KC commented on SPARK-10414:
--

Thanks
Got the JIRA id
https://issues.apache.org/jira/browse/SPARK-9919

> DenseMatrix gives different hashcode even though equals returns true
> 
>
> Key: SPARK-10414
> URL: https://issues.apache.org/jira/browse/SPARK-10414
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Vinod KC
>Priority: Minor
>
> hashcode implementation in DenseMatrix gives different result for same input
> val dm = Matrices.dense(2, 2, Array(0.0, 1.0, 2.0, 3.0))
> val dm1 = Matrices.dense(2, 2, Array(0.0, 1.0, 2.0, 3.0))
> assert(dm1 === dm) // passed
> assert(dm1.hashCode === dm.hashCode) // Failed
> This violates the hashCode/equals contract.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9961) ML prediction abstractions should have defaultEvaluator fields

2015-09-04 Thread George Dittmar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731810#comment-14731810
 ] 

George Dittmar commented on SPARK-9961:
---

Can you expand on what you mean by Evaluator? Just looking for something to 
eval how good predictions are? 

> ML prediction abstractions should have defaultEvaluator fields
> --
>
> Key: SPARK-9961
> URL: https://issues.apache.org/jira/browse/SPARK-9961
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Predictor and PredictionModel should have abstract defaultEvaluator methods 
> which return Evaluators.  Subclasses like Regressor, Classifier, etc. should 
> all provide natural evaluators, set to use the correct input columns and 
> metrics.  Concrete classes may later be modified to 
> The initial implementation should be marked as DeveloperApi since we may need 
> to change the defaults later on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9961) ML prediction abstractions should have defaultEvaluator fields

2015-09-04 Thread George Dittmar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731810#comment-14731810
 ] 

George Dittmar edited comment on SPARK-9961 at 9/5/15 5:23 AM:
---

Can you expand on what you mean by Evaluator? Just looking for something to 
eval how good predictions are?


was (Author: georgedittmar):
Can you expand on what you mean by Evaluator? Just looking for something to 
eval how good predictions are? 

> ML prediction abstractions should have defaultEvaluator fields
> --
>
> Key: SPARK-9961
> URL: https://issues.apache.org/jira/browse/SPARK-9961
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Predictor and PredictionModel should have abstract defaultEvaluator methods 
> which return Evaluators.  Subclasses like Regressor, Classifier, etc. should 
> all provide natural evaluators, set to use the correct input columns and 
> metrics.  Concrete classes may later be modified to 
> The initial implementation should be marked as DeveloperApi since we may need 
> to change the defaults later on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10459) PythonUDF could process UnsafeRow

2015-09-04 Thread Davies Liu (JIRA)
Davies Liu created SPARK-10459:
--

 Summary: PythonUDF could process UnsafeRow
 Key: SPARK-10459
 URL: https://issues.apache.org/jira/browse/SPARK-10459
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu


Currently, There will be ConvertToSafe for PythonUDF, that's not needed 
actually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8632) Poor Python UDF performance because of RDD caching

2015-09-04 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-8632:
-

Assignee: Davies Liu

> Poor Python UDF performance because of RDD caching
> --
>
> Key: SPARK-8632
> URL: https://issues.apache.org/jira/browse/SPARK-8632
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
>Reporter: Justin Uang
>Assignee: Davies Liu
>
> {quote}
> We have been running into performance problems using Python UDFs with 
> DataFrames at large scale.
> From the implementation of BatchPythonEvaluation, it looks like the goal was 
> to reuse the PythonRDD code. It caches the entire child RDD so that it can do 
> two passes over the data. One to give to the PythonRDD, then one to join the 
> python lambda results with the original row (which may have java objects that 
> should be passed through).
> In addition, it caches all the columns, even the ones that don't need to be 
> processed by the Python UDF. In the cases I was working with, I had a 500 
> column table, and i wanted to use a python UDF for one column, and it ended 
> up caching all 500 columns. 
> {quote}
> http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8951) support CJK characters in collect()

2015-09-04 Thread Jihong MA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731140#comment-14731140
 ] 

Jihong MA commented on SPARK-8951:
--

This commit cause R style check failure. 


Running R style checks

Loading required package: methods

Attaching package: 'SparkR'

The following objects are masked from 'package:stats':

filter, na.omit

The following objects are masked from 'package:base':

intersect, rbind, sample, subset, summary, table, transform


Attaching package: 'testthat'

The following object is masked from 'package:SparkR':

describe

R/deserialize.R:63:9: style: Trailing whitespace is superfluous.
  string 
^
lintr checks failed.
[error] running /home/jenkins/workspace/SparkPullRequestBuilder/dev/lint-r ; 
received return code 1
Archiving unit tests logs...
> No log files found.
Attempting to post to Github...
 > Post successful.
Build step 'Execute shell' marked build as failure
Archiving artifacts
Recording test results
ERROR: Publisher 'Publish JUnit test result report' failed: No test report 
files were found. Configuration error?
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 

> support CJK characters in collect()
> ---
>
> Key: SPARK-8951
> URL: https://issues.apache.org/jira/browse/SPARK-8951
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Jaehong Choi
>Assignee: Jaehong Choi
>Priority: Minor
> Fix For: 1.6.0
>
> Attachments: SerDe.scala.diff
>
>
> Spark gives an error message and does not show the output when a field of the 
> result DataFrame contains characters in CJK.
> I found out that SerDe in R API only supports ASCII format for strings right 
> now as commented in source code.  
> So, I fixed SerDe.scala a little to support CJK as the file attached. 
> I did not care efficiency, but just wanted to see if it works.
> {noformat}
> people.json
> {"name":"가나"}
> {"name":"테스트123", "age":30}
> {"name":"Justin", "age":19}
> df <- read.df(sqlContext, "./people.json", "json")
> head(df)
> Error in rawtochar(string) : embedded nul in string : '\0 \x98'
> {noformat}
> {code:title=core/src/main/scala/org/apache/spark/api/r/SerDe.scala}
>   // NOTE: Only works for ASCII right now
>   def writeString(out: DataOutputStream, value: String): Unit = {
> val len = value.length
> out.writeInt(len + 1) // For the \0
> out.writeBytes(value)
> out.writeByte(0)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9925) Set SQLConf.SHUFFLE_PARTITIONS.key correctly for tests

2015-09-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731170#comment-14731170
 ] 

Apache Spark commented on SPARK-9925:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/8602

> Set SQLConf.SHUFFLE_PARTITIONS.key correctly for tests
> --
>
> Key: SPARK-9925
> URL: https://issues.apache.org/jira/browse/SPARK-9925
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> Right now, in our TestSQLContext/TestHiveContext, we use {{override def 
> numShufflePartitions: Int = this.getConf(SQLConf.SHUFFLE_PARTITIONS, 5)}} to 
> set {{SHUFFLE_PARTITIONS}}. However, we never put it to SQLConf. So, after we 
> use {{withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> "number")}}, the number 
> of shuffle partitions will be set back to 200.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10452) Pyspark worker security issue

2015-09-04 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731332#comment-14731332
 ] 

Marcelo Vanzin edited comment on SPARK-10452 at 9/4/15 9:43 PM:


If you need your workers to run as your user, you need to configure YARN to use 
Kerberos.


was (Author: vanzin):
If you need your workers to run as you user, you need to configure YARN to use 
Kerberos.

> Pyspark worker security issue
> -
>
> Key: SPARK-10452
> URL: https://issues.apache.org/jira/browse/SPARK-10452
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.0
> Environment: Spark 1.4.0 running on hadoop 2.5.2.
>Reporter: Michael Procopio
>Priority: Critical
>
> The python worker launched by the executor is given the credentials used to 
> launch yarn. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10455) install java 8 on amplab jenkins workers

2015-09-04 Thread shane knapp (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp resolved SPARK-10455.
-
Resolution: Done

> install java 8 on amplab jenkins workers
> 
>
> Key: SPARK-10455
> URL: https://issues.apache.org/jira/browse/SPARK-10455
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: shane knapp
>Assignee: shane knapp
>
> install java 8 on all jenkins workers.
> and just for clarification:  we want the 64-bit version, yes?
> please assign this to me, thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10455) install java 8 on amplab jenkins workers

2015-09-04 Thread shane knapp (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp closed SPARK-10455.
---

FIN!

> install java 8 on amplab jenkins workers
> 
>
> Key: SPARK-10455
> URL: https://issues.apache.org/jira/browse/SPARK-10455
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: shane knapp
>Assignee: shane knapp
>
> install java 8 on all jenkins workers.
> and just for clarification:  we want the 64-bit version, yes?
> please assign this to me, thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10457) Unable to connect to MySQL with the DataFrame API

2015-09-04 Thread Mariano Simone (JIRA)
Mariano Simone created SPARK-10457:
--

 Summary: Unable to connect to MySQL with the DataFrame API
 Key: SPARK-10457
 URL: https://issues.apache.org/jira/browse/SPARK-10457
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
 Environment: Linux singularity 3.13.0-63-generic #103-Ubuntu SMP Fri 
Aug 14 21:42:59 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)

 "org.apache.spark" %% "spark-core"% "1.4.1" % "provided",
  "org.apache.spark" %  "spark-sql_2.10"% "1.4.1" % "provided",
  "org.apache.spark" %  "spark-streaming_2.10"  % "1.4.1" % "provided",
  "org.apache.spark" %% "spark-streaming-kafka" % "1.4.1",
  "mysql"%  "mysql-connector-java"  % "5.1.36"

Reporter: Mariano Simone


I'm getting this error everytime I try to create a dataframe using jdbc:

java.sql.SQLException: No suitable driver found for 
jdbc:mysql://localhost:3306/test

What I have so far:

standart sbt project.

Added the dep. on mysql-connector to build.sbt like this:
"mysql"%  "mysql-connector-java"  % "5.1.36"

The code that creates the df:
val url   = "jdbc:mysql://localhost:3306/test"
val table = "test_table"

val properties = new Properties
properties.put("user", "123")
properties.put("password", "123")
properties.put("driver", "com.mysql.jdbc.Driver")

val tiers  = sqlContext.read.jdbc(url, table, properties)

I also loaded the jar like this:
streamingContext.sparkContext.addJar("mysql-connector-java-5.1.36.jar")

This is the back trace of the exception being thrown:

15/09/04 18:37:40 ERROR JobScheduler: Error running job streaming job 
144140266 ms.0
java.sql.SQLException: No suitable driver found for 
jdbc:mysql://localhost:3306/test
at java.sql.DriverManager.getConnection(DriverManager.java:689)
at java.sql.DriverManager.getConnection(DriverManager.java:208)
at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:118)
at org.apache.spark.sql.jdbc.JDBCRelation.(JDBCRelation.scala:128)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:200)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:130)
at com.playtika.etl.Application$.processRDD(Application.scala:69)
at 
com.playtika.etl.Application$$anonfun$processStream$1.apply(Application.scala:52)
at 
com.playtika.etl.Application$$anonfun$processStream$1.apply(Application.scala:51)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:34)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:193)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:192)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10457) Unable to connect to MySQL with the DataFrame API

2015-09-04 Thread Mariano Simone (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mariano Simone updated SPARK-10457:
---
Description: 
I'm getting this error everytime I try to create a dataframe using jdbc:

java.sql.SQLException: No suitable driver found for 
jdbc:mysql://localhost:3306/test

What I have so far:

standart sbt project.

Added the dep. on mysql-connector to build.sbt like this:
"mysql"%  "mysql-connector-java"  % "5.1.36"

The code that creates the df:
val url   = "jdbc:mysql://localhost:3306/test"
val table = "test_table"

val properties = new Properties
properties.put("user", "123")
properties.put("password", "123")
properties.put("driver", "com.mysql.jdbc.Driver")

val tiers  = sqlContext.read.jdbc(url, table, properties)

I also loaded the jar like this:
streamingContext.sparkContext.addJar("mysql-connector-java-5.1.36.jar")

This is the back trace of the exception being thrown:

15/09/04 18:37:40 ERROR JobScheduler: Error running job streaming job 
144140266 ms.0
java.sql.SQLException: No suitable driver found for 
jdbc:mysql://localhost:3306/test
at java.sql.DriverManager.getConnection(DriverManager.java:689)
at java.sql.DriverManager.getConnection(DriverManager.java:208)
at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:118)
at org.apache.spark.sql.jdbc.JDBCRelation.(JDBCRelation.scala:128)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:200)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:130)
at com.playtika.etl.Application$.processRDD(Application.scala:69)
at 
com.playtika.etl.Application$$anonfun$processStream$1.apply(Application.scala:52)
at 
com.playtika.etl.Application$$anonfun$processStream$1.apply(Application.scala:51)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:34)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:193)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:192)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Let me know if more data is needed.


  was:
I'm getting this error everytime I try to create a dataframe using jdbc:

java.sql.SQLException: No suitable driver found for 
jdbc:mysql://localhost:3306/test

What I have so far:

standart sbt project.

Added the dep. on mysql-connector to build.sbt like this:
"mysql"%  "mysql-connector-java"  % "5.1.36"

The code that creates the df:
val url   = "jdbc:mysql://localhost:3306/test"
val table = "test_table"

val properties = new Properties
properties.put("user", "123")
properties.put("password", "123")
properties.put("driver", "com.mysql.jdbc.Driver")

val tiers  = sqlContext.read.jdbc(url, table, properties)

I also loaded the jar like this:
streamingContext.sparkContext.addJar("mysql-connector-java-5.1.36.jar")

This is the back trace of the exception being thrown:

15/09/04 18:37:40 ERROR JobScheduler: Error running job streaming job 
144140266 ms.0
java.sql.SQLException: No suitable driver found for 
jdbc:mysql://localhost:3306/test
at java.sql.DriverManager.getConnection(DriverManager.java:689)
at java.sql.DriverManager.getConnection(DriverManager.java:208)
at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:118)
at org.apache.spark.sql.jdbc.JDBCRelation.(JDBCRelation.scala:128)
at 

[jira] [Closed] (SPARK-10457) Unable to connect to MySQL with the DataFrame API

2015-09-04 Thread Mariano Simone (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mariano Simone closed SPARK-10457.
--
Resolution: Fixed

Found the solution.

spark.executor.extraClassPath needed configuration.

> Unable to connect to MySQL with the DataFrame API
> -
>
> Key: SPARK-10457
> URL: https://issues.apache.org/jira/browse/SPARK-10457
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: Linux singularity 3.13.0-63-generic #103-Ubuntu SMP Fri 
> Aug 14 21:42:59 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
> Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
>  "org.apache.spark" %% "spark-core"% "1.4.1" % "provided",
>   "org.apache.spark" %  "spark-sql_2.10"% "1.4.1" % "provided",
>   "org.apache.spark" %  "spark-streaming_2.10"  % "1.4.1" % "provided",
>   "org.apache.spark" %% "spark-streaming-kafka" % "1.4.1",
>   "mysql"%  "mysql-connector-java"  % "5.1.36"
>Reporter: Mariano Simone
>
> I'm getting this error everytime I try to create a dataframe using jdbc:
> java.sql.SQLException: No suitable driver found for 
> jdbc:mysql://localhost:3306/test
> What I have so far:
> standart sbt project.
> Added the dep. on mysql-connector to build.sbt like this:
> "mysql"%  "mysql-connector-java"  % "5.1.36"
> The code that creates the df:
> val url   = "jdbc:mysql://localhost:3306/test"
> val table = "test_table"
> val properties = new Properties
> properties.put("user", "123")
> properties.put("password", "123")
> properties.put("driver", "com.mysql.jdbc.Driver")
> val tiers  = sqlContext.read.jdbc(url, table, properties)
> I also loaded the jar like this:
> streamingContext.sparkContext.addJar("mysql-connector-java-5.1.36.jar")
> This is the back trace of the exception being thrown:
> 15/09/04 18:37:40 ERROR JobScheduler: Error running job streaming job 
> 144140266 ms.0
> java.sql.SQLException: No suitable driver found for 
> jdbc:mysql://localhost:3306/test
>   at java.sql.DriverManager.getConnection(DriverManager.java:689)
>   at java.sql.DriverManager.getConnection(DriverManager.java:208)
>   at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:118)
>   at org.apache.spark.sql.jdbc.JDBCRelation.(JDBCRelation.scala:128)
>   at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:200)
>   at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:130)
>   at com.playtika.etl.Application$.processRDD(Application.scala:69)
>   at 
> com.playtika.etl.Application$$anonfun$processStream$1.apply(Application.scala:52)
>   at 
> com.playtika.etl.Application$$anonfun$processStream$1.apply(Application.scala:51)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:40)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
>   at scala.util.Try$.apply(Try.scala:161)
>   at org.apache.spark.streaming.scheduler.Job.run(Job.scala:34)
>   at 
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:193)
>   at 
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)
>   at 
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:192)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Let me know if more data is needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10311) In cluster mode, AppId and AttemptId should be update when ApplicationMaster is new

2015-09-04 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-10311:
--
Affects Version/s: 1.5.0
   1.4.1

> In cluster mode, AppId and AttemptId should be update when ApplicationMaster 
> is new
> ---
>
> Key: SPARK-10311
> URL: https://issues.apache.org/jira/browse/SPARK-10311
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1, 1.5.0
>Reporter: meiyoula
>
> When I start a streaming app with checkpoint data in yarn-cluster mode, the 
> appId and attempId are old(which app first create the checkpoint data), and 
> the event log writes into the old file name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10311) In cluster mode, AppId and AttemptId should be update when ApplicationMaster is new

2015-09-04 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-10311:
--
Target Version/s: 1.6.0, 1.5.1

> In cluster mode, AppId and AttemptId should be update when ApplicationMaster 
> is new
> ---
>
> Key: SPARK-10311
> URL: https://issues.apache.org/jira/browse/SPARK-10311
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1, 1.5.0
>Reporter: meiyoula
>
> When I start a streaming app with checkpoint data in yarn-cluster mode, the 
> appId and attempId are old(which app first create the checkpoint data), and 
> the event log writes into the old file name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10433) Gradient boosted trees

2015-09-04 Thread DB Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731533#comment-14731533
 ] 

DB Tsai commented on SPARK-10433:
-

[~sowen] I can confirm that this should be fixed in 1.5

> Gradient boosted trees
> --
>
> Key: SPARK-10433
> URL: https://issues.apache.org/jira/browse/SPARK-10433
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Sean Owen
>
> (Sorry to say I don't have any leads on a fix, but this was reported by three 
> different people and I confirmed it at fairly close range, so think it's 
> legitimate:)
> This is probably best explained in the words from the mailing list thread at 
> http://mail-archives.apache.org/mod_mbox/spark-user/201509.mbox/%3C55E84380.2000408%40gmail.com%3E
>  . Matt Forbes says:
> {quote}
> I am training a boosted trees model on a couple million input samples (with 
> around 300 features) and am noticing that the input size of each stage is 
> increasing each iteration. For each new tree, the first step seems to be 
> building the decision tree metadata, which does a .count() on the input data, 
> so this is the step I've been using to track the input size changing. Here is 
> what I'm seeing: 
> {quote}
> {code}
> count at DecisionTreeMetadata.scala:111 
> 1. Input Size / Records: 726.1 MB / 1295620 
> 2. Input Size / Records: 106.9 GB / 64780816 
> 3. Input Size / Records: 160.3 GB / 97171224 
> 4. Input Size / Records: 214.8 GB / 129680959 
> 5. Input Size / Records: 268.5 GB / 162533424 
>  
> Input Size / Records: 1912.6 GB / 1382017686 
>  
> {code}
> {quote}
> This step goes from taking less than 10s up to 5 minutes by the 15th or so 
> iteration. I'm not quite sure what could be causing this. I am passing a 
> memory-only cached RDD[LabeledPoint] to GradientBoostedTrees.train 
> {quote}
> Johannes Bauer showed me a very similar problem.
> Peter Rudenko offers this sketch of a reproduction:
> {code}
> val boostingStrategy = BoostingStrategy.defaultParams("Classification")
> boostingStrategy.setNumIterations(30)
> boostingStrategy.setLearningRate(1.0)
> boostingStrategy.treeStrategy.setMaxDepth(3)
> boostingStrategy.treeStrategy.setMaxBins(128)
> boostingStrategy.treeStrategy.setSubsamplingRate(1.0)
> boostingStrategy.treeStrategy.setMinInstancesPerNode(1)
> boostingStrategy.treeStrategy.setUseNodeIdCache(true)
> boostingStrategy.treeStrategy.setCategoricalFeaturesInfo(
>   
> mapAsJavaMap(categoricalFeatures).asInstanceOf[java.util.Map[java.lang.Integer,
>  java.lang.Integer]])
> val model = GradientBoostedTrees.train(instances, boostingStrategy)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10420) Implementing Reactive Streams based Spark Streaming Receiver

2015-09-04 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-10420:
--
Target Version/s: 1.6.0  (was: )

> Implementing Reactive Streams based Spark Streaming Receiver
> 
>
> Key: SPARK-10420
> URL: https://issues.apache.org/jira/browse/SPARK-10420
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Nilanjan Raychaudhuri
>Priority: Minor
>
> Hello TD,
> This is probably the last bit of the back-pressure story, implementing 
> ReactiveStreams based Spark streaming receivers. After discussing about this 
> with my Typesafe team we came up with the following design document
> https://docs.google.com/document/d/1lGQKXfNznd5SPuQigvCdLsudl-gcvWKuHWr0Bpn3y30/edit?usp=sharing
> Could you please take a look at this when you get a chance?
> Thanks
> Nilanjan



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10447) Upgrade pyspark to use py4j 0.9

2015-09-04 Thread Justin Uang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731592#comment-14731592
 ] 

Justin Uang commented on SPARK-10447:
-

Sure, I wouldn't mind doing the code review. Can you add me?



> Upgrade pyspark to use py4j 0.9
> ---
>
> Key: SPARK-10447
> URL: https://issues.apache.org/jira/browse/SPARK-10447
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.4.1
>Reporter: Justin Uang
>
> This was recently released, and it has many improvements, especially the 
> following:
> {quote}
> Python side: IDEs and interactive interpreters such as IPython can now get 
> help text/autocompletion for Java classes, objects, and members. This makes 
> Py4J an ideal tool to explore complex Java APIs (e.g., the Eclipse API). 
> Thanks to @jonahkichwacoders
> {quote}
> Normally we wrap all the APIs in spark, but for the ones that aren't, this 
> would make it easier to offroad by using the java proxy objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10447) Upgrade pyspark to use py4j 0.9

2015-09-04 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731591#comment-14731591
 ] 

holdenk commented on SPARK-10447:
-

I can give this a shot if no one else is interested in doing this (I've been 
wrangling some py4j bits with Sparkling Pandas).

> Upgrade pyspark to use py4j 0.9
> ---
>
> Key: SPARK-10447
> URL: https://issues.apache.org/jira/browse/SPARK-10447
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.4.1
>Reporter: Justin Uang
>
> This was recently released, and it has many improvements, especially the 
> following:
> {quote}
> Python side: IDEs and interactive interpreters such as IPython can now get 
> help text/autocompletion for Java classes, objects, and members. This makes 
> Py4J an ideal tool to explore complex Java APIs (e.g., the Eclipse API). 
> Thanks to @jonahkichwacoders
> {quote}
> Normally we wrap all the APIs in spark, but for the ones that aren't, this 
> would make it easier to offroad by using the java proxy objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10447) Upgrade pyspark to use py4j 0.9

2015-09-04 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731597#comment-14731597
 ] 

holdenk commented on SPARK-10447:
-

Sure, I'll ping you when I've got the PR ready (probably sometime this long 
weekend) if that's good for you?

> Upgrade pyspark to use py4j 0.9
> ---
>
> Key: SPARK-10447
> URL: https://issues.apache.org/jira/browse/SPARK-10447
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.4.1
>Reporter: Justin Uang
>
> This was recently released, and it has many improvements, especially the 
> following:
> {quote}
> Python side: IDEs and interactive interpreters such as IPython can now get 
> help text/autocompletion for Java classes, objects, and members. This makes 
> Py4J an ideal tool to explore complex Java APIs (e.g., the Eclipse API). 
> Thanks to @jonahkichwacoders
> {quote}
> Normally we wrap all the APIs in spark, but for the ones that aren't, this 
> would make it easier to offroad by using the java proxy objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10447) Upgrade pyspark to use py4j 0.9

2015-09-04 Thread Justin Uang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731598#comment-14731598
 ] 

Justin Uang commented on SPARK-10447:
-

Sound good



> Upgrade pyspark to use py4j 0.9
> ---
>
> Key: SPARK-10447
> URL: https://issues.apache.org/jira/browse/SPARK-10447
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.4.1
>Reporter: Justin Uang
>
> This was recently released, and it has many improvements, especially the 
> following:
> {quote}
> Python side: IDEs and interactive interpreters such as IPython can now get 
> help text/autocompletion for Java classes, objects, and members. This makes 
> Py4J an ideal tool to explore complex Java APIs (e.g., the Eclipse API). 
> Thanks to @jonahkichwacoders
> {quote}
> Normally we wrap all the APIs in spark, but for the ones that aren't, this 
> would make it easier to offroad by using the java proxy objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10397) Make Python's SparkContext self-descriptive on "print sc"

2015-09-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731618#comment-14731618
 ] 

Apache Spark commented on SPARK-10397:
--

User 'alexrovner' has created a pull request for this issue:
https://github.com/apache/spark/pull/8608

> Make Python's SparkContext self-descriptive on "print sc"
> -
>
> Key: SPARK-10397
> URL: https://issues.apache.org/jira/browse/SPARK-10397
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.4.0
>Reporter: Sergey Tryuber
>Priority: Trivial
>
> When I execute in Python shell:
> {code}
> print sc
> {code}
> I receive something like:
> {noformat}
> 
> {noformat}
> But this is very inconvenient, especially if a user wants to create a 
> good-looking and self-descriptive IPython Notebook. He would like to see some 
> information about his Spark cluster.
> In contrast, H2O context does have this feature and it is very helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10397) Make Python's SparkContext self-descriptive on "print sc"

2015-09-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10397:


Assignee: (was: Apache Spark)

> Make Python's SparkContext self-descriptive on "print sc"
> -
>
> Key: SPARK-10397
> URL: https://issues.apache.org/jira/browse/SPARK-10397
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.4.0
>Reporter: Sergey Tryuber
>Priority: Trivial
>
> When I execute in Python shell:
> {code}
> print sc
> {code}
> I receive something like:
> {noformat}
> 
> {noformat}
> But this is very inconvenient, especially if a user wants to create a 
> good-looking and self-descriptive IPython Notebook. He would like to see some 
> information about his Spark cluster.
> In contrast, H2O context does have this feature and it is very helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10397) Make Python's SparkContext self-descriptive on "print sc"

2015-09-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10397:


Assignee: Apache Spark

> Make Python's SparkContext self-descriptive on "print sc"
> -
>
> Key: SPARK-10397
> URL: https://issues.apache.org/jira/browse/SPARK-10397
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.4.0
>Reporter: Sergey Tryuber
>Assignee: Apache Spark
>Priority: Trivial
>
> When I execute in Python shell:
> {code}
> print sc
> {code}
> I receive something like:
> {noformat}
> 
> {noformat}
> But this is very inconvenient, especially if a user wants to create a 
> good-looking and self-descriptive IPython Notebook. He would like to see some 
> information about his Spark cluster.
> In contrast, H2O context does have this feature and it is very helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10397) Make Python's SparkContext self-descriptive on "print sc"

2015-09-04 Thread Alex Rovner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731619#comment-14731619
 ] 

Alex Rovner commented on SPARK-10397:
-

Pull: https://github.com/apache/spark/pull/8608

{noformat}
>>> sc
{'_accumulatorServer': ,
 '_batchSize': 0,
 '_callsite': CallSite(function='', 
file='/Users/alex.rovner/git/spark/python/pyspark/shell.py', linenum=43),
 '_conf': {'_jconf': JavaObject id=o0},
 '_javaAccumulator': JavaObject id=o11,
 '_jsc': JavaObject id=o8,
 '_pickled_broadcast_vars': set([]),
 '_python_includes': [],
 '_temp_dir': 
u'/private/var/folders/hj/v4zb0_f159q8mt4w3j8m2_mrgp/T/spark-a9cc47a9-db90-49a3-a82e-263f0b56268c/pyspark-773c7490-2b2d-4418-a030-256a5b9c1fe1',
 '_unbatched_serializer': PickleSerializer(),
 'appName': u'PySparkShell',
 'environment': {},
 'master': u'local[*]',
 'profiler_collector': None,
 'pythonExec': 'python2.7',
 'pythonVer': '2.7',
 'serializer': AutoBatchedSerializer(PickleSerializer()),
 'sparkHome': None}
>>> print sc
{'_accumulatorServer': ,
 '_batchSize': 0,
 '_callsite': CallSite(function='', 
file='/Users/alex.rovner/git/spark/python/pyspark/shell.py', linenum=43),
 '_conf': {'_jconf': JavaObject id=o0},
 '_javaAccumulator': JavaObject id=o11,
 '_jsc': JavaObject id=o8,
 '_pickled_broadcast_vars': set([]),
 '_python_includes': [],
 '_temp_dir': 
u'/private/var/folders/hj/v4zb0_f159q8mt4w3j8m2_mrgp/T/spark-a9cc47a9-db90-49a3-a82e-263f0b56268c/pyspark-773c7490-2b2d-4418-a030-256a5b9c1fe1',
 '_unbatched_serializer': PickleSerializer(),
 'appName': u'PySparkShell',
 'environment': {},
 'master': u'local[*]',
 'profiler_collector': None,
 'pythonExec': 'python2.7',
 'pythonVer': '2.7',
 'serializer': AutoBatchedSerializer(PickleSerializer()),
 'sparkHome': None}
>>> 

{noformat}

> Make Python's SparkContext self-descriptive on "print sc"
> -
>
> Key: SPARK-10397
> URL: https://issues.apache.org/jira/browse/SPARK-10397
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.4.0
>Reporter: Sergey Tryuber
>Priority: Trivial
>
> When I execute in Python shell:
> {code}
> print sc
> {code}
> I receive something like:
> {noformat}
> 
> {noformat}
> But this is very inconvenient, especially if a user wants to create a 
> good-looking and self-descriptive IPython Notebook. He would like to see some 
> information about his Spark cluster.
> In contrast, H2O context does have this feature and it is very helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10458) Would like to know if a given Spark Context is stopped or currently stopping

2015-09-04 Thread Matt Cheah (JIRA)
Matt Cheah created SPARK-10458:
--

 Summary: Would like to know if a given Spark Context is stopped or 
currently stopping
 Key: SPARK-10458
 URL: https://issues.apache.org/jira/browse/SPARK-10458
 Project: Spark
  Issue Type: Improvement
Reporter: Matt Cheah
Priority: Minor


I ran into a case where a thread stopped a Spark Context, specifically when I 
hit the "kill" link from the Spark standalone UI. There was no real way for 
another thread to know that the context had stopped and thus should have 
handled that accordingly.

Checking that the SparkEnv is null is one way, but that doesn't handle the case 
where the context is in the midst of stopping, and stopping the context may 
actually not be instantaneous - in my case for some reason the DAGScheduler was 
taking a non-trivial amount of time to stop.

Implementation wise I'm more or less requesting the boolean value returned from 
SparkContext.stopped.get() to be visible in some way. As long as we return the 
value and not the Atomic Boolean itself (we wouldn't want anyone to be setting 
this, after all!) it would help client applications check the context's 
liveliness.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10402) Add scaladoc for default values of params in ML

2015-09-04 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-10402.
---
   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

Issue resolved by pull request 8591
[https://github.com/apache/spark/pull/8591]

> Add scaladoc for default values of params in ML
> ---
>
> Key: SPARK-10402
> URL: https://issues.apache.org/jira/browse/SPARK-10402
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: holdenk
>Assignee: holdenk
>Priority: Minor
> Fix For: 1.6.0, 1.5.1
>
>
> We should make sure the scaladoc for params includes their default values 
> through the models in ml/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9925) Set SQLConf.SHUFFLE_PARTITIONS.key correctly for tests

2015-09-04 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-9925.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

> Set SQLConf.SHUFFLE_PARTITIONS.key correctly for tests
> --
>
> Key: SPARK-9925
> URL: https://issues.apache.org/jira/browse/SPARK-9925
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 1.6.0
>
>
> Right now, in our TestSQLContext/TestHiveContext, we use {{override def 
> numShufflePartitions: Int = this.getConf(SQLConf.SHUFFLE_PARTITIONS, 5)}} to 
> set {{SHUFFLE_PARTITIONS}}. However, we never put it to SQLConf. So, after we 
> use {{withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> "number")}}, the number 
> of shuffle partitions will be set back to 200.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10414) DenseMatrix gives different hashcode even though equals returns true

2015-09-04 Thread Vinod KC (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731745#comment-14731745
 ] 

Vinod KC commented on SPARK-10414:
--

[~josephkb]
Could you please share me that existing JIRA id to review the PR
Thanks


> DenseMatrix gives different hashcode even though equals returns true
> 
>
> Key: SPARK-10414
> URL: https://issues.apache.org/jira/browse/SPARK-10414
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Vinod KC
>Priority: Minor
>
> hashcode implementation in DenseMatrix gives different result for same input
> val dm = Matrices.dense(2, 2, Array(0.0, 1.0, 2.0, 3.0))
> val dm1 = Matrices.dense(2, 2, Array(0.0, 1.0, 2.0, 3.0))
> assert(dm1 === dm) // passed
> assert(dm1.hashCode === dm.hashCode) // Failed
> This violates the hashCode/equals contract.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7257) Find nearest neighbor satisfying predicate

2015-09-04 Thread Luvsandondov Lkhamsuren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731744#comment-14731744
 ] 

Luvsandondov Lkhamsuren commented on SPARK-7257:


This sounds very interesting! If I understood correctly, having multiple 
vertices satisfying the predicate (let's call the set P, which is a subset of 
V), and we want to find set of vertices from the P that is closest. 
Is it guaranteed that |P| << |V|? What is the use case you'd in mind 
[~josephkb]?

> Find nearest neighbor satisfying predicate
> --
>
> Key: SPARK-7257
> URL: https://issues.apache.org/jira/browse/SPARK-7257
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> It would be useful to be able to find nearest neighbors satisfying 
> predicates.  E.g.:
> * Given one or more starting vertices, plus a predicate.
> * Find the closest vertex or vertices satisfying the predicate.
> This is different from ShortestPaths in that ShortestPaths searches for a 
> fixed (small) set of vertices, rather than all vertices satisfying a 
> predicate (which could be a large set).
> It could be implemented using BFS from the initial vertex/vertices, though 
> faster implementations might also search from vertices satisfying the 
> predicate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10199) Avoid using reflections for parquet model save

2015-09-04 Thread Vinod KC (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731753#comment-14731753
 ] 

Vinod KC commented on SPARK-10199:
--

[~mengxr]
Thanks for the suggestion. 
Shall I close the PR?

> Avoid using reflections for parquet model save
> --
>
> Key: SPARK-10199
> URL: https://issues.apache.org/jira/browse/SPARK-10199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Feynman Liang
>Priority: Minor
>
> These items are not high priority since the overhead writing to Parquest is 
> much greater than for runtime reflections.
> Multiple model save/load in MLlib use case classes to infer a schema for the 
> data frame saved to Parquet. However, inferring a schema from case classes or 
> tuples uses [runtime 
> reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361]
>  which is unnecessary since the types are already known at the time `save` is 
> called.
> It would be better to just specify the schema for the data frame directly 
> using {{sqlContext.createDataFrame(dataRDD, schema)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10402) Add scaladoc for default values of params in ML

2015-09-04 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10402:
--
Shepherd: Joseph K. Bradley
Assignee: holdenk
Target Version/s: 1.6.0, 1.5.1

> Add scaladoc for default values of params in ML
> ---
>
> Key: SPARK-10402
> URL: https://issues.apache.org/jira/browse/SPARK-10402
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: holdenk
>Assignee: holdenk
>Priority: Minor
>
> We should make sure the scaladoc for params includes their default values 
> through the models in ml/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10456) upgrade java 7 on amplab jenkins workers

2015-09-04 Thread shane knapp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731480#comment-14731480
 ] 

shane knapp commented on SPARK-10456:
-

ok, 79 is installed but i will wait until downtime to switch the symlinks over. 
 here's the command i will be running when that time comes:

pssh -h jenkins_workers.txt "cd /usr/java; rm -f latest; rm -f default; ln -s 
jdk1.7.0_79 latest; ln -s latest default

> upgrade java 7 on amplab jenkins workers
> 
>
> Key: SPARK-10456
> URL: https://issues.apache.org/jira/browse/SPARK-10456
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: shane knapp
>Assignee: shane knapp
>  Labels: build
>
> our java 7 installation is really old (from last september).  update this to 
> the latest java 7 jdk.
> please assign this to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10455) install java 8 on amplab jenkins workers

2015-09-04 Thread shane knapp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731478#comment-14731478
 ] 

shane knapp commented on SPARK-10455:
-

it's installed in:  /usr/java/jdk1.8.0_60

i'll email the dev@ list and let everyone know.

> install java 8 on amplab jenkins workers
> 
>
> Key: SPARK-10455
> URL: https://issues.apache.org/jira/browse/SPARK-10455
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: shane knapp
>Assignee: shane knapp
>
> install java 8 on all jenkins workers.
> and just for clarification:  we want the 64-bit version, yes?
> please assign this to me, thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10450) Minor SQL style, format, typo, readability fixes

2015-09-04 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-10450.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

> Minor SQL style, format, typo, readability fixes
> 
>
> Key: SPARK-10450
> URL: https://issues.apache.org/jira/browse/SPARK-10450
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Minor
> Fix For: 1.6.0
>
>
> This JIRA isn't exactly tied to one particular patch. Like SPARK-10003 it's 
> more of a continuous process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10304) Partition discovery does not throw an exception if the dir structure is invalid

2015-09-04 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-10304:
--
Target Version/s: 1.6.0, 1.5.1

> Partition discovery does not throw an exception if the dir structure is 
> invalid
> ---
>
> Key: SPARK-10304
> URL: https://issues.apache.org/jira/browse/SPARK-10304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Zhan Zhang
>Priority: Critical
>
> I have a dir structure like {{/path/table1/partition_column=1/}}. When I try 
> to use {{load("/path/")}}, it works and I get a DF. When I query this DF, if 
> it is stored as ORC, there will be the following NPE. But, if it is Parquet, 
> we even can return rows. We should complain to users about the dir struct 
> because {{table1}} does not meet our format.
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in 
> stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 
> (TID 3504, 10.0.195.227): java.lang.NullPointerException
> at 
> org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256)
>   at scala.Option.map(Option.scala:145)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316)
>   at 
> org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10304) Partition discovery does not throw an exception if the dir structure is invalid

2015-09-04 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-10304:
--
Target Version/s: 1.6.0, 1.5.1  (was: 1.5.1,1.6.0)

> Partition discovery does not throw an exception if the dir structure is 
> invalid
> ---
>
> Key: SPARK-10304
> URL: https://issues.apache.org/jira/browse/SPARK-10304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Zhan Zhang
>Priority: Critical
>
> I have a dir structure like {{/path/table1/partition_column=1/}}. When I try 
> to use {{load("/path/")}}, it works and I get a DF. When I query this DF, if 
> it is stored as ORC, there will be the following NPE. But, if it is Parquet, 
> we even can return rows. We should complain to users about the dir struct 
> because {{table1}} does not meet our format.
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in 
> stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 
> (TID 3504, 10.0.195.227): java.lang.NullPointerException
> at 
> org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256)
>   at scala.Option.map(Option.scala:145)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316)
>   at 
> org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10013) Remove Java assert from Java unit tests

2015-09-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731585#comment-14731585
 ] 

Apache Spark commented on SPARK-10013:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/8607

> Remove Java assert from Java unit tests
> ---
>
> Key: SPARK-10013
> URL: https://issues.apache.org/jira/browse/SPARK-10013
> Project: Spark
>  Issue Type: Test
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>
> We should use assertTrue, etc. instead to make sure the asserts are not 
> ignored in tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10456) upgrade java 7 on amplab jenkins workers

2015-09-04 Thread shane knapp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731436#comment-14731436
 ] 

shane knapp edited comment on SPARK-10456 at 9/4/15 9:46 PM:
-

looks like we'll be installing 7u79 (we're at 7u71 currently).


was (Author: shaneknapp):
looks like we'll be installing 7u79 (we're at 7u51 currently).

> upgrade java 7 on amplab jenkins workers
> 
>
> Key: SPARK-10456
> URL: https://issues.apache.org/jira/browse/SPARK-10456
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: shane knapp
>Assignee: shane knapp
>  Labels: build
>
> our java 7 installation is really old (from last september).  update this to 
> the latest java 7 jdk.
> please assign this to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10176) Show partially analyzed plan when checkAnswer df fails to resolve

2015-09-04 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10176:
--
Target Version/s: 1.6.0  (was: 1.5.0)

> Show partially analyzed plan when checkAnswer df fails to resolve
> -
>
> Key: SPARK-10176
> URL: https://issues.apache.org/jira/browse/SPARK-10176
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
> Fix For: 1.6.0
>
>
> It would be much easier to debug test failures if we could see the failed 
> plan instead of just the user friendly error message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10176) Show partially analyzed plan when checkAnswer df fails to resolve

2015-09-04 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10176:
--
Fix Version/s: (was: 1.5.0)
   1.6.0

> Show partially analyzed plan when checkAnswer df fails to resolve
> -
>
> Key: SPARK-10176
> URL: https://issues.apache.org/jira/browse/SPARK-10176
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
> Fix For: 1.6.0
>
>
> It would be much easier to debug test failures if we could see the failed 
> plan instead of just the user friendly error message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10176) Show partially analyzed plan when checkAnswer df fails to resolve

2015-09-04 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-10176.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

> Show partially analyzed plan when checkAnswer df fails to resolve
> -
>
> Key: SPARK-10176
> URL: https://issues.apache.org/jira/browse/SPARK-10176
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
> Fix For: 1.5.0
>
>
> It would be much easier to debug test failures if we could see the failed 
> plan instead of just the user friendly error message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10199) Avoid using reflections for parquet model save

2015-09-04 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731529#comment-14731529
 ] 

Xiangrui Meng commented on SPARK-10199:
---

The improvement numbers also depends on the model size. In unit tests, the 
model sizes are usually very small. Then the overhead of reflection becomes 
significant. With real models, it could be either the model itself is too small 
or the model is large and then the overhead of reflection becomes 
insignificant. Keeping the code simple and easy to understand is also quite 
important. +[~josephkb]

> Avoid using reflections for parquet model save
> --
>
> Key: SPARK-10199
> URL: https://issues.apache.org/jira/browse/SPARK-10199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Feynman Liang
>Priority: Minor
>
> These items are not high priority since the overhead writing to Parquest is 
> much greater than for runtime reflections.
> Multiple model save/load in MLlib use case classes to infer a schema for the 
> data frame saved to Parquet. However, inferring a schema from case classes or 
> tuples uses [runtime 
> reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361]
>  which is unnecessary since the types are already known at the time `save` is 
> called.
> It would be better to just specify the schema for the data frame directly 
> using {{sqlContext.createDataFrame(dataRDD, schema)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9669) Support PySpark with Mesos Cluster mode

2015-09-04 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-9669.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

> Support PySpark with Mesos Cluster mode
> ---
>
> Key: SPARK-9669
> URL: https://issues.apache.org/jira/browse/SPARK-9669
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos, PySpark
>Affects Versions: 1.5.0
>Reporter: Timothy Chen
>Assignee: Timothy Chen
> Fix For: 1.6.0
>
>
> PySpark with cluster mode with Mesos is not yet supported.
> We need to enable it and make sure it's able to launch Pyspark jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10454) Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause multiple concurrent attempts for the same map stage

2015-09-04 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-10454.
---
  Resolution: Fixed
   Fix Version/s: 1.5.1
  1.6.0
Target Version/s: 1.6.0, 1.5.1

> Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause 
> multiple concurrent attempts for the same map stage
> -
>
> Key: SPARK-10454
> URL: https://issues.apache.org/jira/browse/SPARK-10454
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.5.1
>Reporter: Pete Robbins
>Assignee: Pete Robbins
>Priority: Critical
>  Labels: flaky-test
> Fix For: 1.6.0, 1.5.1
>
>
> test case fails intermittently in Jenkins.
> For eg, see the following builds-
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41991/
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41999/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >