[jira] [Commented] (SPARK-11125) Unreadable exception when running spark-sql without building with -Phive-thriftserver and SPARK_PREPEND_CLASSES is set

2015-10-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958441#comment-14958441
 ] 

Apache Spark commented on SPARK-11125:
--

User 'zjffdu' has created a pull request for this issue:
https://github.com/apache/spark/pull/9134

> Unreadable exception when running spark-sql without building with 
> -Phive-thriftserver and SPARK_PREPEND_CLASSES is set
> --
>
> Key: SPARK-11125
> URL: https://issues.apache.org/jira/browse/SPARK-11125
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Jeff Zhang
>Priority: Minor
>
> In development environment, building spark without -Phive-thriftserver and 
> SPARK_PREPEND_CLASSES is set. The following exception is thrown.
> SparkSQLCliDriver can be loaded but hive related code could not be loaded.
> {code}
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/apache/hadoop/hive/cli/CliDriver
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
>   at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:412)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:270)
>   at org.apache.spark.util.Utils$.classForName(Utils.scala:173)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:647)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.hadoop.hive.cli.CliDriver
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   ... 21 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11125) Unreadable exception when running spark-sql without building with -Phive-thriftserver and SPARK_PREPEND_CLASSES is set

2015-10-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11125:


Assignee: Apache Spark

> Unreadable exception when running spark-sql without building with 
> -Phive-thriftserver and SPARK_PREPEND_CLASSES is set
> --
>
> Key: SPARK-11125
> URL: https://issues.apache.org/jira/browse/SPARK-11125
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Jeff Zhang
>Assignee: Apache Spark
>Priority: Minor
>
> In development environment, building spark without -Phive-thriftserver and 
> SPARK_PREPEND_CLASSES is set. The following exception is thrown.
> SparkSQLCliDriver can be loaded but hive related code could not be loaded.
> {code}
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/apache/hadoop/hive/cli/CliDriver
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
>   at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:412)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:270)
>   at org.apache.spark.util.Utils$.classForName(Utils.scala:173)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:647)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.hadoop.hive.cli.CliDriver
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   ... 21 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11125) Unreadable exception when running spark-sql without building with -Phive-thriftserver and SPARK_PREPEND_CLASSES is set

2015-10-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11125:


Assignee: (was: Apache Spark)

> Unreadable exception when running spark-sql without building with 
> -Phive-thriftserver and SPARK_PREPEND_CLASSES is set
> --
>
> Key: SPARK-11125
> URL: https://issues.apache.org/jira/browse/SPARK-11125
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Jeff Zhang
>Priority: Minor
>
> In development environment, building spark without -Phive-thriftserver and 
> SPARK_PREPEND_CLASSES is set. The following exception is thrown.
> SparkSQLCliDriver can be loaded but hive related code could not be loaded.
> {code}
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/apache/hadoop/hive/cli/CliDriver
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
>   at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:412)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:270)
>   at org.apache.spark.util.Utils$.classForName(Utils.scala:173)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:647)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.hadoop.hive.cli.CliDriver
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   ... 21 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11097) Add connection established callback to lower level RPC layer so we don't need to check for new connections in NettyRpcHandler.receive

2015-10-15 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958447#comment-14958447
 ] 

Reynold Xin commented on SPARK-11097:
-

Simplifies code and improves performance (no need to check connection existence 
for every message).


> Add connection established callback to lower level RPC layer so we don't need 
> to check for new connections in NettyRpcHandler.receive
> -
>
> Key: SPARK-11097
> URL: https://issues.apache.org/jira/browse/SPARK-11097
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>
> I think we can remove the check for new connections in 
> NettyRpcHandler.receive if we just add a channel registered callback to the 
> lower level network module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11128) strange NPE when writing in non-existing S3 bucket

2015-10-15 Thread mathieu despriee (JIRA)
mathieu despriee created SPARK-11128:


 Summary: strange NPE when writing in non-existing S3 bucket
 Key: SPARK-11128
 URL: https://issues.apache.org/jira/browse/SPARK-11128
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.1
Reporter: mathieu despriee
Priority: Minor


For the record, as it's relatively minor, and related to s3n (not tested with 
s3a).

By mistake, we tried writing a parquet dataframe to a non-existing s3 bucket, 
with a simple df.write.parquet(s3path).
We got a NPE (see stack trace below), which is very misleading.


java.lang.NullPointerException
at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:433)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1398)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:73)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
at 
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
at 
org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10893) Lag Analytic function broken

2015-10-15 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958520#comment-14958520
 ] 

Herman van Hovell commented on SPARK-10893:
---

A bug was found in the Window implementation. It has been fixed in the current 
master: 
https://github.com/apache/spark/commit/6987c067937a50867b4d5788f5bf496ecdfdb62c

Could you try out the latest master and see if this is resolved?

> Lag Analytic function broken
> 
>
> Key: SPARK-10893
> URL: https://issues.apache.org/jira/browse/SPARK-10893
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.5.0
> Environment: Spark Standalone Cluster on Linux
>Reporter: Jo Desmet
>
> Trying to aggregate with the LAG Analytic function gives the wrong result. In 
> my testcase it was always giving the fixed value '103079215105' when I tried 
> to run on an integer.
> Note that this only happens on Spark 1.5.0, and only when running in cluster 
> mode.
> It works fine when running on Spark 1.4.1, or when running in local mode. 
> I did not test on a yarn cluster.
> I did not test other analytic aggregates.
> Input Jason:
> {code:borderStyle=solid|title=/home/app/input.json}
> {"VAA":"A", "VBB":1}
> {"VAA":"B", "VBB":-1}
> {"VAA":"C", "VBB":2}
> {"VAA":"d", "VBB":3}
> {"VAA":null, "VBB":null}
> {code}
> Java:
> {code:borderStyle=solid}
> SparkContext sc = new SparkContext(conf);
> HiveContext sqlContext = new HiveContext(sc);
> DataFrame df = sqlContext.read().json("file:///home/app/input.json");
> 
> df = df.withColumn(
>   "previous",
>   lag(dataFrame.col("VBB"), 1)
> .over(Window.orderBy(dataFrame.col("VAA")))
>   );
> {code}
> Important to understand the conditions under which the job ran, I submitted 
> to a standalone spark cluster in client mode as follows:
> {code:borderStyle=solid}
> spark-submit \
>   --master spark:\\xx:7077 \
>   --deploy-mode client \
>   --class package.to.DriverClass \
>   --driver-java-options -Dhdp.version=2.2.0.0–2041 \
>   --num-executors 2 \
>   --driver-memory 2g \
>   --executor-memory 2g \
>   --executor-cores 2 \
>   /path/to/sample-program.jar
> {code}
> Expected Result:
> {code:borderStyle=solid}
> {"VAA":null, "VBB":null, "previous":null}
> {"VAA":"A", "VBB":1, "previous":null}
> {"VAA":"B", "VBB":-1, "previous":1}
> {"VAA":"C", "VBB":2, "previous":-1}
> {"VAA":"d", "VBB":3, "previous":2}
> {code}
> Actual Result:
> {code:borderStyle=solid}
> {"VAA":null, "VBB":null, "previous":103079215105}
> {"VAA":"A", "VBB":1, "previous":103079215105}
> {"VAA":"B", "VBB":-1, "previous":103079215105}
> {"VAA":"C", "VBB":2, "previous":103079215105}
> {"VAA":"d", "VBB":3, "previous":103079215105}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4226) SparkSQL - Add support for subqueries in predicates

2015-10-15 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958524#comment-14958524
 ] 

Cheng Hao commented on SPARK-4226:
--

[~nadenf] Actually I am working on it right now, and the first PR is ready, it 
will be great appreciated if you can try 
https://github.com/apache/spark/pull/9055 in your local testing, let me know if 
there any problem or bug you found.

> SparkSQL - Add support for subqueries in predicates
> ---
>
> Key: SPARK-4226
> URL: https://issues.apache.org/jira/browse/SPARK-4226
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0
> Environment: Spark 1.2 snapshot
>Reporter: Terry Siu
>
> I have a test table defined in Hive as follows:
> {code:sql}
> CREATE TABLE sparkbug (
>   id INT,
>   event STRING
> ) STORED AS PARQUET;
> {code}
> and insert some sample data with ids 1, 2, 3.
> In a Spark shell, I then create a HiveContext and then execute the following 
> HQL to test out subquery predicates:
> {code}
> val hc = HiveContext(hc)
> hc.hql("select customerid from sparkbug where customerid in (select 
> customerid from sparkbug where customerid in (2,3))")
> {code}
> I get the following error:
> {noformat}
> java.lang.RuntimeException: Unsupported language features in query: select 
> customerid from sparkbug where customerid in (select customerid from sparkbug 
> where customerid in (2,3))
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
> sparkbug
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_DIR
> TOK_TMP_FILE
> TOK_SELECT
>   TOK_SELEXPR
> TOK_TABLE_OR_COL
>   customerid
> TOK_WHERE
>   TOK_SUBQUERY_EXPR
> TOK_SUBQUERY_OP
>   in
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
> sparkbug
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_DIR
> TOK_TMP_FILE
> TOK_SELECT
>   TOK_SELEXPR
> TOK_TABLE_OR_COL
>   customerid
> TOK_WHERE
>   TOK_FUNCTION
> in
> TOK_TABLE_OR_COL
>   customerid
> 2
> 3
> TOK_TABLE_OR_COL
>   customerid
> scala.NotImplementedError: No parse rules for ASTNode type: 817, text: 
> TOK_SUBQUERY_EXPR :
> TOK_SUBQUERY_EXPR
>   TOK_SUBQUERY_OP
> in
>   TOK_QUERY
> TOK_FROM
>   TOK_TABREF
> TOK_TABNAME
>   sparkbug
> TOK_INSERT
>   TOK_DESTINATION
> TOK_DIR
>   TOK_TMP_FILE
>   TOK_SELECT
> TOK_SELEXPR
>   TOK_TABLE_OR_COL
> customerid
>   TOK_WHERE
> TOK_FUNCTION
>   in
>   TOK_TABLE_OR_COL
> customerid
>   2
>   3
>   TOK_TABLE_OR_COL
> customerid
> " +
>  
> org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1098)
> 
> at scala.sys.package$.error(package.scala:27)
> at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:252)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49)
> at 
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
> {noformat}
> [This 
> thread|http://apache-spark-user-list.1001560.n3.nabble.com/Subquery-in-having-clause-Spark-1-1-0-td17401.html]
>  also brings up lack of subquery support in SparkSQL. It would be nice to 
> have subquery predicate support in a near, future release (1.3, maybe?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6065) Optimize word2vec.findSynonyms speed

2015-10-15 Thread Franck Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958585#comment-14958585
 ] 

Franck Zhang  commented on SPARK-6065:
--

When I used the same dataset (text8  -  around100mb), same parameters for 
training, 
python runs 10x faster than spark in my notebook(2015 MacBook Pro 15")
I think the word2vec model in spark still have a long way to go ...

> Optimize word2vec.findSynonyms speed
> 
>
> Key: SPARK-6065
> URL: https://issues.apache.org/jira/browse/SPARK-6065
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>Assignee: Manoj Kumar
> Fix For: 1.4.0
>
>
> word2vec.findSynonyms iterates through the entire vocabulary to find similar 
> words.  This is really slow relative to the [gcode-hosted word2vec 
> implementation | https://code.google.com/p/word2vec/].  It should be 
> optimized by storing words in a datastructure designed for finding nearest 
> neighbors.
> This would require storing a copy of the model (basically an inverted 
> dictionary), which could be a problem if users have a big model (e.g., 100 
> features x 10M words or phrases = big dictionary).  It might be best to 
> provide a function for converting the model into a model optimized for 
> findSynonyms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11128) strange NPE when writing in non-existing S3 bucket

2015-10-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958756#comment-14958756
 ] 

Sean Owen commented on SPARK-11128:
---

Is this a Spark problem? sounds like an issue between Hadoop and S3, and is 
ultimately due to bad input.

> strange NPE when writing in non-existing S3 bucket
> --
>
> Key: SPARK-11128
> URL: https://issues.apache.org/jira/browse/SPARK-11128
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: mathieu despriee
>Priority: Minor
>
> For the record, as it's relatively minor, and related to s3n (not tested with 
> s3a).
> By mistake, we tried writing a parquet dataframe to a non-existing s3 bucket, 
> with a simple df.write.parquet(s3path).
> We got a NPE (see stack trace below), which is very misleading.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:433)
> at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1398)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:73)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
> at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
> at 
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
> at 
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
> at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9241) Supporting multiple DISTINCT columns

2015-10-15 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958758#comment-14958758
 ] 

Herman van Hovell commented on SPARK-9241:
--

We could implement this using GROUPING SETS. That is how they did it in 
Calcite: https://issues.apache.org/jira/browse/CALCITE-732

For example using the following data:
{noformat}
// Create random data similar to the Calcite query.
val df = sqlContext
  .range(1 << 20)
  .select(
$"id".as("employee_id"),
(rand(6321782L) * 4 + 1).cast("int").as("department_id"),
when(rand(981293L) >= 0.5, "M").otherwise("F").as("gender"),
(rand(7123L) * 3 + 1).cast("int").as("education_level")
  )

df.registerTempTable("employee")
{noformat}

We can query multiple distinct counts the regular way: 
{noformat}
sql("""
select   department_id as d,
 count(distinct gender, education_level) as c0,
 count(distinct gender) as c1,
 count(distinct education_level) as c2
from employee
group by department_id
""").show()
{noformat}

This uses the old code path:
{noformat}
== Physical Plan ==
Limit 21
 Aggregate false, [department_id#64556], [department_id#64556 AS 
d#64595,CombineAndCount(partialSets#64599) AS 
c0#64596L,CombineAndCount(partialSets#64600) AS 
c1#64597L,CombineAndCount(partialSets#64601) AS c2#64598L]
  Exchange hashpartitioning(department_id#64556,200)
   Aggregate true, [department_id#64556], 
[department_id#64556,AddToHashSet(gender#64557,education_level#64558) AS 
partialSets#64599,AddToHashSet(gender#64557) AS 
partialSets#64600,AddToHashSet(education_level#64558) AS partialSets#64601]
ConvertToSafe
 TungstenProject [department_id#64556,gender#64557,education_level#64558]
  TungstenProject [id#64554L AS employee_id#64555L,cast(((rand(6321782) * 
4.0) + 1.0) as int) AS department_id#64556,CASE WHEN (rand(981293) >= 0.5) THEN 
M ELSE F AS gender#64557,cast(((rand(7123) * 3.0) + 1.0) as int) AS 
education_level#64558]
   Scan PhysicalRDD[id#64554L]
{noformat}

Or we can do this using grouping sets:
{noformat}
sql("""
select A.d,
   count(case A.i when 3 then 1 else null end) as c0,
   count(case A.i when 5 then 1 else null end) as c1,
   count(case A.i when 7 then 1 else null end) as c2
from (select   department_id as d,
   grouping__id as i
  from employee
  group by department_id,
   gender,
   education_level
  grouping sets (
   (department_id, gender),
   (department_id, education_level),
   (department_id, gender, education_level))) A
group by A.d
""").show
{noformat}

And use the new tungsten-based code path (except for the Expand operator):
{noformat}
== Physical Plan ==
TungstenAggregate(key=[d#64577], functions=[(count(CASE i#64578 WHEN 3 THEN 1 
ELSE null),mode=Final,isDistinct=false),(count(CASE i#64578 WHEN 5 THEN 1 ELSE 
null),mode=Final,isDistinct=false),(count(CASE i#64578 WHEN 7 THEN 1 ELSE 
null),mode=Final,isDistinct=false)], 
output=[d#64577,c0#64579L,c1#64580L,c2#64581L])
 TungstenExchange hashpartitioning(d#64577,200)
  TungstenAggregate(key=[d#64577], functions=[(count(CASE i#64578 WHEN 3 THEN 1 
ELSE null),mode=Partial,isDistinct=false),(count(CASE i#64578 WHEN 5 THEN 1 
ELSE null),mode=Partial,isDistinct=false),(count(CASE i#64578 WHEN 7 THEN 1 
ELSE null),mode=Partial,isDistinct=false)], 
output=[d#64577,currentCount#64587L,currentCount#64589L,currentCount#64591L])
   
TungstenAggregate(key=[department_id#64556,gender#64557,education_level#64558,grouping__id#64582],
 functions=[], output=[d#64577,i#64578])
TungstenExchange 
hashpartitioning(department_id#64556,gender#64557,education_level#64558,grouping__id#64582,200)
 
TungstenAggregate(key=[department_id#64556,gender#64557,education_level#64558,grouping__id#64582],
 functions=[], 
output=[department_id#64556,gender#64557,education_level#64558,grouping__id#64582])
  Expand [ArrayBuffer(department_id#64556, gender#64557, null, 
3),ArrayBuffer(department_id#64556, null, education_level#64558, 
5),ArrayBuffer(department_id#64556, gender#64557, education_level#64558, 7)], 
[department_id#64556,gender#64557,education_level#64558,grouping__id#64582]
   ConvertToSafe
TungstenProject [department_id#64556,gender#64557,education_level#64558]
 TungstenProject [id#64554L AS employee_id#64555L,cast(((rand(6321782) 
* 4.0) + 1.0) as int) AS department_id#64556,CASE WHEN (rand(981293) >= 0.5) 
THEN M ELSE F AS gender#64557,cast(((rand(7123) * 3.0) + 1.0) as int) AS 
education_level#64558]
  Scan PhysicalRDD[id#64554L]
{noformat}

We could implement this using an analysis rule.

[~yhuai] / [~rxin] thoughts?

> Supporting multiple DISTINCT columns
> 
>
> Key: SPARK-9241
> URL: https://issues.apache.org/jira/browse/SPARK-9241
>   

[jira] [Resolved] (SPARK-10528) spark-shell throws java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.

2015-10-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10528.
---
Resolution: Not A Problem

This looks like an environment problem.

> spark-shell throws java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable.
> --
>
> Key: SPARK-10528
> URL: https://issues.apache.org/jira/browse/SPARK-10528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.0
> Environment: Windows 7 x64
>Reporter: Aliaksei Belablotski
>Priority: Minor
>
> Starting spark-shell throws
> java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11120) maxNumExecutorFailures defaults to 3 under dynamic allocation

2015-10-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958762#comment-14958762
 ] 

Sean Owen commented on SPARK-11120:
---

Is this specific to dynamic allocation though? you could have the same problem 
without it.

> maxNumExecutorFailures defaults to 3 under dynamic allocation
> -
>
> Key: SPARK-11120
> URL: https://issues.apache.org/jira/browse/SPARK-11120
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Ryan Williams
>Priority: Minor
>
> With dynamic allocation, the {{spark.executor.instances}} config is 0, 
> meaning [this 
> line|https://github.com/apache/spark/blob/4ace4f8a9c91beb21a0077e12b75637a4560a542/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L66-L68]
>  ends up with {{maxNumExecutorFailures}} equal to {{3}}, which for me has 
> resulted in large dynamicAllocation jobs with hundreds of executors dying due 
> to one bad node serially failing executors that are allocated on it.
> I think that using {{spark.dynamicAllocation.maxExecutors}} would make most 
> sense in this case; I frequently run shells that vary between 1 and 1000 
> executors, so using {{s.dA.minExecutors}} or {{s.dA.initialExecutors}} would 
> still leave me with a value that is lower than makes sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10935) Avito Context Ad Clicks

2015-10-15 Thread Kristina Plazonic (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958823#comment-14958823
 ] 

Kristina Plazonic commented on SPARK-10935:
---

@Xusen, I'm almost done - should be done this weekend - but would love to 
connect with you and get your comments, suggestions and improvements. :)  
Thanks!

> Avito Context Ad Clicks
> ---
>
> Key: SPARK-10935
> URL: https://issues.apache.org/jira/browse/SPARK-10935
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>
> From [~kpl...@gmail.com]:
> I would love to do Avito Context Ad Clicks - 
> https://www.kaggle.com/c/avito-context-ad-clicks - but it involves a lot of 
> feature engineering and preprocessing. I would love to split this with 
> somebody else if anybody is interested on working with this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11129) Link Spark WebUI in Mesos WebUI

2015-10-15 Thread Philipp Hoffmann (JIRA)
Philipp Hoffmann created SPARK-11129:


 Summary: Link Spark WebUI in Mesos WebUI
 Key: SPARK-11129
 URL: https://issues.apache.org/jira/browse/SPARK-11129
 Project: Spark
  Issue Type: New Feature
  Components: Mesos, Web UI
Affects Versions: 1.5.1
Reporter: Philipp Hoffmann


Mesos can directly link into WebUIs provided by frameworks running on top of 
Mesos. Spark currently doesn't make use of this feature.

This ticket aims to provide the necessary information to Mesos in order to link 
back to the Spark WebUI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)

2015-10-15 Thread Dominic Ricard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958841#comment-14958841
 ] 

Dominic Ricard commented on SPARK-11103:


Setting the property {{spark.sql.parquet.filterPushdown}} to {{false}} fixed 
the issue.

Knowing all this, does this indicate a bug in the filter2 implementation of the 
Parquet library? Maybe this issue should be moved to the Parquet project for 
someone to look at...



> Filter applied on Merged Parquet shema with new column fail with 
> (java.lang.IllegalArgumentException: Column [column_name] was not found in 
> schema!)
> 
>
> Key: SPARK-11103
> URL: https://issues.apache.org/jira/browse/SPARK-11103
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Dominic Ricard
>
> When evolving a schema in parquet files, spark properly expose all columns 
> found in the different parquet files but when trying to query the data, it is 
> not possible to apply a filter on a column that is not present in all files.
> To reproduce:
> *SQL:*
> {noformat}
> create table `table1` STORED AS PARQUET LOCATION 
> 'hdfs://:/path/to/table/id=1/' as select 1 as `col1`;
> create table `table2` STORED AS PARQUET LOCATION 
> 'hdfs://:/path/to/table/id=2/' as select 1 as `col1`, 2 as 
> `col2`;
> create table `table3` USING org.apache.spark.sql.parquet OPTIONS (path 
> "hdfs://:/path/to/table");
> select col1 from `table3` where col2 = 2;
> {noformat}
> The last select will output the following Stack Trace:
> {noformat}
> An error occurred when executing the SQL command:
> select col1 from `table3` where col2 = 2
> [Simba][HiveJDBCDriver](500051) ERROR processing query/statement. Error Code: 
> 0, SQL state: TStatus(statusCode:ERROR_STATUS, 
> infoMessages:[*org.apache.hive.service.cli.HiveSQLException:org.apache.spark.SparkException:
>  Job aborted due to stage failure: Task 0 in stage 7212.0 failed 4 times, 
> most recent failure: Lost task 0.3 in stage 7212.0 (TID 138449, 
> 208.92.52.88): java.lang.IllegalArgumentException: Column [col2] was not 
> found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:94)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$Eq.accept(Operators.java:180)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:160)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:155)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:120)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.comp

[jira] [Commented] (SPARK-11129) Link Spark WebUI in Mesos WebUI

2015-10-15 Thread Philipp Hoffmann (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958856#comment-14958856
 ] 

Philipp Hoffmann commented on SPARK-11129:
--

submitted a pull request

> Link Spark WebUI in Mesos WebUI
> ---
>
> Key: SPARK-11129
> URL: https://issues.apache.org/jira/browse/SPARK-11129
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos, Web UI
>Affects Versions: 1.5.1
>Reporter: Philipp Hoffmann
>
> Mesos can directly link into WebUIs provided by frameworks running on top of 
> Mesos. Spark currently doesn't make use of this feature.
> This ticket aims to provide the necessary information to Mesos in order to 
> link back to the Spark WebUI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11129) Link Spark WebUI in Mesos WebUI

2015-10-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11129:


Assignee: (was: Apache Spark)

> Link Spark WebUI in Mesos WebUI
> ---
>
> Key: SPARK-11129
> URL: https://issues.apache.org/jira/browse/SPARK-11129
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos, Web UI
>Affects Versions: 1.5.1
>Reporter: Philipp Hoffmann
>
> Mesos can directly link into WebUIs provided by frameworks running on top of 
> Mesos. Spark currently doesn't make use of this feature.
> This ticket aims to provide the necessary information to Mesos in order to 
> link back to the Spark WebUI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11129) Link Spark WebUI in Mesos WebUI

2015-10-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958858#comment-14958858
 ] 

Apache Spark commented on SPARK-11129:
--

User 'philipphoffmann' has created a pull request for this issue:
https://github.com/apache/spark/pull/9135

> Link Spark WebUI in Mesos WebUI
> ---
>
> Key: SPARK-11129
> URL: https://issues.apache.org/jira/browse/SPARK-11129
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos, Web UI
>Affects Versions: 1.5.1
>Reporter: Philipp Hoffmann
>
> Mesos can directly link into WebUIs provided by frameworks running on top of 
> Mesos. Spark currently doesn't make use of this feature.
> This ticket aims to provide the necessary information to Mesos in order to 
> link back to the Spark WebUI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11129) Link Spark WebUI in Mesos WebUI

2015-10-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11129:


Assignee: Apache Spark

> Link Spark WebUI in Mesos WebUI
> ---
>
> Key: SPARK-11129
> URL: https://issues.apache.org/jira/browse/SPARK-11129
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos, Web UI
>Affects Versions: 1.5.1
>Reporter: Philipp Hoffmann
>Assignee: Apache Spark
>
> Mesos can directly link into WebUIs provided by frameworks running on top of 
> Mesos. Spark currently doesn't make use of this feature.
> This ticket aims to provide the necessary information to Mesos in order to 
> link back to the Spark WebUI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)

2015-10-15 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958879#comment-14958879
 ] 

Hyukjin Kwon commented on SPARK-11103:
--

For me, I think Spark should appropriately set filters for each file, which I 
think is pretty tricky, or simply prevent filtering for this case. Would 
anybody give us some feedback please?


> Filter applied on Merged Parquet shema with new column fail with 
> (java.lang.IllegalArgumentException: Column [column_name] was not found in 
> schema!)
> 
>
> Key: SPARK-11103
> URL: https://issues.apache.org/jira/browse/SPARK-11103
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Dominic Ricard
>
> When evolving a schema in parquet files, spark properly expose all columns 
> found in the different parquet files but when trying to query the data, it is 
> not possible to apply a filter on a column that is not present in all files.
> To reproduce:
> *SQL:*
> {noformat}
> create table `table1` STORED AS PARQUET LOCATION 
> 'hdfs://:/path/to/table/id=1/' as select 1 as `col1`;
> create table `table2` STORED AS PARQUET LOCATION 
> 'hdfs://:/path/to/table/id=2/' as select 1 as `col1`, 2 as 
> `col2`;
> create table `table3` USING org.apache.spark.sql.parquet OPTIONS (path 
> "hdfs://:/path/to/table");
> select col1 from `table3` where col2 = 2;
> {noformat}
> The last select will output the following Stack Trace:
> {noformat}
> An error occurred when executing the SQL command:
> select col1 from `table3` where col2 = 2
> [Simba][HiveJDBCDriver](500051) ERROR processing query/statement. Error Code: 
> 0, SQL state: TStatus(statusCode:ERROR_STATUS, 
> infoMessages:[*org.apache.hive.service.cli.HiveSQLException:org.apache.spark.SparkException:
>  Job aborted due to stage failure: Task 0 in stage 7212.0 failed 4 times, 
> most recent failure: Lost task 0.3 in stage 7212.0 (TID 138449, 
> 208.92.52.88): java.lang.IllegalArgumentException: Column [col2] was not 
> found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:94)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$Eq.accept(Operators.java:180)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:160)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:155)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:120)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheck

[jira] [Commented] (SPARK-10729) word2vec model save for python

2015-10-15 Thread Jian Feng Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959097#comment-14959097
 ] 

Jian Feng Zhang commented on SPARK-10729:
-

I can take this if no objections.

> word2vec model save for python
> --
>
> Key: SPARK-10729
> URL: https://issues.apache.org/jira/browse/SPARK-10729
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Joseph A Gartner III
>
> The ability to save a word2vec model has not been ported to python, and would 
> be extremely useful to have given the long training period.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10935) Avito Context Ad Clicks

2015-10-15 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959140#comment-14959140
 ] 

Xusen Yin commented on SPARK-10935:
---

OK, ping me if you need help.

> Avito Context Ad Clicks
> ---
>
> Key: SPARK-10935
> URL: https://issues.apache.org/jira/browse/SPARK-10935
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>
> From [~kpl...@gmail.com]:
> I would love to do Avito Context Ad Clicks - 
> https://www.kaggle.com/c/avito-context-ad-clicks - but it involves a lot of 
> feature engineering and preprocessing. I would love to split this with 
> somebody else if anybody is interested on working with this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5874) How to improve the current ML pipeline API?

2015-10-15 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959156#comment-14959156
 ] 

Xusen Yin commented on SPARK-5874:
--

How about adding a warm-start strategy in ML Estimator? I.e. update its fit 
function with an intermediate model, like fit(data, param, model).

> How to improve the current ML pipeline API?
> ---
>
> Key: SPARK-5874
> URL: https://issues.apache.org/jira/browse/SPARK-5874
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> I created this JIRA to collect feedbacks about the ML pipeline API we 
> introduced in Spark 1.2. The target is to graduate this set of APIs in 1.4 
> with confidence, which requires valuable input from the community. I'll 
> create sub-tasks for each major issue.
> Design doc (WIP): 
> https://docs.google.com/a/databricks.com/document/d/1plFBPJY_PriPTuMiFYLSm7fQgD1FieP4wt3oMVKMGcc/edit#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9941) Try ML pipeline API on Kaggle competitions

2015-10-15 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959174#comment-14959174
 ] 

Xusen Yin commented on SPARK-9941:
--

I'd love to try the cooking dataset: https://www.kaggle.com/c/whats-cooking

> Try ML pipeline API on Kaggle competitions
> --
>
> Key: SPARK-9941
> URL: https://issues.apache.org/jira/browse/SPARK-9941
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> This is an umbrella JIRA to track some fun tasks :)
> We have built many features under the ML pipeline API, and we want to see how 
> it works on real-world datasets, e.g., Kaggle competition datasets 
> (https://www.kaggle.com/competitions). We want to invite community members to 
> help test. The goal is NOT to win the competitions but to provide code 
> examples and to find out missing features and other issues to help shape the 
> roadmap.
> For people who are interested, please do the following:
> 1. Create a subtask (or leave a comment if you cannot create a subtask) to 
> claim a Kaggle dataset.
> 2. Use the ML pipeline API to build and tune an ML pipeline that works for 
> the Kaggle dataset.
> 3. Paste the code to gist (https://gist.github.com/) and provide the link 
> here.
> 4. Report missing features, issues, running times, and accuracy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10217) Spark SQL cannot handle ordering directive in ORDER BY clauses with expressions

2015-10-15 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959197#comment-14959197
 ] 

Simeon Simeonov commented on SPARK-10217:
-

Well, that would suggest the issue is fixed. :)

> Spark SQL cannot handle ordering directive in ORDER BY clauses with 
> expressions
> ---
>
> Key: SPARK-10217
> URL: https://issues.apache.org/jira/browse/SPARK-10217
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: Ubuntu on AWS
>Reporter: Simeon Simeonov
>  Labels: SQL, analyzers
>
> Spark SQL supports expressions in ORDER BY clauses, e.g.,
> {code}
> scala> sqlContext.sql("select cnt from test order by (cnt + cnt)")
> res2: org.apache.spark.sql.DataFrame = [cnt: bigint]
> {code}
> However, the analyzer gets confused when there is an explicit ordering 
> directive (ASC/DESC):
> {code}
> scala> sqlContext.sql("select cnt from test order by (cnt + cnt) asc")
> 15/08/25 04:08:02 INFO ParseDriver: Parsing command: select cnt from test 
> order by (cnt + cnt) asc
> org.apache.spark.sql.AnalysisException: extraneous input 'asc' expecting EOF 
> near ''; line 1 pos 40
>   at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:289)
>   at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
>   at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11093) ChildFirstURLClassLoader#getResources should return all found resources, not just those in the child classloader

2015-10-15 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-11093.

   Resolution: Fixed
 Assignee: Adam Lewandowski
Fix Version/s: 1.6.0

> ChildFirstURLClassLoader#getResources should return all found resources, not 
> just those in the child classloader
> 
>
> Key: SPARK-11093
> URL: https://issues.apache.org/jira/browse/SPARK-11093
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Adam Lewandowski
>Assignee: Adam Lewandowski
> Fix For: 1.6.0
>
>
> Currently when using a child-first classloader 
> (spark.driver|executor.userClassPathFirst = true), the getResources method 
> does not return any matching resources from the parent classloader if the 
> child classloader contains any. This is not child-first, it's child-only and 
> is inconsistent with how the default parent-first classloaders work in the 
> JDK (all found resources are returned from both classloaders). It is also 
> inconsistent with how child-first classloaders work in other environments 
> (Servlet containers, for example). 
> ChildFirstURLClassLoader#getResources() should return resources found from 
> both the child and the parent classloaders, placing any found from the child 
> classloader first. 
> For reference, the specific use case where I encountered this problem was 
> running Spark on AWS EMR in a child-first arrangement (due to guava version 
> conflicts), where Akka's configuration file (reference.conf) was made 
> available in the parent classloader, but was not visible to the Typesafe 
> config library which uses Classloader.getResources() on the Thread's context 
> classloader to find them. This resulted in a fatal error from the Config 
> library: "com.typesafe.config.ConfigException$Missing: No configuration 
> setting found for key 'akka.version'" .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11099) Default conf property file is not loaded

2015-10-15 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-11099.

   Resolution: Fixed
 Assignee: Jeff Zhang
Fix Version/s: 1.6.0

> Default conf property file is not loaded 
> -
>
> Key: SPARK-11099
> URL: https://issues.apache.org/jira/browse/SPARK-11099
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
>Priority: Critical
> Fix For: 1.6.0
>
>
> spark.driver.extraClassPath doesn't take effect in the latest code, and find 
> the root cause is due to the default conf property file is not loaded 
> The bug is caused by this code snippet in AbstractCommandBuilder
> {code}
>   Map getEffectiveConfig() throws IOException {
> if (effectiveConfig == null) {
>   if (propertiesFile == null) {
> effectiveConfig = conf;   // return from here if no propertyFile 
> is provided
>   } else {
> effectiveConfig = new HashMap<>(conf);
> Properties p = loadPropertiesFile();// default propertyFile 
> will load here
> for (String key : p.stringPropertyNames()) {
>   if (!effectiveConfig.containsKey(key)) {
> effectiveConfig.put(key, p.getProperty(key));
>   }
> }
>   }
> }
> return effectiveConfig;
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11130) TestHive fails on machines with few cores

2015-10-15 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-11130:
--

 Summary: TestHive fails on machines with few cores
 Key: SPARK-11130
 URL: https://issues.apache.org/jira/browse/SPARK-11130
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0, 1.6.0
Reporter: Marcelo Vanzin
Priority: Minor


Filing so it doesn't get lost (again).

TestHive.scala has this code:

{core}
new SparkContext(
  System.getProperty("spark.sql.test.master", "local[32]"),
{core}

On machines with less cores, that causes many tests to fail with "unable to 
allocate memory" errors, because the default page size calculation seems to be 
based on the machine's core count, and not on the core count specified for the 
SparkContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11130) TestHive fails on machines with few cores

2015-10-15 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-11130:
---
Description: 
Filing so it doesn't get lost (again).

TestHive.scala has this code:

{code}
new SparkContext(
  System.getProperty("spark.sql.test.master", "local[32]"),
{code}

On machines with less cores, that causes many tests to fail with "unable to 
allocate memory" errors, because the default page size calculation seems to be 
based on the machine's core count, and not on the core count specified for the 
SparkContext.

  was:
Filing so it doesn't get lost (again).

TestHive.scala has this code:

{core}
new SparkContext(
  System.getProperty("spark.sql.test.master", "local[32]"),
{core}

On machines with less cores, that causes many tests to fail with "unable to 
allocate memory" errors, because the default page size calculation seems to be 
based on the machine's core count, and not on the core count specified for the 
SparkContext.


> TestHive fails on machines with few cores
> -
>
> Key: SPARK-11130
> URL: https://issues.apache.org/jira/browse/SPARK-11130
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> Filing so it doesn't get lost (again).
> TestHive.scala has this code:
> {code}
> new SparkContext(
>   System.getProperty("spark.sql.test.master", "local[32]"),
> {code}
> On machines with less cores, that causes many tests to fail with "unable to 
> allocate memory" errors, because the default page size calculation seems to 
> be based on the machine's core count, and not on the core count specified for 
> the SparkContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10217) Spark SQL cannot handle ordering directive in ORDER BY clauses with expressions

2015-10-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10217.
---
Resolution: Cannot Reproduce

> Spark SQL cannot handle ordering directive in ORDER BY clauses with 
> expressions
> ---
>
> Key: SPARK-10217
> URL: https://issues.apache.org/jira/browse/SPARK-10217
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: Ubuntu on AWS
>Reporter: Simeon Simeonov
>  Labels: SQL, analyzers
>
> Spark SQL supports expressions in ORDER BY clauses, e.g.,
> {code}
> scala> sqlContext.sql("select cnt from test order by (cnt + cnt)")
> res2: org.apache.spark.sql.DataFrame = [cnt: bigint]
> {code}
> However, the analyzer gets confused when there is an explicit ordering 
> directive (ASC/DESC):
> {code}
> scala> sqlContext.sql("select cnt from test order by (cnt + cnt) asc")
> 15/08/25 04:08:02 INFO ParseDriver: Parsing command: select cnt from test 
> order by (cnt + cnt) asc
> org.apache.spark.sql.AnalysisException: extraneous input 'asc' expecting EOF 
> near ''; line 1 pos 40
>   at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:289)
>   at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
>   at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11131) Worker registration protocol is racy

2015-10-15 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-11131:
--

 Summary: Worker registration protocol is racy
 Key: SPARK-11131
 URL: https://issues.apache.org/jira/browse/SPARK-11131
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Marcelo Vanzin
Priority: Minor


I ran into this while making changes to the new RPC framework. Because the 
Worker registration protocol is based on sending unrelated messages between 
Master and Worker, it's possible that another message (e.g. caused by an a app 
trying to allocate workers) to arrive at the Worker before it knows the Master 
has registered it. This triggers the following code:

{code}
case LaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_) =>
  if (masterUrl != activeMasterUrl) {
logWarning("Invalid Master (" + masterUrl + ") attempted to launch 
executor.")
{code}

This may or may not be made worse by SPARK-11098.

A simple workaround is to use an {{ask}} instead of a {{send}} for these 
messages. That should at least narrow the race. 

Note this is more of a problem in {{local-cluster}} mode, used a lot by unit 
tests, where Master and Worker instances are coming up as part of the app 
itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11066) Flaky test o.a.scheduler.DAGSchedulerSuite.misbehavedResultHandler occasionally fails due to j.l.UnsupportedOperationException concerning a finished JobWaiter

2015-10-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11066.
---
   Resolution: Fixed
Fix Version/s: 1.5.2
   1.6.0

Issue resolved by pull request 9076
[https://github.com/apache/spark/pull/9076]

> Flaky test o.a.scheduler.DAGSchedulerSuite.misbehavedResultHandler 
> occasionally fails due to j.l.UnsupportedOperationException concerning a 
> finished JobWaiter
> --
>
> Key: SPARK-11066
> URL: https://issues.apache.org/jira/browse/SPARK-11066
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core, Tests
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1
> Environment: Multiple OS and platform types.
> (Also observed by others, e.g. see External URL)
>Reporter: Dr Stephen A Hellberg
>Priority: Minor
> Fix For: 1.6.0, 1.5.2
>
>
> The DAGSchedulerSuite test for the "misbehaved ResultHandler" has an inherent 
> problem: it creates a job for the DAGScheduler comprising multiple (2) tasks, 
> but whilst the job will fail and a SparkDriverExecutionException will be 
> returned, a race condition exists as to whether the first task's 
> (deliberately) thrown exception causes the job to fail - and having its 
> causing exception set to the DAGSchedulerSuiteDummyException that was thrown 
> as the setup of the misbehaving test - or second (and subsequent) tasks who 
> equally end, but have instead the DAGScheduler's legitimate 
> UnsupportedOperationException (a subclass of RuntimeException) returned 
> instead as their causing exception.  This race condition is likely associated 
> with the vagaries of processing quanta, and expense of throwing two 
> exceptions (under interpreter execution) per thread of control; this race is 
> usually 'won' by the first task throwing the DAGSchedulerDummyException, as 
> desired (and expected)... but not always.
> The problem for the testcase is that the first assertion is largely 
> concerning the test setup, and doesn't (can't? Sorry, still not a ScalaTest 
> expert) capture all the causes of SparkDriverExecutionException that can 
> legitimately arise from a correctly working (not crashed) DAGScheduler.  
> Arguably, this assertion might test something of the DAGScheduler... but not 
> all the possible outcomes for a working DAGScheduler.  Nevertheless, this 
> test - when comprising a multiple task job - will report as a failure when in 
> fact the DAGScheduler is working-as-designed (and not crashed ;-).  
> Furthermore, the test is already failed before it actually tries to use the 
> SparkContext a second time (for an arbitrary processing task), which I think 
> is the real subject of the test?
> The solution, I submit, is to ensure that the job is composed of just one 
> task, and that single task will result in the call to the compromised 
> ResultHandler causing the test's deliberate exception to be thrown and 
> exercising the relevant (DAGScheduler) code paths.  Given tasks are scoped by 
> the number of partitions of an RDD, this could be achieved with a single 
> partitioned RDD (indeed, doing so seems to exercise/would test some default 
> parallelism support of the TaskScheduler?); the pull request offered, 
> however, is based on the minimal change of just using a single partition of 
> the 2 (or more) partition parallelized RDD.  This will result in scheduling a 
> job of just one task, one successful task calling the user-supplied 
> compromised ResultHandler function, which results in failing the job and 
> unambiguously wrapping our DAGSchedulerSuiteException inside a 
> SparkDriverExecutionException; there are no other tasks that on running 
> successfully will find the job failed causing the 'undesired' 
> UnsupportedOperationException to be thrown instead.  This, then, satisfies 
> the test's setup assertion.
> I have tested this hypothesis having parametised the number of partitions, N, 
> used by the "misbehaved ResultHandler" job and have observed the 1 x 
> DAGSchedulerSuiteException first, followed by the legitimate N-1 x 
> UnsupportedOperationExceptions ... what propagates back from the job seems to 
> simply become the result of the race between task threads and the 
> intermittent failures observed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11066) Flaky test o.a.scheduler.DAGSchedulerSuite.misbehavedResultHandler occasionally fails due to j.l.UnsupportedOperationException concerning a finished JobWaiter

2015-10-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11066:
--
Assignee: Dr Stephen A Hellberg

> Flaky test o.a.scheduler.DAGSchedulerSuite.misbehavedResultHandler 
> occasionally fails due to j.l.UnsupportedOperationException concerning a 
> finished JobWaiter
> --
>
> Key: SPARK-11066
> URL: https://issues.apache.org/jira/browse/SPARK-11066
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core, Tests
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1
> Environment: Multiple OS and platform types.
> (Also observed by others, e.g. see External URL)
>Reporter: Dr Stephen A Hellberg
>Assignee: Dr Stephen A Hellberg
>Priority: Minor
> Fix For: 1.5.2, 1.6.0
>
>
> The DAGSchedulerSuite test for the "misbehaved ResultHandler" has an inherent 
> problem: it creates a job for the DAGScheduler comprising multiple (2) tasks, 
> but whilst the job will fail and a SparkDriverExecutionException will be 
> returned, a race condition exists as to whether the first task's 
> (deliberately) thrown exception causes the job to fail - and having its 
> causing exception set to the DAGSchedulerSuiteDummyException that was thrown 
> as the setup of the misbehaving test - or second (and subsequent) tasks who 
> equally end, but have instead the DAGScheduler's legitimate 
> UnsupportedOperationException (a subclass of RuntimeException) returned 
> instead as their causing exception.  This race condition is likely associated 
> with the vagaries of processing quanta, and expense of throwing two 
> exceptions (under interpreter execution) per thread of control; this race is 
> usually 'won' by the first task throwing the DAGSchedulerDummyException, as 
> desired (and expected)... but not always.
> The problem for the testcase is that the first assertion is largely 
> concerning the test setup, and doesn't (can't? Sorry, still not a ScalaTest 
> expert) capture all the causes of SparkDriverExecutionException that can 
> legitimately arise from a correctly working (not crashed) DAGScheduler.  
> Arguably, this assertion might test something of the DAGScheduler... but not 
> all the possible outcomes for a working DAGScheduler.  Nevertheless, this 
> test - when comprising a multiple task job - will report as a failure when in 
> fact the DAGScheduler is working-as-designed (and not crashed ;-).  
> Furthermore, the test is already failed before it actually tries to use the 
> SparkContext a second time (for an arbitrary processing task), which I think 
> is the real subject of the test?
> The solution, I submit, is to ensure that the job is composed of just one 
> task, and that single task will result in the call to the compromised 
> ResultHandler causing the test's deliberate exception to be thrown and 
> exercising the relevant (DAGScheduler) code paths.  Given tasks are scoped by 
> the number of partitions of an RDD, this could be achieved with a single 
> partitioned RDD (indeed, doing so seems to exercise/would test some default 
> parallelism support of the TaskScheduler?); the pull request offered, 
> however, is based on the minimal change of just using a single partition of 
> the 2 (or more) partition parallelized RDD.  This will result in scheduling a 
> job of just one task, one successful task calling the user-supplied 
> compromised ResultHandler function, which results in failing the job and 
> unambiguously wrapping our DAGSchedulerSuiteException inside a 
> SparkDriverExecutionException; there are no other tasks that on running 
> successfully will find the job failed causing the 'undesired' 
> UnsupportedOperationException to be thrown instead.  This, then, satisfies 
> the test's setup assertion.
> I have tested this hypothesis having parametised the number of partitions, N, 
> used by the "misbehaved ResultHandler" job and have observed the 1 x 
> DAGSchedulerSuiteException first, followed by the legitimate N-1 x 
> UnsupportedOperationExceptions ... what propagates back from the job seems to 
> simply become the result of the race between task threads and the 
> intermittent failures observed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11132) Mean Shift algorithm integration

2015-10-15 Thread JIRA
Beck Gaël created SPARK-11132:
-

 Summary: Mean Shift algorithm integration
 Key: SPARK-11132
 URL: https://issues.apache.org/jira/browse/SPARK-11132
 Project: Spark
  Issue Type: Brainstorming
  Components: MLlib
Reporter: Beck Gaël
Priority: Minor


I made a version of the clustering algorithm Mean Shift in scala/Spark and 
would like to contribute if you think that it is a good idea.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11132) Mean Shift algorithm integration

2015-10-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959245#comment-14959245
 ] 

Sean Owen commented on SPARK-11132:
---

[~Kybe] please have a look at 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
regarding if and when algos are integrated into MLlib. What's the case for mean 
shift? 

> Mean Shift algorithm integration
> 
>
> Key: SPARK-11132
> URL: https://issues.apache.org/jira/browse/SPARK-11132
> Project: Spark
>  Issue Type: Brainstorming
>  Components: MLlib
>Reporter: Beck Gaël
>Priority: Minor
>
> I made a version of the clustering algorithm Mean Shift in scala/Spark and 
> would like to contribute if you think that it is a good idea.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11047) Internal accumulators miss the internal flag when replaying events in the history server

2015-10-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-11047.
-
   Resolution: Fixed
 Assignee: Carson Wang
Fix Version/s: 1.6.0
   1.5.2

> Internal accumulators miss the internal flag when replaying events in the 
> history server
> 
>
> Key: SPARK-11047
> URL: https://issues.apache.org/jira/browse/SPARK-11047
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Carson Wang
>Assignee: Carson Wang
>Priority: Critical
> Fix For: 1.5.2, 1.6.0
>
>
> Internal accumulators don't write the internal flag to event log. So on the 
> history server Web UI, all accumulators are not internal. This causes 
> incorrect peak execution memory and unwanted accumulator table displayed on 
> the stage page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2015-10-15 Thread Ragu Ramaswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959273#comment-14959273
 ] 

Ragu Ramaswamy commented on SPARK-4105:
---

I get this error consistently when using spark-shell on 1.5.1 (win 7)

{code}
scala> sc.textFile("README.md", 1).flatMap(x => x.split(" ")).countByValue()
{code}

Happens for any {code}countByValue/groupByKey/reduceByKey{code} operations. The 
affected versions tag on this issue mentions 1.2.0, 1.2.1, 1.3.0, 1.4.1 but not 
1.5.1

Can someone help me if I am doing something wrong or is this a problem in 1.5.1 
also

> FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based 
> shuffle
> -
>
> Key: SPARK-4105
> URL: https://issues.apache.org/jira/browse/SPARK-4105
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.4.1
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
> Attachments: JavaObjectToSerialize.java, 
> SparkFailedToUncompressGenerator.scala
>
>
> We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during 
> shuffle read.  Here's a sample stacktrace from an executor:
> {code}
> 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 
> 33053)
> java.io.IOException: FAILED_TO_UNCOMPRESS(5)
>   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
>   at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
>   at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391)
>   at org.xerial.snappy.Snappy.uncompress(Snappy.java:427)
>   at 
> org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
>   at 
> org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
>   at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
>   at 
> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
>   at 
> org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129)
>   at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
>   at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>   at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.sch

[jira] [Assigned] (SPARK-10186) Add support for more postgres column types

2015-10-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10186:


Assignee: Apache Spark

> Add support for more postgres column types
> --
>
> Key: SPARK-10186
> URL: https://issues.apache.org/jira/browse/SPARK-10186
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: Ubuntu on AWS
>Reporter: Simeon Simeonov
>Assignee: Apache Spark
>
> The specific observations below are based on Postgres 9.4 tables accessed via 
> the postgresql-9.4-1201.jdbc41.jar driver. However, based on the behavior, I 
> would expect the problem to exists for all external SQL databases.
> - *json and jsonb columns generate {{java.sql.SQLException: Unsupported type 
> }}*. While it is reasonable to not support dynamic schema discovery of 
> JSON columns automatically (it requires two passes over the data), a better 
> behavior would be to create a String column and return the JSON.
> - *Array columns generate {{java.sql.SQLException: Unsupported type 2003}}*. 
> This is true even for simple types, e.g., {{text[]}}. A better behavior would 
> be be create an Array column. 
> - *Custom type columns are mapped to a String column.* This behavior is 
> harder to understand as the schema of a custom type is fixed and therefore 
> mappable to a Struct column. The automatic conversion to a string is also 
> inconsistent when compared to json and array column handling.
> The exceptions are thrown by 
> {{org.apache.spark.sql.jdbc.JDBCRDD$.org$apache$spark$sql$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:100)}}
>  so this definitely looks like a Spark SQL and not a JDBC problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10186) Add support for more postgres column types

2015-10-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10186:


Assignee: (was: Apache Spark)

> Add support for more postgres column types
> --
>
> Key: SPARK-10186
> URL: https://issues.apache.org/jira/browse/SPARK-10186
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: Ubuntu on AWS
>Reporter: Simeon Simeonov
>
> The specific observations below are based on Postgres 9.4 tables accessed via 
> the postgresql-9.4-1201.jdbc41.jar driver. However, based on the behavior, I 
> would expect the problem to exists for all external SQL databases.
> - *json and jsonb columns generate {{java.sql.SQLException: Unsupported type 
> }}*. While it is reasonable to not support dynamic schema discovery of 
> JSON columns automatically (it requires two passes over the data), a better 
> behavior would be to create a String column and return the JSON.
> - *Array columns generate {{java.sql.SQLException: Unsupported type 2003}}*. 
> This is true even for simple types, e.g., {{text[]}}. A better behavior would 
> be be create an Array column. 
> - *Custom type columns are mapped to a String column.* This behavior is 
> harder to understand as the schema of a custom type is fixed and therefore 
> mappable to a Struct column. The automatic conversion to a string is also 
> inconsistent when compared to json and array column handling.
> The exceptions are thrown by 
> {{org.apache.spark.sql.jdbc.JDBCRDD$.org$apache$spark$sql$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:100)}}
>  so this definitely looks like a Spark SQL and not a JDBC problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10186) Add support for more postgres column types

2015-10-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959289#comment-14959289
 ] 

Apache Spark commented on SPARK-10186:
--

User 'mariusvniekerk' has created a pull request for this issue:
https://github.com/apache/spark/pull/9137

> Add support for more postgres column types
> --
>
> Key: SPARK-10186
> URL: https://issues.apache.org/jira/browse/SPARK-10186
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: Ubuntu on AWS
>Reporter: Simeon Simeonov
>
> The specific observations below are based on Postgres 9.4 tables accessed via 
> the postgresql-9.4-1201.jdbc41.jar driver. However, based on the behavior, I 
> would expect the problem to exists for all external SQL databases.
> - *json and jsonb columns generate {{java.sql.SQLException: Unsupported type 
> }}*. While it is reasonable to not support dynamic schema discovery of 
> JSON columns automatically (it requires two passes over the data), a better 
> behavior would be to create a String column and return the JSON.
> - *Array columns generate {{java.sql.SQLException: Unsupported type 2003}}*. 
> This is true even for simple types, e.g., {{text[]}}. A better behavior would 
> be be create an Array column. 
> - *Custom type columns are mapped to a String column.* This behavior is 
> harder to understand as the schema of a custom type is fixed and therefore 
> mappable to a Struct column. The automatic conversion to a string is also 
> inconsistent when compared to json and array column handling.
> The exceptions are thrown by 
> {{org.apache.spark.sql.jdbc.JDBCRDD$.org$apache$spark$sql$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:100)}}
>  so this definitely looks like a Spark SQL and not a JDBC problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5739) Size exceeds Integer.MAX_VALUE in File Map

2015-10-15 Thread Karl D. Gierach (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959298#comment-14959298
 ] 

Karl D. Gierach commented on SPARK-5739:


Is there anyway to increase this block limit?  I'm hitting the same issue 
during a UnionRDD operation.

Also, above this issue's state is "resolved" but I'm not sure what the 
resolution is?


> Size exceeds Integer.MAX_VALUE in File Map
> --
>
> Key: SPARK-5739
> URL: https://issues.apache.org/jira/browse/SPARK-5739
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.1.1
> Environment: Spark1.1.1 on a cluster with 12 node. Every node with 
> 128GB RAM, 24 Core. the data is just 40GB, and there is 48 parallel task on a 
> node.
>Reporter: DjvuLee
>Priority: Minor
>
> I just run the kmeans algorithm using a random generate data,but occurred 
> this problem after some iteration. I try several time, and this problem is 
> reproduced. 
> Because the data is random generate, so I guess is there a bug ? Or if random 
> data can lead to such a scenario that the size is bigger than 
> Integer.MAX_VALUE, can we check the size before using the file map?
> 015-02-11 00:39:36,057 [sparkDriver-akka.actor.default-dispatcher-15] WARN  
> org.apache.spark.util.SizeEstimator - Failed to check whether 
> UseCompressedOops is set; assuming yes
> [error] (run-main-0) java.lang.IllegalArgumentException: Size exceeds 
> Integer.MAX_VALUE
> java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
>   at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:850)
>   at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:105)
>   at org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:86)
>   at 
> org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:140)
>   at 
> org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:105)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:747)
>   at 
> org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:598)
>   at 
> org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:869)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:79)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:68)
>   at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:36)
>   at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)
>   at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
>   at org.apache.spark.SparkContext.broadcast(SparkContext.scala:809)
>   at 
> org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:270)
>   at org.apache.spark.mllib.clustering.KMeans.runBreeze(KMeans.scala:143)
>   at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:126)
>   at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:338)
>   at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:348)
>   at KMeansDataGenerator$.main(kmeans.scala:105)
>   at KMeansDataGenerator.main(kmeans.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:94)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
>   at java.lang.reflect.Method.invoke(Method.java:619)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10943) NullType Column cannot be written to Parquet

2015-10-15 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-10943:
-
Description: 
{code}
var data02 = sqlContext.sql("select 1 as id, \"cat in the hat\" as text, null 
as comments")
{code}

//FAIL - Try writing a NullType column (where all the values are NULL)

{code}
data02.write.parquet("/tmp/test/dataset2")

at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
at 
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
at 
org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 179.0 failed 4 times, most recent failure: Lost task 0.3 in 
stage 179.0 (TID 39924, 10.0.196.208): org.apache.spark.sql.AnalysisException: 
Unsupported data type StructField(comments,NullType,true).dataType;
at 
org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:524)
at 
org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:312)
at 
org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305)
at 
org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at org.apache.spark.sql.types.StructType.foreach(StructType.scala:92)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at org.apache.spark.sql.types.StructType.map(StructType.scala:92)
at 
org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convert(CatalystSchemaConverter.scala:305)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypesConverter.scala:58)
at 
org.apache.spark.sql.execution.datasources.parquet.RowWriteSupport.init(ParquetTableSupport.scala:55)
at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:288)
at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetRelation.scala:94)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anon$3.newInstance(ParquetRelation.scala:272)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:233)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRe

[jira] [Commented] (SPARK-10943) NullType Column cannot be written to Parquet

2015-10-15 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959304#comment-14959304
 ] 

Michael Armbrust commented on SPARK-10943:
--

Yeah, parquet doesn't have a concept of null type.  I'd probably suggest they 
case null to a type {{CAST(NULL AS INT)}} if they really want to do this, but 
really you should just omit the column probably.

> NullType Column cannot be written to Parquet
> 
>
> Key: SPARK-10943
> URL: https://issues.apache.org/jira/browse/SPARK-10943
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Jason Pohl
>
> {code}
> var data02 = sqlContext.sql("select 1 as id, \"cat in the hat\" as text, null 
> as comments")
> {code}
> //FAIL - Try writing a NullType column (where all the values are NULL)
> {code}
> data02.write.parquet("/tmp/test/dataset2")
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 179.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 179.0 (TID 39924, 10.0.196.208): 
> org.apache.spark.sql.AnalysisException: Unsupported data type 
> StructField(comments,NullType,true).dataType;
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:524)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:312)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:92)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at org.apache.spark.sql.types.StructType.map(StructType.scala:92)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convert(CatalystSchemaConverter.scala:305)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypesConverter.scala:58)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.RowWriteSupport.init(ParquetTableSupport.scala:55)
>   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:288)
>   at 
> org.apache.parquet.hadoop.ParquetOutpu

[jira] [Commented] (SPARK-11058) failed spark job reports on YARN as successful

2015-10-15 Thread Lan Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959320#comment-14959320
 ] 

Lan Jiang commented on SPARK-11058:
---

Sean,

I recreated this problem after I triggered some exceptions in my tasks on 
purpose. The resource manager UI reports the final status to be "succeed" but 
the job shows up in the "incomplete list" on the spark history server.  I do 
see the exception thrown by the the driver. 

Driver stacktrace:
   at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)

As to the second possibility, it might be true.  I cannot test the scenario on 
the existing cluster. I need to launch a new cluster to test it. Will report 
back


Lan

> failed spark job reports on YARN as successful
> --
>
> Key: SPARK-11058
> URL: https://issues.apache.org/jira/browse/SPARK-11058
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.3.0
> Environment: CDH 5.4
>Reporter: Lan Jiang
>Priority: Minor
>
> I have a spark batch job running on CDH5.4 + Spark 1.3.0. Job is submitted in 
> “yarn-client” mode. The job itself failed due to YARN kills several executor 
> containers because the containers exceeded the memory limit posed by YARN. 
> However, when I went to the YARN resource manager site, it displayed the job 
> as successful. I found there was an issue reported in JIRA 
> https://issues.apache.org/jira/browse/SPARK-3627, but it says it was fixed in 
> Spark 1.2. On Spark history server, it shows the job as “Incomplete”. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate

2015-10-15 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959342#comment-14959342
 ] 

Zhan Zhang commented on SPARK-11087:


[~patcharee] I tried a simple case with partition and predicate pushdown, and 
didn't hit the problem. The predicate is pushdown correctly. I will try to use 
your same table to see whether it works.


2501  case class Contact(name: String, phone: String)
2502  case class Person(name: String, age: Int, contacts: Seq[Contact])
2503  val records = (1 to 100).map { i =>;
2504  Person(s"name_$i", i, (0 to 1).map { m => Contact(s"contact_$m", 
s"phone_$m") })
2505  }
2506  sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
2507
sc.parallelize(records).toDF().write.format("orc").partitionBy("age").save("peoplePartitioned")
2508   val peoplePartitioned = 
sqlContext.read.format("orc").load("peoplePartitioned")
2509  peoplePartitioned.registerTempTable("peoplePartitioned")
2510sqlContext.sql("SELECT * FROM peoplePartitioned WHERE age = 20 and name 
= 'name_20'").count
2511  :history
2512sqlContext.sql("SELECT * FROM peoplePartitioned WHERE name = 'name_20' 
and age = 20").count
2513  :history

scala>

2015-10-15 10:40:45 OrcInputFormat [INFO] ORC pushdown predicate: leaf-0 = 
(LESS_THAN age 15)
expr = leaf-0

2015-10-15 10:48:20 OrcInputFormat [INFO] ORC pushdown predicate: leaf-0 = 
(EQUALS name name_20)
expr = leaf-0

sqlContext.sql("SELECT name FROM people WHERE age == 15 and age < 16").count()

2015-10-15 10:58:35 OrcInputFormat [INFO] ORC pushdown predicate: leaf-0 = 
(EQUALS age 15)
leaf-1 = (LESS_THAN age 16)

sqlContext.sql("SELECT name FROM people WHERE age < 15").count()

> spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
> -
>
> Key: SPARK-11087
> URL: https://issues.apache.org/jira/browse/SPARK-11087
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: orc file version 0.12 with HIVE_8732
> hive version 1.2.1.2.3.0.0-2557
>Reporter: patcharee
>Priority: Minor
>
> I have an external hive table stored as partitioned orc file (see the table 
> schema below). I tried to query from the table with where clause>
> hiveContext.setConf("spark.sql.orc.filterPushdown", "true")
> hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = 
> 117")). 
> But from the log file with debug logging level on, the ORC pushdown predicate 
> was not generated. 
> Unfortunately my table was not sorted when I inserted the data, but I 
> expected the ORC pushdown predicate should be generated (because of the where 
> clause) though
> Table schema
> 
> hive> describe formatted 4D;
> OK
> # col_namedata_type   comment 
>
> date  int 
> hhint 
> x int 
> y int 
> heightfloat   
> u float   
> v float   
> w float   
> phfloat   
> phb   float   
> t float   
> p float   
> pbfloat   
> qvaporfloat   
> qgraupfloat   
> qnice float   
> qnrainfloat   
> tke_pbl   float   
> el_pblfloat   
> qcloudfloat   
>
> # Partition Information
> # col_namedata_type   comment 
>
> zone  int 
> z int 
> year  int 
> month int 
>
> # Detailed Table Information   
> Database: default  
> Owner:patcharee 

[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-10-15 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959347#comment-14959347
 ] 

Michael Armbrust commented on SPARK-:
-

Yeah, that Scala code should work.  Regarding the Java version, the only 
difference is the API I have in mind would be {{Encoder.for(MyClass2.class)}}.  
Passing in an encoder instead of a raw {{Class[_]}} gives us some extra 
indirection in case we want to support custom encoders some day.  

I'll add that we can also play reflection tricks in cases where things are not 
erased for Java, and this is the part of the proposal that is the least thought 
out at the moment.  Any help making this part as powerful/robust as possible 
would be greatly appreciated.

I think that is possible that in the long term we will do as you propose and 
remake the RDD API as a compatibility layer with the option to infer the 
encoder based on the class tag.  The problem with this being the primary 
implementation is erasure.

{code}
scala> import scala.reflect._

scala> classTag[(Int, Int)].erasure.getTypeParameters
res0: Array[java.lang.reflect.TypeVariable[Class[_$1]]] forSome { type _$1 } = 
Array(T1, T2)
{code}

We've lost the type of {{_1}} and {{_2}} and so we are going to have to fall 
back on runtime reflection again, per tuple.  Where as the encoders that are 
checked into master could extract primitive int without any additional boxing 
and encode them directly into tungsten buffers.

> RDD-like API on top of Catalyst/DataFrame
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10943) NullType Column cannot be written to Parquet

2015-10-15 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-10943.
--
Resolution: Won't Fix

> NullType Column cannot be written to Parquet
> 
>
> Key: SPARK-10943
> URL: https://issues.apache.org/jira/browse/SPARK-10943
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Jason Pohl
>
> {code}
> var data02 = sqlContext.sql("select 1 as id, \"cat in the hat\" as text, null 
> as comments")
> {code}
> //FAIL - Try writing a NullType column (where all the values are NULL)
> {code}
> data02.write.parquet("/tmp/test/dataset2")
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 179.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 179.0 (TID 39924, 10.0.196.208): 
> org.apache.spark.sql.AnalysisException: Unsupported data type 
> StructField(comments,NullType,true).dataType;
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:524)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:312)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:92)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at org.apache.spark.sql.types.StructType.map(StructType.scala:92)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convert(CatalystSchemaConverter.scala:305)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypesConverter.scala:58)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.RowWriteSupport.init(ParquetTableSupport.scala:55)
>   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:288)
>   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetRelation.scala:94)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.Parque

[jira] [Commented] (SPARK-5874) How to improve the current ML pipeline API?

2015-10-15 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959357#comment-14959357
 ] 

Joseph K. Bradley commented on SPARK-5874:
--

That sounds useful, but we should add that support to individual models first 
before we make it a part of the Estimator abstraction.  Only a few models have 
it currently, so if there are ones you'd prioritize, it'd be great to get your 
help in adding support.

> How to improve the current ML pipeline API?
> ---
>
> Key: SPARK-5874
> URL: https://issues.apache.org/jira/browse/SPARK-5874
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> I created this JIRA to collect feedbacks about the ML pipeline API we 
> introduced in Spark 1.2. The target is to graduate this set of APIs in 1.4 
> with confidence, which requires valuable input from the community. I'll 
> create sub-tasks for each major issue.
> Design doc (WIP): 
> https://docs.google.com/a/databricks.com/document/d/1plFBPJY_PriPTuMiFYLSm7fQgD1FieP4wt3oMVKMGcc/edit#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9919) Matrices should respect Java's equals and hashCode contract

2015-10-15 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9919:
-
Assignee: (was: Manoj Kumar)

> Matrices should respect Java's equals and hashCode contract
> ---
>
> Key: SPARK-9919
> URL: https://issues.apache.org/jira/browse/SPARK-9919
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Feynman Liang
>Priority: Critical
>
> The contract for Java's Object is that a.equals(b) implies a.hashCode == 
> b.hashCode. So usually we need to implement both. The problem with hashCode 
> is that we shouldn't compute it based on all values, which could be very 
> expensive. You can use the implementation of Vector.hashCode as a template, 
> but that requires some changes to avoid hash code collisions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11131) Worker registration protocol is racy

2015-10-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959378#comment-14959378
 ] 

Apache Spark commented on SPARK-11131:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/9138

> Worker registration protocol is racy
> 
>
> Key: SPARK-11131
> URL: https://issues.apache.org/jira/browse/SPARK-11131
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> I ran into this while making changes to the new RPC framework. Because the 
> Worker registration protocol is based on sending unrelated messages between 
> Master and Worker, it's possible that another message (e.g. caused by an a 
> app trying to allocate workers) to arrive at the Worker before it knows the 
> Master has registered it. This triggers the following code:
> {code}
> case LaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_) =>
>   if (masterUrl != activeMasterUrl) {
> logWarning("Invalid Master (" + masterUrl + ") attempted to launch 
> executor.")
> {code}
> This may or may not be made worse by SPARK-11098.
> A simple workaround is to use an {{ask}} instead of a {{send}} for these 
> messages. That should at least narrow the race. 
> Note this is more of a problem in {{local-cluster}} mode, used a lot by unit 
> tests, where Master and Worker instances are coming up as part of the app 
> itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11131) Worker registration protocol is racy

2015-10-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11131:


Assignee: Apache Spark

> Worker registration protocol is racy
> 
>
> Key: SPARK-11131
> URL: https://issues.apache.org/jira/browse/SPARK-11131
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Minor
>
> I ran into this while making changes to the new RPC framework. Because the 
> Worker registration protocol is based on sending unrelated messages between 
> Master and Worker, it's possible that another message (e.g. caused by an a 
> app trying to allocate workers) to arrive at the Worker before it knows the 
> Master has registered it. This triggers the following code:
> {code}
> case LaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_) =>
>   if (masterUrl != activeMasterUrl) {
> logWarning("Invalid Master (" + masterUrl + ") attempted to launch 
> executor.")
> {code}
> This may or may not be made worse by SPARK-11098.
> A simple workaround is to use an {{ask}} instead of a {{send}} for these 
> messages. That should at least narrow the race. 
> Note this is more of a problem in {{local-cluster}} mode, used a lot by unit 
> tests, where Master and Worker instances are coming up as part of the app 
> itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11131) Worker registration protocol is racy

2015-10-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11131:


Assignee: (was: Apache Spark)

> Worker registration protocol is racy
> 
>
> Key: SPARK-11131
> URL: https://issues.apache.org/jira/browse/SPARK-11131
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> I ran into this while making changes to the new RPC framework. Because the 
> Worker registration protocol is based on sending unrelated messages between 
> Master and Worker, it's possible that another message (e.g. caused by an a 
> app trying to allocate workers) to arrive at the Worker before it knows the 
> Master has registered it. This triggers the following code:
> {code}
> case LaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_) =>
>   if (masterUrl != activeMasterUrl) {
> logWarning("Invalid Master (" + masterUrl + ") attempted to launch 
> executor.")
> {code}
> This may or may not be made worse by SPARK-11098.
> A simple workaround is to use an {{ask}} instead of a {{send}} for these 
> messages. That should at least narrow the race. 
> Note this is more of a problem in {{local-cluster}} mode, used a lot by unit 
> tests, where Master and Worker instances are coming up as part of the app 
> itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5739) Size exceeds Integer.MAX_VALUE in File Map

2015-10-15 Thread Karl D. Gierach (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959298#comment-14959298
 ] 

Karl D. Gierach edited comment on SPARK-5739 at 10/15/15 7:06 PM:
--

Is there anyway to increase this block limit?  I'm hitting the same issue 
during a UnionRDD operation.

Also, above this issue's state is "resolved" but I'm not sure what the 
resolution is?  Maybe a state of "closed" with a reference to the duplicate 
ticket would make it more clear.



was (Author: kgierach):
Is there anyway to increase this block limit?  I'm hitting the same issue 
during a UnionRDD operation.

Also, above this issue's state is "resolved" but I'm not sure what the 
resolution is?


> Size exceeds Integer.MAX_VALUE in File Map
> --
>
> Key: SPARK-5739
> URL: https://issues.apache.org/jira/browse/SPARK-5739
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.1.1
> Environment: Spark1.1.1 on a cluster with 12 node. Every node with 
> 128GB RAM, 24 Core. the data is just 40GB, and there is 48 parallel task on a 
> node.
>Reporter: DjvuLee
>Priority: Minor
>
> I just run the kmeans algorithm using a random generate data,but occurred 
> this problem after some iteration. I try several time, and this problem is 
> reproduced. 
> Because the data is random generate, so I guess is there a bug ? Or if random 
> data can lead to such a scenario that the size is bigger than 
> Integer.MAX_VALUE, can we check the size before using the file map?
> 015-02-11 00:39:36,057 [sparkDriver-akka.actor.default-dispatcher-15] WARN  
> org.apache.spark.util.SizeEstimator - Failed to check whether 
> UseCompressedOops is set; assuming yes
> [error] (run-main-0) java.lang.IllegalArgumentException: Size exceeds 
> Integer.MAX_VALUE
> java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
>   at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:850)
>   at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:105)
>   at org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:86)
>   at 
> org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:140)
>   at 
> org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:105)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:747)
>   at 
> org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:598)
>   at 
> org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:869)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:79)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:68)
>   at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:36)
>   at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)
>   at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
>   at org.apache.spark.SparkContext.broadcast(SparkContext.scala:809)
>   at 
> org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:270)
>   at org.apache.spark.mllib.clustering.KMeans.runBreeze(KMeans.scala:143)
>   at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:126)
>   at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:338)
>   at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:348)
>   at KMeansDataGenerator$.main(kmeans.scala:105)
>   at KMeansDataGenerator.main(kmeans.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:94)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
>   at java.lang.reflect.Method.invoke(Method.java:619)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9241) Supporting multiple DISTINCT columns

2015-10-15 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959440#comment-14959440
 ] 

Reynold Xin commented on SPARK-9241:


Do we have any idea on performance characteristics of this rewrite? IIUC, 
grouping set's complexity grows exponentially with the number of items in the 
set?

> Supporting multiple DISTINCT columns
> 
>
> Key: SPARK-9241
> URL: https://issues.apache.org/jira/browse/SPARK-9241
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> Right now the new aggregation code path only support a single distinct column 
> (you can use it in multiple aggregate functions in the query). We need to 
> support multiple distinct columns by generating a different plan for handling 
> multiple distinct columns (without change aggregate functions).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8658) AttributeReference equals method only compare name, exprId and dataType

2015-10-15 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959451#comment-14959451
 ] 

Michael Armbrust commented on SPARK-8658:
-

There is no query that exposes the problem as its an internal quirk.  The 
{{equals}} method should check all of the specified fields for equality.  Today 
it is missing some.

> AttributeReference equals method only compare name, exprId and dataType
> ---
>
> Key: SPARK-8658
> URL: https://issues.apache.org/jira/browse/SPARK-8658
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0, 1.3.1, 1.4.0
>Reporter: Antonio Jesus Navarro
>
> The AttributeReference "equals" method only accept as different objects with 
> different name, expression id or dataType. With this behavior when I tried to 
> do a "transformExpressionsDown" and try to transform qualifiers inside 
> "AttributeReferences", these objects are not replaced, because the 
> transformer considers them equal.
> I propose to add to the "equals" method this variables:
> name, dataType, nullable, metadata, epxrId, qualifiers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11039) Document all UI "retained*" configurations

2015-10-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-11039.

   Resolution: Fixed
Fix Version/s: 1.5.2
   1.6.0

Issue resolved by pull request 9052
[https://github.com/apache/spark/pull/9052]

> Document all UI "retained*" configurations
> --
>
> Key: SPARK-11039
> URL: https://issues.apache.org/jira/browse/SPARK-11039
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Web UI
>Affects Versions: 1.5.1
>Reporter: Nick Pritchard
>Priority: Trivial
> Fix For: 1.6.0, 1.5.2
>
>
> Most are documented except these:
> - spark.sql.ui.retainedExecutions
> - spark.streaming.ui.retainedBatches
> They are really helpful for managing the memory usage of the driver 
> application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11039) Document all UI "retained*" configurations

2015-10-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-11039:
---
Assignee: Nick Pritchard

> Document all UI "retained*" configurations
> --
>
> Key: SPARK-11039
> URL: https://issues.apache.org/jira/browse/SPARK-11039
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Web UI
>Affects Versions: 1.5.1
>Reporter: Nick Pritchard
>Assignee: Nick Pritchard
>Priority: Trivial
> Fix For: 1.5.2, 1.6.0
>
>
> Most are documented except these:
> - spark.sql.ui.retainedExecutions
> - spark.streaming.ui.retainedBatches
> They are really helpful for managing the memory usage of the driver 
> application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5657) Add PySpark Avro Output Format example

2015-10-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-5657.
---
Resolution: Won't Fix

> Add PySpark Avro Output Format example
> --
>
> Key: SPARK-5657
> URL: https://issues.apache.org/jira/browse/SPARK-5657
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples, PySpark
>Affects Versions: 1.2.0
>Reporter: Stanislav Los
>
> There is an Avro Input Format example that shows how to read Avro data in 
> PySpark, but nothing shows how to write from PySpark to Avro. The main 
> challenge, a Converter needs an Avro schema to build a record, but current 
> Spark API doesn't provide a way to supply extra parameters to custom 
> converters. Provided workaround is possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6488) Support addition/multiplication in PySpark's BlockMatrix

2015-10-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6488:
---

Assignee: Apache Spark  (was: Mike Dusenberry)

> Support addition/multiplication in PySpark's BlockMatrix
> 
>
> Key: SPARK-6488
> URL: https://issues.apache.org/jira/browse/SPARK-6488
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>
> This JIRA is to add addition/multiplication to BlockMatrix in PySpark. We 
> should reuse the Scala implementation instead of having a separate 
> implementation in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6488) Support addition/multiplication in PySpark's BlockMatrix

2015-10-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959534#comment-14959534
 ] 

Apache Spark commented on SPARK-6488:
-

User 'dusenberrymw' has created a pull request for this issue:
https://github.com/apache/spark/pull/9139

> Support addition/multiplication in PySpark's BlockMatrix
> 
>
> Key: SPARK-6488
> URL: https://issues.apache.org/jira/browse/SPARK-6488
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Mike Dusenberry
>
> This JIRA is to add addition/multiplication to BlockMatrix in PySpark. We 
> should reuse the Scala implementation instead of having a separate 
> implementation in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6488) Support addition/multiplication in PySpark's BlockMatrix

2015-10-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6488:
---

Assignee: Mike Dusenberry  (was: Apache Spark)

> Support addition/multiplication in PySpark's BlockMatrix
> 
>
> Key: SPARK-6488
> URL: https://issues.apache.org/jira/browse/SPARK-6488
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Mike Dusenberry
>
> This JIRA is to add addition/multiplication to BlockMatrix in PySpark. We 
> should reuse the Scala implementation instead of having a separate 
> implementation in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5874) How to improve the current ML pipeline API?

2015-10-15 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959564#comment-14959564
 ] 

Xusen Yin commented on SPARK-5874:
--

I'd love to add supports to individual models first. But since there are many 
estimators in ML package now, I think we'd better add an umbrella JIRA to 
control the process. Can I create new JIRA subtask under this JIRA?

> How to improve the current ML pipeline API?
> ---
>
> Key: SPARK-5874
> URL: https://issues.apache.org/jira/browse/SPARK-5874
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> I created this JIRA to collect feedbacks about the ML pipeline API we 
> introduced in Spark 1.2. The target is to graduate this set of APIs in 1.4 
> with confidence, which requires valuable input from the community. I'll 
> create sub-tasks for each major issue.
> Design doc (WIP): 
> https://docs.google.com/a/databricks.com/document/d/1plFBPJY_PriPTuMiFYLSm7fQgD1FieP4wt3oMVKMGcc/edit#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2984) FileNotFoundException on _temporary directory

2015-10-15 Thread Pratik Khadloya (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959571#comment-14959571
 ] 

Pratik Khadloya commented on SPARK-2984:


Am seeing the same issue on Spark 1.4.1. I am trying to save a dataframe as 
table ( saveAsTable ) using SaveMode.Overwrite.

{code}
15/10/15 16:19:57 INFO hadoop.ColumnChunkPageWriteStore: written 1,508B for 
[flight_id] INT64: 1,142 values, 1,441B raw, 1,464B comp, 1 pages, encodings: 
[BIT_PACKED, RLE, PLAIN_DICTIONARY], dic { 563 entries, 4,504B raw, 563B comp}
15/10/15 16:19:57 WARN hdfs.DFSClient: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
 No lease on 
/warehouse/hive-user-tables/agg_imps_pratik/_temporary/0/_temporary/attempt_201510151447_0002_m_23_0/part-r-00023-6778754e-ac4d-44ef-8ee8-fc87e89639bc.gz.parquet
 (inode 2376521862): File does not exist. Holder 
DFSClient_attempt_201510151447_0002_m_23_0_529613711_162 does not have any 
open files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3083)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2885)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2767)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:606)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:455)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
{code}

> FileNotFoundException on _temporary directory
> -
>
> Key: SPARK-2984
> URL: https://issues.apache.org/jira/browse/SPARK-2984
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Ash
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.3.0
>
>
> We've seen several stacktraces and threads on the user mailing list where 
> people are having issues with a {{FileNotFoundException}} stemming from an 
> HDFS path containing {{_temporary}}.
> I ([~aash]) think this may be related to {{spark.speculation}}.  I think the 
> error condition might manifest in this circumstance:
> 1) task T starts on a executor E1
> 2) it takes a long time, so task T' is started on another executor E2
> 3) T finishes in E1 so moves its data from {{_temporary}} to the final 
> destination and deletes the {{_temporary}} directory during cleanup
> 4) T' finishes in E2 and attempts to move its data from {{_temporary}}, but 
> those files no longer exist!  exception
> Some samples:
> {noformat}
> 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job 
> 140774430 ms.0
> java.io.FileNotFoundException: File 
> hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07
>  does not exist.
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
> at 
> org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136)
> at 
> org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126)
> at 
> org.apache.spark.rdd.

[jira] [Comment Edited] (SPARK-2984) FileNotFoundException on _temporary directory

2015-10-15 Thread Pratik Khadloya (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959571#comment-14959571
 ] 

Pratik Khadloya edited comment on SPARK-2984 at 10/15/15 8:40 PM:
--

Am seeing the same issue on Spark 1.4.1. I am trying to save a dataframe as 
table ( saveAsTable ) using SaveMode.Overwrite.

{code}
15/10/15 16:19:57 INFO hadoop.ColumnChunkPageWriteStore: written 1,508B for 
[flight_id] INT64: 1,142 values, 1,441B raw, 1,464B comp, 1 pages, encodings: 
[BIT_PACKED, RLE, PLAIN_DICTIONARY], dic { 563 entries, 4,504B raw, 563B comp}
15/10/15 16:19:57 WARN hdfs.DFSClient: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
 No lease on 
/warehouse/hive-user-tables/agg_imps_pratik/_temporary/0/_temporary/attempt_201510151447_0002_m_23_0/part-r-00023-6778754e-ac4d-44ef-8ee8-fc87e89639bc.gz.parquet
 (inode 2376521862): File does not exist. Holder 
DFSClient_attempt_201510151447_0002_m_23_0_529613711_162 does not have any 
open files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3083)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2885)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2767)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:606)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:455)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
{code}


Also, i am not running in speculative mode.
{code}
.set("spark.speculation", "false")
{code}


was (Author: tispratik):
Am seeing the same issue on Spark 1.4.1. I am trying to save a dataframe as 
table ( saveAsTable ) using SaveMode.Overwrite.

{code}
15/10/15 16:19:57 INFO hadoop.ColumnChunkPageWriteStore: written 1,508B for 
[flight_id] INT64: 1,142 values, 1,441B raw, 1,464B comp, 1 pages, encodings: 
[BIT_PACKED, RLE, PLAIN_DICTIONARY], dic { 563 entries, 4,504B raw, 563B comp}
15/10/15 16:19:57 WARN hdfs.DFSClient: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
 No lease on 
/warehouse/hive-user-tables/agg_imps_pratik/_temporary/0/_temporary/attempt_201510151447_0002_m_23_0/part-r-00023-6778754e-ac4d-44ef-8ee8-fc87e89639bc.gz.parquet
 (inode 2376521862): File does not exist. Holder 
DFSClient_attempt_201510151447_0002_m_23_0_529613711_162 does not have any 
open files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3083)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2885)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2767)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:606)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:455)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
{code}

Also, i am not running in speculative mode.
.set("spark.speculation", "false")

> FileNotFoundException on _temporary directory
> ---

[jira] [Comment Edited] (SPARK-2984) FileNotFoundException on _temporary directory

2015-10-15 Thread Pratik Khadloya (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959571#comment-14959571
 ] 

Pratik Khadloya edited comment on SPARK-2984 at 10/15/15 8:39 PM:
--

Am seeing the same issue on Spark 1.4.1. I am trying to save a dataframe as 
table ( saveAsTable ) using SaveMode.Overwrite.

{code}
15/10/15 16:19:57 INFO hadoop.ColumnChunkPageWriteStore: written 1,508B for 
[flight_id] INT64: 1,142 values, 1,441B raw, 1,464B comp, 1 pages, encodings: 
[BIT_PACKED, RLE, PLAIN_DICTIONARY], dic { 563 entries, 4,504B raw, 563B comp}
15/10/15 16:19:57 WARN hdfs.DFSClient: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
 No lease on 
/warehouse/hive-user-tables/agg_imps_pratik/_temporary/0/_temporary/attempt_201510151447_0002_m_23_0/part-r-00023-6778754e-ac4d-44ef-8ee8-fc87e89639bc.gz.parquet
 (inode 2376521862): File does not exist. Holder 
DFSClient_attempt_201510151447_0002_m_23_0_529613711_162 does not have any 
open files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3083)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2885)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2767)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:606)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:455)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
{code}

Also, i am not running in speculative mode.
.set("spark.speculation", "false")


was (Author: tispratik):
Am seeing the same issue on Spark 1.4.1. I am trying to save a dataframe as 
table ( saveAsTable ) using SaveMode.Overwrite.

{code}
15/10/15 16:19:57 INFO hadoop.ColumnChunkPageWriteStore: written 1,508B for 
[flight_id] INT64: 1,142 values, 1,441B raw, 1,464B comp, 1 pages, encodings: 
[BIT_PACKED, RLE, PLAIN_DICTIONARY], dic { 563 entries, 4,504B raw, 563B comp}
15/10/15 16:19:57 WARN hdfs.DFSClient: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
 No lease on 
/warehouse/hive-user-tables/agg_imps_pratik/_temporary/0/_temporary/attempt_201510151447_0002_m_23_0/part-r-00023-6778754e-ac4d-44ef-8ee8-fc87e89639bc.gz.parquet
 (inode 2376521862): File does not exist. Holder 
DFSClient_attempt_201510151447_0002_m_23_0_529613711_162 does not have any 
open files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3083)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2885)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2767)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:606)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:455)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
{code}

> FileNotFoundException on _temporary directory
> -
>
> Key: SPARK-2984
> URL: https://issue

[jira] [Commented] (SPARK-5874) How to improve the current ML pipeline API?

2015-10-15 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959592#comment-14959592
 ] 

Xusen Yin commented on SPARK-5874:
--

Sure I'll do it.

> How to improve the current ML pipeline API?
> ---
>
> Key: SPARK-5874
> URL: https://issues.apache.org/jira/browse/SPARK-5874
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> I created this JIRA to collect feedbacks about the ML pipeline API we 
> introduced in Spark 1.2. The target is to graduate this set of APIs in 1.4 
> with confidence, which requires valuable input from the community. I'll 
> create sub-tasks for each major issue.
> Design doc (WIP): 
> https://docs.google.com/a/databricks.com/document/d/1plFBPJY_PriPTuMiFYLSm7fQgD1FieP4wt3oMVKMGcc/edit#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate

2015-10-15 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959599#comment-14959599
 ] 

Zhan Zhang commented on SPARK-11087:


[~patcharee] I try to duplicate your table as much as possible, but still 
didn't hit the problem. Please refer to the below for the details.

case class record(date: Int, hh: Int, x: Int, y: Int, height: Float, u: Float, 
w: Float, ph: Float, phb: Float, t: Float, p: Float, pb: Float, tke_pbl: Float, 
el_pbl: Float, qcloud: Float, zone: Int, z: Int, year: Int, month: Int)

val records = (1 to 100).map { i =>
record(i.toInt, i.toInt, i.toInt, i.toInt, i.toFloat, i.toFloat, i.toFloat, 
i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, 
i.toFloat, i.toInt, i.toInt, i.toInt, i.toInt)
}


sc.parallelize(records).toDF().write.format("org.apache.spark.sql.hive.orc.DefaultSource").mode(org.apache.spark.sql.SaveMode.Append).partitionBy("zone","z","year","month").saveAsTable("5D")
sc.parallelize(records).toDF().write.format("org.apache.spark.sql.hive.orc.DefaultSource").partitionBy("zone","z","year","month").save("4D")
val test = sqlContext.read.format("orc").load("4D")
2503   test.registerTempTable("4D")
2504   sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
2505  sqlContext.sql("select date, month, year, hh, u*0.9122461, u*-0.40964267, 
z from 4D where x = 320 and y = 117 and zone == 2 and year=2 and z >= 2 and z 
<= 8").show

2015-10-15 13:37:45 OrcInputFormat [INFO] ORC pushdown predicate: leaf-0 = 
(EQUALS x 320)
leaf-1 = (EQUALS y 117)
expr = (and leaf-0 leaf-1)
2507   sqlContext.sql("select date, month, year, hh, u*0.9122461, 
u*-0.40964267, z from 5D where x = 321 and y = 118 and zone == 2 and year=2 and 
z >= 2 and z <= 8").show
2015-10-15 13:40:06 OrcInputFormat [INFO] ORC pushdown predicate: leaf-0 = 
(EQUALS x 321)
leaf-1 = (EQUALS y 118)
expr = (and leaf-0 leaf-1)


> spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
> -
>
> Key: SPARK-11087
> URL: https://issues.apache.org/jira/browse/SPARK-11087
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: orc file version 0.12 with HIVE_8732
> hive version 1.2.1.2.3.0.0-2557
>Reporter: patcharee
>Priority: Minor
>
> I have an external hive table stored as partitioned orc file (see the table 
> schema below). I tried to query from the table with where clause>
> hiveContext.setConf("spark.sql.orc.filterPushdown", "true")
> hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = 
> 117")). 
> But from the log file with debug logging level on, the ORC pushdown predicate 
> was not generated. 
> Unfortunately my table was not sorted when I inserted the data, but I 
> expected the ORC pushdown predicate should be generated (because of the where 
> clause) though
> Table schema
> 
> hive> describe formatted 4D;
> OK
> # col_namedata_type   comment 
>
> date  int 
> hhint 
> x int 
> y int 
> heightfloat   
> u float   
> v float   
> w float   
> phfloat   
> phb   float   
> t float   
> p float   
> pbfloat   
> qvaporfloat   
> qgraupfloat   
> qnice float   
> qnrainfloat   
> tke_pbl   float   
> el_pblfloat   
> qcloudfloat   
>
> # Partition Information
> # col_namedata_type   comment 
>
> zone  int 
> z int 
> year  int 

[jira] [Comment Edited] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate

2015-10-15 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959599#comment-14959599
 ] 

Zhan Zhang edited comment on SPARK-11087 at 10/15/15 8:58 PM:
--

[~patcharee] I try to duplicate your table as much as possible, but still 
didn't hit the problem.  Note that the query has to include some valid record 
in the partition. Otherwise, the partition pruning will trim all predicate 
before hitting the orc scan. Please refer to the below for the details.

case class record(date: Int, hh: Int, x: Int, y: Int, height: Float, u: Float, 
w: Float, ph: Float, phb: Float, t: Float, p: Float, pb: Float, tke_pbl: Float, 
el_pbl: Float, qcloud: Float, zone: Int, z: Int, year: Int, month: Int)

val records = (1 to 100).map { i =>
record(i.toInt, i.toInt, i.toInt, i.toInt, i.toFloat, i.toFloat, i.toFloat, 
i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, 
i.toFloat, i.toInt, i.toInt, i.toInt, i.toInt)
}


sc.parallelize(records).toDF().write.format("org.apache.spark.sql.hive.orc.DefaultSource").mode(org.apache.spark.sql.SaveMode.Append).partitionBy("zone","z","year","month").saveAsTable("5D")
sc.parallelize(records).toDF().write.format("org.apache.spark.sql.hive.orc.DefaultSource").partitionBy("zone","z","year","month").save("4D")
val test = sqlContext.read.format("orc").load("4D")
test.registerTempTable("4D")
sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
sqlContext.sql("select date, month, year, hh, u*0.9122461, u*-0.40964267, z 
from 4D where x = and y = 117 and zone == 2 and year=2 and z >= 2 and z <= 
8").show

2015-10-15 13:37:45 OrcInputFormat [INFO] ORC pushdown predicate: leaf-0 = 
(EQUALS x 320)
leaf-1 = (EQUALS y 117)
expr = (and leaf-0 leaf-1)
2507   sqlContext.sql("select date, month, year, hh, u*0.9122461, 
u*-0.40964267, z from 5D where x = 321 and y = 118 and zone == 2 and year=2 and 
z >= 2 and z <= 8").show
2015-10-15 13:40:06 OrcInputFormat [INFO] ORC pushdown predicate: leaf-0 = 
(EQUALS x 321)
leaf-1 = (EQUALS y 118)
expr = (and leaf-0 leaf-1)



was (Author: zzhan):
[~patcharee] I try to duplicate your table as much as possible, but still 
didn't hit the problem. Please refer to the below for the details.

case class record(date: Int, hh: Int, x: Int, y: Int, height: Float, u: Float, 
w: Float, ph: Float, phb: Float, t: Float, p: Float, pb: Float, tke_pbl: Float, 
el_pbl: Float, qcloud: Float, zone: Int, z: Int, year: Int, month: Int)

val records = (1 to 100).map { i =>
record(i.toInt, i.toInt, i.toInt, i.toInt, i.toFloat, i.toFloat, i.toFloat, 
i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, 
i.toFloat, i.toInt, i.toInt, i.toInt, i.toInt)
}


sc.parallelize(records).toDF().write.format("org.apache.spark.sql.hive.orc.DefaultSource").mode(org.apache.spark.sql.SaveMode.Append).partitionBy("zone","z","year","month").saveAsTable("5D")
sc.parallelize(records).toDF().write.format("org.apache.spark.sql.hive.orc.DefaultSource").partitionBy("zone","z","year","month").save("4D")
val test = sqlContext.read.format("orc").load("4D")
2503   test.registerTempTable("4D")
2504   sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
2505  sqlContext.sql("select date, month, year, hh, u*0.9122461, u*-0.40964267, 
z from 4D where x = 320 and y = 117 and zone == 2 and year=2 and z >= 2 and z 
<= 8").show

2015-10-15 13:37:45 OrcInputFormat [INFO] ORC pushdown predicate: leaf-0 = 
(EQUALS x 320)
leaf-1 = (EQUALS y 117)
expr = (and leaf-0 leaf-1)
2507   sqlContext.sql("select date, month, year, hh, u*0.9122461, 
u*-0.40964267, z from 5D where x = 321 and y = 118 and zone == 2 and year=2 and 
z >= 2 and z <= 8").show
2015-10-15 13:40:06 OrcInputFormat [INFO] ORC pushdown predicate: leaf-0 = 
(EQUALS x 321)
leaf-1 = (EQUALS y 118)
expr = (and leaf-0 leaf-1)


> spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
> -
>
> Key: SPARK-11087
> URL: https://issues.apache.org/jira/browse/SPARK-11087
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: orc file version 0.12 with HIVE_8732
> hive version 1.2.1.2.3.0.0-2557
>Reporter: patcharee
>Priority: Minor
>
> I have an external hive table stored as partitioned orc file (see the table 
> schema below). I tried to query from the table with where clause>
> hiveContext.setConf("spark.sql.orc.filterPushdown", "true")
> hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = 
> 117")). 
> But from the log file with debug logging level on, the ORC pushdown predicate 
> was not generated. 
> Unfortunately my table was not sorted when I inserted the data, but I 
> expected 

[jira] [Commented] (SPARK-9241) Supporting multiple DISTINCT columns

2015-10-15 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959604#comment-14959604
 ] 

Herman van Hovell commented on SPARK-9241:
--

It should grow linear (or am I missing something). For example if we have 3 
grouping sets (like in the example), we would duplicate and project the data 3x 
times. It is still bad, but similar to the approach in [~yhuai]'s example 
(saving a join). We could have a problem with the {{GROUPING__ID}} bitmask 
field, only 32/64 fields can be in a grouping set.

> Supporting multiple DISTINCT columns
> 
>
> Key: SPARK-9241
> URL: https://issues.apache.org/jira/browse/SPARK-9241
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> Right now the new aggregation code path only support a single distinct column 
> (you can use it in multiple aggregate functions in the query). We need to 
> support multiple distinct columns by generating a different plan for handling 
> multiple distinct columns (without change aggregate functions).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11133) Flaky test: o.a.s.launcher.LauncherServerSuite

2015-10-15 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-11133:
--
Labels: flaky-test  (was: )

> Flaky test: o.a.s.launcher.LauncherServerSuite
> --
>
> Key: SPARK-11133
> URL: https://issues.apache.org/jira/browse/SPARK-11133
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: Andrew Or
>Priority: Critical
>  Labels: flaky-test
>
> {code}
> sbt.ForkMain$ForkError: Expected exception caused by connection timeout.
>   at org.junit.Assert.fail(Assert.java:88)
>   at 
> org.apache.spark.launcher.LauncherServerSuite.testTimeout(LauncherServerSuite.java:140)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
> {code}
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-SBT/3769/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/junit/org.apache.spark.launcher/LauncherServerSuite/testTimeout/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11133) Flaky test: o.a.s.launcher.LauncherServerSuite

2015-10-15 Thread Andrew Or (JIRA)
Andrew Or created SPARK-11133:
-

 Summary: Flaky test: o.a.s.launcher.LauncherServerSuite
 Key: SPARK-11133
 URL: https://issues.apache.org/jira/browse/SPARK-11133
 Project: Spark
  Issue Type: Bug
  Components: Tests
Reporter: Andrew Or
Priority: Critical


{code}
sbt.ForkMain$ForkError: Expected exception caused by connection timeout.
at org.junit.Assert.fail(Assert.java:88)
at 
org.apache.spark.launcher.LauncherServerSuite.testTimeout(LauncherServerSuite.java:140)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
{code}

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-SBT/3769/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/junit/org.apache.spark.launcher/LauncherServerSuite/testTimeout/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11134) Flaky test: o.a.s.launcher.LauncherBackendSuite

2015-10-15 Thread Andrew Or (JIRA)
Andrew Or created SPARK-11134:
-

 Summary: Flaky test: o.a.s.launcher.LauncherBackendSuite
 Key: SPARK-11134
 URL: https://issues.apache.org/jira/browse/SPARK-11134
 Project: Spark
  Issue Type: Bug
  Components: Tests
Reporter: Andrew Or
Priority: Critical


{code}
sbt.ForkMain$ForkError: The code passed to eventually never returned normally. 
Attempted 110 times over 10.042591494 seconds. Last failure message: The 
reference was null.
at 
org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
at 
org.apache.spark.launcher.LauncherBackendSuite.org$apache$spark$launcher$LauncherBackendSuite$$testWithMaster(LauncherBackendSuite.scala:57)
at 
org.apache.spark.launcher.LauncherBackendSuite$$anonfun$1$$anonfun$apply$1.apply$mcV$sp(LauncherBackendSuite.scala:39)
at 
org.apache.spark.launcher.LauncherBackendSuite$$anonfun$1$$anonfun$apply$1.apply(LauncherBackendSuite.scala:39)
at 
org.apache.spark.launcher.LauncherBackendSuite$$anonfun$1$$anonfun$apply$1.apply(LauncherBackendSuite.scala:39)
{code}

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-SBT/3768/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=spark-test/testReport/junit/org.apache.spark.launcher/LauncherBackendSuite/local__launcher_handle/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11133) Flaky test: o.a.s.launcher.LauncherServerSuite

2015-10-15 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-11133.

Resolution: Duplicate

> Flaky test: o.a.s.launcher.LauncherServerSuite
> --
>
> Key: SPARK-11133
> URL: https://issues.apache.org/jira/browse/SPARK-11133
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: Andrew Or
>Priority: Critical
>  Labels: flaky-test
>
> {code}
> sbt.ForkMain$ForkError: Expected exception caused by connection timeout.
>   at org.junit.Assert.fail(Assert.java:88)
>   at 
> org.apache.spark.launcher.LauncherServerSuite.testTimeout(LauncherServerSuite.java:140)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
> {code}
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-SBT/3769/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/junit/org.apache.spark.launcher/LauncherServerSuite/testTimeout/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5874) How to improve the current ML pipeline API?

2015-10-15 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959589#comment-14959589
 ] 

Joseph K. Bradley commented on SPARK-5874:
--

Sure, that sounds good.  Can you also please search for existing tickets and 
link them to the umbrella?

> How to improve the current ML pipeline API?
> ---
>
> Key: SPARK-5874
> URL: https://issues.apache.org/jira/browse/SPARK-5874
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> I created this JIRA to collect feedbacks about the ML pipeline API we 
> introduced in Spark 1.2. The target is to graduate this set of APIs in 1.4 
> with confidence, which requires valuable input from the community. I'll 
> create sub-tasks for each major issue.
> Design doc (WIP): 
> https://docs.google.com/a/databricks.com/document/d/1plFBPJY_PriPTuMiFYLSm7fQgD1FieP4wt3oMVKMGcc/edit#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11134) Flaky test: o.a.s.launcher.LauncherBackendSuite

2015-10-15 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-11134:
--
Labels: flaky-test  (was: )

> Flaky test: o.a.s.launcher.LauncherBackendSuite
> ---
>
> Key: SPARK-11134
> URL: https://issues.apache.org/jira/browse/SPARK-11134
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: Andrew Or
>Priority: Critical
>  Labels: flaky-test
>
> {code}
> sbt.ForkMain$ForkError: The code passed to eventually never returned 
> normally. Attempted 110 times over 10.042591494 seconds. Last failure 
> message: The reference was null.
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
>   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
>   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
>   at 
> org.apache.spark.launcher.LauncherBackendSuite.org$apache$spark$launcher$LauncherBackendSuite$$testWithMaster(LauncherBackendSuite.scala:57)
>   at 
> org.apache.spark.launcher.LauncherBackendSuite$$anonfun$1$$anonfun$apply$1.apply$mcV$sp(LauncherBackendSuite.scala:39)
>   at 
> org.apache.spark.launcher.LauncherBackendSuite$$anonfun$1$$anonfun$apply$1.apply(LauncherBackendSuite.scala:39)
>   at 
> org.apache.spark.launcher.LauncherBackendSuite$$anonfun$1$$anonfun$apply$1.apply(LauncherBackendSuite.scala:39)
> {code}
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-SBT/3768/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=spark-test/testReport/junit/org.apache.spark.launcher/LauncherBackendSuite/local__launcher_handle/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11071) Flaky test: o.a.s.launcher.LauncherServerSuite

2015-10-15 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-11071:
--
Component/s: (was: Spark Core)
 Tests

> Flaky test: o.a.s.launcher.LauncherServerSuite
> --
>
> Key: SPARK-11071
> URL: https://issues.apache.org/jira/browse/SPARK-11071
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>  Labels: flaky-test
>
> This test has failed a few times on jenkins, e.g.:
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/lastCompletedBuild/HADOOP_PROFILE=hadoop-2.4,label=spark-test/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11071) Flaky test: o.a.s.launcher.LauncherServerSuite

2015-10-15 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-11071:
--
Labels: flaky-test  (was: )

> Flaky test: o.a.s.launcher.LauncherServerSuite
> --
>
> Key: SPARK-11071
> URL: https://issues.apache.org/jira/browse/SPARK-11071
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>  Labels: flaky-test
>
> This test has failed a few times on jenkins, e.g.:
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/lastCompletedBuild/HADOOP_PROFILE=hadoop-2.4,label=spark-test/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11071) Flaky test: o.a.s.launcher.LauncherServerSuite

2015-10-15 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-11071:
--
Summary: Flaky test: o.a.s.launcher.LauncherServerSuite  (was: 
LauncherServerSuite::testTimeout is flaky)

> Flaky test: o.a.s.launcher.LauncherServerSuite
> --
>
> Key: SPARK-11071
> URL: https://issues.apache.org/jira/browse/SPARK-11071
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>  Labels: flaky-test
>
> This test has failed a few times on jenkins, e.g.:
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/lastCompletedBuild/HADOOP_PROFILE=hadoop-2.4,label=spark-test/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11135) Exchange sort-planning logic may incorrect avoid sorts

2015-10-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-11135:
---
Description: 
In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases 
where the data has already been sorted by a superset of the requested sorting 
columns. For instance, let's say that a query calls for an operator's input to 
be sorted by `a.asc` and the input happens to already be sorted by `[a.asc, 
b.asc]`. In this case, we do not need to re-sort the input. The converse, 
however, is not true: if the query calls for `[a.asc, b.asc]`, then `a.asc` 
alone will not satisfy the ordering requirements, requiring an additional sort 
to be planned by Exchange.

However, the current Exchange code gets this wrong and incorrectly skips 
sorting when the existing output ordering is a subset of the required ordering. 
This is simple to fix, however.

This bug was introduced in https://github.com/apache/spark/pull/7458, so it 
affects 1.5.0+.

  was:
In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases 
where the data has already been sorted by a superset of the requested sorting 
columns. For instance, let's say that a query calls for an operator's input to 
be sorted by `a.asc` and the input happens to already be sorted by `[a.asc, 
b.asc]`. In this case, we do not need to re-sort the input. The converse, 
however, is not true: if the query calls for `[a.asc, b.asc]`, then `a.asc` 
alone will not satisfy the ordering requirements, requiring an additional sort 
to be planned by Exchange.

However, the current Exchange code gets this wrong and incorrectly skips 
sorting when the existing output ordering is a subset of the required ordering. 
This is simple to fix, however.


> Exchange sort-planning logic may incorrect avoid sorts
> --
>
> Key: SPARK-11135
> URL: https://issues.apache.org/jira/browse/SPARK-11135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
>
> In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases 
> where the data has already been sorted by a superset of the requested sorting 
> columns. For instance, let's say that a query calls for an operator's input 
> to be sorted by `a.asc` and the input happens to already be sorted by 
> `[a.asc, b.asc]`. In this case, we do not need to re-sort the input. The 
> converse, however, is not true: if the query calls for `[a.asc, b.asc]`, then 
> `a.asc` alone will not satisfy the ordering requirements, requiring an 
> additional sort to be planned by Exchange.
> However, the current Exchange code gets this wrong and incorrectly skips 
> sorting when the existing output ordering is a subset of the required 
> ordering. This is simple to fix, however.
> This bug was introduced in https://github.com/apache/spark/pull/7458, so it 
> affects 1.5.0+.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11135) Exchange sort-planning logic may incorrect avoid sorts

2015-10-15 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-11135:
--

 Summary: Exchange sort-planning logic may incorrect avoid sorts
 Key: SPARK-11135
 URL: https://issues.apache.org/jira/browse/SPARK-11135
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Blocker


In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases 
where the data has already been sorted by a superset of the requested sorting 
columns. For instance, let's say that a query calls for an operator's input to 
be sorted by `a.asc` and the input happens to already be sorted by `[a.asc, 
b.asc]`. In this case, we do not need to re-sort the input. The converse, 
however, is not true: if the query calls for `[a.asc, b.asc]`, then `a.asc` 
alone will not satisfy the ordering requirements, requiring an additional sort 
to be planned by Exchange.

However, the current Exchange code gets this wrong and incorrectly skips 
sorting when the existing output ordering is a subset of the required ordering. 
This is simple to fix, however.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11135) Exchange sort-planning logic incorrectly avoid sorts when existing ordering is subset of required ordering

2015-10-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-11135:
---
Summary: Exchange sort-planning logic incorrectly avoid sorts when existing 
ordering is subset of required ordering  (was: Exchange sort-planning logic may 
incorrect avoid sorts)

> Exchange sort-planning logic incorrectly avoid sorts when existing ordering 
> is subset of required ordering
> --
>
> Key: SPARK-11135
> URL: https://issues.apache.org/jira/browse/SPARK-11135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
>
> In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases 
> where the data has already been sorted by a superset of the requested sorting 
> columns. For instance, let's say that a query calls for an operator's input 
> to be sorted by `a.asc` and the input happens to already be sorted by 
> `[a.asc, b.asc]`. In this case, we do not need to re-sort the input. The 
> converse, however, is not true: if the query calls for `[a.asc, b.asc]`, then 
> `a.asc` alone will not satisfy the ordering requirements, requiring an 
> additional sort to be planned by Exchange.
> However, the current Exchange code gets this wrong and incorrectly skips 
> sorting when the existing output ordering is a subset of the required 
> ordering. This is simple to fix, however.
> This bug was introduced in https://github.com/apache/spark/pull/7458, so it 
> affects 1.5.0+.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11135) Exchange sort-planning logic incorrectly avoid sorts when existing ordering is non-empty subset of required ordering

2015-10-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-11135:
---
Summary: Exchange sort-planning logic incorrectly avoid sorts when existing 
ordering is non-empty subset of required ordering  (was: Exchange sort-planning 
logic incorrectly avoid sorts when existing ordering is subset of required 
ordering)

> Exchange sort-planning logic incorrectly avoid sorts when existing ordering 
> is non-empty subset of required ordering
> 
>
> Key: SPARK-11135
> URL: https://issues.apache.org/jira/browse/SPARK-11135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
>
> In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases 
> where the data has already been sorted by a superset of the requested sorting 
> columns. For instance, let's say that a query calls for an operator's input 
> to be sorted by `a.asc` and the input happens to already be sorted by 
> `[a.asc, b.asc]`. In this case, we do not need to re-sort the input. The 
> converse, however, is not true: if the query calls for `[a.asc, b.asc]`, then 
> `a.asc` alone will not satisfy the ordering requirements, requiring an 
> additional sort to be planned by Exchange.
> However, the current Exchange code gets this wrong and incorrectly skips 
> sorting when the existing output ordering is a subset of the required 
> ordering. This is simple to fix, however.
> This bug was introduced in https://github.com/apache/spark/pull/7458, so it 
> affects 1.5.0+.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11071) Flaky test: o.a.s.launcher.LauncherServerSuite

2015-10-15 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-11071.
---
  Resolution: Fixed
   Fix Version/s: 1.6.0
Target Version/s: 1.6.0

> Flaky test: o.a.s.launcher.LauncherServerSuite
> --
>
> Key: SPARK-11071
> URL: https://issues.apache.org/jira/browse/SPARK-11071
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>  Labels: flaky-test
> Fix For: 1.6.0
>
>
> This test has failed a few times on jenkins, e.g.:
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/lastCompletedBuild/HADOOP_PROFILE=hadoop-2.4,label=spark-test/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10515) When killing executor, the pending replacement executors will be lost

2015-10-15 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-10515.
---
  Resolution: Fixed
Assignee: KaiXinXIaoLei
   Fix Version/s: 1.6.0
  1.5.2
Target Version/s: 1.5.2, 1.6.0

> When killing executor, the pending replacement executors will be lost
> -
>
> Key: SPARK-10515
> URL: https://issues.apache.org/jira/browse/SPARK-10515
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: KaiXinXIaoLei
>Assignee: KaiXinXIaoLei
> Fix For: 1.5.2, 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10412) In SQL tab, show execution memory per physical operator

2015-10-15 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-10412.
---
   Resolution: Fixed
 Assignee: Wenchen Fan
Fix Version/s: 1.6.0

> In SQL tab, show execution memory per physical operator
> ---
>
> Key: SPARK-10412
> URL: https://issues.apache.org/jira/browse/SPARK-10412
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 1.5.0
>Reporter: Andrew Or
>Assignee: Wenchen Fan
> Fix For: 1.6.0
>
>
> We already display it per task / stage. It's really useful to also display it 
> per operator so the user can know which one caused all the memory to be 
> allocated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11136) Warm-start support for ML estimator

2015-10-15 Thread Xusen Yin (JIRA)
Xusen Yin created SPARK-11136:
-

 Summary: Warm-start support for ML estimator
 Key: SPARK-11136
 URL: https://issues.apache.org/jira/browse/SPARK-11136
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xusen Yin
Priority: Minor


The current implementation of Estimator does not support warm-start fitting, 
i.e. estimator.fit(data, params, partialModel). But first we need to add 
warm-start for all ML estimators. This is an umbrella JIRA to add support for 
the warm-start estimator. 

Possible solutions:

1. Add warm-start fitting interface like def fit(dataset: DataFrame, initModel: 
M, paramMap: ParamMap): M

2. Treat model as a special parameter, passing it through ParamMap. e.g. val 
partialModel: Param[Option[M]] = new Param(...). In the case of model existing, 
we use it to warm-start, else we start the training process from the beginning.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11135) Exchange sort-planning logic incorrectly avoid sorts when existing ordering is non-empty subset of required ordering

2015-10-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11135:


Assignee: Apache Spark  (was: Josh Rosen)

> Exchange sort-planning logic incorrectly avoid sorts when existing ordering 
> is non-empty subset of required ordering
> 
>
> Key: SPARK-11135
> URL: https://issues.apache.org/jira/browse/SPARK-11135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Apache Spark
>Priority: Blocker
>
> In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases 
> where the data has already been sorted by a superset of the requested sorting 
> columns. For instance, let's say that a query calls for an operator's input 
> to be sorted by `a.asc` and the input happens to already be sorted by 
> `[a.asc, b.asc]`. In this case, we do not need to re-sort the input. The 
> converse, however, is not true: if the query calls for `[a.asc, b.asc]`, then 
> `a.asc` alone will not satisfy the ordering requirements, requiring an 
> additional sort to be planned by Exchange.
> However, the current Exchange code gets this wrong and incorrectly skips 
> sorting when the existing output ordering is a subset of the required 
> ordering. This is simple to fix, however.
> This bug was introduced in https://github.com/apache/spark/pull/7458, so it 
> affects 1.5.0+.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11135) Exchange sort-planning logic incorrectly avoid sorts when existing ordering is non-empty subset of required ordering

2015-10-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959735#comment-14959735
 ] 

Apache Spark commented on SPARK-11135:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/9140

> Exchange sort-planning logic incorrectly avoid sorts when existing ordering 
> is non-empty subset of required ordering
> 
>
> Key: SPARK-11135
> URL: https://issues.apache.org/jira/browse/SPARK-11135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
>
> In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases 
> where the data has already been sorted by a superset of the requested sorting 
> columns. For instance, let's say that a query calls for an operator's input 
> to be sorted by `a.asc` and the input happens to already be sorted by 
> `[a.asc, b.asc]`. In this case, we do not need to re-sort the input. The 
> converse, however, is not true: if the query calls for `[a.asc, b.asc]`, then 
> `a.asc` alone will not satisfy the ordering requirements, requiring an 
> additional sort to be planned by Exchange.
> However, the current Exchange code gets this wrong and incorrectly skips 
> sorting when the existing output ordering is a subset of the required 
> ordering. This is simple to fix, however.
> This bug was introduced in https://github.com/apache/spark/pull/7458, so it 
> affects 1.5.0+.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11135) Exchange sort-planning logic incorrectly avoid sorts when existing ordering is non-empty subset of required ordering

2015-10-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11135:


Assignee: Josh Rosen  (was: Apache Spark)

> Exchange sort-planning logic incorrectly avoid sorts when existing ordering 
> is non-empty subset of required ordering
> 
>
> Key: SPARK-11135
> URL: https://issues.apache.org/jira/browse/SPARK-11135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
>
> In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases 
> where the data has already been sorted by a superset of the requested sorting 
> columns. For instance, let's say that a query calls for an operator's input 
> to be sorted by `a.asc` and the input happens to already be sorted by 
> `[a.asc, b.asc]`. In this case, we do not need to re-sort the input. The 
> converse, however, is not true: if the query calls for `[a.asc, b.asc]`, then 
> `a.asc` alone will not satisfy the ordering requirements, requiring an 
> additional sort to be planned by Exchange.
> However, the current Exchange code gets this wrong and incorrectly skips 
> sorting when the existing output ordering is a subset of the required 
> ordering. This is simple to fix, however.
> This bug was introduced in https://github.com/apache/spark/pull/7458, so it 
> affects 1.5.0+.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10829) Scan DataSource with predicate expression combine partition key and attributes doesn't work

2015-10-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10829:
--
Assignee: Cheng Hao

> Scan DataSource with predicate expression combine partition key and 
> attributes doesn't work
> ---
>
> Key: SPARK-10829
> URL: https://issues.apache.org/jira/browse/SPARK-10829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Cheng Hao
>Priority: Critical
> Fix For: 1.6.0
>
>
> To reproduce that with the code:
> {code}
> withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") {
>   withTempPath { dir =>
> val path = s"${dir.getCanonicalPath}/part=1"
> (1 to 3).map(i => (i, i.toString)).toDF("a", "b").write.parquet(path)
> // If the "part = 1" filter gets pushed down, this query will throw 
> an exception since
> // "part" is not a valid column in the actual Parquet file
> checkAnswer(
>   sqlContext.read.parquet(path).filter("a > 0 and (part = 0 or a > 
> 1)"),
>   (2 to 3).map(i => Row(i, i.toString, 1)))
>   }
> }
> {code}
> We expect the result as:
> {code}
> 2, 1
> 3, 1
> {code}
> But we got:
> {code}
> 1, 1
> 2, 1
> 3, 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5391) SparkSQL fails to create tables with custom JSON SerDe

2015-10-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5391:
-
Assignee: Davies Liu

> SparkSQL fails to create tables with custom JSON SerDe
> --
>
> Key: SPARK-5391
> URL: https://issues.apache.org/jira/browse/SPARK-5391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: David Ross
>Assignee: Davies Liu
> Fix For: 1.6.0
>
>
> - Using Spark built from trunk on this commit: 
> https://github.com/apache/spark/commit/bc20a52b34e826895d0dcc1d783c021ebd456ebd
> - Build for Hive13
> - Using this JSON serde: https://github.com/rcongiu/Hive-JSON-Serde
> First download jar locally:
> {code}
> $ curl 
> http://www.congiu.net/hive-json-serde/1.3/cdh5/json-serde-1.3-jar-with-dependencies.jar
>  > /tmp/json-serde-1.3-jar-with-dependencies.jar
> {code}
> Then add it in SparkSQL session:
> {code}
> add jar /tmp/json-serde-1.3-jar-with-dependencies.jar
> {code}
> Finally create table:
> {code}
> create table test_json (c1 boolean) ROW FORMAT SERDE 
> 'org.openx.data.jsonserde.JsonSerDe';
> {code}
> Logs for add jar:
> {code}
> 15/01/23 23:48:33 INFO thriftserver.SparkExecuteStatementOperation: Running 
> query 'add jar /tmp/json-serde-1.3-jar-with-dependencies.jar'
> 15/01/23 23:48:34 INFO session.SessionState: No Tez session required at this 
> point. hive.execution.engine=mr.
> 15/01/23 23:48:34 INFO SessionState: Added 
> /tmp/json-serde-1.3-jar-with-dependencies.jar to class path
> 15/01/23 23:48:34 INFO SessionState: Added resource: 
> /tmp/json-serde-1.3-jar-with-dependencies.jar
> 15/01/23 23:48:34 INFO spark.SparkContext: Added JAR 
> /tmp/json-serde-1.3-jar-with-dependencies.jar at 
> http://192.168.99.9:51312/jars/json-serde-1.3-jar-with-dependencies.jar with 
> timestamp 1422056914776
> 15/01/23 23:48:34 INFO thriftserver.SparkExecuteStatementOperation: Result 
> Schema: List()
> 15/01/23 23:48:34 INFO thriftserver.SparkExecuteStatementOperation: Result 
> Schema: List()
> {code}
> Logs (with error) for create table:
> {code}
> 15/01/23 23:49:00 INFO thriftserver.SparkExecuteStatementOperation: Running 
> query 'create table test_json (c1 boolean) ROW FORMAT SERDE 
> 'org.openx.data.jsonserde.JsonSerDe''
> 15/01/23 23:49:00 INFO parse.ParseDriver: Parsing command: create table 
> test_json (c1 boolean) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
> 15/01/23 23:49:01 INFO parse.ParseDriver: Parse Completed
> 15/01/23 23:49:01 INFO session.SessionState: No Tez session required at this 
> point. hive.execution.engine=mr.
> 15/01/23 23:49:01 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO ql.Driver: Concurrency mode is disabled, not creating 
> a lock manager
> 15/01/23 23:49:01 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO parse.ParseDriver: Parsing command: create table 
> test_json (c1 boolean) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
> 15/01/23 23:49:01 INFO parse.ParseDriver: Parse Completed
> 15/01/23 23:49:01 INFO log.PerfLogger:  start=1422056941103 end=1422056941104 duration=1 
> from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO parse.SemanticAnalyzer: Starting Semantic Analysis
> 15/01/23 23:49:01 INFO parse.SemanticAnalyzer: Creating table test_json 
> position=13
> 15/01/23 23:49:01 INFO ql.Driver: Semantic Analysis Completed
> 15/01/23 23:49:01 INFO log.PerfLogger:  start=1422056941104 end=1422056941240 duration=136 
> from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO ql.Driver: Returning Hive schema: 
> Schema(fieldSchemas:null, properties:null)
> 15/01/23 23:49:01 INFO log.PerfLogger:  start=1422056941071 end=1422056941252 duration=181 
> from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO ql.Driver: Starting command: create table test_json 
> (c1 boolean) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
> 15/01/23 23:49:01 INFO log.PerfLogger:  start=1422056941067 end=1422056941258 duration=191 
> from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 WARN security.ShellBasedUnixGroupsMapping: got exception 
> trying to get groups for user anonymous
> org.apache.hadoop.util.Shell$ExitCodeException: id: anonymous: No such user
>   at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)
>

[jira] [Updated] (SPARK-11032) Failure to resolve having correctly

2015-10-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11032:
--
Assignee: Wenchen Fan

> Failure to resolve having correctly
> ---
>
> Key: SPARK-11032
> URL: https://issues.apache.org/jira/browse/SPARK-11032
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Michael Armbrust
>Assignee: Wenchen Fan
>Priority: Blocker
> Fix For: 1.6.0
>
>
> This is a regression from Spark 1.4
> {code}
> Seq(("michael", 30)).toDF("name", "age").registerTempTable("people")
> sql("SELECT MIN(t0.age) FROM (SELECT * FROM PEOPLE WHERE age > 0) t0 
> HAVING(COUNT(1) > 0)").explain(true)
> == Parsed Logical Plan ==
> 'Filter cast(('COUNT(1) > 0) as boolean)
>  'Project [unresolvedalias('MIN('t0.age))]
>   'Subquery t0
>'Project [unresolvedalias(*)]
> 'Filter ('age > 0)
>  'UnresolvedRelation [PEOPLE], None
> == Analyzed Logical Plan ==
> _c0: int
> Filter cast((count(1) > cast(0 as bigint)) as boolean)
>  Aggregate [min(age#6) AS _c0#9]
>   Subquery t0
>Project [name#5,age#6]
> Filter (age#6 > 0)
>  Subquery people
>   Project [_1#3 AS name#5,_2#4 AS age#6]
>LocalRelation [_1#3,_2#4], [[michael,30]]
> == Optimized Logical Plan ==
> Filter (count(1) > 0)
>  Aggregate [min(age#6) AS _c0#9]
>   Project [_2#4 AS age#6]
>Filter (_2#4 > 0)
> LocalRelation [_1#3,_2#4], [[michael,30]]
> == Physical Plan ==
> Filter (count(1) > 0)
>  TungstenAggregate(key=[], 
> functions=[(min(age#6),mode=Final,isDistinct=false)], output=[_c0#9])
>   TungstenExchange SinglePartition
>TungstenAggregate(key=[], 
> functions=[(min(age#6),mode=Partial,isDistinct=false)], output=[min#12])
> TungstenProject [_2#4 AS age#6]
>  Filter (_2#4 > 0)
>   LocalTableScan [_1#3,_2#4], [[michael,30]]
> Code Generation: true
> {code}
> {code}
> Caused by: java.lang.UnsupportedOperationException: Cannot evaluate 
> expression: count(1)
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:188)
>   at 
> org.apache.spark.sql.catalyst.expressions.Count.eval(aggregates.scala:156)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:327)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:117)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:115)
>   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11076) Decimal Support for Ceil/Floor

2015-10-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11076:
--
Assignee: Cheng Hao

> Decimal Support for Ceil/Floor
> --
>
> Key: SPARK-11076
> URL: https://issues.apache.org/jira/browse/SPARK-11076
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Cheng Hao
> Fix For: 1.6.0
>
>
> Currently, Ceil & Floor doesn't support decimal, but Hive does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11068) Add callback to query execution

2015-10-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11068:
--
Assignee: Wenchen Fan

> Add callback to query execution
> ---
>
> Key: SPARK-11068
> URL: https://issues.apache.org/jira/browse/SPARK-11068
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11123) Inprove HistoryServer with multithread to relay logs

2015-10-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11123.
---
Resolution: Duplicate

[~xietingwen] please search JIRAs before opening a new one.

> Inprove HistoryServer with multithread to relay logs
> 
>
> Key: SPARK-11123
> URL: https://issues.apache.org/jira/browse/SPARK-11123
> Project: Spark
>  Issue Type: Improvement
>Reporter: Xie Tingwen
>
> Now,with Spark 1.4,when I restart HistoryServer,it took over 30 hours to 
> replay over 40 000 log file. What's more,when I have started it,it may take 
> half an hour to relay it and block other logs to be replayed.How about 
> rewrite it with multithread to accelerate replay log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11124) JsonParser/Generator should be closed for resource recycle

2015-10-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11124:
--
Component/s: Spark Core

> JsonParser/Generator should be closed for resource recycle
> --
>
> Key: SPARK-11124
> URL: https://issues.apache.org/jira/browse/SPARK-11124
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Navis
>Priority: Trivial
>
> Some json parsers are not closed. parser in JacksonParser#parseJson, for 
> example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11128) strange NPE when writing in non-existing S3 bucket

2015-10-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11128:
--
Component/s: Input/Output

> strange NPE when writing in non-existing S3 bucket
> --
>
> Key: SPARK-11128
> URL: https://issues.apache.org/jira/browse/SPARK-11128
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.5.1
>Reporter: mathieu despriee
>Priority: Minor
>
> For the record, as it's relatively minor, and related to s3n (not tested with 
> s3a).
> By mistake, we tried writing a parquet dataframe to a non-existing s3 bucket, 
> with a simple df.write.parquet(s3path).
> We got a NPE (see stack trace below), which is very misleading.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:433)
> at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1398)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:73)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
> at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
> at 
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
> at 
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
> at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11102) Uninformative exception when specifing non-exist input for JSON data source

2015-10-15 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-11102:
---
Summary: Uninformative exception when specifing non-exist input for JSON 
data source  (was: Unreadable exception when specifing non-exist input for JSON 
data source)

> Uninformative exception when specifing non-exist input for JSON data source
> ---
>
> Key: SPARK-11102
> URL: https://issues.apache.org/jira/browse/SPARK-11102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Jeff Zhang
>Priority: Minor
>
> If I specify a non-exist input path for json data source, the following 
> exception will be thrown, it is not readable. 
> {code}
> 15/10/14 16:14:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 19.9 KB, free 251.4 KB)
> 15/10/14 16:14:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on 192.168.3.3:54725 (size: 19.9 KB, free: 2.2 GB)
> 15/10/14 16:14:39 INFO SparkContext: Created broadcast 0 from json at 
> :19
> java.io.IOException: No input paths specified in job
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1087)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1085)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$.apply(InferSchema.scala:58)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:105)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:100)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:100)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:99)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:561)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:106)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:221)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:19)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:28)
>   at $iwC$$iwC$$iwC$$iwC.(:30)
>   at $iwC$$iwC$$iwC.(:32)
>   at $iwC$$iwC.(:34)
>   at $iwC.(:36)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11137) Make StreamingContext.stop() exception-safe

2015-10-15 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-11137:


 Summary: Make StreamingContext.stop() exception-safe
 Key: SPARK-11137
 URL: https://issues.apache.org/jira/browse/SPARK-11137
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.5.1
Reporter: Felix Cheung
Priority: Minor


In StreamingContext.stop(), when an exception is thrown the rest of the 
stop/cleanup action is aborted.

Discussed in https://github.com/apache/spark/pull/9116,
srowen commented
Hm, this is getting unwieldy. There are several nested try blocks here. The 
same argument goes for many of these methods -- if one fails should they not 
continue trying? A more tidy solution would be to execute a series of () -> 
Unit code blocks that perform some cleanup and make sure that they each fire in 
succession, regardless of the others. The final one to remove the shutdown hook 
could occur outside synchronization.

I realize we're expanding the scope of the change here, but is it maybe 
worthwhile to go all the way here?

Really, something similar could be done for SparkContext and there's an 
existing JIRA for it somewhere.

At least, I'd prefer to either narrowly fix the deadlock here, or fix all of 
the finally-related issue separately and all at once.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11138) Flaky pyspark test: test_add_py_file

2015-10-15 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-11138:
--

 Summary: Flaky pyspark test: test_add_py_file
 Key: SPARK-11138
 URL: https://issues.apache.org/jira/browse/SPARK-11138
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.6.0
Reporter: Marcelo Vanzin


This test fails pretty often when running PR tests. For example:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43800/console

{noformat}
==
ERROR: test_add_py_file (__main__.AddFileTests)
--
Traceback (most recent call last):
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/tests.py", 
line 396, in test_add_py_file
res = self.sc.parallelize(range(2)).map(func).first()
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/rdd.py", line 
1315, in first
rs = self.take(1)
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/rdd.py", line 
1297, in take
res = self.context.runJob(self, takeUpToNumLeft, p)
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/context.py", 
line 923, in runJob
port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
partitions)
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
 line 538, in __call__
self.target_id, self.name)
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
 line 300, in get_return_value
format(target_id, '.', name), value)
Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in 
stage 3.0 failed 1 times, most recent failure: Lost task 2.0 in stage 3.0 (TID 
7, localhost): org.apache.spark.api.python.PythonException: Traceback (most 
recent call last):
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/pyspark.zip/pyspark/worker.py",
 line 111, in main
process()
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/pyspark.zip/pyspark/worker.py",
 line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/pyspark.zip/pyspark/serializers.py",
 line 263, in dump_stream
vs = list(itertools.islice(iterator, batch))
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/rdd.py", line 
1293, in takeUpToNumLeft
yield next(iterator)
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/tests.py", 
line 388, in func
from userlibrary import UserClass
ImportError: cannot import name UserClass

at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1427)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1415)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1414)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1414)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:793)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:793)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:793)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1639)
at 
org.

  1   2   >