[jira] [Commented] (SPARK-6397) Override QueryPlan.missingInput when necessary and rely on CheckAnalysis

2015-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375477#comment-14375477
 ] 

Apache Spark commented on SPARK-6397:
-

User 'watermen' has created a pull request for this issue:
https://github.com/apache/spark/pull/5132

 Override QueryPlan.missingInput when necessary and rely on CheckAnalysis
 

 Key: SPARK-6397
 URL: https://issues.apache.org/jira/browse/SPARK-6397
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yadong Qi
Priority: Minor

 Currently, some LogicalPlans do not override missingInput, but they should. 
 Then, the lack of proper missingInput implementations leaks to CheckAnalysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6461) spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos

2015-03-23 Thread Littlestar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375508#comment-14375508
 ] 

Littlestar commented on SPARK-6461:
---

http://spark.apache.org/docs/latest/running-on-mesos.html
I run ok with ./bin/spark-shell --master mesos://host:5050

but failed with run-example SparkPi

does spark 1.3.0 tested SparkPi on mesos, thanks

 spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos
 --

 Key: SPARK-6461
 URL: https://issues.apache.org/jira/browse/SPARK-6461
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 1.3.0
Reporter: Littlestar

 I use mesos run spak 1.3.0 ./run-example SparkPi
 but failed.
 spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos
 spark.executorEnv.PATH
 spark.executorEnv.HADOOP_HOME
 spark.executorEnv.JAVA_HOME
 E0323 14:24:36.400635 11355 fetcher.cpp:109] HDFS copyToLocal failed: hadoop 
 fs -copyToLocal 
 'hdfs://192.168.1.9:54310/home/test/spark-1.3.0-bin-2.4.0.tar.gz' 
 '/home/mesos/work_dir/slaves/20150323-100710-1214949568-5050-3453-S3/frameworks/20150323-133400-1214949568-5050-15440-0007/executors/20150323-100710-1214949568-5050-3453-S3/runs/915b40d8-f7c4-428a-9df8-ac9804c6cd21/spark-1.3.0-bin-2.4.0.tar.gz'
 sh: hadoop: command not found



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6213) sql.catalyst.expressions.Expression is not friendly to java

2015-03-23 Thread Littlestar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375612#comment-14375612
 ] 

Littlestar commented on SPARK-6213:
---

may be change protected[sql] def selectFilters(filters: Seq[Expression]) to 
static public java class is a easy way 

 sql.catalyst.expressions.Expression is not friendly to java
 ---

 Key: SPARK-6213
 URL: https://issues.apache.org/jira/browse/SPARK-6213
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Littlestar
Priority: Minor

 sql.sources.CatalystScan#
 public RDDRow buildScan(SeqAttribute requiredColumns, SeqExpression 
 filters)
 I use java to extends BaseRelation, but sql.catalyst.expressions.Expression 
 is not friendly to java, it's can't  iterated by java, such as NodeName, 
 NodeType, FuncName, FuncArgs.
 DataSourceStrategy.scala#selectFilters
 {noformat}
 /**
* Selects Catalyst predicate [[Expression]]s which are convertible into 
 data source [[Filter]]s,
* and convert them.
*/
   protected[sql] def selectFilters(filters: Seq[Expression]) = {
 def translate(predicate: Expression): Option[Filter] = predicate match {
   case expressions.EqualTo(a: Attribute, Literal(v, _)) =
 Some(sources.EqualTo(a.name, v))
   case expressions.EqualTo(Literal(v, _), a: Attribute) =
 Some(sources.EqualTo(a.name, v))
   case expressions.GreaterThan(a: Attribute, Literal(v, _)) =
 Some(sources.GreaterThan(a.name, v))
   case expressions.GreaterThan(Literal(v, _), a: Attribute) =
 Some(sources.LessThan(a.name, v))
   case expressions.LessThan(a: Attribute, Literal(v, _)) =
 Some(sources.LessThan(a.name, v))
   case expressions.LessThan(Literal(v, _), a: Attribute) =
 Some(sources.GreaterThan(a.name, v))
   case expressions.GreaterThanOrEqual(a: Attribute, Literal(v, _)) =
 Some(sources.GreaterThanOrEqual(a.name, v))
   case expressions.GreaterThanOrEqual(Literal(v, _), a: Attribute) =
 Some(sources.LessThanOrEqual(a.name, v))
   case expressions.LessThanOrEqual(a: Attribute, Literal(v, _)) =
 Some(sources.LessThanOrEqual(a.name, v))
   case expressions.LessThanOrEqual(Literal(v, _), a: Attribute) =
 Some(sources.GreaterThanOrEqual(a.name, v))
   case expressions.InSet(a: Attribute, set) =
 Some(sources.In(a.name, set.toArray))
   case expressions.IsNull(a: Attribute) =
 Some(sources.IsNull(a.name))
   case expressions.IsNotNull(a: Attribute) =
 Some(sources.IsNotNull(a.name))
   case expressions.And(left, right) =
 (translate(left) ++ translate(right)).reduceOption(sources.And)
   case expressions.Or(left, right) =
 for {
   leftFilter - translate(left)
   rightFilter - translate(right)
 } yield sources.Or(leftFilter, rightFilter)
   case expressions.Not(child) =
 translate(child).map(sources.Not)
   case _ = None
 }
 filters.flatMap(translate)
   }
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-6461) spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos

2015-03-23 Thread Littlestar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Littlestar updated SPARK-6461:
--
Comment: was deleted

(was: in spark/bin, some shell script use usr/bin/env bash

I think  changed #!/usr/bin/env bash to #!/bin/bash and that worked.)

 spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos
 --

 Key: SPARK-6461
 URL: https://issues.apache.org/jira/browse/SPARK-6461
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 1.3.0
Reporter: Littlestar

 I use mesos run spak 1.3.0 ./run-example SparkPi
 but failed.
 spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos
 spark.executorEnv.PATH
 spark.executorEnv.HADOOP_HOME
 spark.executorEnv.JAVA_HOME
 E0323 14:24:36.400635 11355 fetcher.cpp:109] HDFS copyToLocal failed: hadoop 
 fs -copyToLocal 
 'hdfs://192.168.1.9:54310/home/test/spark-1.3.0-bin-2.4.0.tar.gz' 
 '/home/mesos/work_dir/slaves/20150323-100710-1214949568-5050-3453-S3/frameworks/20150323-133400-1214949568-5050-15440-0007/executors/20150323-100710-1214949568-5050-3453-S3/runs/915b40d8-f7c4-428a-9df8-ac9804c6cd21/spark-1.3.0-bin-2.4.0.tar.gz'
 sh: hadoop: command not found



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6474) Replace image.run with connection.run_instances in spark_ec2.py

2015-03-23 Thread Andrew Drozdov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376579#comment-14376579
 ] 

Andrew Drozdov commented on SPARK-6474:
---

Great, and thanks. Taking a look now.

 Replace image.run with connection.run_instances in spark_ec2.py
 ---

 Key: SPARK-6474
 URL: https://issues.apache.org/jira/browse/SPARK-6474
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Andrew Drozdov
Priority: Minor

 After looking at an issue in Boto [1], ec2.image.Image.run and 
 ec2.connection.EC2Connection.run_instances are similar calls, but 
 run_instances appears to have more features and is more up to date. For 
 example, run_instances has the capability to launch ebs_optimized instances 
 while run does not. The run call is being used in only a couple places in 
 spark_ec2.py, so let's replace it with run_instances.
 [1] https://github.com/boto/boto/issues/3054



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6474) Replace image.run with connection.run_instances in spark_ec2.py

2015-03-23 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-6474:

Issue Type: Improvement  (was: Bug)

 Replace image.run with connection.run_instances in spark_ec2.py
 ---

 Key: SPARK-6474
 URL: https://issues.apache.org/jira/browse/SPARK-6474
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Andrew Drozdov
Priority: Minor

 After looking at an issue in Boto [1], ec2.image.Image.run and 
 ec2.connection.EC2Connection.run_instances are similar calls, but 
 run_instances appears to have more features and is more up to date. For 
 example, run_instances has the capability to launch ebs_optimized instances 
 while run does not. The run call is being used in only a couple places in 
 spark_ec2.py, so let's replace it with run_instances.
 [1] https://github.com/boto/boto/issues/3054



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6477) Run MIMA tests before the Spark test suite

2015-03-23 Thread Brennon York (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brennon York updated SPARK-6477:

Issue Type: Improvement  (was: Bug)

 Run MIMA tests before the Spark test suite
 --

 Key: SPARK-6477
 URL: https://issues.apache.org/jira/browse/SPARK-6477
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Brennon York
Priority: Minor

 Right now the MIMA tests are the last thing to run, yet run very quickly and, 
 if they fail, didn't need the entire Spark test suite to have completed 
 first. I propose we move the MIMA tests to run before the full Spark suite so 
 that builds that fail the MIMA checks will return much faster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6474) Replace image.run with connection.run_instances in spark_ec2.py

2015-03-23 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376584#comment-14376584
 ] 

Nicholas Chammas commented on SPARK-6474:
-

This change also fits the pattern of 
[{{request_spot_instances()}}|https://github.com/apache/spark/blob/474d1320c9b93c501710ad1cfa836b8284562a2c/ec2/spark_ec2.py#L542],
 which is called on the connection like {{run_instances()}} as opposed to on an 
{{Image}}.

 Replace image.run with connection.run_instances in spark_ec2.py
 ---

 Key: SPARK-6474
 URL: https://issues.apache.org/jira/browse/SPARK-6474
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Andrew Drozdov
Priority: Minor

 After looking at an issue in Boto [1], ec2.image.Image.run and 
 ec2.connection.EC2Connection.run_instances are similar calls, but 
 run_instances appears to have more features and is more up to date. For 
 example, run_instances has the capability to launch ebs_optimized instances 
 while run does not. The run call is being used in only a couple places in 
 spark_ec2.py, so let's replace it with run_instances.
 [1] https://github.com/boto/boto/issues/3054



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6476) Spark fileserver not started on same IP as using spark.driver.host

2015-03-23 Thread Rares Vernica (JIRA)
Rares Vernica created SPARK-6476:


 Summary: Spark fileserver not started on same IP as using 
spark.driver.host
 Key: SPARK-6476
 URL: https://issues.apache.org/jira/browse/SPARK-6476
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
Reporter: Rares Vernica


I initially inquired about this here: 
http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3ccalq9kxcn2mwfnd4r4k0q+qh1ypwn3p8rgud1v6yrx9_05lv...@mail.gmail.com%3E

If the Spark driver host has multiple IPs and spark.driver.host is set to one 
of them, I would expect the fileserver to start on the same IP. I checked 
HttpServer and the jetty Server is started the default IP of the machine: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala#L75

Something like this might work instead:

{code:title=HttpServer.scala#L75}
val server = new Server(new InetSocketAddress(conf.get(spark.driver.host), 0))
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6477) Run MIMA tests before the Spark test suite

2015-03-23 Thread Brennon York (JIRA)
Brennon York created SPARK-6477:
---

 Summary: Run MIMA tests before the Spark test suite
 Key: SPARK-6477
 URL: https://issues.apache.org/jira/browse/SPARK-6477
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Brennon York
Priority: Minor


Right now the MIMA tests are the last thing to run, yet run very quickly and, 
if they fail, didn't need the entire Spark test suite to have completed first. 
I propose we move the MIMA tests to run before the full Spark suite so that 
builds that fail the MIMA checks will return much faster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-6122) Upgrade Tachyon dependency to 0.6.0

2015-03-23 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell reopened SPARK-6122:


I reverted this because it looks like it was responsible for some testing 
failures due to the dependency changes.

 Upgrade Tachyon dependency to 0.6.0
 ---

 Key: SPARK-6122
 URL: https://issues.apache.org/jira/browse/SPARK-6122
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.1
Reporter: Haoyuan Li
Assignee: Calvin Jia





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6308) VectorUDT is displayed as `vecto` in dtypes

2015-03-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-6308.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5118
[https://github.com/apache/spark/pull/5118]

 VectorUDT is displayed as `vecto` in dtypes
 ---

 Key: SPARK-6308
 URL: https://issues.apache.org/jira/browse/SPARK-6308
 Project: Spark
  Issue Type: Bug
  Components: MLlib, SQL
Reporter: Xiangrui Meng
Assignee: Manoj Kumar
 Fix For: 1.4.0


 VectorUDT should override simpleString instead of relying on the default 
 implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6474) Replace image.run with connection.run_instances in spark_ec2.py

2015-03-23 Thread Andrew Drozdov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Drozdov updated SPARK-6474:
--
Summary: Replace image.run with connection.run_instances in spark_ec2.py  
(was: Replace image.run with connection.run_instances)

 Replace image.run with connection.run_instances in spark_ec2.py
 ---

 Key: SPARK-6474
 URL: https://issues.apache.org/jira/browse/SPARK-6474
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Andrew Drozdov

 After looking at an issue in Boto [1], ec2.image.Image.run and 
 ec2.connection.EC2Connection.run_instances are similar calls, but 
 run_instances appears to have more features and is more up to date. For 
 example, run_instances has the capability to launch ebs_optimized instances 
 while run does not. The run call is being used in only a couple places in 
 spark_ec2.py, so let's replace it with run_instances.
 [1] https://github.com/boto/boto/issues/3054



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6474) Replace image.run with connection.run_instances

2015-03-23 Thread Andrew Drozdov (JIRA)
Andrew Drozdov created SPARK-6474:
-

 Summary: Replace image.run with connection.run_instances
 Key: SPARK-6474
 URL: https://issues.apache.org/jira/browse/SPARK-6474
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Andrew Drozdov


After looking at an issue in Boto [1], ec2.image.Image.run and 
ec2.connection.EC2Connection.run_instances are similar calls, but run_instances 
appears to have more features and is more up to date. For example, 
run_instances has the capability to launch ebs_optimized instances while run 
does not. The run call is being used in only a couple places in spark_ec2.py, 
so let's replace it with run_instances.

[1] https://github.com/boto/boto/issues/3054



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6474) Replace image.run with connection.run_instances in spark_ec2.py

2015-03-23 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-6474:

Priority: Minor  (was: Major)

 Replace image.run with connection.run_instances in spark_ec2.py
 ---

 Key: SPARK-6474
 URL: https://issues.apache.org/jira/browse/SPARK-6474
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Andrew Drozdov
Priority: Minor

 After looking at an issue in Boto [1], ec2.image.Image.run and 
 ec2.connection.EC2Connection.run_instances are similar calls, but 
 run_instances appears to have more features and is more up to date. For 
 example, run_instances has the capability to launch ebs_optimized instances 
 while run does not. The run call is being used in only a couple places in 
 spark_ec2.py, so let's replace it with run_instances.
 [1] https://github.com/boto/boto/issues/3054



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6474) Replace image.run with connection.run_instances in spark_ec2.py

2015-03-23 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376572#comment-14376572
 ] 

Nicholas Chammas edited comment on SPARK-6474 at 3/23/15 8:29 PM:
--

LGTM. Just setting the Priority to Minor since this doesn't cause any major 
problems, though it should be fixed.


was (Author: nchammas):
LGTM.

 Replace image.run with connection.run_instances in spark_ec2.py
 ---

 Key: SPARK-6474
 URL: https://issues.apache.org/jira/browse/SPARK-6474
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Andrew Drozdov
Priority: Minor

 After looking at an issue in Boto [1], ec2.image.Image.run and 
 ec2.connection.EC2Connection.run_instances are similar calls, but 
 run_instances appears to have more features and is more up to date. For 
 example, run_instances has the capability to launch ebs_optimized instances 
 while run does not. The run call is being used in only a couple places in 
 spark_ec2.py, so let's replace it with run_instances.
 [1] https://github.com/boto/boto/issues/3054



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6473) Launcher lib shouldn't try to figure out Scala version when not in dev mode

2015-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376569#comment-14376569
 ] 

Apache Spark commented on SPARK-6473:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5143

 Launcher lib shouldn't try to figure out Scala version when not in dev mode
 ---

 Key: SPARK-6473
 URL: https://issues.apache.org/jira/browse/SPARK-6473
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Marcelo Vanzin
 Fix For: 1.4.0


 Thanks to [~nravi] for pointing this out.
 The launcher library currently always tries to figure out what's the build's 
 scala version, even when it's not needed. That code is only used when setting 
 some dev options, and relies on the layout of the build directories, so it 
 doesn't work with the directory layout created by make-distribution.sh.
 Right now this works on Linux because bin/load-spark-env.sh sets the Scala 
 version explicitly, but it would break the distribution on Windows, for 
 example.
 Fix is pretty straight-forward.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6474) Replace image.run with connection.run_instances in spark_ec2.py

2015-03-23 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376572#comment-14376572
 ] 

Nicholas Chammas commented on SPARK-6474:
-

LGTM.

 Replace image.run with connection.run_instances in spark_ec2.py
 ---

 Key: SPARK-6474
 URL: https://issues.apache.org/jira/browse/SPARK-6474
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Andrew Drozdov

 After looking at an issue in Boto [1], ec2.image.Image.run and 
 ec2.connection.EC2Connection.run_instances are similar calls, but 
 run_instances appears to have more features and is more up to date. For 
 example, run_instances has the capability to launch ebs_optimized instances 
 while run does not. The run call is being used in only a couple places in 
 spark_ec2.py, so let's replace it with run_instances.
 [1] https://github.com/boto/boto/issues/3054



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6475) DataFrame should support array types when creating DFs from JavaBeans.

2015-03-23 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-6475:


 Summary: DataFrame should support array types when creating DFs 
from JavaBeans.
 Key: SPARK-6475
 URL: https://issues.apache.org/jira/browse/SPARK-6475
 Project: Spark
  Issue Type: Improvement
  Components: DataFrame, SQL
Reporter: Xiangrui Meng


If we have a JavaBean class with array fields, SQL throws an exception in 
`createDataFrame` because arrays are not matched in `getSchema` from a JavaBean 
class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6475) DataFrame should support array types when creating DFs from JavaBeans.

2015-03-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-6475:


Assignee: Xiangrui Meng

 DataFrame should support array types when creating DFs from JavaBeans.
 --

 Key: SPARK-6475
 URL: https://issues.apache.org/jira/browse/SPARK-6475
 Project: Spark
  Issue Type: Improvement
  Components: DataFrame, SQL
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 If we have a JavaBean class with array fields, SQL throws an exception in 
 `createDataFrame` because arrays are not matched in `getSchema` from a 
 JavaBean class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6475) DataFrame should support array types when creating DFs from JavaBeans.

2015-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376780#comment-14376780
 ] 

Apache Spark commented on SPARK-6475:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/5146

 DataFrame should support array types when creating DFs from JavaBeans.
 --

 Key: SPARK-6475
 URL: https://issues.apache.org/jira/browse/SPARK-6475
 Project: Spark
  Issue Type: Improvement
  Components: DataFrame, SQL
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 If we have a JavaBean class with array fields, SQL throws an exception in 
 `createDataFrame` because arrays are not matched in `getSchema` from a 
 JavaBean class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6369) InsertIntoHiveTable should use logic from SparkHadoopWriter

2015-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376234#comment-14376234
 ] 

Apache Spark commented on SPARK-6369:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/5139

 InsertIntoHiveTable should use logic from SparkHadoopWriter
 ---

 Key: SPARK-6369
 URL: https://issues.apache.org/jira/browse/SPARK-6369
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Priority: Blocker

 Right now it is possible that we will corrupt the output if there is a race 
 between competing speculative tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4848) On a stand-alone cluster, several worker-specific variables are read only on the master

2015-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376239#comment-14376239
 ] 

Apache Spark commented on SPARK-4848:
-

User 'nkronenfeld' has created a pull request for this issue:
https://github.com/apache/spark/pull/5140

 On a stand-alone cluster, several worker-specific variables are read only on 
 the master
 ---

 Key: SPARK-4848
 URL: https://issues.apache.org/jira/browse/SPARK-4848
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.0
 Environment: stand-alone spark cluster
Reporter: Nathan Kronenfeld
   Original Estimate: 24h
  Remaining Estimate: 24h

 On a stand-alone spark cluster, much of the determination of worker 
 specifics, especially one has multiple instances per node, is done only on 
 the master.
 The master loops over instances, and starts a worker per instance on each 
 node.
 This means, if your workers have different values of SPARK_WORKER_INSTANCES 
 or SPARK_WORKER_WEBUI_PORT from each other (or from the master), all values 
 are ignored except the one on the master.
 SPARK_WORKER_PORT looks like it is unread in scripts, but read in code - I'm 
 not sure how it will behave, since all instances will read the same value 
 from the environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5954) Add topByKey to pair RDDs

2015-03-23 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376215#comment-14376215
 ] 

Xiangrui Meng commented on SPARK-5954:
--

Note: We added topByKey in mllib.rdd.MLPairRDDFunctions instead of in Core.

 Add topByKey to pair RDDs
 -

 Key: SPARK-5954
 URL: https://issues.apache.org/jira/browse/SPARK-5954
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Xiangrui Meng
Assignee: Shuo Xiang
 Fix For: 1.4.0


 `topByKey(num: Int): RDD[(K, V)]` finds the top-k values for each key in a 
 pair RDD. This is used, e.g., in computing top recommendations. We can use 
 the Guava implementation of finding top-k from an iterator. See also 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/Utils.scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6470) Allow Spark apps to put YARN node labels in their requests

2015-03-23 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza reassigned SPARK-6470:
-

Assignee: Sandy Ryza

 Allow Spark apps to put YARN node labels in their requests
 --

 Key: SPARK-6470
 URL: https://issues.apache.org/jira/browse/SPARK-6470
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Sandy Ryza
Assignee: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6471) Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns

2015-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376311#comment-14376311
 ] 

Apache Spark commented on SPARK-6471:
-

User 'saucam' has created a pull request for this issue:
https://github.com/apache/spark/pull/5141

 Metastore schema should only be a subset of parquet schema to support 
 dropping of columns using replace columns
 ---

 Key: SPARK-6471
 URL: https://issues.apache.org/jira/browse/SPARK-6471
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yash Datta
 Fix For: 1.4.0


 Currently in the parquet relation 2 implementation, error is thrown in case 
 merged schema is not exactly the same as metastore schema. 
 But to support cases like deletion of column using replace column command, we 
 can relax the restriction so that even if metastore schema is a subset of 
 merged parquet schema, the query will work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3720) support ORC in spark sql

2015-03-23 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376361#comment-14376361
 ] 

Zhan Zhang commented on SPARK-3720:
---

[~iward] Since this jiar is duplicated to Spark-2883, we can move the 
discussion to that one.
In short, Sparkl-2883 wants to support saveAsOrcFile and orcFile similar to 
parquet file support in spark. But due to the underlying api change, the patch 
is delayed.

 support ORC in spark sql
 

 Key: SPARK-3720
 URL: https://issues.apache.org/jira/browse/SPARK-3720
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.1.0
Reporter: Fei Wang
 Attachments: orc.diff


 The Optimized Row Columnar (ORC) file format provides a highly efficient way 
 to store data on hdfs.ORC file format has many advantages such as:
 1 a single file as the output of each task, which reduces the NameNode's load
 2 Hive type support including datetime, decimal, and the complex types 
 (struct, list, map, and union)
 3 light-weight indexes stored within the file
 skip row groups that don't pass predicate filtering
 seek to a given row
 4 block-mode compression based on data type
 run-length encoding for integer columns
 dictionary encoding for string columns
 5 concurrent reads of the same file using separate RecordReaders
 6 ability to split files without scanning for markers
 7 bound the amount of memory needed for reading or writing
 8 metadata stored using Protocol Buffers, which allows addition and removal 
 of fields
 Now spark sql support Parquet, support ORC provide people more opts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2331) SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T]

2015-03-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2331:
-
Priority: Minor  (was: Major)
Target Version/s: 2+

 SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T]
 --

 Key: SPARK-2331
 URL: https://issues.apache.org/jira/browse/SPARK-2331
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Ian Hummel
Priority: Minor

 The return type for SparkContext.emptyRDD is EmptyRDD[T].
 It should be RDD[T].  That means you have to add extra type annotations on 
 code like the below (which creates a union of RDDs over some subset of paths 
 in a folder)
 val rdds = Seq(a, b, c).foldLeft[RDD[String]](sc.emptyRDD[String]) { 
 (rdd, path) ⇒
   rdd.union(sc.textFile(path))
 }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-2331) SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T]

2015-03-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-2331:
--

Reopening to catch this for version 2.x

 SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T]
 --

 Key: SPARK-2331
 URL: https://issues.apache.org/jira/browse/SPARK-2331
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Ian Hummel

 The return type for SparkContext.emptyRDD is EmptyRDD[T].
 It should be RDD[T].  That means you have to add extra type annotations on 
 code like the below (which creates a union of RDDs over some subset of paths 
 in a folder)
 val rdds = Seq(a, b, c).foldLeft[RDD[String]](sc.emptyRDD[String]) { 
 (rdd, path) ⇒
   rdd.union(sc.textFile(path))
 }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4830) Spark Streaming Java Application : java.lang.ClassNotFoundException

2015-03-23 Thread sam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376335#comment-14376335
 ] 

sam commented on SPARK-4830:


I've been seeing a similar problem but just for a regular Spark job (no 
streaming).  The executor logs also contain a storm of other exceptions (lots 
of IO and connection exceptions), all seem unrelated to each other.  My 
suspicion is that this is caused by 
https://issues.apache.org/jira/browse/SPARK-3768 (also read linked JIRAs) and 
the storm of nonsense exceptions is because YARN is killing off the executors 
using some heavy kill signals - the JVMs are mid process and cannot cope 
gracefully.

A workaround appears to manually set spark.yarn.executor.memoryOverhead to 
something much larger than the default (~400 MB).  I'm pretty sure I've seen a 
correspondence with the number of partitions I use and the chances of seeing 
this kind of thing.  Anyway I've had to set my overhead to 12GB, this seems a 
little large right?  Anything lower and I see a mess of exceptions in my worker 
logs.

15/03/23 18:08:13 ERROR BlockManagerWorker: Exception handling buffer message
java.lang.ClassNotFoundException: com.xxx.Main$XxxXxx$3
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at 
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:60)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at org.apache.spark.util.NextIterator.foreach(NextIterator.scala:21)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at org.apache.spark.storage.MemoryStore.putBytes(MemoryStore.scala:59)
at 
org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:437)
at 
org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:359)
at 
org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90)
at 
org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69)
at 
org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
at 
org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at 
org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at 
org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28)
at 
org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:44)
at 
org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
at 
org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
at 

[jira] [Updated] (SPARK-6471) Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns

2015-03-23 Thread Yash Datta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yash Datta updated SPARK-6471:
--
Summary: Metastore schema should only be a subset of parquet schema to 
support dropping of columns using replace columns  (was: Metastoreschema should 
only be a subset of parquetSchema to support dropping of columns using replace 
columns)

 Metastore schema should only be a subset of parquet schema to support 
 dropping of columns using replace columns
 ---

 Key: SPARK-6471
 URL: https://issues.apache.org/jira/browse/SPARK-6471
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yash Datta
 Fix For: 1.4.0


 Currently in the parquet relation 2 implementation, error is thrown in case 
 merged schema is not exactly the same as metastore schema. 
 But to support cases like deletion of column using replace column command, we 
 can relax the restriction so that even if metastore schema is a subset of 
 merged parquet schema, the query will work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6471) Metastoreschema should only be a subset of parquetSchema to support dropping of columns using replace columns

2015-03-23 Thread Yash Datta (JIRA)
Yash Datta created SPARK-6471:
-

 Summary: Metastoreschema should only be a subset of parquetSchema 
to support dropping of columns using replace columns
 Key: SPARK-6471
 URL: https://issues.apache.org/jira/browse/SPARK-6471
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yash Datta
 Fix For: 1.4.0


Currently in the parquet relation 2 implementation, error is thrown in case 
merged schema is not exactly the same as metastore schema. 
But to support cases like deletion of column using replace column command, we 
can relax the restriction so that even if metastore schema is a subset of 
merged parquet schema, the query will work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4086) Fold-style aggregation for VertexRDD

2015-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376341#comment-14376341
 ] 

Apache Spark commented on SPARK-4086:
-

User 'brennonyork' has created a pull request for this issue:
https://github.com/apache/spark/pull/5142

 Fold-style aggregation for VertexRDD
 

 Key: SPARK-4086
 URL: https://issues.apache.org/jira/browse/SPARK-4086
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Ankur Dave

 VertexRDD currently supports creations and joins only through a reduce-style 
 interface where the user specifies how to merge two conflicting values. We 
 should also support a fold-style interface that takes a default value 
 (possibly of a different type than the input collection) and a fold function 
 specifying how to accumulate values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6469) The YARN driver in yarn-client mode will not use the local directories configured for YARN

2015-03-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376409#comment-14376409
 ] 

Sean Owen commented on SPARK-6469:
--

[~preaudc] would you like to open a small PR to add a note about this?

 The YARN driver in yarn-client mode will not use the local directories 
 configured for YARN 
 ---

 Key: SPARK-6469
 URL: https://issues.apache.org/jira/browse/SPARK-6469
 Project: Spark
  Issue Type: Documentation
  Components: YARN
Reporter: Christophe PRÉAUD
Priority: Minor
 Attachments: TestYarnVars.scala


 According to the [Spark YARN doc 
 page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes],
  Spark executors will use the local directories configured for YARN, not 
 {{spark.local.dir}} which should be ignored.
 However it should be noted that in yarn-client mode, though the executors 
 will indeed use the local directories configured for YARN, the driver will 
 not, because it is not running on the YARN cluster; the driver in yarn-client 
 will use the local directories defined in {{spark.local.dir}}
 Can this please be clarified in the Spark YARN documentation above?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6471) Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns

2015-03-23 Thread Yash Datta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376286#comment-14376286
 ] 

Yash Datta commented on SPARK-6471:
---

https://github.com/apache/spark/pull/5141

 Metastore schema should only be a subset of parquet schema to support 
 dropping of columns using replace columns
 ---

 Key: SPARK-6471
 URL: https://issues.apache.org/jira/browse/SPARK-6471
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yash Datta
 Fix For: 1.4.0


 Currently in the parquet relation 2 implementation, error is thrown in case 
 merged schema is not exactly the same as metastore schema. 
 But to support cases like deletion of column using replace column command, we 
 can relax the restriction so that even if metastore schema is a subset of 
 merged parquet schema, the query will work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-6471) Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns

2015-03-23 Thread Yash Datta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yash Datta updated SPARK-6471:
--
Comment: was deleted

(was: https://github.com/apache/spark/pull/5141)

 Metastore schema should only be a subset of parquet schema to support 
 dropping of columns using replace columns
 ---

 Key: SPARK-6471
 URL: https://issues.apache.org/jira/browse/SPARK-6471
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yash Datta
 Fix For: 1.4.0


 Currently in the parquet relation 2 implementation, error is thrown in case 
 merged schema is not exactly the same as metastore schema. 
 But to support cases like deletion of column using replace column command, we 
 can relax the restriction so that even if metastore schema is a subset of 
 merged parquet schema, the query will work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6192) Enhance MLlib's Python API (GSoC 2015)

2015-03-23 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376329#comment-14376329
 ] 

Manoj Kumar commented on SPARK-6192:


[~mengxr] Sorry for spamming, but do you have anything else to add? (The 
deadline is a few days away, hence)

 Enhance MLlib's Python API (GSoC 2015)
 --

 Key: SPARK-6192
 URL: https://issues.apache.org/jira/browse/SPARK-6192
 Project: Spark
  Issue Type: Umbrella
  Components: ML, MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Manoj Kumar
  Labels: gsoc, gsoc2015, mentor

 This is an umbrella JIRA for [~MechCoder]'s GSoC 2015 project. The main theme 
 is to enhance MLlib's Python API, to make it on par with the Scala/Java API. 
 The main tasks are:
 1. For all models in MLlib, provide save/load method. This also
 includes save/load in Scala.
 2. Python API for evaluation metrics.
 3. Python API for streaming ML algorithms.
 4. Python API for distributed linear algebra.
 5. Simplify MLLibPythonAPI using DataFrames. Currently, we use
 customized serialization, making MLLibPythonAPI hard to maintain. It
 would be nice to use the DataFrames for serialization.
 I'll link the JIRAs for each of the tasks.
 Note that this doesn't mean all these JIRAs are pre-assigned to [~MechCoder]. 
 The TODO list will be dynamic based on the backlog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6480) histogram() bucket function is wrong in some simple edge cases

2015-03-23 Thread Sean Owen (JIRA)
Sean Owen created SPARK-6480:


 Summary: histogram() bucket function is wrong in some simple edge 
cases
 Key: SPARK-6480
 URL: https://issues.apache.org/jira/browse/SPARK-6480
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Sean Owen
Assignee: Sean Owen


(Credit to a customer report here) This test would fail now: 

{code}
val rdd = sc.parallelize(Seq(1, 1, 1, 2, 3, 3))
assert(Array(3, 1, 2) === rdd.map(_.toDouble).histogram(3)._2)
{code}

Because it returns 3, 1, 0. The problem ultimately traces to the 'fast' bucket 
function that judges buckets based on a multiple of the gap between first and 
second elements. Errors multiply and the end of the final bucket fails to 
include the max.

Fairly plausible use case actually.

This can be tightened up easily with a slightly better expression. It will also 
fix this test, which is actually expecting the wrong answer:

{code}
val rdd = sc.parallelize(6 to 99)
val (histogramBuckets, histogramResults) = rdd.histogram(9)
val expectedHistogramResults =
  Array(11, 10, 11, 10, 10, 11, 10, 10, 11)
{code}

(Should be {{Array(11, 10, 10, 11, 10, 10, 11, 10, 11)}})



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5508) Arrays and Maps stored with Hive Parquet Serde may not be able to read by the Parquet support in the Data Souce API

2015-03-23 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-5508:

Priority: Critical  (was: Major)

 Arrays and Maps stored with Hive Parquet Serde may not be able to read by the 
 Parquet support in the Data Souce API
 ---

 Key: SPARK-5508
 URL: https://issues.apache.org/jira/browse/SPARK-5508
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
 Environment: mesos, cdh
Reporter: Ayoub Benali
Assignee: Cheng Lian
Priority: Critical
  Labels: hivecontext, parquet

 *The root cause of this bug is explained below ([see 
 here|https://issues.apache.org/jira/secure/EditComment!default.jspa?id=12771559commentId=14368505]).*
  
 *The workaround of this issue is to set the following confs*
 {code}
 sql(set spark.sql.parquet.useDataSourceApi=false)
 sql(set spark.sql.hive.convertMetastoreParquet=false)
 {code}
 *Below is the original description.*
 When the table is saved as parquet, we cannot query a field which is an array 
 of struct after an INSERT statement, like show bellow:  
 {noformat}
 scala val data1={
  | timestamp: 1422435598,
  | data_array: [
  | {
  | field1: 1,
  | field2: 2
  | }
  | ]
  | }
 scala val data2={
  | timestamp: 1422435598,
  | data_array: [
  | {
  | field1: 3,
  | field2: 4
  | }
  | ]
 scala val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil)
 scala val rdd = hiveContext.jsonRDD(jsonRDD)
 scala rdd.printSchema
 root
  |-- data_array: array (nullable = true)
  ||-- element: struct (containsNull = false)
  |||-- field1: integer (nullable = true)
  |||-- field2: integer (nullable = true)
  |-- timestamp: integer (nullable = true)
 scala rdd.registerTempTable(tmp_table)
 scala hiveContext.sql(select data.field1 from tmp_table LATERAL VIEW 
 explode(data_array) nestedStuff AS data).collect
 res3: Array[org.apache.spark.sql.Row] = Array([1], [3])
 scala hiveContext.sql(SET hive.exec.dynamic.partition = true)
 scala hiveContext.sql(SET hive.exec.dynamic.partition.mode = nonstrict)
 scala hiveContext.sql(set parquet.compression=GZIP)
 scala hiveContext.setConf(spark.sql.parquet.binaryAsString, true)
 scala hiveContext.sql(create external table if not exists 
 persisted_table(data_array ARRAY STRUCTfield1: INT, field2: INT, 
 timestamp INT) STORED AS PARQUET Location 'hdfs:///test_table')
 scala hiveContext.sql(insert into table persisted_table select * from 
 tmp_table).collect
 scala hiveContext.sql(select data.field1 from persisted_table LATERAL VIEW 
 explode(data_array) nestedStuff AS data).collect
 parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in 
 file hdfs://*/test_table/part-1
   at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
   at 
 parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
   at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
   at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:797)
   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:797)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at 

[jira] [Updated] (SPARK-5508) Arrays and Maps stored with Hive Parquet Serde may not be able to read by the Parquet support in the Data Souce API

2015-03-23 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-5508:

Assignee: Cheng Lian

 Arrays and Maps stored with Hive Parquet Serde may not be able to read by the 
 Parquet support in the Data Souce API
 ---

 Key: SPARK-5508
 URL: https://issues.apache.org/jira/browse/SPARK-5508
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
 Environment: mesos, cdh
Reporter: Ayoub Benali
Assignee: Cheng Lian
  Labels: hivecontext, parquet

 *The root cause of this bug is explained below ([see 
 here|https://issues.apache.org/jira/secure/EditComment!default.jspa?id=12771559commentId=14368505]).*
  
 *The workaround of this issue is to set the following confs*
 {code}
 sql(set spark.sql.parquet.useDataSourceApi=false)
 sql(set spark.sql.hive.convertMetastoreParquet=false)
 {code}
 *Below is the original description.*
 When the table is saved as parquet, we cannot query a field which is an array 
 of struct after an INSERT statement, like show bellow:  
 {noformat}
 scala val data1={
  | timestamp: 1422435598,
  | data_array: [
  | {
  | field1: 1,
  | field2: 2
  | }
  | ]
  | }
 scala val data2={
  | timestamp: 1422435598,
  | data_array: [
  | {
  | field1: 3,
  | field2: 4
  | }
  | ]
 scala val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil)
 scala val rdd = hiveContext.jsonRDD(jsonRDD)
 scala rdd.printSchema
 root
  |-- data_array: array (nullable = true)
  ||-- element: struct (containsNull = false)
  |||-- field1: integer (nullable = true)
  |||-- field2: integer (nullable = true)
  |-- timestamp: integer (nullable = true)
 scala rdd.registerTempTable(tmp_table)
 scala hiveContext.sql(select data.field1 from tmp_table LATERAL VIEW 
 explode(data_array) nestedStuff AS data).collect
 res3: Array[org.apache.spark.sql.Row] = Array([1], [3])
 scala hiveContext.sql(SET hive.exec.dynamic.partition = true)
 scala hiveContext.sql(SET hive.exec.dynamic.partition.mode = nonstrict)
 scala hiveContext.sql(set parquet.compression=GZIP)
 scala hiveContext.setConf(spark.sql.parquet.binaryAsString, true)
 scala hiveContext.sql(create external table if not exists 
 persisted_table(data_array ARRAY STRUCTfield1: INT, field2: INT, 
 timestamp INT) STORED AS PARQUET Location 'hdfs:///test_table')
 scala hiveContext.sql(insert into table persisted_table select * from 
 tmp_table).collect
 scala hiveContext.sql(select data.field1 from persisted_table LATERAL VIEW 
 explode(data_array) nestedStuff AS data).collect
 parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in 
 file hdfs://*/test_table/part-1
   at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
   at 
 parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
   at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
   at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:797)
   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:797)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
  

[jira] [Updated] (SPARK-6437) SQL ExternalSort should use CompletionIterator to clean up temp files

2015-03-23 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6437:

Assignee: Michael Armbrust

 SQL ExternalSort should use CompletionIterator to clean up temp files
 -

 Key: SPARK-6437
 URL: https://issues.apache.org/jira/browse/SPARK-6437
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Assignee: Michael Armbrust
Priority: Critical

 Right now, temp files used by SQL ExternalSort are not cleaned up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6124) Support jdbc connection properties in OPTIONS part of the query

2015-03-23 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-6124.
-
   Resolution: Fixed
Fix Version/s: 1.4.0
   1.3.1

Issue resolved by pull request 4859
[https://github.com/apache/spark/pull/4859]

 Support jdbc connection properties in OPTIONS part of the query
 ---

 Key: SPARK-6124
 URL: https://issues.apache.org/jira/browse/SPARK-6124
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Volodymyr Lyubinets
Assignee: Volodymyr Lyubinets
Priority: Minor
 Fix For: 1.3.1, 1.4.0


 We would like to make it possible to specify connection properties in OPTIONS 
 part of the sql query that uses jdbc driver. For example:
 CREATE TEMPORARY TABLE abc
 USING org.apache.spark.sql.jdbc
 OPTIONS (url '$url', dbtable '(SELECT _ROWID_ FROM test.people)', user 
 'testUser', password 'testPass')
 To do this, we will do minor changes in JDBCRelation and JDBCRDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6112) Provide OffHeap support through HDFS RAM_DISK

2015-03-23 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated SPARK-6112:
--
Attachment: SparkOffheapsupportbyHDFS.pdf

Design doc for hdfs offheap support

 Provide OffHeap support through HDFS RAM_DISK
 -

 Key: SPARK-6112
 URL: https://issues.apache.org/jira/browse/SPARK-6112
 Project: Spark
  Issue Type: New Feature
  Components: Block Manager
Reporter: Zhan Zhang
 Attachments: SparkOffheapsupportbyHDFS.pdf


 HDFS Lazy_Persist policy provide possibility to cache the RDD off_heap in 
 hdfs. We may want to provide similar capacity to Tachyon by leveraging hdfs 
 RAM_DISK feature, if the user environment does not have tachyon deployed. 
 With this feature, it potentially provides possibility to share RDD in memory 
 across different jobs and even share with jobs other than spark, and avoid 
 the RDD recomputation if executors crash. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6479) Create off-heap block storage API (internal)

2015-03-23 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated SPARK-6479:
--
Attachment: SparkOffheapsupportbyHDFS.pdf

The design doc also includes stuff from SPARK-6112

 Create off-heap block storage API (internal)
 

 Key: SPARK-6479
 URL: https://issues.apache.org/jira/browse/SPARK-6479
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager, Spark Core
Reporter: Reynold Xin
 Attachments: SparkOffheapsupportbyHDFS.pdf


 Would be great to create APIs for off-heap block stores, rather than doing a 
 bunch of if statements everywhere.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6054) SQL UDF returning object of case class; regression from 1.2.0

2015-03-23 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-6054:
---

Assignee: Michael Armbrust

 SQL UDF returning object of case class; regression from 1.2.0
 -

 Key: SPARK-6054
 URL: https://issues.apache.org/jira/browse/SPARK-6054
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
 Environment: Windows 8, Scala 2.11.2, Spark 1.3.0 RC1
Reporter: Spiro Michaylov
Assignee: Michael Armbrust
Priority: Blocker

 The following code fails with a stack trace beginning with:
 {code}
 15/02/26 23:21:32 ERROR Executor: Exception in task 2.0 in stage 7.0 (TID 422)
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
 tree: scalaUDF(sales#2,discounts#3)
   at 
 org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:309)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:237)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:192)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:207)
 {code}
 Here is the 1.3.0 version of the code:
 {code}
 case class SalesDisc(sales: Double, discounts: Double)
 def makeStruct(sales: Double, disc:Double) = SalesDisc(sales, disc)
 sqlContext.udf.register(makeStruct, makeStruct _)
 val withStruct =
   sqlContext.sql(SELECT id, sd.sales FROM (SELECT id, makeStruct(sales, 
 discounts) AS sd FROM customerTable) AS d)
 withStruct.foreach(println)
 {code}
 This used to work in 1.2.0. Interestingly, the following simplified version 
 fails similarly, even though it seems to me to be VERY similar to the last 
 test in the UDFSuite:
 {code}
 SELECT makeStruct(sales, discounts) AS sd FROM customerTable
 {code}
 The data table is defined thus:
 {code}
   val custs = Seq(
   Cust(1, Widget Co, 12.00, 0.00, AZ),
   Cust(2, Acme Widgets, 410500.00, 500.00, CA),
   Cust(3, Widgetry, 410500.00, 200.00, CA),
   Cust(4, Widgets R Us, 410500.00, 0.0, CA),
   Cust(5, Ye Olde Widgete, 500.00, 0.0, MA)
 )
 val customerTable = sc.parallelize(custs, 4).toDF()
 customerTable.registerTempTable(customerTable)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6480) histogram() bucket function is wrong in some simple edge cases

2015-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376894#comment-14376894
 ] 

Apache Spark commented on SPARK-6480:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/5148

 histogram() bucket function is wrong in some simple edge cases
 --

 Key: SPARK-6480
 URL: https://issues.apache.org/jira/browse/SPARK-6480
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Sean Owen
Assignee: Sean Owen

 (Credit to a customer report here) This test would fail now: 
 {code}
 val rdd = sc.parallelize(Seq(1, 1, 1, 2, 3, 3))
 assert(Array(3, 1, 2) === rdd.map(_.toDouble).histogram(3)._2)
 {code}
 Because it returns 3, 1, 0. The problem ultimately traces to the 'fast' 
 bucket function that judges buckets based on a multiple of the gap between 
 first and second elements. Errors multiply and the end of the final bucket 
 fails to include the max.
 Fairly plausible use case actually.
 This can be tightened up easily with a slightly better expression. It will 
 also fix this test, which is actually expecting the wrong answer:
 {code}
 val rdd = sc.parallelize(6 to 99)
 val (histogramBuckets, histogramResults) = rdd.histogram(9)
 val expectedHistogramResults =
   Array(11, 10, 11, 10, 10, 11, 10, 10, 11)
 {code}
 (Should be {{Array(11, 10, 10, 11, 10, 10, 11, 10, 11)}})



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4925) Publish Spark SQL hive-thriftserver maven artifact

2015-03-23 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4925:

Assignee: Patrick Wendell

 Publish Spark SQL hive-thriftserver maven artifact 
 ---

 Key: SPARK-4925
 URL: https://issues.apache.org/jira/browse/SPARK-4925
 Project: Spark
  Issue Type: Improvement
  Components: Build, SQL
Affects Versions: 1.2.1, 1.3.0
Reporter: Alex Liu
Assignee: Patrick Wendell
Priority: Critical

 The hive-thriftserver maven artifact is needed for integrating Spark SQL with 
 Cassandra.
 Can we publish it to maven?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6373) Add SSL/TLS for the Netty based BlockTransferService

2015-03-23 Thread Jeffrey Turpin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376907#comment-14376907
 ] 

Jeffrey Turpin commented on SPARK-6373:
---

Hey Aaron,

Thanks for the feedback! I definitely agree we should find a common way to 
support both and your proposal sounds good to me. That being said, do you want 
me to take a cut at doing this, integrating the work I have already done (where 
applicable)? Definitely don't want to lose traction on this and would like to 
get it into a release sooner rather than later Let me know your thoughts... 
Cheers!

Jeff


 Add SSL/TLS for the Netty based BlockTransferService 
 -

 Key: SPARK-6373
 URL: https://issues.apache.org/jira/browse/SPARK-6373
 Project: Spark
  Issue Type: New Feature
  Components: Block Manager, Shuffle
Affects Versions: 1.2.1
Reporter: Jeffrey Turpin

 Add the ability to allow for secure communications (SSL/TLS) for the Netty 
 based BlockTransferService and the ExternalShuffleClient. This ticket will 
 hopefully start the conversation around potential designs... Below is a 
 reference to a WIP prototype which implements this functionality 
 (prototype)... I have attempted to disrupt as little code as possible and 
 tried to follow the current code structure (for the most part) in the areas I 
 modified. I also studied how Hadoop achieves encrypted shuffle 
 (http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html)
 https://github.com/turp1twin/spark/commit/024b559f27945eb63068d1badf7f82e4e7c3621c



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6481) Set In Progress when a PR is opened for an issue

2015-03-23 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-6481:
---

 Summary: Set In Progress when a PR is opened for an issue
 Key: SPARK-6481
 URL: https://issues.apache.org/jira/browse/SPARK-6481
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Reporter: Michael Armbrust
Assignee: Nicholas Chammas


[~pwendell] and I are not sure if this is possible, but it would be really 
helpful if the JIRA status was updated to In Progress when we do the linking 
to an open pull request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6483) Spark SQL udf(ScalaUdf) is very slow

2015-03-23 Thread zzc (JIRA)
zzc created SPARK-6483:
--

 Summary: Spark SQL udf(ScalaUdf) is very slow
 Key: SPARK-6483
 URL: https://issues.apache.org/jira/browse/SPARK-6483
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0, 1.4.0
 Environment: 1. Spark version is 1.3.0 
2. 3 node per 80G/20C 
3. read 250G parquet files from hdfs 
Reporter: zzc


Test case: 
1. 
register floor func with command: 
sqlContext.udf.register(floor, (ts: Int) = ts - ts % 300), 
then run with sql select chan, floor(ts) as tt, sum(size) from qlogbase3 group 
by chan, floor(ts), 

*it takes 17 minutes.*

{quote}
== Physical Plan == 
Aggregate false, [chan#23015,PartialGroup#23500], 
[chan#23015,PartialGroup#23500 AS tt#23494,CombineSum(PartialSum#23499L) AS 
c2#23495L] 
 Exchange (HashPartitioning [chan#23015,PartialGroup#23500], 54) 
  Aggregate true, [chan#23015,scalaUDF(ts#23016)], 
[chan#23015,*scalaUDF*(ts#23016) AS PartialGroup#23500,SUM(size#23023L) AS 
PartialSum#23499L] 
   PhysicalRDD [chan#23015,ts#23016,size#23023L], MapPartitionsRDD[115] at map 
at newParquet.scala:562 
{quote}
2. 
run with sql select chan, (ts - ts % 300) as tt, sum(size) from qlogbase3 
group by chan, (ts - ts % 300), 

*it takes only 5 minutes.*
{quote}
== Physical Plan == 
Aggregate false, [chan#23015,PartialGroup#23349], 
[chan#23015,PartialGroup#23349 AS tt#23343,CombineSum(PartialSum#23348L) AS 
c2#23344L]   
 Exchange (HashPartitioning [chan#23015,PartialGroup#23349], 54)   
  Aggregate true, [chan#23015,(ts#23016 - (ts#23016 % 300))], 
[chan#23015,*(ts#23016 - (ts#23016 % 300))* AS 
PartialGroup#23349,SUM(size#23023L) AS PartialSum#23348L] 
   PhysicalRDD [chan#23015,ts#23016,size#23023L], MapPartitionsRDD[83] at map 
at newParquet.scala:562 
{quote}
3. 
use *HiveContext* with sql select chan, floor((ts - ts % 300)) as tt, 
sum(size) from qlogbase3 group by chan, floor((ts - ts % 300)) 

*it takes only 5 minutes too. *
{quote}
== Physical Plan == 
Aggregate false, [chan#23015,PartialGroup#23108L], 
[chan#23015,PartialGroup#23108L AS tt#23102L,CombineSum(PartialSum#23107L) AS 
_c2#23103L] 
 Exchange (HashPartitioning [chan#23015,PartialGroup#23108L], 54) 
  Aggregate true, 
[chan#23015,HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFFloor((ts#23016
 - (ts#23016 % 300)))], 
[chan#23015,*HiveGenericUdf*#org.apache.hadoop.hive.ql.udf.generic.GenericUDFFloor((ts#23016
 - (ts#23016 % 300))) AS PartialGroup#23108L,SUM(size#23023L) AS 
PartialSum#23107L] 
   PhysicalRDD [chan#23015,ts#23016,size#23023L], MapPartitionsRDD[28] at map 
at newParquet.scala:562 
{quote}

*Why? ScalaUdf is so slow?? How to improve it?*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6112) Provide OffHeap support through HDFS RAM_DISK

2015-03-23 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376849#comment-14376849
 ] 

Reynold Xin commented on SPARK-6112:


[~zhanzhang] I created https://issues.apache.org/jira/browse/SPARK-6479 as the 
first step of this: to create an API. Do you mind uploading the API part of 
this design doc (or even just the entire design doc) to that ticket? Thanks.


 Provide OffHeap support through HDFS RAM_DISK
 -

 Key: SPARK-6112
 URL: https://issues.apache.org/jira/browse/SPARK-6112
 Project: Spark
  Issue Type: New Feature
  Components: Block Manager
Reporter: Zhan Zhang
 Attachments: SparkOffheapsupportbyHDFS.pdf


 HDFS Lazy_Persist policy provide possibility to cache the RDD off_heap in 
 hdfs. We may want to provide similar capacity to Tachyon by leveraging hdfs 
 RAM_DISK feature, if the user environment does not have tachyon deployed. 
 With this feature, it potentially provides possibility to share RDD in memory 
 across different jobs and even share with jobs other than spark, and avoid 
 the RDD recomputation if executors crash. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2973) Use LocalRelation for all ExecutedCommands, avoid job for take/collect()

2015-03-23 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2973:

Priority: Blocker  (was: Critical)

 Use LocalRelation for all ExecutedCommands, avoid job for take/collect()
 

 Key: SPARK-2973
 URL: https://issues.apache.org/jira/browse/SPARK-2973
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Aaron Davidson
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.2.0


 Right now, sql(show tables).collect() will start a Spark job which shows up 
 in the UI. There should be a way to get these without this step.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2973) Use LocalRelation for all ExecutedCommands, avoid job for take/collect()

2015-03-23 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2973:

Summary: Use LocalRelation for all ExecutedCommands, avoid job for 
take/collect()  (was: Add a way to show tables without executing a job)

 Use LocalRelation for all ExecutedCommands, avoid job for take/collect()
 

 Key: SPARK-2973
 URL: https://issues.apache.org/jira/browse/SPARK-2973
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Aaron Davidson
Assignee: Cheng Lian
Priority: Critical
 Fix For: 1.2.0


 Right now, sql(show tables).collect() will start a Spark job which shows up 
 in the UI. There should be a way to get these without this step.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6482) Remove synchronization of Hive Native commands

2015-03-23 Thread David Ross (JIRA)
David Ross created SPARK-6482:
-

 Summary: Remove synchronization of Hive Native commands
 Key: SPARK-6482
 URL: https://issues.apache.org/jira/browse/SPARK-6482
 Project: Spark
  Issue Type: Improvement
Reporter: David Ross


As discussed in https://issues.apache.org/jira/browse/SPARK-4908, concurrent 
hive native commands run into thread-safety issues with 
{{org.apache.hadoop.hive.ql.Driver}}.

The quick-fix was to synchronize calls to {{runHive}}:

https://github.com/apache/spark/commit/480bd1d2edd1de06af607b0cf3ff3c0b16089add

However, if the hive native command is long-running, this can block subsequent 
queries if they have native dependencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6478) new RDD.pipeWithPartition method

2015-03-23 Thread Maxim Ivanov (JIRA)
Maxim Ivanov created SPARK-6478:
---

 Summary: new RDD.pipeWithPartition method
 Key: SPARK-6478
 URL: https://issues.apache.org/jira/browse/SPARK-6478
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Maxim Ivanov
Priority: Minor




This method allows building command line args and process environement map
using partition as an argument.

Use case for this feature is to provide additional informatin about the 
partition to spawned application in case where partitioner provides it (like in 
cassandra connector or when custom partitioner/RDD is used).

Also it provides simpler and more intuitive alternative for printPipeContext 
function.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6478) new RDD.pipeWithPartition method

2015-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376827#comment-14376827
 ] 

Apache Spark commented on SPARK-6478:
-

User 'redbaron' has created a pull request for this issue:
https://github.com/apache/spark/pull/5147

 new RDD.pipeWithPartition method
 

 Key: SPARK-6478
 URL: https://issues.apache.org/jira/browse/SPARK-6478
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Maxim Ivanov
Priority: Minor
  Labels: pipe

 This method allows building command line args and process environement map
 using partition as an argument.
 Use case for this feature is to provide additional informatin about the 
 partition to spawned application in case where partitioner provides it (like 
 in cassandra connector or when custom partitioner/RDD is used).
 Also it provides simpler and more intuitive alternative for printPipeContext 
 function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB

2015-03-23 Thread Allan Douglas R. de Oliveira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376833#comment-14376833
 ] 

Allan Douglas R. de Oliveira edited comment on SPARK-5928 at 3/23/15 10:54 PM:
---

I will answer with the info I have right know, later I'll try to get more 
information:
a) I don't have the exact number now, but it's something like 300G of shuffle 
read/write
b) It was working with 1600 partitions before this started to happen, then 
after the error we increased to 1 but got pretty much the same exception.
c) Yes, it happens some minutes later generally. We've put the job back in 
production changing to lz4. But I feel that the problem will come back.
d) Yes, and I think that issue also happens with other errors (e.g if the 
BlockManager timeouts because of too much GC)


was (Author: douglaz):
I will answer with the info I have right know, later I'll try to get more 
information:
a) I don't have the exact number now, but it's something like 300G of shuffle 
read/write
b) It was working with 1600 partitions before this started to happen, then 
after the error we increased to 1 but got pretty much the same exception.
c) Yes, it happens some minutes later generally. We've got the job back in 
production changing to lz4. But I feel that the problem will come back.
d) Yes, and I think that issue also happens with other errors (e.g if the 
BlockManager timeouts because of too much GC)

 Remote Shuffle Blocks cannot be more than 2 GB
 --

 Key: SPARK-5928
 URL: https://issues.apache.org/jira/browse/SPARK-5928
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Imran Rashid

 If a shuffle block is over 2GB, the shuffle fails, with an uninformative 
 exception.  The tasks get retried a few times and then eventually the job 
 fails.
 Here is an example program which can cause the exception:
 {code}
 val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =
   val n = 3e3.toInt
   val arr = new Array[Byte](n)
   //need to make sure the array doesn't compress to something small
   scala.util.Random.nextBytes(arr)
   arr
 }
 rdd.map { x = (1, x)}.groupByKey().count()
 {code}
 Note that you can't trigger this exception in local mode, it only happens on 
 remote fetches.   I triggered these exceptions running with 
 {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}}
 {noformat}
 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, 
 imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, 
 imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message=
 org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 
 2147483647: 3021252889 - discarded
   at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
   at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
   at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
   at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
   at 
 org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
   at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame 
 length exceeds 2147483647: 3021252889 - discarded
   at 
 io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501)
   at 
 io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477)
   at 
 

[jira] [Created] (SPARK-6479) Create off-heap block storage API

2015-03-23 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-6479:
--

 Summary: Create off-heap block storage API
 Key: SPARK-6479
 URL: https://issues.apache.org/jira/browse/SPARK-6479
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager, Spark Core
Reporter: Reynold Xin


Would be great to create APIs for off-heap block stores, rather than doing a 
bunch of if statements everywhere.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6450) Native Parquet reader does not assign table name as qualifier

2015-03-23 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6450:

Summary: Native Parquet reader does not assign table name as qualifier  
(was: Self joining query failure)

 Native Parquet reader does not assign table name as qualifier
 -

 Key: SPARK-6450
 URL: https://issues.apache.org/jira/browse/SPARK-6450
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Anand Mohan Tumuluri
Assignee: Cheng Lian
Priority: Blocker

 The below query was working fine till 1.3 commit 
 9a151ce58b3e756f205c9f3ebbbf3ab0ba5b33fd.(Yes it definitely works at this 
 commit although this commit is completely unrelated)
 It got broken in 1.3.0 release with an AnalysisException: resolved attributes 
 ... missing from  (although this list contains the fields which it 
 reports missing)
 {code}
 at 
 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:189)
   at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
   at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
   at 
 org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
   at 
 org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
   at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
   at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
   at com.sun.proxy.$Proxy17.executeStatementAsync(Unknown Source)
   at 
 org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)
   at 
 org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344)
   at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313)
   at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298)
   at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
   at 
 org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
   at 
 org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}
 {code}
 select Orders.Country, Orders.ProductCategory,count(1) from Orders join 
 (select Orders.Country, count(1) CountryOrderCount from Orders where 
 to_date(Orders.PlacedDate)  '2015-01-01' group by Orders.Country order by 
 CountryOrderCount DESC LIMIT 5) Top5Countries on Top5Countries.Country = 
 Orders.Country where to_date(Orders.PlacedDate)  '2015-01-01' group by 
 Orders.Country,Orders.ProductCategory;
 {code}
 The temporary workaround is to add explicit alias for the table Orders
 {code}
 select o.Country, o.ProductCategory,count(1) from Orders o join (select 
 r.Country, count(1) CountryOrderCount from Orders r where 
 to_date(r.PlacedDate)  '2015-01-01' group by r.Country order by 
 CountryOrderCount DESC LIMIT 5) Top5Countries on Top5Countries.Country = 
 o.Country where to_date(o.PlacedDate)  '2015-01-01' group by 
 o.Country,o.ProductCategory;
 {code}
 However this change not only affects self joins, it also seems to affect 
 union queries as well, like the below query which was again working 
 before(commit 9a151ce) got broken
 {code}
 select Orders.Country,null,count(1) OrderCount from Orders group by 
 Orders.Country,null
 union all
 select null,Orders.ProductCategory,count(1) OrderCount from Orders group by 
 null, Orders.ProductCategory
 {code}
 also fails with a Analysis exception.
 The workaround is to add different aliases for the tables.



--
This message was sent 

[jira] [Commented] (SPARK-6479) Create off-heap block storage API (internal)

2015-03-23 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376978#comment-14376978
 ] 

Zhan Zhang commented on SPARK-6479:
---

The current API may not be good enough as it has some redundant interface and 
mainly for POC to make HDFS work. After migrating Tachyon to these new APIs, we 
should have a better version. I will update the doc later. But I do agree that 
unify offheap api should be separated JIRA, and we can start from here.

 Create off-heap block storage API (internal)
 

 Key: SPARK-6479
 URL: https://issues.apache.org/jira/browse/SPARK-6479
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager, Spark Core
Reporter: Reynold Xin
 Attachments: SparkOffheapsupportbyHDFS.pdf


 Would be great to create APIs for off-heap block stores, rather than doing a 
 bunch of if statements everywhere.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6481) Set In Progress when a PR is opened for an issue

2015-03-23 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377034#comment-14377034
 ] 

Nicholas Chammas commented on SPARK-6481:
-

I'm guessing this will be done via 
[github_jira_sync.py|https://github.com/apache/spark/blob/master/dev/github_jira_sync.py].
 OK, will take a look this week.

 Set In Progress when a PR is opened for an issue
 --

 Key: SPARK-6481
 URL: https://issues.apache.org/jira/browse/SPARK-6481
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Reporter: Michael Armbrust
Assignee: Nicholas Chammas

 [~pwendell] and I are not sure if this is possible, but it would be really 
 helpful if the JIRA status was updated to In Progress when we do the 
 linking to an open pull request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6122) Upgrade Tachyon dependency to 0.6.0

2015-03-23 Thread Calvin Jia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376813#comment-14376813
 ] 

Calvin Jia commented on SPARK-6122:
---

[~pwendell] Are you referring to the issues here: 
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/1940/ ?

 Upgrade Tachyon dependency to 0.6.0
 ---

 Key: SPARK-6122
 URL: https://issues.apache.org/jira/browse/SPARK-6122
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.1
Reporter: Haoyuan Li
Assignee: Calvin Jia





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB

2015-03-23 Thread Allan Douglas R. de Oliveira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376833#comment-14376833
 ] 

Allan Douglas R. de Oliveira commented on SPARK-5928:
-

I will answer with the info I have right know, later I'll try to get more 
information:
a) I don't have the exact number now, but it's something like 300G of shuffle 
read/write
b) It was working with 1600 partitions before this started to happen, then 
after the error we increased to 1 but got pretty much the same exception.
c) Yes, it happens some minutes later generally. We've got the job back in 
production changing to lz4. But I feel that the problem will come back.
d) Yes, and I think that issue also happens with other errors (e.g if the 
BlockManager timeouts because of too much GC)

 Remote Shuffle Blocks cannot be more than 2 GB
 --

 Key: SPARK-5928
 URL: https://issues.apache.org/jira/browse/SPARK-5928
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Imran Rashid

 If a shuffle block is over 2GB, the shuffle fails, with an uninformative 
 exception.  The tasks get retried a few times and then eventually the job 
 fails.
 Here is an example program which can cause the exception:
 {code}
 val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =
   val n = 3e3.toInt
   val arr = new Array[Byte](n)
   //need to make sure the array doesn't compress to something small
   scala.util.Random.nextBytes(arr)
   arr
 }
 rdd.map { x = (1, x)}.groupByKey().count()
 {code}
 Note that you can't trigger this exception in local mode, it only happens on 
 remote fetches.   I triggered these exceptions running with 
 {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}}
 {noformat}
 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, 
 imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, 
 imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message=
 org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 
 2147483647: 3021252889 - discarded
   at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
   at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
   at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
   at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
   at 
 org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
   at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame 
 length exceeds 2147483647: 3021252889 - discarded
   at 
 io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501)
   at 
 io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477)
   at 
 io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403)
   at 
 io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343)
   at 
 io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249)
   at 
 io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149)
   at 
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
   at 
 io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
   at 
 

[jira] [Updated] (SPARK-6479) Create off-heap block storage API (internal)

2015-03-23 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6479:
---
Summary: Create off-heap block storage API (internal)  (was: Create 
off-heap block storage API)

 Create off-heap block storage API (internal)
 

 Key: SPARK-6479
 URL: https://issues.apache.org/jira/browse/SPARK-6479
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager, Spark Core
Reporter: Reynold Xin

 Would be great to create APIs for off-heap block stores, rather than doing a 
 bunch of if statements everywhere.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6369) InsertIntoHiveTable should use logic from SparkHadoopWriter

2015-03-23 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6369:

Assignee: Cheng Lian

 InsertIntoHiveTable should use logic from SparkHadoopWriter
 ---

 Key: SPARK-6369
 URL: https://issues.apache.org/jira/browse/SPARK-6369
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Blocker

 Right now it is possible that we will corrupt the output if there is a race 
 between competing speculative tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1684) Merge script should standardize SPARK-XXX prefix

2015-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377108#comment-14377108
 ] 

Apache Spark commented on SPARK-1684:
-

User 'texasmichelle' has created a pull request for this issue:
https://github.com/apache/spark/pull/5149

 Merge script should standardize SPARK-XXX prefix
 

 Key: SPARK-1684
 URL: https://issues.apache.org/jira/browse/SPARK-1684
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Minor
  Labels: starter

 If users write [SPARK-XXX] Issue or SPARK-XXX. Issue or SPARK XXX: 
 Issue we should convert it to SPARK-XXX: Issue



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6100) Distributed linear algebra in PySpark/MLlib

2015-03-23 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377208#comment-14377208
 ] 

Xiangrui Meng commented on SPARK-6100:
--

We don't have APIs for distributed matrices in Python. Now we have MatrixUDT 
merged, it would be nice to create RowMatrix/CoordinateMatrix/BlockMatrix and 
use DataFrames for serialization. Do you want to start with that task?

 Distributed linear algebra in PySpark/MLlib
 ---

 Key: SPARK-6100
 URL: https://issues.apache.org/jira/browse/SPARK-6100
 Project: Spark
  Issue Type: Umbrella
  Components: MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 This is an umbrella JIRA for the Python API of distributed linear algebra in 
 MLlib. The goal is to make Python API on par with the Scala/Java API. We 
 should try wrapping Scala implementations as much as possible, instead of 
 implementing them in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6488) Support addition/multiplication in PySpark's BlockMatrix

2015-03-23 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-6488:


 Summary: Support addition/multiplication in PySpark's BlockMatrix
 Key: SPARK-6488
 URL: https://issues.apache.org/jira/browse/SPARK-6488
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Xiangrui Meng


This JIRA is to add addition/multiplication to BlockMatrix in PySpark. We 
should reuse the Scala implementation instead of having a separate 
implementation in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6449) Driver OOM results in reported application result SUCCESS

2015-03-23 Thread Ryan Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377249#comment-14377249
 ] 

Ryan Williams commented on SPARK-6449:
--

It doesn't look like it; [here is a 
gist|https://gist.github.com/ryan-williams/ff74066c127546910cac] with the 
entire file (8M), and the last 1000 lines, fwiw.



 Driver OOM results in reported application result SUCCESS
 -

 Key: SPARK-6449
 URL: https://issues.apache.org/jira/browse/SPARK-6449
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.3.0
Reporter: Ryan Williams

 I ran a job yesterday that according to the History Server and YARN RM 
 finished with status {{SUCCESS}}.
 Clicking around on the history server UI, there were too few stages run, and 
 I couldn't figure out why that would have been.
 Finally, inspecting the end of the driver's logs, I saw:
 {code}
 15/03/20 15:08:13 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
 Shutting down remote daemon.
 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
 Remote daemon shut down; proceeding with flushing remote transports.
 15/03/20 15:08:13 INFO spark.SparkContext: Successfully stopped SparkContext
 Exception in thread Driver scala.MatchError: java.lang.OutOfMemoryError: GC 
 overhead limit exceeded (of class java.lang.OutOfMemoryError)
 at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485)
 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, 
 exitCode: 0, (reason: Shutdown hook called before final status was reported.)
 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Unregistering 
 ApplicationMaster with SUCCEEDED (diag message: Shutdown hook called before 
 final status was reported.)
 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
 Remoting shut down.
 15/03/20 15:08:13 INFO impl.AMRMClientImpl: Waiting for application to be 
 successfully unregistered.
 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Deleting staging directory 
 .sparkStaging/application_1426705269584_0055
 {code}
 The driver OOM'd, [the {{catch}} block that presumably should have caught 
 it|https://github.com/apache/spark/blob/b6090f902e6ec24923b4dde4aabc9076956521c1/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L484]
  threw a {{MatchError}}, and then {{SUCCESS}} was returned to YARN and 
 written to the event log.
 This should be logged as a failed job and reported as such to YARN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6449) Driver OOM results in reported application result SUCCESS

2015-03-23 Thread Ryan Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377251#comment-14377251
 ] 

Ryan Williams commented on SPARK-6449:
--

Seems like this was fixed as of 
[SPARK-6018|https://issues.apache.org/jira/browse/SPARK-6018], closing

 Driver OOM results in reported application result SUCCESS
 -

 Key: SPARK-6449
 URL: https://issues.apache.org/jira/browse/SPARK-6449
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.3.0
Reporter: Ryan Williams

 I ran a job yesterday that according to the History Server and YARN RM 
 finished with status {{SUCCESS}}.
 Clicking around on the history server UI, there were too few stages run, and 
 I couldn't figure out why that would have been.
 Finally, inspecting the end of the driver's logs, I saw:
 {code}
 15/03/20 15:08:13 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
 Shutting down remote daemon.
 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
 Remote daemon shut down; proceeding with flushing remote transports.
 15/03/20 15:08:13 INFO spark.SparkContext: Successfully stopped SparkContext
 Exception in thread Driver scala.MatchError: java.lang.OutOfMemoryError: GC 
 overhead limit exceeded (of class java.lang.OutOfMemoryError)
 at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485)
 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, 
 exitCode: 0, (reason: Shutdown hook called before final status was reported.)
 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Unregistering 
 ApplicationMaster with SUCCEEDED (diag message: Shutdown hook called before 
 final status was reported.)
 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
 Remoting shut down.
 15/03/20 15:08:13 INFO impl.AMRMClientImpl: Waiting for application to be 
 successfully unregistered.
 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Deleting staging directory 
 .sparkStaging/application_1426705269584_0055
 {code}
 The driver OOM'd, [the {{catch}} block that presumably should have caught 
 it|https://github.com/apache/spark/blob/b6090f902e6ec24923b4dde4aabc9076956521c1/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L484]
  threw a {{MatchError}}, and then {{SUCCESS}} was returned to YARN and 
 written to the event log.
 This should be logged as a failed job and reported as such to YARN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6475) DataFrame should support array types when creating DFs from JavaBeans.

2015-03-23 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6475:

Component/s: (was: DataFrame)

 DataFrame should support array types when creating DFs from JavaBeans.
 --

 Key: SPARK-6475
 URL: https://issues.apache.org/jira/browse/SPARK-6475
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 If we have a JavaBean class with array fields, SQL throws an exception in 
 `createDataFrame` because arrays are not matched in `getSchema` from a 
 JavaBean class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5941) `def table` is not using the unresolved logical plan `UnresolvedRelation`

2015-03-23 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-5941:

Component/s: (was: DataFrame)
 SQL

 `def table` is not using the unresolved logical plan `UnresolvedRelation`
 -

 Key: SPARK-5941
 URL: https://issues.apache.org/jira/browse/SPARK-5941
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6484) Ganglia metrics xml reporter doesn't escape correctly

2015-03-23 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377080#comment-14377080
 ] 

Josh Rosen commented on SPARK-6484:
---

To provide some extra context for this JIRA, I think the problem here is that 
the Spark driver's executor ID is {{driver}}, which gets included in the 
metrics names, breaking things when it's not escaped properly.  Given that this 
{{driver}} has caused problems in other places, too, I'd suggest that we 
change it to something else (maybe just {{driver}}).  If we consider this 
identifier to be a stable public API, though, then I guess we'd have to resort 
to fixing the escaping.

 Ganglia metrics xml reporter doesn't escape correctly
 -

 Key: SPARK-6484
 URL: https://issues.apache.org/jira/browse/SPARK-6484
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Michael Armbrust
Assignee: Josh Rosen
Priority: Critical

 The following should be escaped:
 {code}
quot;
 '   apos;
lt;
gt;
amp;
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5368) Spark should support NAT (via akka improvements)

2015-03-23 Thread jay vyas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375163#comment-14375163
 ] 

jay vyas edited comment on SPARK-5368 at 3/24/15 1:59 AM:
--

Okay, to close *optionally* with the close someone could PR an update to 
spark {{docs/configuration.md}} as part of closing, to clarify how the 
SPARK_LOCAL_HOSTNAME (edit) changes akka's behavior, so that the context  is 
clarified...


was (Author: jayunit100):
Okay, to close *optionally* with the close someone could PR an update to 
spark {{docs/configuration.md}} as part of closing, to clarify how the 
{{LOCAL_IP}} changes akka's behavior, so that the context  is clarified...

 Spark should  support NAT (via akka improvements)
 -

 Key: SPARK-5368
 URL: https://issues.apache.org/jira/browse/SPARK-5368
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: jay vyas
 Fix For: 1.2.2


 Spark sets up actors for akka with a set of variables which are defined in 
 the {{AkkaUtils.scala}} class.  
 A snippet:
 {noformat}
  98   |akka.loggers = [akka.event.slf4j.Slf4jLogger]
  99   |akka.stdout-loglevel = ERROR
 100   |akka.jvm-exit-on-fatal-error = off
 101   |akka.remote.require-cookie = $requireCookie
 102   |akka.remote.secure-cookie = $secureCookie
 {noformat}
 We should allow users to pass in custom settings, for example, so that 
 arbitrary akka modifications can be used at runtime for security, 
 performance, logging, and so on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5368) Spark should support NAT (via akka improvements)

2015-03-23 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377096#comment-14377096
 ] 

Matthew Farrellee commented on SPARK-5368:
--

[~jayunit100] the relevant config is {{LOCAL_HOSTNAME}}

 Spark should  support NAT (via akka improvements)
 -

 Key: SPARK-5368
 URL: https://issues.apache.org/jira/browse/SPARK-5368
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: jay vyas
 Fix For: 1.2.2


 Spark sets up actors for akka with a set of variables which are defined in 
 the {{AkkaUtils.scala}} class.  
 A snippet:
 {noformat}
  98   |akka.loggers = [akka.event.slf4j.Slf4jLogger]
  99   |akka.stdout-loglevel = ERROR
 100   |akka.jvm-exit-on-fatal-error = off
 101   |akka.remote.require-cookie = $requireCookie
 102   |akka.remote.secure-cookie = $secureCookie
 {noformat}
 We should allow users to pass in custom settings, for example, so that 
 arbitrary akka modifications can be used at runtime for security, 
 performance, logging, and so on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6361) Support adding a column with metadata in DataFrames

2015-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377187#comment-14377187
 ] 

Apache Spark commented on SPARK-6361:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/5151

 Support adding a column with metadata in DataFrames
 ---

 Key: SPARK-6361
 URL: https://issues.apache.org/jira/browse/SPARK-6361
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 There is no easy way to add a column with metadata in DataFrames. This is 
 required by ML transformers to generate ML attributes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6485) Add CoordinateMatrix/RowMatrix/IndexedRowMatrix in PySpark

2015-03-23 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-6485:


 Summary: Add CoordinateMatrix/RowMatrix/IndexedRowMatrix in PySpark
 Key: SPARK-6485
 URL: https://issues.apache.org/jira/browse/SPARK-6485
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Xiangrui Meng


We should add APIs for CoordinateMatrix/RowMatrix/IndexedRowMatrix in PySpark. 
Internally, we can use DataFrames for serialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6486) Add BlockMatrix in PySpark

2015-03-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6486:
-
Description: We should add BlockMatrix to PySpark. Internally, we can use 
DataFrames and MatrixUDT for serialization. This JIRA should contain 
conversions between IndexedRowMatrix/CoordinateMatrix to block matrices. But 
this does NOT cover linear algebra operations of block matrices.  (was: We 
should add BlockMatrix to PySpark. Internally, we can use DataFrames and 
MatrixUDT for serialization. This JIRA does NOT cover linear algebra operations 
of block matrices.)

 Add BlockMatrix in PySpark
 --

 Key: SPARK-6486
 URL: https://issues.apache.org/jira/browse/SPARK-6486
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Xiangrui Meng

 We should add BlockMatrix to PySpark. Internally, we can use DataFrames and 
 MatrixUDT for serialization. This JIRA should contain conversions between 
 IndexedRowMatrix/CoordinateMatrix to block matrices. But this does NOT cover 
 linear algebra operations of block matrices.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6487) Add sequential pattern mining algorithm to Spark MLlib

2015-03-23 Thread Zhang JiaJin (JIRA)
Zhang JiaJin created SPARK-6487:
---

 Summary: Add sequential pattern mining algorithm to Spark MLlib
 Key: SPARK-6487
 URL: https://issues.apache.org/jira/browse/SPARK-6487
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Zhang JiaJin


Sequential pattern mining is an important branch in the pattern mining. In the 
past the actual work, we use the sequence mining (mainly PrefixSpan algorithm) 
to find the telecommunication signaling sequence pattern, achieved good 
results. But once the data is too large, the operation time is too long, even 
can not meet the the service requirements. We are ready to implement the 
PrefixSpan algorithm in spark, and applied to our subsequent work. 
The related Paper: Distributed PrefixSpan algorithm based on MapReduce.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6489) Optimize lateral view with explode to not read unnecessary columns

2015-03-23 Thread Konstantin Shaposhnikov (JIRA)
Konstantin Shaposhnikov created SPARK-6489:
--

 Summary: Optimize lateral view with explode to not read 
unnecessary columns
 Key: SPARK-6489
 URL: https://issues.apache.org/jira/browse/SPARK-6489
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Konstantin Shaposhnikov


Currently a query with lateral view explode(...) results in an execution plan 
that reads all columns of the underlying RDD.

E.g. given *ppl* table is DF created from Person case class:
{code}
case class Person(val name: String, val age: Int, val data: Array[Int])
{code}
the following SQL:
{code}
select name, sum(d) from ppl lateral view explode(data) d as d group by name
{code}
executes as follows:
{noformat}
== Physical Plan ==
Aggregate false, [name#0], [name#0,SUM(PartialSum#8L) AS _c1#3L]
 Exchange (HashPartitioning [name#0], 200)
  Aggregate true, [name#0], [name#0,SUM(CAST(d#6, LongType)) AS PartialSum#8L]
   Project [name#0,d#6]
Generate explode(data#2), true, false
 PhysicalRDD [name#0,age#1,data#2], MapPartitionsRDD[1] at mapPartitions at 
ExistingRDD.scala:35
{noformat}

Note that *age* column is not needed to produce the output but it is still read 
from the underlying RDD.

A sample program to demonstrate the issue:
{code}
case class Person(val name: String, val age: Int, val data: Array[Int])

object ExplodeDemo extends App {
  val ppl = Array(
Person(A, 20, Array(10, 12, 19)),
Person(B, 25, Array(7, 8, 4)),
Person(C, 19, Array(12, 4, 232)))
  

  val conf = new SparkConf().setMaster(local[2]).setAppName(sql)
  val sc = new SparkContext(conf)
  val sqlCtx = new HiveContext(sc)
  import sqlCtx.implicits._
  val df = sc.makeRDD(ppl).toDF
  df.registerTempTable(ppl)
  val s = sqlCtx.sql(select name, sum(d) from ppl lateral view explode(data) d 
as d group by name)
  s.explain(true)
}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6487) Add sequential pattern mining algorithm to Spark MLlib

2015-03-23 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377234#comment-14377234
 ] 

Xiangrui Meng commented on SPARK-6487:
--

[~Zhang JiaJin] I'm not very familiar with patten mining, but I don't see many 
citations of the paper you mentioned. So I need more information to understand 
the importance/popularity of sequential pattern mining and whether there exist 
really scalable algorithms. If there are not many requests for this feature or 
there are no scalable algorithms, you can certainly register your 
implementation as a third-party package on spark-packages.org and maintain it 
outside Spark for users.

 Add sequential pattern mining algorithm to Spark MLlib
 --

 Key: SPARK-6487
 URL: https://issues.apache.org/jira/browse/SPARK-6487
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Zhang JiaJin

 [~mengxr] [~zhangyouhua]
 Sequential pattern mining is an important branch in the pattern mining. In 
 the past the actual work, we use the sequence mining (mainly PrefixSpan 
 algorithm) to find the telecommunication signaling sequence pattern, achieved 
 good results. But once the data is too large, the operation time is too long, 
 even can not meet the the service requirements. We are ready to implement the 
 PrefixSpan algorithm in spark, and applied to our subsequent work. 
 The related Paper: Distributed PrefixSpan algorithm based on MapReduce.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6489) Optimize lateral view with explode to not read unnecessary columns

2015-03-23 Thread Konstantin Shaposhnikov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Shaposhnikov updated SPARK-6489:
---
Description: 
Currently a query with lateral view explode(...) results in an execution plan 
that reads all columns of the underlying RDD.

E.g. given *ppl* table is DF created from Person case class:
{code}
case class Person(val name: String, val age: Int, val data: Array[Int])
{code}
the following SQL:
{code}
select name, sum(d) from ppl lateral view explode(data) d as d group by name
{code}
executes as follows:
{noformat}
== Physical Plan ==
Aggregate false, [name#0], [name#0,SUM(PartialSum#38L) AS _c1#18L]
 Exchange (HashPartitioning [name#0], 200)
  Aggregate true, [name#0], [name#0,SUM(CAST(d#21, LongType)) AS PartialSum#38L]
   Project [name#0,d#21]
Generate explode(data#2), true, false
 InMemoryColumnarTableScan [name#0,age#1,data#2], [], (InMemoryRelation 
[name#0,age#1,data#2], true, 1, StorageLevel(true, true, false, true, 1), 
(PhysicalRDD [name#0,age#1,data#2], MapPartitionsRDD[1] at mapPartitions at 
ExistingRDD.scala:35), Some(ppl))
{noformat}

Note that *age* column is not needed to produce the output but it is still read 
from the underlying RDD.

A sample program to demonstrate the issue:
{code}
case class Person(val name: String, val age: Int, val data: Array[Int])

object ExplodeDemo extends App {
  val ppl = Array(
Person(A, 20, Array(10, 12, 19)),
Person(B, 25, Array(7, 8, 4)),
Person(C, 19, Array(12, 4, 232)))
  

  val conf = new SparkConf().setMaster(local[2]).setAppName(sql)
  val sc = new SparkContext(conf)
  val sqlCtx = new HiveContext(sc)
  import sqlCtx.implicits._
  val df = sc.makeRDD(ppl).toDF
  df.registerTempTable(ppl)
  sqlCtx.cacheTable(ppl) // cache table otherwise ExistingRDD will be used 
that do not support column pruning
  val s = sqlCtx.sql(select name, sum(d) from ppl lateral view explode(data) d 
as d group by name)
  s.explain(true)
}
{code}

  was:
Currently a query with lateral view explode(...) results in an execution plan 
that reads all columns of the underlying RDD.

E.g. given *ppl* table is DF created from Person case class:
{code}
case class Person(val name: String, val age: Int, val data: Array[Int])
{code}
the following SQL:
{code}
select name, sum(d) from ppl lateral view explode(data) d as d group by name
{code}
executes as follows:
{noformat}
== Physical Plan ==
Aggregate false, [name#0], [name#0,SUM(PartialSum#8L) AS _c1#3L]
 Exchange (HashPartitioning [name#0], 200)
  Aggregate true, [name#0], [name#0,SUM(CAST(d#6, LongType)) AS PartialSum#8L]
   Project [name#0,d#6]
Generate explode(data#2), true, false
 PhysicalRDD [name#0,age#1,data#2], MapPartitionsRDD[1] at mapPartitions at 
ExistingRDD.scala:35
{noformat}

Note that *age* column is not needed to produce the output but it is still read 
from the underlying RDD.

A sample program to demonstrate the issue:
{code}
case class Person(val name: String, val age: Int, val data: Array[Int])

object ExplodeDemo extends App {
  val ppl = Array(
Person(A, 20, Array(10, 12, 19)),
Person(B, 25, Array(7, 8, 4)),
Person(C, 19, Array(12, 4, 232)))
  

  val conf = new SparkConf().setMaster(local[2]).setAppName(sql)
  val sc = new SparkContext(conf)
  val sqlCtx = new HiveContext(sc)
  import sqlCtx.implicits._
  val df = sc.makeRDD(ppl).toDF
  df.registerTempTable(ppl)
  val s = sqlCtx.sql(select name, sum(d) from ppl lateral view explode(data) d 
as d group by name)
  s.explain(true)
}
{code}


 Optimize lateral view with explode to not read unnecessary columns
 --

 Key: SPARK-6489
 URL: https://issues.apache.org/jira/browse/SPARK-6489
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Konstantin Shaposhnikov

 Currently a query with lateral view explode(...) results in an execution 
 plan that reads all columns of the underlying RDD.
 E.g. given *ppl* table is DF created from Person case class:
 {code}
 case class Person(val name: String, val age: Int, val data: Array[Int])
 {code}
 the following SQL:
 {code}
 select name, sum(d) from ppl lateral view explode(data) d as d group by name
 {code}
 executes as follows:
 {noformat}
 == Physical Plan ==
 Aggregate false, [name#0], [name#0,SUM(PartialSum#38L) AS _c1#18L]
  Exchange (HashPartitioning [name#0], 200)
   Aggregate true, [name#0], [name#0,SUM(CAST(d#21, LongType)) AS 
 PartialSum#38L]
Project [name#0,d#21]
 Generate explode(data#2), true, false
  InMemoryColumnarTableScan [name#0,age#1,data#2], [], (InMemoryRelation 
 [name#0,age#1,data#2], true, 1, StorageLevel(true, true, false, true, 1), 
 (PhysicalRDD [name#0,age#1,data#2], MapPartitionsRDD[1] at mapPartitions at 
 ExistingRDD.scala:35), 

[jira] [Commented] (SPARK-6352) Supporting non-default OutputCommitter when using saveAsParquetFile

2015-03-23 Thread Pei-Lun Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377264#comment-14377264
 ] 

Pei-Lun Lee commented on SPARK-6352:


The above PR adds a new hadoop config value 
spark.sql.parquet.output.committer.class to let user select the output 
committer class used in saving parquet file, similar to SPARK-3595. The base 
class is ParquetOutputCommitter. There is also a DirectParquetOutputCommitter 
added for file system like s3.

 Supporting non-default OutputCommitter when using saveAsParquetFile
 ---

 Key: SPARK-6352
 URL: https://issues.apache.org/jira/browse/SPARK-6352
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.1, 1.2.1, 1.3.0
Reporter: Pei-Lun Lee

 SPARK-3595 only handles custom OutputCommitter for saveAsHadoopFile, it can 
 be nice to have similar behavior in saveAsParquetFile. It maybe difficult to 
 have a fully customizable OutputCommitter solution, at least adding something 
 like a DirectParquetOutputCommitter and letting users choose between this and 
 the default should be enough.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6484) Ganglia metrics xml reporter doesn't escape correctly

2015-03-23 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-6484:
---

 Summary: Ganglia metrics xml reporter doesn't escape correctly
 Key: SPARK-6484
 URL: https://issues.apache.org/jira/browse/SPARK-6484
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Michael Armbrust
Assignee: Josh Rosen
Priority: Critical


The following should be escaped:

{code}
   quot;
'   apos;
   lt;
   gt;
   amp;
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5368) Spark should support NAT (via akka improvements)

2015-03-23 Thread jay vyas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377106#comment-14377106
 ] 

jay vyas commented on SPARK-5368:
-

looks like this is subsumed maybe by the work going on in SPARK-5113 's doc 
improvements?   
Either way, I propose to close this JIRA, i think the problem is solved and the 
doc improvement was just optional.

 Spark should  support NAT (via akka improvements)
 -

 Key: SPARK-5368
 URL: https://issues.apache.org/jira/browse/SPARK-5368
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: jay vyas
 Fix For: 1.2.2


 Spark sets up actors for akka with a set of variables which are defined in 
 the {{AkkaUtils.scala}} class.  
 A snippet:
 {noformat}
  98   |akka.loggers = [akka.event.slf4j.Slf4jLogger]
  99   |akka.stdout-loglevel = ERROR
 100   |akka.jvm-exit-on-fatal-error = off
 101   |akka.remote.require-cookie = $requireCookie
 102   |akka.remote.secure-cookie = $secureCookie
 {noformat}
 We should allow users to pass in custom settings, for example, so that 
 arbitrary akka modifications can be used at runtime for security, 
 performance, logging, and so on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6430) Cannot resolve column correctlly when using left semi join

2015-03-23 Thread zzc (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377139#comment-14377139
 ] 

zzc commented on SPARK-6430:


what's wrong with this?

 Cannot resolve column correctlly when using left semi join
 --

 Key: SPARK-6430
 URL: https://issues.apache.org/jira/browse/SPARK-6430
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
 Environment: Spark 1.3.0 on yarn mode
Reporter: zzc

 My code:
 {quote}
 case class TestData(key: Int, value: String)
 case class TestData2(a: Int, b: Int)
 import org.apache.spark.sql.execution.joins._
 import sqlContext.implicits._
 val testData = sc.parallelize(
 (1 to 100).map(i = TestData(i, i.toString))).toDF()
 testData.registerTempTable(testData)
 val testData2 = sc.parallelize(
   TestData2(1, 1) ::
   TestData2(1, 2) ::
   TestData2(2, 1) ::
   TestData2(2, 2) ::
   TestData2(3, 1) ::
   TestData2(3, 2) :: Nil, 2).toDF()
 testData2.registerTempTable(testData2)
 //val tmp = sqlContext.sql(SELECT * FROM testData *LEFT SEMI JOIN* testData2 
 ON key = a )
 val tmp = sqlContext.sql(SELECT testData2.b, count(testData2.b) FROM 
 testData *LEFT SEMI JOIN* testData2 ON key = testData2.a group by 
 testData2.b)
 tmp.explain()
 {quote}
 Error log:
 {quote}
 org.apache.spark.sql.AnalysisException: cannot resolve 'testData2.b' given 
 input columns key, value; line 1 pos 108
   at 
 org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:48)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:45)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
   at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:103)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:117)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.List.foreach(List.scala:318)
 {quote}
 {quote}SELECT * FROM testData LEFT SEMI JOIN testData2 ON key = a{quote} is 
 correct, 
 {quote}
 SELECT a FROM testData LEFT SEMI JOIN testData2 ON key = a
 SELECT max(value) FROM testData LEFT SEMI JOIN testData2 ON key = a group by b
 SELECT max(value) FROM testData LEFT SEMI JOIN testData2 ON key = testData2.a 
 group by testData2.b
 SELECT testData2.b, count(testData2.b) FROM testData LEFT SEMI JOIN testData2 
 ON key = testData2.a group by testData2.b
 {quote} are incorrect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3720) support ORC in spark sql

2015-03-23 Thread iward (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377155#comment-14377155
 ] 

iward commented on SPARK-3720:
--

[~zhanzhang], I see. since the patch is delayed, so we can't use orcFile API in 
spark currently. But, the problem of reading whole files is urgent, do we have 
other way to solve this in spark ?

 support ORC in spark sql
 

 Key: SPARK-3720
 URL: https://issues.apache.org/jira/browse/SPARK-3720
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.1.0
Reporter: Fei Wang
 Attachments: orc.diff


 The Optimized Row Columnar (ORC) file format provides a highly efficient way 
 to store data on hdfs.ORC file format has many advantages such as:
 1 a single file as the output of each task, which reduces the NameNode's load
 2 Hive type support including datetime, decimal, and the complex types 
 (struct, list, map, and union)
 3 light-weight indexes stored within the file
 skip row groups that don't pass predicate filtering
 seek to a given row
 4 block-mode compression based on data type
 run-length encoding for integer columns
 dictionary encoding for string columns
 5 concurrent reads of the same file using separate RecordReaders
 6 ability to split files without scanning for markers
 7 bound the amount of memory needed for reading or writing
 8 metadata stored using Protocol Buffers, which allows addition and removal 
 of fields
 Now spark sql support Parquet, support ORC provide people more opts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3735) Sending the factor directly or AtA based on the cost in ALS

2015-03-23 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377200#comment-14377200
 ] 

Xiangrui Meng commented on SPARK-3735:
--

The proposal is actually something different. For example, in a single block if 
we have 1000 items that are rated by all users. We need to send those 1000 item 
factors to all user in-blocks. The shuffle size is 1000*k*numUserBlocks. 
However, if we only send the partial AtA and Atb generated by those items, the 
shuffle size is k*(k+1)*numUserBlocks, which is much smaller if k  1000. This 
is orthogonal to explicit/implicit feedback models.

 Sending the factor directly or AtA based on the cost in ALS
 ---

 Key: SPARK-3735
 URL: https://issues.apache.org/jira/browse/SPARK-3735
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 It is common to have some super popular products in the dataset. In this 
 case, sending many user factors to the target product block could be more 
 expensive than sending the normal equation `\sum_i u_i u_i^T` and `\sum_i u_i 
 r_ij` to the product block. The cost of sending a single factor is `k`, while 
 the cost of sending a normal equation is much more expensive, `k * (k + 3) / 
 2`. However, if we use normal equation for all products associated with a 
 user, we don't need to send this user factor.
 Determining the optimal assignment is hard. But we could use a simple 
 heuristic. Inside any rating block,
 1) order the product ids by the number of user ids associated with them in 
 desc order
 2) starting from the most popular product, mark popular products as use 
 normal eq and calculate the cost
 Remember the best assignment that comes with the lowest cost and use it for 
 computation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3278) Isotonic regression

2015-03-23 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377205#comment-14377205
 ] 

Xiangrui Meng commented on SPARK-3278:
--

Did you try truncating the digits of x to reduce the number of possible 
buckets? If the loss of precision is not super important, this could help 
scalability.

 Isotonic regression
 ---

 Key: SPARK-3278
 URL: https://issues.apache.org/jira/browse/SPARK-3278
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Martin Zapletal
 Fix For: 1.3.0


 Add isotonic regression for score calibration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6464) Add a new transformation of rdd named processCoalesce which was particularly to deal with the small and cached rdd

2015-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377206#comment-14377206
 ] 

Apache Spark commented on SPARK-6464:
-

User 'SaintBacchus' has created a pull request for this issue:
https://github.com/apache/spark/pull/5152

 Add a new transformation of rdd named processCoalesce which was  particularly 
 to deal with the small and cached rdd
 ---

 Key: SPARK-6464
 URL: https://issues.apache.org/jira/browse/SPARK-6464
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: SaintBacchus
 Attachments: screenshot-1.png


 Nowadays, the transformation *coalesce* was always used to expand or reduce 
 the number of the partition in order to gain a good performance.
 But *coalesce* can't make sure that the child partition will be executed in 
 the same executor as the parent partition. And this will lead to have a large 
 network transfer.
 In some scenario such as I mentioned in the title +small and cached rdd+, we 
 want to coalesce all the partition in the same executor into one partition 
 and make sure the child partition will be executed in this executor. It can 
 avoid network transfer and reduce the scheduler of the Tasks and also can 
 reused the cpu core to do other job. 
 In this scenario, our performance had improved 20% than before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6487) Add sequential pattern mining algorithm to Spark MLlib

2015-03-23 Thread Zhang JiaJin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhang JiaJin updated SPARK-6487:

Description: 
[~mengxr] [~zhangyouhua]
Sequential pattern mining is an important branch in the pattern mining. In the 
past the actual work, we use the sequence mining (mainly PrefixSpan algorithm) 
to find the telecommunication signaling sequence pattern, achieved good 
results. But once the data is too large, the operation time is too long, even 
can not meet the the service requirements. We are ready to implement the 
PrefixSpan algorithm in spark, and applied to our subsequent work. 
The related Paper: Distributed PrefixSpan algorithm based on MapReduce.

  was:
Sequential pattern mining is an important branch in the pattern mining. In the 
past the actual work, we use the sequence mining (mainly PrefixSpan algorithm) 
to find the telecommunication signaling sequence pattern, achieved good 
results. But once the data is too large, the operation time is too long, even 
can not meet the the service requirements. We are ready to implement the 
PrefixSpan algorithm in spark, and applied to our subsequent work. 
The related Paper: Distributed PrefixSpan algorithm based on MapReduce.


 Add sequential pattern mining algorithm to Spark MLlib
 --

 Key: SPARK-6487
 URL: https://issues.apache.org/jira/browse/SPARK-6487
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Zhang JiaJin

 [~mengxr] [~zhangyouhua]
 Sequential pattern mining is an important branch in the pattern mining. In 
 the past the actual work, we use the sequence mining (mainly PrefixSpan 
 algorithm) to find the telecommunication signaling sequence pattern, achieved 
 good results. But once the data is too large, the operation time is too long, 
 even can not meet the the service requirements. We are ready to implement the 
 PrefixSpan algorithm in spark, and applied to our subsequent work. 
 The related Paper: Distributed PrefixSpan algorithm based on MapReduce.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6449) Driver OOM results in reported application result SUCCESS

2015-03-23 Thread Ryan Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Williams resolved SPARK-6449.
--
   Resolution: Implemented
Fix Version/s: 1.3.0

 Driver OOM results in reported application result SUCCESS
 -

 Key: SPARK-6449
 URL: https://issues.apache.org/jira/browse/SPARK-6449
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.3.0
Reporter: Ryan Williams
 Fix For: 1.3.0


 I ran a job yesterday that according to the History Server and YARN RM 
 finished with status {{SUCCESS}}.
 Clicking around on the history server UI, there were too few stages run, and 
 I couldn't figure out why that would have been.
 Finally, inspecting the end of the driver's logs, I saw:
 {code}
 15/03/20 15:08:13 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
 Shutting down remote daemon.
 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
 Remote daemon shut down; proceeding with flushing remote transports.
 15/03/20 15:08:13 INFO spark.SparkContext: Successfully stopped SparkContext
 Exception in thread Driver scala.MatchError: java.lang.OutOfMemoryError: GC 
 overhead limit exceeded (of class java.lang.OutOfMemoryError)
 at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485)
 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, 
 exitCode: 0, (reason: Shutdown hook called before final status was reported.)
 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Unregistering 
 ApplicationMaster with SUCCEEDED (diag message: Shutdown hook called before 
 final status was reported.)
 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
 Remoting shut down.
 15/03/20 15:08:13 INFO impl.AMRMClientImpl: Waiting for application to be 
 successfully unregistered.
 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Deleting staging directory 
 .sparkStaging/application_1426705269584_0055
 {code}
 The driver OOM'd, [the {{catch}} block that presumably should have caught 
 it|https://github.com/apache/spark/blob/b6090f902e6ec24923b4dde4aabc9076956521c1/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L484]
  threw a {{MatchError}}, and then {{SUCCESS}} was returned to YARN and 
 written to the event log.
 This should be logged as a failed job and reported as such to YARN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5692) Model import/export for Word2Vec

2015-03-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5692:
-
Assignee: Manoj Kumar  (was: ANUPAM MEDIRATTA)

 Model import/export for Word2Vec
 

 Key: SPARK-5692
 URL: https://issues.apache.org/jira/browse/SPARK-5692
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Manoj Kumar

 Supoort save and load for Word2VecModel. We may want to discuss whether we 
 want to be compatible with the original Word2Vec model storage format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5692) Model import/export for Word2Vec

2015-03-23 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377266#comment-14377266
 ] 

Xiangrui Meng commented on SPARK-5692:
--

[~anupamme] You should get familiar with Scala and Spark development first 
before working on specific JIRAs. I've assigned this ticket to [~MechCoder]. 
There are many open JIRAs for MLlib. Once you are familiar with Scala/Spark, 
feel free to ping me on a JIRA that you are interested in.

 Model import/export for Word2Vec
 

 Key: SPARK-5692
 URL: https://issues.apache.org/jira/browse/SPARK-5692
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: ANUPAM MEDIRATTA

 Supoort save and load for Word2VecModel. We may want to discuss whether we 
 want to be compatible with the original Word2Vec model storage format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6189) Pandas to DataFrame conversion should check field names for periods

2015-03-23 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6189:

Component/s: (was: DataFrame)

 Pandas to DataFrame conversion should check field names for periods
 ---

 Key: SPARK-6189
 URL: https://issues.apache.org/jira/browse/SPARK-6189
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor

 Issue I ran into:  I imported an R dataset in CSV format into a Pandas 
 DataFrame and then use toDF() to convert that into a Spark DataFrame.  The R 
 dataset had a column with a period in it (column GNP.deflator in the 
 longley dataset).  When I tried to select it using the Spark DataFrame DSL, 
 I could not because the DSL thought the period was selecting a field within 
 GNP.
 Also, since GNP is another field's name, it gives an error which could be 
 obscure to users, complaining:
 {code}
 org.apache.spark.sql.AnalysisException: GetField is not valid on fields of 
 type DoubleType;
 {code}
 We should either handle periods in column names or check during loading and 
 warn/fail gracefully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5919) Enable broadcast joins for Parquet files

2015-03-23 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-5919:

Component/s: (was: DataFrame)
 SQL

 Enable broadcast joins for Parquet files
 

 Key: SPARK-5919
 URL: https://issues.apache.org/jira/browse/SPARK-5919
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.1
Reporter: Dima Zhiyanov

 Unable to perform broadcast join of Schema RDDs created from Parquet files. 
 Computing statistics is only available for real Hive tables, and it is not 
 always convenient to create a Hive table for every Parquet file
 The issue is discussed here
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-do-broadcast-join-in-SparkSQL-td15298.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6229) Support SASL encryption in network/common module

2015-03-23 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377102#comment-14377102
 ] 

Marcelo Vanzin commented on SPARK-6229:
---

Hi, me again. So I finally got back to actually playing with the code today, 
and I was trying out what it would take to implement my suggestion. While I 
think it's worthwhile to abstract as much set up as possible behind the library 
code, that change in itself is becoming too large and kinda taking away from 
the focus of the change (adding encryption). So at this point I think it would 
be easier to go with something smaller that, while not optimal in my view, at 
least is more targeted, and we can do cleanup, if desired, separately.

So my current plan is to go with something not entirely unlike what Aaron is 
saying.

* On the client side, TransportClientBootstrap would expose the channel to the 
bootstrap implementation, so that the SASL bootstrap can, after negotiation, 
insert itself into the pipeline to perform encryption.

* On the server side, I'll probably need to build something similar to 
TransportClientBootstrap. I haven't really looked at what the code would look 
like, but this will most probably require changes to all call sites that use 
SaslRpcHandler at the moment.

So hopefully this will be a much smaller change that is also easier to review.

 Support SASL encryption in network/common module
 

 Key: SPARK-6229
 URL: https://issues.apache.org/jira/browse/SPARK-6229
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Marcelo Vanzin

 After SASL support has been added to network/common, supporting encryption 
 should be rather simple. Encryption is supported for DIGEST-MD5 and GSSAPI. 
 Since the latter requires a valid kerberos login to work (and so doesn't 
 really work with executors), encryption would require the use of DIGEST-MD5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1684) Merge script should standardize SPARK-XXX prefix

2015-03-23 Thread Michelle Casbon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michelle Casbon updated SPARK-1684:
---
Attachment: spark_pulls_before_after.txt

Test data (spark_pulls_before_after.txt): titles from existing pull requests in 
their original form and after being modified with the new function. It's not 
perfect, but it cleans up most of the common problems.

 Merge script should standardize SPARK-XXX prefix
 

 Key: SPARK-1684
 URL: https://issues.apache.org/jira/browse/SPARK-1684
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Minor
  Labels: starter
 Attachments: spark_pulls_before_after.txt


 If users write [SPARK-XXX] Issue or SPARK-XXX. Issue or SPARK XXX: 
 Issue we should convert it to SPARK-XXX: Issue



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6192) Enhance MLlib's Python API (GSoC 2015)

2015-03-23 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377197#comment-14377197
 ] 

Xiangrui Meng commented on SPARK-6192:
--

Thanks for the update! The current version looks good to me. Please keep me 
updated on key events.

 Enhance MLlib's Python API (GSoC 2015)
 --

 Key: SPARK-6192
 URL: https://issues.apache.org/jira/browse/SPARK-6192
 Project: Spark
  Issue Type: Umbrella
  Components: ML, MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Manoj Kumar
  Labels: gsoc, gsoc2015, mentor

 This is an umbrella JIRA for [~MechCoder]'s GSoC 2015 project. The main theme 
 is to enhance MLlib's Python API, to make it on par with the Scala/Java API. 
 The main tasks are:
 1. For all models in MLlib, provide save/load method. This also
 includes save/load in Scala.
 2. Python API for evaluation metrics.
 3. Python API for streaming ML algorithms.
 4. Python API for distributed linear algebra.
 5. Simplify MLLibPythonAPI using DataFrames. Currently, we use
 customized serialization, making MLLibPythonAPI hard to maintain. It
 would be nice to use the DataFrames for serialization.
 I'll link the JIRAs for each of the tasks.
 Note that this doesn't mean all these JIRAs are pre-assigned to [~MechCoder]. 
 The TODO list will be dynamic based on the backlog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6334) spark-local dir not getting cleared during ALS

2015-03-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-6334.
--
Resolution: Duplicate

SPARK-5955 was merged. So if you can use the latest master, you can set 
checkpoint interval to control the shuffle files. I'm closing this issue since 
there exists a workaround and it is fixed in master.

 spark-local dir not getting cleared during ALS
 --

 Key: SPARK-6334
 URL: https://issues.apache.org/jira/browse/SPARK-6334
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Antony Mayi
 Attachments: als-diskusage.png


 when running bigger ALS training spark spills loads of temp data into the 
 local-dir (in my case yarn/local/usercache/antony.mayi/appcache/... - running 
 on YARN from cdh 5.3.2) eventually causing all the disks of all nodes running 
 out of space (in my case I have 12TB of available disk capacity before 
 kicking off the ALS but it all gets used (and yarn kills the containers when 
 reaching 90%).
 even with all recommended options (configuring checkpointing and forcing GC 
 when possible) it still doesn't get cleared.
 here is my (pseudo)code (pyspark):
 {code}
 sc.setCheckpointDir('/tmp')
 training = 
 sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK)
 model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40)
 sc._jvm.System.gc()
 {code}
 the training RDD has about 3.5 billions of items (~60GB on disk). after about 
 6 hours the ALS will consume all 12TB of disk space in local-dir data and 
 gets killed. my cluster has 192 cores, 1.5TB RAM and for this task I am using 
 37 executors of 4 cores/28+4GB RAM each.
 this is the graph of disk consumption pattern showing the space being all 
 eaten from 7% to 90% during the ALS (90% is when YARN kills the container):
 !als-diskusage.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   >