[jira] [Commented] (SPARK-1087) Separate file for traceback and callsite related functions

2014-09-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14133094#comment-14133094
 ] 

Apache Spark commented on SPARK-1087:
-

User 'staple' has created a pull request for this issue:
https://github.com/apache/spark/pull/2385

> Separate file for traceback and callsite related functions
> --
>
> Key: SPARK-1087
> URL: https://issues.apache.org/jira/browse/SPARK-1087
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jyotiska NK
>
> Right now, _extract_concise_traceback() is written inside rdd.py which 
> provides the callsite information. But for 
> [SPARK-972](https://spark-project.atlassian.net/browse/SPARK-972) in PR #581, 
> we used the function from context.py. Also some issues were faced regarding 
> the return string format. 
> It would be a good idea to move the the traceback function from rdd and 
> create a separate file for future developments. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3463) Show metrics about spilling in Python

2014-09-13 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3463.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2336
[https://github.com/apache/spark/pull/2336]

> Show metrics about spilling in Python
> -
>
> Key: SPARK-3463
> URL: https://issues.apache.org/jira/browse/SPARK-3463
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 1.2.0
>
>
> It should also show the number of bytes spilled into disks while doing 
> aggregation in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3438) Support for accessing secured HDFS in Standalone Mode

2014-09-13 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3438:
---
Description: 
Access to secured HDFS is currently supported in YARN using YARN's built in 
security mechanism. In YARN mode, a user application is authenticated when it 
is submitted, then it acquires delegation tokens and them ship them (via YARN) 
securely to workers.

In Standalone mode, it would be nice to support a more mechanism for accessing 
HDFS where we rely on a single shared secret to authenticate communication in 
the standalone cluster.

1. A company is running a standalone cluster.
2. They are fine if all Spark jobs in the cluster share a global secret, i.e. 
all Spark jobs can trust one another.
3. They are able to provide a Hadoop login on the driver node via a keytab or 
kinit. They want tokens from this login to be distributed to the executors to 
allow access to secure HDFS.
4. They also don't want to trust the network on the cluster. I.e. don't want to 
allow someone to fetch HDFS tokens easily over a known protocol, without 
authentication.

  was:Secured HDFS is supported in YARN currently, but not in standalone mode. 
The tricky bit is how disseminate the delegation tokens securely in standalone 
mode.


> Support for accessing secured HDFS in Standalone Mode
> -
>
> Key: SPARK-3438
> URL: https://issues.apache.org/jira/browse/SPARK-3438
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, Spark Core
>Affects Versions: 1.0.2
>Reporter: Zhanfeng Huo
>
> Access to secured HDFS is currently supported in YARN using YARN's built in 
> security mechanism. In YARN mode, a user application is authenticated when it 
> is submitted, then it acquires delegation tokens and them ship them (via 
> YARN) securely to workers.
> In Standalone mode, it would be nice to support a more mechanism for 
> accessing HDFS where we rely on a single shared secret to authenticate 
> communication in the standalone cluster.
> 1. A company is running a standalone cluster.
> 2. They are fine if all Spark jobs in the cluster share a global secret, i.e. 
> all Spark jobs can trust one another.
> 3. They are able to provide a Hadoop login on the driver node via a keytab or 
> kinit. They want tokens from this login to be distributed to the executors to 
> allow access to secure HDFS.
> 4. They also don't want to trust the network on the cluster. I.e. don't want 
> to allow someone to fetch HDFS tokens easily over a known protocol, without 
> authentication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3501) Hive SimpleUDF will create duplicated type cast which cause exception in constant folding

2014-09-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3501:

Assignee: Cheng Hao

> Hive SimpleUDF will create duplicated type cast which cause exception in 
> constant folding
> -
>
> Key: SPARK-3501
> URL: https://issues.apache.org/jira/browse/SPARK-3501
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Cheng Hao
>Priority: Minor
>
> When do the query like:
> select datediff(cast(value as timestamp), cast('2002-03-21 00:00:00' as 
> timestamp)) from src;
> SparkSQL will raise exception:
> {panel}
> [info] - Cast Timestamp to Timestamp in UDF *** FAILED ***
> [info]   scala.MatchError: TimestampType (of class 
> org.apache.spark.sql.catalyst.types.TimestampType$)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.Cast.castToTimestamp(Cast.scala:77)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:251)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$5$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:217)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$5$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:210)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$2.apply(TreeNode.scala:180)
> [info]   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> [info]   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3515) ParquetMetastoreSuite fails when executed together with other suites under Maven

2014-09-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3515.
-
   Resolution: Fixed
Fix Version/s: 1.2.0
 Assignee: Cheng Lian

> ParquetMetastoreSuite fails when executed together with other suites under 
> Maven
> 
>
> Key: SPARK-3515
> URL: https://issues.apache.org/jira/browse/SPARK-3515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 1.2.0
>
>
> Reproduction step:
> {code}
> mvn -Phive,hadoop-2.4 
> -DwildcardSuites=org.apache.spark.sql.parquet.ParquetMetastoreSuite,org.apache.spark.sql.hive.StatisticsSuite
>  -pl core,sql/catalyst,sql/core,sql/hive test
> {code}
> Maven instantiates all discovered test suite object at first, and then starts 
> executing all test cases. {{ParquetMetastoreSuite}} sets up several temporary 
> tables in constructor, but these tables are deleted immediately since 
> {{StatisticsSuite}}'s constructor calls {{TestHiveContext.reset()}}.
> To fix this issue, we shouldn't put this kind of side effect in constructor, 
> but in {{beforeAll}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3501) Hive SimpleUDF will create duplicated type cast which cause exception in constant folding

2014-09-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3501:

Target Version/s: 1.2.0

> Hive SimpleUDF will create duplicated type cast which cause exception in 
> constant folding
> -
>
> Key: SPARK-3501
> URL: https://issues.apache.org/jira/browse/SPARK-3501
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Cheng Hao
>Priority: Minor
>
> When do the query like:
> select datediff(cast(value as timestamp), cast('2002-03-21 00:00:00' as 
> timestamp)) from src;
> SparkSQL will raise exception:
> {panel}
> [info] - Cast Timestamp to Timestamp in UDF *** FAILED ***
> [info]   scala.MatchError: TimestampType (of class 
> org.apache.spark.sql.catalyst.types.TimestampType$)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.Cast.castToTimestamp(Cast.scala:77)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:251)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$5$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:217)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$5$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:210)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$2.apply(TreeNode.scala:180)
> [info]   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> [info]   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3485) should check parameter type when find constructors

2014-09-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3485:

Target Version/s: 1.2.0

> should check parameter type when find constructors
> --
>
> Key: SPARK-3485
> URL: https://issues.apache.org/jira/browse/SPARK-3485
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Adrian Wang
>
> In hiveUdfs, we get constructors of primitivetypes by find a constructor 
> which takes only one parameter. This is very dangerous when more than one 
> constructors match. When the sequence of primitiveTypes becomes larger, the 
> problem would occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3030) reuse python worker

2014-09-13 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3030.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2259
[https://github.com/apache/spark/pull/2259]

> reuse python worker
> ---
>
> Key: SPARK-3030
> URL: https://issues.apache.org/jira/browse/SPARK-3030
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 1.2.0
>
>
> Currently, it will fork an Python worker for each task, it will better if we 
> can reuse the worker for later tasks.
> This will be very useful for large dataset with big broadcast, so it does not 
> need to sending broadcast to worker again and again. Also it can reduce the 
> overhead of launch a task.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3294) Avoid boxing/unboxing when handling in-memory columnar storage

2014-09-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3294.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

> Avoid boxing/unboxing when handling in-memory columnar storage
> --
>
> Key: SPARK-3294
> URL: https://issues.apache.org/jira/browse/SPARK-3294
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Critical
> Fix For: 1.2.0
>
>
> When Spark SQL in-memory columnar storage was implemented, we tried to avoid 
> boxing/unboxing costs as much as possible, but {{javap}} shows that there 
> still exist code that involves boxing/unboxing on critical paths due to type 
> erasure, especially methods of sub-classes of {{ColumnType}}. Should 
> eliminate them whenever possible for better performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3519) PySpark RDDs are missing the distinct(n) method

2014-09-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14132963#comment-14132963
 ] 

Apache Spark commented on SPARK-3519:
-

User 'mattf' has created a pull request for this issue:
https://github.com/apache/spark/pull/2383

> PySpark RDDs are missing the distinct(n) method
> ---
>
> Key: SPARK-3519
> URL: https://issues.apache.org/jira/browse/SPARK-3519
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.1.0
>Reporter: Nicholas Chammas
>Assignee: Matthew Farrellee
>
> {{distinct()}} works but {{distinct(N)}} doesn't.
> {code}
> >>> sc.parallelize([1,1,2]).distinct()
> PythonRDD[15] at RDD at PythonRDD.scala:43
> >>> sc.parallelize([1,1,2]).distinct(2)
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: distinct() takes exactly 1 argument (2 given)
> {code}
> The PySpark docs only call out [the {{distinct()}} 
> signature|http://spark.apache.org/docs/1.1.0/api/python/pyspark.rdd.RDD-class.html#distinct],
>  but the programming guide [includes the {{distinct(N)}} 
> signature|http://spark.apache.org/docs/1.1.0/programming-guide.html#transformations]
>  as well.
> {quote}
> {noformat}
> distinct([numTasks])) Return a new dataset that contains the distinct 
> elements of the source dataset.
> {noformat}
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3481) HiveComparisonTest throws exception of "org.apache.hadoop.hive.ql.metadata.HiveException: Database does not exist: default"

2014-09-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3481.
-
   Resolution: Fixed
Fix Version/s: 1.2.0
 Assignee: Cheng Hao

> HiveComparisonTest throws exception of 
> "org.apache.hadoop.hive.ql.metadata.HiveException: Database does not exist: 
> default"
> ---
>
> Key: SPARK-3481
> URL: https://issues.apache.org/jira/browse/SPARK-3481
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Cheng Hao
>Priority: Minor
> Fix For: 1.2.0
>
>
> In local test, lots of exception raised like:
> {panel}
> 11:08:01.746 ERROR hive.ql.exec.DDLTask: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Database does not exist: 
> default
>   at 
> org.apache.hadoop.hive.ql.exec.DDLTask.switchDatabase(DDLTask.java:3480)
>   at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:237)
>   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
>   at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)
>   at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414)
>   at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192)
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
>   at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:298)
>   at 
> org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:272)
>   at 
> org.apache.spark.sql.hive.test.TestHiveContext.runSqlHive(TestHive.scala:88)
>   at 
> org.apache.spark.sql.hive.test.TestHiveContext.reset(TestHive.scala:348)
>   at 
> org.apache.spark.sql.hive.execution.HiveComparisonTest$$anonfun$createQueryTest$1.apply$mcV$sp(HiveComparisonTest.scala:255)
>   at 
> org.apache.spark.sql.hive.execution.HiveComparisonTest$$anonfun$createQueryTest$1.apply(HiveComparisonTest.scala:225)
>   at 
> org.apache.spark.sql.hive.execution.HiveComparisonTest$$anonfun$createQueryTest$1.apply(HiveComparisonTest.scala:225)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158)
>   at org.scalatest.Suite$class.withFixture(Suite.scala:1121)
>   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1559)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:200)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1559)
>   at org.scalatest.Suite$class.run(Suite.scala:1423)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1559)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:204)
>   at 
> org.apache.spark.sql.hive.execution.HiveComparisonTest.org$scalatest$BeforeAndAfterAll$$super$run(HiveComparisonTest.scala:41)
>   at 
> org.scalatest.BeforeAndAfte

[jira] [Commented] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with uppercase letters in their names

2014-09-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14132893#comment-14132893
 ] 

Apache Spark commented on SPARK-3414:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/2382

> Case insensitivity breaks when unresolved relation contains attributes with 
> uppercase letters in their names
> 
>
> Key: SPARK-3414
> URL: https://issues.apache.org/jira/browse/SPARK-3414
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.2
>Reporter: Cheng Lian
>Assignee: Michael Armbrust
>Priority: Critical
> Fix For: 1.2.0
>
>
> Paste the following snippet to {{spark-shell}} (need Hive support) to 
> reproduce this issue:
> {code}
> import org.apache.spark.sql.hive.HiveContext
> val hiveContext = new HiveContext(sc)
> import hiveContext._
> case class LogEntry(filename: String, message: String)
> case class LogFile(name: String)
> sc.makeRDD(Seq.empty[LogEntry]).registerTempTable("rawLogs")
> sc.makeRDD(Seq.empty[LogFile]).registerTempTable("logFiles")
> val srdd = sql(
>   """
> SELECT name, message
> FROM rawLogs
> JOIN (
>   SELECT name
>   FROM logFiles
> ) files
> ON rawLogs.filename = files.name
>   """)
> srdd.registerTempTable("boom")
> sql("select * from boom")
> {code}
> Exception thrown:
> {code}
> SchemaRDD[7] at RDD at SchemaRDD.scala:103
> == Query Plan ==
> == Physical Plan ==
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
> attributes: *, tree:
> Project [*]
>  LowerCaseSchema
>   Subquery boom
>Project ['name,'message]
> Join Inner, Some(('rawLogs.filename = name#2))
>  LowerCaseSchema
>   Subquery rawlogs
>SparkLogicalPlan (ExistingRdd [filename#0,message#1], 
> MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208)
>  Subquery files
>   Project [name#2]
>LowerCaseSchema
> Subquery logfiles
>  SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at 
> mapPartitions at basicOperators.scala:208)
> {code}
> Notice that {{rawLogs}} in the join operator is not lowercased.
> The reason is that, during analysis phase, the 
> {{CaseInsensitiveAttributeReferences}} batch is only executed before the 
> {{Resolution}} batch. And when {{srdd}} is registered as temporary table 
> {{boom}}, its original (unanalyzed) logical plan is stored into the catalog:
> {code}
> Join Inner, Some(('rawLogs.filename = 'files.name))
>  UnresolvedRelation None, rawLogs, None
>  Subquery files
>   Project ['name]
>UnresolvedRelation None, logFiles, None
> {code}
> notice that attributes referenced in the join operator (esp. {{rawLogs}}) is 
> not lowercased yet.
> And then, when {{select * from boom}} is been analyzed, its input logical 
> plan is:
> {code}
> Project [*]
>  UnresolvedRelation None, boom, None
> {code}
> here the unresolved relation points to the unanalyzed logical plan of 
> {{srdd}} above, which is later discovered by rule {{ResolveRelations}}, thus 
> not touched by {{CaseInsensitiveAttributeReferences}} at all, and 
> {{rawLogs.filename}} is thus not lowercased:
> {code}
> === Applying Rule 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
>  Project [*]Project [*]
> ! UnresolvedRelation None, boom, NoneLowerCaseSchema
> ! Subquery boom
> !  Project ['name,'message]
> !   Join Inner, 
> Some(('rawLogs.filename = 'files.name))
> !LowerCaseSchema
> ! Subquery rawlogs
> !  SparkLogicalPlan (ExistingRdd 
> [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at 
> basicOperators.scala:208)
> !Subquery files
> ! Project ['name]
> !  LowerCaseSchema
> !   Subquery logfiles
> !SparkLogicalPlan 
> (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at 
> basicOperators.scala:208)
> {code}
> A reasonable fix for this could be always register analyzed logical plan to 
> the catalog when registering temporary tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues

[jira] [Reopened] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with uppercase letters in their names

2014-09-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reopened SPARK-3414:
-
  Assignee: Michael Armbrust  (was: Cheng Lian)

> Case insensitivity breaks when unresolved relation contains attributes with 
> uppercase letters in their names
> 
>
> Key: SPARK-3414
> URL: https://issues.apache.org/jira/browse/SPARK-3414
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.2
>Reporter: Cheng Lian
>Assignee: Michael Armbrust
>Priority: Critical
> Fix For: 1.2.0
>
>
> Paste the following snippet to {{spark-shell}} (need Hive support) to 
> reproduce this issue:
> {code}
> import org.apache.spark.sql.hive.HiveContext
> val hiveContext = new HiveContext(sc)
> import hiveContext._
> case class LogEntry(filename: String, message: String)
> case class LogFile(name: String)
> sc.makeRDD(Seq.empty[LogEntry]).registerTempTable("rawLogs")
> sc.makeRDD(Seq.empty[LogFile]).registerTempTable("logFiles")
> val srdd = sql(
>   """
> SELECT name, message
> FROM rawLogs
> JOIN (
>   SELECT name
>   FROM logFiles
> ) files
> ON rawLogs.filename = files.name
>   """)
> srdd.registerTempTable("boom")
> sql("select * from boom")
> {code}
> Exception thrown:
> {code}
> SchemaRDD[7] at RDD at SchemaRDD.scala:103
> == Query Plan ==
> == Physical Plan ==
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
> attributes: *, tree:
> Project [*]
>  LowerCaseSchema
>   Subquery boom
>Project ['name,'message]
> Join Inner, Some(('rawLogs.filename = name#2))
>  LowerCaseSchema
>   Subquery rawlogs
>SparkLogicalPlan (ExistingRdd [filename#0,message#1], 
> MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208)
>  Subquery files
>   Project [name#2]
>LowerCaseSchema
> Subquery logfiles
>  SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at 
> mapPartitions at basicOperators.scala:208)
> {code}
> Notice that {{rawLogs}} in the join operator is not lowercased.
> The reason is that, during analysis phase, the 
> {{CaseInsensitiveAttributeReferences}} batch is only executed before the 
> {{Resolution}} batch. And when {{srdd}} is registered as temporary table 
> {{boom}}, its original (unanalyzed) logical plan is stored into the catalog:
> {code}
> Join Inner, Some(('rawLogs.filename = 'files.name))
>  UnresolvedRelation None, rawLogs, None
>  Subquery files
>   Project ['name]
>UnresolvedRelation None, logFiles, None
> {code}
> notice that attributes referenced in the join operator (esp. {{rawLogs}}) is 
> not lowercased yet.
> And then, when {{select * from boom}} is been analyzed, its input logical 
> plan is:
> {code}
> Project [*]
>  UnresolvedRelation None, boom, None
> {code}
> here the unresolved relation points to the unanalyzed logical plan of 
> {{srdd}} above, which is later discovered by rule {{ResolveRelations}}, thus 
> not touched by {{CaseInsensitiveAttributeReferences}} at all, and 
> {{rawLogs.filename}} is thus not lowercased:
> {code}
> === Applying Rule 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
>  Project [*]Project [*]
> ! UnresolvedRelation None, boom, NoneLowerCaseSchema
> ! Subquery boom
> !  Project ['name,'message]
> !   Join Inner, 
> Some(('rawLogs.filename = 'files.name))
> !LowerCaseSchema
> ! Subquery rawlogs
> !  SparkLogicalPlan (ExistingRdd 
> [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at 
> basicOperators.scala:208)
> !Subquery files
> ! Project ['name]
> !  LowerCaseSchema
> !   Subquery logfiles
> !SparkLogicalPlan 
> (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at 
> basicOperators.scala:208)
> {code}
> A reasonable fix for this could be always register analyzed logical plan to 
> the catalog when registering temporary tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2594) Add CACHE TABLE AS SELECT ...

2014-09-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14132876#comment-14132876
 ] 

Apache Spark commented on SPARK-2594:
-

User 'ravipesala' has created a pull request for this issue:
https://github.com/apache/spark/pull/2381

> Add CACHE TABLE  AS SELECT ...
> 
>
> Key: SPARK-2594
> URL: https://issues.apache.org/jira/browse/SPARK-2594
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Michael Armbrust
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2562) Add Date datatype support to Spark SQL

2014-09-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2562.
-
Resolution: Duplicate

> Add Date datatype support to Spark SQL
> --
>
> Key: SPARK-2562
> URL: https://issues.apache.org/jira/browse/SPARK-2562
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.0.1
>Reporter: Zongheng Yang
>Priority: Minor
>
> Spark SQL currently supports Timestamp, but not Date. Hive introduced support 
> for Date in [HIVE-4055|https://issues.apache.org/jira/browse/HIVE-4055], 
> where the underlying representation is {{java.sql.Date}}.
> (Thanks to user Rindra Ramamonjison for reporting this.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3407) Add Date type support

2014-09-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3407:

Assignee: Adrian Wang

> Add Date type support
> -
>
> Key: SPARK-3407
> URL: https://issues.apache.org/jira/browse/SPARK-3407
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Adrian Wang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3407) Add Date type support

2014-09-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3407:

Target Version/s: 1.2.0

> Add Date type support
> -
>
> Key: SPARK-3407
> URL: https://issues.apache.org/jira/browse/SPARK-3407
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cheng Hao
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3519) PySpark RDDs are missing the distinct(n) method

2014-09-13 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14132833#comment-14132833
 ] 

Nicholas Chammas commented on SPARK-3519:
-

[~joshrosen] & [~davies]: Here is a ticket for the missing {{distinct(N)}} 
method. I marked it as a bug since the programming guide says it should exist.

> PySpark RDDs are missing the distinct(n) method
> ---
>
> Key: SPARK-3519
> URL: https://issues.apache.org/jira/browse/SPARK-3519
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.1.0
>Reporter: Nicholas Chammas
>
> {{distinct()}} works but {{distinct(N)}} doesn't.
> {code}
> >>> sc.parallelize([1,1,2]).distinct()
> PythonRDD[15] at RDD at PythonRDD.scala:43
> >>> sc.parallelize([1,1,2]).distinct(2)
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: distinct() takes exactly 1 argument (2 given)
> {code}
> The PySpark docs only call out [the {{distinct()}} 
> signature|http://spark.apache.org/docs/1.1.0/api/python/pyspark.rdd.RDD-class.html#distinct],
>  but the programming guide [includes the {{distinct(N)}} 
> signature|http://spark.apache.org/docs/1.1.0/programming-guide.html#transformations]
>  as well.
> {quote}
> {noformat}
> distinct([numTasks])) Return a new dataset that contains the distinct 
> elements of the source dataset.
> {noformat}
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3519) PySpark RDDs are missing the distinct(n) method

2014-09-13 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-3519:
---

 Summary: PySpark RDDs are missing the distinct(n) method
 Key: SPARK-3519
 URL: https://issues.apache.org/jira/browse/SPARK-3519
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
Reporter: Nicholas Chammas


{{distinct()}} works but {{distinct(N)}} doesn't.

{code}
>>> sc.parallelize([1,1,2]).distinct()
PythonRDD[15] at RDD at PythonRDD.scala:43
>>> sc.parallelize([1,1,2]).distinct(2)
Traceback (most recent call last):
  File "", line 1, in 
TypeError: distinct() takes exactly 1 argument (2 given)
{code}

The PySpark docs only call out [the {{distinct()}} 
signature|http://spark.apache.org/docs/1.1.0/api/python/pyspark.rdd.RDD-class.html#distinct],
 but the programming guide [includes the {{distinct(N)}} 
signature|http://spark.apache.org/docs/1.1.0/programming-guide.html#transformations]
 as well.

{quote}
{noformat}
distinct([numTasks]))   Return a new dataset that contains the distinct 
elements of the source dataset.
{noformat}
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2593) Add ability to pass an existing Akka ActorSystem into Spark

2014-09-13 Thread Helena Edelson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14132797#comment-14132797
 ] 

Helena Edelson commented on SPARK-2593:
---

Here is a good example of just one of the issues: it is difficult to locate a 
remote spark actor to publish data to the stream. Here I have to have the 
streaming actor get created and in the preStart, publish a custom message with 
`self`which my actors in my ActorSystem can receive in order to get the 
ActorRef to send to. This is incredibly clunky.

I will try to carve out some time to do this PR this week.
 

> Add ability to pass an existing Akka ActorSystem into Spark
> ---
>
> Key: SPARK-2593
> URL: https://issues.apache.org/jira/browse/SPARK-2593
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Helena Edelson
>
> As a developer I want to pass an existing ActorSystem into StreamingContext 
> in load-time so that I do not have 2 actor systems running on a node in an 
> Akka application.
> This would mean having spark's actor system on its own named-dispatchers as 
> well as exposing the new private creation of its own actor system.
>   
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2593) Add ability to pass an existing Akka ActorSystem into Spark

2014-09-13 Thread Helena Edelson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Helena Edelson updated SPARK-2593:
--
Description: 
As a developer I want to pass an existing ActorSystem into StreamingContext in 
load-time so that I do not have 2 actor systems running on a node in an Akka 
application.

This would mean having spark's actor system on its own named-dispatchers as 
well as exposing the new private creation of its own actor system.
  
 

  was:
As a developer I want to pass an existing ActorSystem into StreamingContext in 
load-time so that I do not have 2 actor systems running on a node in an Akka 
application.

This would mean having spark's actor system on its own named-dispatchers as 
well as exposing the new private creation of its own actor system.
 
I would like to create an Akka Extension that wraps around Spark/Spark 
Streaming and Cassandra. So the programmatic creation would simply be this for 
a user

val extension = SparkCassandra(system)
 


> Add ability to pass an existing Akka ActorSystem into Spark
> ---
>
> Key: SPARK-2593
> URL: https://issues.apache.org/jira/browse/SPARK-2593
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Helena Edelson
>
> As a developer I want to pass an existing ActorSystem into StreamingContext 
> in load-time so that I do not have 2 actor systems running on a node in an 
> Akka application.
> This would mean having spark's actor system on its own named-dispatchers as 
> well as exposing the new private creation of its own actor system.
>   
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3518) Remove useless statement in JsonProtocol

2014-09-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14132608#comment-14132608
 ] 

Apache Spark commented on SPARK-3518:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/2380

> Remove useless statement in JsonProtocol
> 
>
> Key: SPARK-3518
> URL: https://issues.apache.org/jira/browse/SPARK-3518
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Kousuke Saruta
>Priority: Minor
>
> In org.apache.spark.util.JsonProtocol#taskInfoToJson, a variable named 
> "accumUpdateMap" is created as follows.
> {code}
> val accumUpdateMap = taskInfo.accumulables
> {code}
> But accumUpdateMap is never used and there is 2nd invocation of 
> "taskInfo.accumlables" as follows.
> {code}
> ("Accumulables" -> 
> JArray(taskInfo.accumulables.map(accumulableInfoToJson).toList))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3518) Remove useless statement in JsonProtocol

2014-09-13 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-3518:
-

 Summary: Remove useless statement in JsonProtocol
 Key: SPARK-3518
 URL: https://issues.apache.org/jira/browse/SPARK-3518
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Kousuke Saruta
Priority: Minor


In org.apache.spark.util.JsonProtocol#taskInfoToJson, a variable named 
"accumUpdateMap" is created as follows.

{code}
val accumUpdateMap = taskInfo.accumulables
{code}

But accumUpdateMap is never used and there is 2nd invocation of 
"taskInfo.accumlables" as follows.

{code}
("Accumulables" -> 
JArray(taskInfo.accumulables.map(accumulableInfoToJson).toList))
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle

2014-09-13 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14132598#comment-14132598
 ] 

Saisai Shao edited comment on SPARK-2926 at 9/13/14 8:09 AM:
-

Ok, I will take a try and let you know when it is ready. Thanks a lot.


was (Author: jerryshao):
Ok, I will take a try and let you know then it is ready. Thanks a lot.

> Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
> --
>
> Key: SPARK-2926
> URL: https://issues.apache.org/jira/browse/SPARK-2926
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.1.0
>Reporter: Saisai Shao
> Attachments: SortBasedShuffleRead.pdf, Spark Shuffle Test 
> Report(contd).pdf, Spark Shuffle Test Report.pdf
>
>
> Currently Spark has already integrated sort-based shuffle write, which 
> greatly improve the IO performance and reduce the memory consumption when 
> reducer number is very large. But for the reducer side, it still adopts the 
> implementation of hash-based shuffle reader, which neglects the ordering 
> attributes of map output data in some situations.
> Here we propose a MR style sort-merge like shuffle reader for sort-based 
> shuffle to better improve the performance of sort-based shuffle.
> Working in progress code and performance test report will be posted later 
> when some unit test bugs are fixed.
> Any comments would be greatly appreciated. 
> Thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle

2014-09-13 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14132598#comment-14132598
 ] 

Saisai Shao commented on SPARK-2926:


Ok, I will take a try and let you know then it is ready. Thanks a lot.

> Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
> --
>
> Key: SPARK-2926
> URL: https://issues.apache.org/jira/browse/SPARK-2926
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.1.0
>Reporter: Saisai Shao
> Attachments: SortBasedShuffleRead.pdf, Spark Shuffle Test 
> Report(contd).pdf, Spark Shuffle Test Report.pdf
>
>
> Currently Spark has already integrated sort-based shuffle write, which 
> greatly improve the IO performance and reduce the memory consumption when 
> reducer number is very large. But for the reducer side, it still adopts the 
> implementation of hash-based shuffle reader, which neglects the ordering 
> attributes of map output data in some situations.
> Here we propose a MR style sort-merge like shuffle reader for sort-based 
> shuffle to better improve the performance of sort-based shuffle.
> Working in progress code and performance test report will be posted later 
> when some unit test bugs are fixed.
> Any comments would be greatly appreciated. 
> Thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2098) All Spark processes should support spark-defaults.conf, config file

2014-09-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14132593#comment-14132593
 ] 

Apache Spark commented on SPARK-2098:
-

User 'witgo' has created a pull request for this issue:
https://github.com/apache/spark/pull/2379

> All Spark processes should support spark-defaults.conf, config file
> ---
>
> Key: SPARK-2098
> URL: https://issues.apache.org/jira/browse/SPARK-2098
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Marcelo Vanzin
>Assignee: Guoqiang Li
>
> SparkSubmit supports the idea of a config file to set SparkConf 
> configurations. This is handy because you can easily set a site-wide 
> configuration file, and power users can use their own when needed, or resort 
> to JVM properties or other means of overriding configs.
> It would be nice if all Spark processes (e.g. master / worker / history 
> server) also supported something like this. For daemon processes this is 
> particularly interesting because it makes it easy to decouple starting the 
> daemon (e.g. some /etc/init.d script packaged by some distribution) from 
> configuring that daemon. Right now you have to set environment variables to 
> modify the configuration of those daemons, which is not very friendly to the 
> above scenario.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3491) Use pickle to serialize the data in MLlib Python

2014-09-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14132590#comment-14132590
 ] 

Apache Spark commented on SPARK-3491:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/2378

> Use pickle to serialize the data in MLlib Python
> 
>
> Key: SPARK-3491
> URL: https://issues.apache.org/jira/browse/SPARK-3491
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Currently, we write the code for serialization/deserialization in Python and 
> Scala manually, it can not scale to the big number of MLlib API.
> If the serialization could be done in pickle (using Pyrolite in JVM) in 
> extensional way, then it should be much easy to add Python API for MLlib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org