date:20150328

[jira] [Assigned] (SPARK-5277) SparkSqlSerializer does not register user specified KryoRegistrators

2015-03-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5277:
---

Assignee: Apache Spark

> SparkSqlSerializer does not register user specified KryoRegistrators 
> -
>
> Key: SPARK-5277
> URL: https://issues.apache.org/jira/browse/SPARK-5277
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1, 1.3.0
>Reporter: Max Seiden
>Assignee: Apache Spark
>
> Although the SparkSqlSerializer class extends the KryoSerializer in core, 
> it's overridden newKryo() does not call super.newKryo(). This results in 
> inconsistent serializer behaviors depending on whether a KryoSerializer 
> instance or a SparkSqlSerializer instance is used. This may also be related 
> to the TODO in KryoResourcePool, which uses KryoSerializer instead of 
> SparkSqlSerializer due to yet-to-be-investigated test failures.
> An example of the divergence in behavior: The Exchange operator creates a new 
> SparkSqlSerializer instance (with an empty conf; another issue) when it is 
> constructed, whereas the GENERIC ColumnType pulls a KryoSerializer out of the 
> resource pool (see above). The result is that the serialized in-memory 
> columns are created using the user provided serializers / registrators, while 
> serialization during exchange does not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5277) SparkSqlSerializer does not register user specified KryoRegistrators

2015-03-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385172#comment-14385172
 ] 

Apache Spark commented on SPARK-5277:
-

User 'mhseiden' has created a pull request for this issue:
https://github.com/apache/spark/pull/5237

> SparkSqlSerializer does not register user specified KryoRegistrators 
> -
>
> Key: SPARK-5277
> URL: https://issues.apache.org/jira/browse/SPARK-5277
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1, 1.3.0
>Reporter: Max Seiden
>
> Although the SparkSqlSerializer class extends the KryoSerializer in core, 
> it's overridden newKryo() does not call super.newKryo(). This results in 
> inconsistent serializer behaviors depending on whether a KryoSerializer 
> instance or a SparkSqlSerializer instance is used. This may also be related 
> to the TODO in KryoResourcePool, which uses KryoSerializer instead of 
> SparkSqlSerializer due to yet-to-be-investigated test failures.
> An example of the divergence in behavior: The Exchange operator creates a new 
> SparkSqlSerializer instance (with an empty conf; another issue) when it is 
> constructed, whereas the GENERIC ColumnType pulls a KryoSerializer out of the 
> resource pool (see above). The result is that the serialized in-memory 
> columns are created using the user provided serializers / registrators, while 
> serialization during exchange does not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-5277) SparkSqlSerializer does not register user specified KryoRegistrators

2015-03-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5277:
---

Assignee: (was: Apache Spark)

> SparkSqlSerializer does not register user specified KryoRegistrators 
> -
>
> Key: SPARK-5277
> URL: https://issues.apache.org/jira/browse/SPARK-5277
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1, 1.3.0
>Reporter: Max Seiden
>
> Although the SparkSqlSerializer class extends the KryoSerializer in core, 
> it's overridden newKryo() does not call super.newKryo(). This results in 
> inconsistent serializer behaviors depending on whether a KryoSerializer 
> instance or a SparkSqlSerializer instance is used. This may also be related 
> to the TODO in KryoResourcePool, which uses KryoSerializer instead of 
> SparkSqlSerializer due to yet-to-be-investigated test failures.
> An example of the divergence in behavior: The Exchange operator creates a new 
> SparkSqlSerializer instance (with an empty conf; another issue) when it is 
> constructed, whereas the GENERIC ColumnType pulls a KryoSerializer out of the 
> resource pool (see above). The result is that the serialized in-memory 
> columns are created using the user provided serializers / registrators, while 
> serialization during exchange does not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6582) Support ssl for this AvroSink in Spark Streaming External

2015-03-28 Thread SaintBacchus (JIRA)

SaintBacchus created SPARK-6582:
---

 Summary: Support ssl for this AvroSink in Spark Streaming External
 Key: SPARK-6582
 URL: https://issues.apache.org/jira/browse/SPARK-6582
 Project: Spark
  Issue Type: Improvement
Reporter: SaintBacchus
 Fix For: 1.4.0


AvroSink had already support the *ssl*,  so it's better to support *ssl* in the 
Spark Streaming External Flume. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6464) Add a new transformation of rdd named processCoalesce which was particularly to deal with the small and cached rdd

2015-03-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6464:
---

Assignee: Apache Spark

> Add a new transformation of rdd named processCoalesce which was  particularly 
> to deal with the small and cached rdd
> ---
>
> Key: SPARK-6464
> URL: https://issues.apache.org/jira/browse/SPARK-6464
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: SaintBacchus
>Assignee: Apache Spark
> Attachments: screenshot-1.png
>
>
> Nowadays, the transformation *coalesce* was always used to expand or reduce 
> the number of the partition in order to gain a good performance.
> But *coalesce* can't make sure that the child partition will be executed in 
> the same executor as the parent partition. And this will lead to have a large 
> network transfer.
> In some scenario such as I mentioned in the title +small and cached rdd+, we 
> want to coalesce all the partition in the same executor into one partition 
> and make sure the child partition will be executed in this executor. It can 
> avoid network transfer and reduce the scheduler of the Tasks and also can 
> reused the cpu core to do other job. 
> In this scenario, our performance had improved 20% than before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6464) Add a new transformation of rdd named processCoalesce which was particularly to deal with the small and cached rdd

2015-03-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6464:
---

Assignee: (was: Apache Spark)

> Add a new transformation of rdd named processCoalesce which was  particularly 
> to deal with the small and cached rdd
> ---
>
> Key: SPARK-6464
> URL: https://issues.apache.org/jira/browse/SPARK-6464
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: SaintBacchus
> Attachments: screenshot-1.png
>
>
> Nowadays, the transformation *coalesce* was always used to expand or reduce 
> the number of the partition in order to gain a good performance.
> But *coalesce* can't make sure that the child partition will be executed in 
> the same executor as the parent partition. And this will lead to have a large 
> network transfer.
> In some scenario such as I mentioned in the title +small and cached rdd+, we 
> want to coalesce all the partition in the same executor into one partition 
> and make sure the child partition will be executed in this executor. It can 
> avoid network transfer and reduce the scheduler of the Tasks and also can 
> reused the cpu core to do other job. 
> In this scenario, our performance had improved 20% than before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6577) SparseMatrix should be supported in PySpark

2015-03-28 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385183#comment-14385183
 ] 

Joseph K. Bradley commented on SPARK-6577:
--

Good point.  [~mengxr], do we want to require scipy and add UDTs which handle 
numpy and scipy dense and sparse vectors and matrices?  Or do we want to add 
our own SparseMatrix?

> SparseMatrix should be supported in PySpark
> ---
>
> Key: SPARK-6577
> URL: https://issues.apache.org/jira/browse/SPARK-6577
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Manoj Kumar
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6583) Support aggregated function in order by

2015-03-28 Thread Yadong Qi (JIRA)

Yadong Qi created SPARK-6583:


 Summary: Support aggregated function in order by
 Key: SPARK-6583
 URL: https://issues.apache.org/jira/browse/SPARK-6583
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yadong Qi






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-5338) Support cluster mode with Mesos

2015-03-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5338:
---

Assignee: (was: Apache Spark)

> Support cluster mode with Mesos
> ---
>
> Key: SPARK-5338
> URL: https://issues.apache.org/jira/browse/SPARK-5338
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
>
> Currently using Spark with Mesos, the only supported deployment is client 
> mode.
> It is also useful to have a cluster mode deployment that can be shared and 
> long running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-5338) Support cluster mode with Mesos

2015-03-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5338:
---

Assignee: Apache Spark

> Support cluster mode with Mesos
> ---
>
> Key: SPARK-5338
> URL: https://issues.apache.org/jira/browse/SPARK-5338
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
>Assignee: Apache Spark
>
> Currently using Spark with Mesos, the only supported deployment is client 
> mode.
> It is also useful to have a cluster mode deployment that can be shared and 
> long running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6464) Add a new transformation of rdd named processCoalesce which was particularly to deal with the small and cached rdd

2015-03-28 Thread SaintBacchus (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SaintBacchus updated SPARK-6464:

Affects Version/s: (was: 1.3.0)
   1.4.0

> Add a new transformation of rdd named processCoalesce which was  particularly 
> to deal with the small and cached rdd
> ---
>
> Key: SPARK-6464
> URL: https://issues.apache.org/jira/browse/SPARK-6464
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: SaintBacchus
> Attachments: screenshot-1.png
>
>
> Nowadays, the transformation *coalesce* was always used to expand or reduce 
> the number of the partition in order to gain a good performance.
> But *coalesce* can't make sure that the child partition will be executed in 
> the same executor as the parent partition. And this will lead to have a large 
> network transfer.
> In some scenario such as I mentioned in the title +small and cached rdd+, we 
> want to coalesce all the partition in the same executor into one partition 
> and make sure the child partition will be executed in this executor. It can 
> avoid network transfer and reduce the scheduler of the Tasks and also can 
> reused the cpu core to do other job. 
> In this scenario, our performance had improved 20% than before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6584) Provide ExecutorPrefixTaskLocation to support the rdd which can be aware of partition's executor location.

2015-03-28 Thread SaintBacchus (JIRA)

SaintBacchus created SPARK-6584:
---

 Summary: Provide ExecutorPrefixTaskLocation to support the rdd 
which can be aware of partition's executor  location.
 Key: SPARK-6584
 URL: https://issues.apache.org/jira/browse/SPARK-6584
 Project: Spark
  Issue Type: Sub-task
Affects Versions: 1.4.0
Reporter: SaintBacchus


The function *RDD.getPreferredLocations* can only be set the host awareness 
prefer locations.
If some *RDD* wants to be scheduled by executor(such as BlockRDD), spark can do 
nothing for this.
So  I want to provide *ExecutorPrefixTaskLocation* to support the rdd which can 
be aware of partition's executor location. This mechanism can avoid data 
transfor in the case of many executor in the same host.
I think it's very useful especially for *SparkStreaming* since the *Receriver* 
save data into the *BlockManger* and then become a BlockRDD



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6585) FileServerSuite.test ("HttpFileServer should not work with SSL when the server is untrusted") failed is some evn.

2015-03-28 Thread June (JIRA)

June created SPARK-6585:
---

 Summary: FileServerSuite.test ("HttpFileServer should not work 
with SSL when the server is untrusted") failed is some evn.
 Key: SPARK-6585
 URL: https://issues.apache.org/jira/browse/SPARK-6585
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.3.0
Reporter: June
Priority: Minor


In my test machine, FileServerSuite.test ("HttpFileServer should not work with 
SSL when the server is untrusted") case throw SSLException not 
SSLHandshakeException, suggest change to catch SSLException to  add test case 
robustness.

[info] - HttpFileServer should not work with SSL when the server is untrusted 
*** FAILED *** (69 milliseconds)
[info]   Expected exception javax.net.ssl.SSLHandshakeException to be thrown, 
but javax.net.ssl.SSLException was thrown. (FileServerSuite.scala:231)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:496)
[info]   at 
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
[info]   at org.scalatest.Assertions$class.intercept(Assertions.scala:1004)
[info]   at org.scalatest.FunSuite.intercept(FunSuite.scala:1555)
[info]   at 
org.apache.spark.FileServerSuite$$anonfun$15.apply$mcV$sp(FileServerSuite.scala:231)
[info]   at 
org.apache.spark.FileServerSuite$$anonfun$15.apply(FileServerSuite.scala:224)
[info]   at 
org.apache.spark.FileServerSuite$$anonfun$15.apply(FileServerSuite.scala:224)
[info]   at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
[info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
[info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
[info]   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
[info]   at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
[info]   at 
org.apache.spark.FileServerSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(FileServerSuite.scala:34)
[info]   at 
org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6585) FileServerSuite.test ("HttpFileServer should not work with SSL when the server is untrusted") failed is some evn.

2015-03-28 Thread June (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

June updated SPARK-6585:

Description: 
In my test machine, FileServerSuite.test ("HttpFileServer should not work with 
SSL when the server is untrusted") case throw SSLException not 
SSLHandshakeException, suggest change to catch SSLException to  improve test 
case 's robustness.

[info] - HttpFileServer should not work with SSL when the server is untrusted 
*** FAILED *** (69 milliseconds)
[info]   Expected exception javax.net.ssl.SSLHandshakeException to be thrown, 
but javax.net.ssl.SSLException was thrown. (FileServerSuite.scala:231)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:496)
[info]   at 
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
[info]   at org.scalatest.Assertions$class.intercept(Assertions.scala:1004)
[info]   at org.scalatest.FunSuite.intercept(FunSuite.scala:1555)
[info]   at 
org.apache.spark.FileServerSuite$$anonfun$15.apply$mcV$sp(FileServerSuite.scala:231)
[info]   at 
org.apache.spark.FileServerSuite$$anonfun$15.apply(FileServerSuite.scala:224)
[info]   at 
org.apache.spark.FileServerSuite$$anonfun$15.apply(FileServerSuite.scala:224)
[info]   at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
[info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
[info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
[info]   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
[info]   at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
[info]   at 
org.apache.spark.FileServerSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(FileServerSuite.scala:34)
[info]   at 
org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)

  was:
In my test machine, FileServerSuite.test ("HttpFileServer should not work with 
SSL when the server is untrusted") case throw SSLException not 
SSLHandshakeException, suggest change to catch SSLException to  add test case 
robustness.

[info] - HttpFileServer should not work with SSL when the server is untrusted 
*** FAILED *** (69 milliseconds)
[info]   Expected exception javax.net.ssl.SSLHandshakeException to be thrown, 
but javax.net.ssl.SSLException was thrown. (FileServerSuite.scala:231)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:496)
[info]   at 
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
[info]   at org.scalatest.Assertions$class.intercept(Assertions.scala:1004)
[info]   at org.scalatest.FunSuite.intercept(FunSuite.scala:1555)
[info]   at 
org.apache.spark.FileServerSuite$$anonfun$15.apply$mcV$sp(FileServerSuite.scala:231)
[info]   at 
org.apache.spark.FileServerSuite$$anonfun$15.apply(FileServerSuite.scala:224)
[info]   at 
org.apache.spark.FileServerSuite$$anonfun$15.apply(FileServerSuite.scala:224)
[info]   at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
[info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
[info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
[info]   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
[info]   at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
[info]   at 
org.apache.spark.FileServerSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(FileServerSuite.scala:34)
[info]   at 
org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)


> FileServerSuite.test ("HttpFileServer should not work with SSL when the 
> ser

[jira] [Assigned] (SPARK-6585) FileServerSuite.test ("HttpFileServer should not work with SSL when the server is untrusted") failed is some evn.

2015-03-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6585:
---

Assignee: (was: Apache Spark)

> FileServerSuite.test ("HttpFileServer should not work with SSL when the 
> server is untrusted") failed is some evn.
> -
>
> Key: SPARK-6585
> URL: https://issues.apache.org/jira/browse/SPARK-6585
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.3.0
>Reporter: June
>Priority: Minor
>
> In my test machine, FileServerSuite.test ("HttpFileServer should not work 
> with SSL when the server is untrusted") case throw SSLException not 
> SSLHandshakeException, suggest change to catch SSLException to  improve test 
> case 's robustness.
> [info] - HttpFileServer should not work with SSL when the server is untrusted 
> *** FAILED *** (69 milliseconds)
> [info]   Expected exception javax.net.ssl.SSLHandshakeException to be thrown, 
> but javax.net.ssl.SSLException was thrown. (FileServerSuite.scala:231)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:496)
> [info]   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
> [info]   at org.scalatest.Assertions$class.intercept(Assertions.scala:1004)
> [info]   at org.scalatest.FunSuite.intercept(FunSuite.scala:1555)
> [info]   at 
> org.apache.spark.FileServerSuite$$anonfun$15.apply$mcV$sp(FileServerSuite.scala:231)
> [info]   at 
> org.apache.spark.FileServerSuite$$anonfun$15.apply(FileServerSuite.scala:224)
> [info]   at 
> org.apache.spark.FileServerSuite$$anonfun$15.apply(FileServerSuite.scala:224)
> [info]   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
> [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
> [info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
> [info]   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
> [info]   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
> [info]   at 
> org.apache.spark.FileServerSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(FileServerSuite.scala:34)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6585) FileServerSuite.test ("HttpFileServer should not work with SSL when the server is untrusted") failed is some evn.

2015-03-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6585:
---

Assignee: Apache Spark

> FileServerSuite.test ("HttpFileServer should not work with SSL when the 
> server is untrusted") failed is some evn.
> -
>
> Key: SPARK-6585
> URL: https://issues.apache.org/jira/browse/SPARK-6585
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.3.0
>Reporter: June
>Assignee: Apache Spark
>Priority: Minor
>
> In my test machine, FileServerSuite.test ("HttpFileServer should not work 
> with SSL when the server is untrusted") case throw SSLException not 
> SSLHandshakeException, suggest change to catch SSLException to  improve test 
> case 's robustness.
> [info] - HttpFileServer should not work with SSL when the server is untrusted 
> *** FAILED *** (69 milliseconds)
> [info]   Expected exception javax.net.ssl.SSLHandshakeException to be thrown, 
> but javax.net.ssl.SSLException was thrown. (FileServerSuite.scala:231)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:496)
> [info]   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
> [info]   at org.scalatest.Assertions$class.intercept(Assertions.scala:1004)
> [info]   at org.scalatest.FunSuite.intercept(FunSuite.scala:1555)
> [info]   at 
> org.apache.spark.FileServerSuite$$anonfun$15.apply$mcV$sp(FileServerSuite.scala:231)
> [info]   at 
> org.apache.spark.FileServerSuite$$anonfun$15.apply(FileServerSuite.scala:224)
> [info]   at 
> org.apache.spark.FileServerSuite$$anonfun$15.apply(FileServerSuite.scala:224)
> [info]   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
> [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
> [info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
> [info]   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
> [info]   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
> [info]   at 
> org.apache.spark.FileServerSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(FileServerSuite.scala:34)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6585) FileServerSuite.test ("HttpFileServer should not work with SSL when the server is untrusted") failed is some evn.

2015-03-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385216#comment-14385216
 ] 

Apache Spark commented on SPARK-6585:
-

User 'sisihj' has created a pull request for this issue:
https://github.com/apache/spark/pull/5239

> FileServerSuite.test ("HttpFileServer should not work with SSL when the 
> server is untrusted") failed is some evn.
> -
>
> Key: SPARK-6585
> URL: https://issues.apache.org/jira/browse/SPARK-6585
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.3.0
>Reporter: June
>Priority: Minor
>
> In my test machine, FileServerSuite.test ("HttpFileServer should not work 
> with SSL when the server is untrusted") case throw SSLException not 
> SSLHandshakeException, suggest change to catch SSLException to  improve test 
> case 's robustness.
> [info] - HttpFileServer should not work with SSL when the server is untrusted 
> *** FAILED *** (69 milliseconds)
> [info]   Expected exception javax.net.ssl.SSLHandshakeException to be thrown, 
> but javax.net.ssl.SSLException was thrown. (FileServerSuite.scala:231)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:496)
> [info]   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
> [info]   at org.scalatest.Assertions$class.intercept(Assertions.scala:1004)
> [info]   at org.scalatest.FunSuite.intercept(FunSuite.scala:1555)
> [info]   at 
> org.apache.spark.FileServerSuite$$anonfun$15.apply$mcV$sp(FileServerSuite.scala:231)
> [info]   at 
> org.apache.spark.FileServerSuite$$anonfun$15.apply(FileServerSuite.scala:224)
> [info]   at 
> org.apache.spark.FileServerSuite$$anonfun$15.apply(FileServerSuite.scala:224)
> [info]   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
> [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
> [info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
> [info]   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
> [info]   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
> [info]   at 
> org.apache.spark.FileServerSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(FileServerSuite.scala:34)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6584) Provide ExecutorPrefixTaskLocation to support the rdd which can be aware of partition's executor location.

2015-03-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385221#comment-14385221
 ] 

Apache Spark commented on SPARK-6584:
-

User 'SaintBacchus' has created a pull request for this issue:
https://github.com/apache/spark/pull/5240

> Provide ExecutorPrefixTaskLocation to support the rdd which can be aware of 
> partition's executor  location.
> ---
>
> Key: SPARK-6584
> URL: https://issues.apache.org/jira/browse/SPARK-6584
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: SaintBacchus
>
> The function *RDD.getPreferredLocations* can only be set the host awareness 
> prefer locations.
> If some *RDD* wants to be scheduled by executor(such as BlockRDD), spark can 
> do nothing for this.
> So  I want to provide *ExecutorPrefixTaskLocation* to support the rdd which 
> can be aware of partition's executor location. This mechanism can avoid data 
> transfor in the case of many executor in the same host.
> I think it's very useful especially for *SparkStreaming* since the 
> *Receriver* save data into the *BlockManger* and then become a BlockRDD



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6584) Provide ExecutorPrefixTaskLocation to support the rdd which can be aware of partition's executor location.

2015-03-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6584:
---

Assignee: (was: Apache Spark)

> Provide ExecutorPrefixTaskLocation to support the rdd which can be aware of 
> partition's executor  location.
> ---
>
> Key: SPARK-6584
> URL: https://issues.apache.org/jira/browse/SPARK-6584
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: SaintBacchus
>
> The function *RDD.getPreferredLocations* can only be set the host awareness 
> prefer locations.
> If some *RDD* wants to be scheduled by executor(such as BlockRDD), spark can 
> do nothing for this.
> So  I want to provide *ExecutorPrefixTaskLocation* to support the rdd which 
> can be aware of partition's executor location. This mechanism can avoid data 
> transfor in the case of many executor in the same host.
> I think it's very useful especially for *SparkStreaming* since the 
> *Receriver* save data into the *BlockManger* and then become a BlockRDD



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6584) Provide ExecutorPrefixTaskLocation to support the rdd which can be aware of partition's executor location.

2015-03-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6584:
---

Assignee: Apache Spark

> Provide ExecutorPrefixTaskLocation to support the rdd which can be aware of 
> partition's executor  location.
> ---
>
> Key: SPARK-6584
> URL: https://issues.apache.org/jira/browse/SPARK-6584
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: SaintBacchus
>Assignee: Apache Spark
>
> The function *RDD.getPreferredLocations* can only be set the host awareness 
> prefer locations.
> If some *RDD* wants to be scheduled by executor(such as BlockRDD), spark can 
> do nothing for this.
> So  I want to provide *ExecutorPrefixTaskLocation* to support the rdd which 
> can be aware of partition's executor location. This mechanism can avoid data 
> transfor in the case of many executor in the same host.
> I think it's very useful especially for *SparkStreaming* since the 
> *Receriver* save data into the *BlockManger* and then become a BlockRDD



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6577) SparseMatrix should be supported in PySpark

2015-03-28 Thread Manoj Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385223#comment-14385223
 ] 

Manoj Kumar commented on SPARK-6577:


Ah, I just noticed that SciPy is an optional dependency. In any case, I believe 
in that any case having a "if _have_scipy", "else" clause, would lead to more 
lines of code to maintain. We could either have SciPy has a hard dependency, 
which would mean SparseMatrix would be a wrapper to scipy.CSR routines or we 
could just implement our own methods.



> SparseMatrix should be supported in PySpark
> ---
>
> Key: SPARK-6577
> URL: https://issues.apache.org/jira/browse/SPARK-6577
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Manoj Kumar
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6586) Add the capability of retrieving original logical plan of DataFrame

2015-03-28 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-6586:
--

 Summary: Add the capability of retrieving original logical plan of 
DataFrame
 Key: SPARK-6586
 URL: https://issues.apache.org/jira/browse/SPARK-6586
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Priority: Minor


In order to solve a bug, since #5217, {{DataFrame}} now uses analyzed plan 
instead of logical plan. However, by doing that we can't know the logical plan 
of a {{DataFrame}}. But it might be still useful and important to retrieve the 
original logical plan in some use cases.

In this pr, we introduce the capability of retrieving original logical plan of 
{{DataFrame}}.

The approach is that we add an {{analyzed}} variable to {{LogicalPlan}}. Once 
{{Analyzer}} finishes analysis, it sets {{analyzed}} of {{LogicalPlan}} as 
{{true}}.  In {{QueryExecution}}, we keep the original logical plan in the 
analyzed plan. In {{LogicalPlan}}, a method {{originalPlan}} is added to 
recursively replace the analyzed logical plan with original logical plan and 
retrieve it.

Besides the capability of retrieving original logical plan, this modification 
also can avoid do plan analysis if it is already analyzed.
 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6586) Add the capability of retrieving original logical plan of DataFrame

2015-03-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6586:
---

Assignee: (was: Apache Spark)

> Add the capability of retrieving original logical plan of DataFrame
> ---
>
> Key: SPARK-6586
> URL: https://issues.apache.org/jira/browse/SPARK-6586
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Priority: Minor
>
> In order to solve a bug, since #5217, {{DataFrame}} now uses analyzed plan 
> instead of logical plan. However, by doing that we can't know the logical 
> plan of a {{DataFrame}}. But it might be still useful and important to 
> retrieve the original logical plan in some use cases.
> In this pr, we introduce the capability of retrieving original logical plan 
> of {{DataFrame}}.
> The approach is that we add an {{analyzed}} variable to {{LogicalPlan}}. Once 
> {{Analyzer}} finishes analysis, it sets {{analyzed}} of {{LogicalPlan}} as 
> {{true}}.  In {{QueryExecution}}, we keep the original logical plan in the 
> analyzed plan. In {{LogicalPlan}}, a method {{originalPlan}} is added to 
> recursively replace the analyzed logical plan with original logical plan and 
> retrieve it.
> Besides the capability of retrieving original logical plan, this modification 
> also can avoid do plan analysis if it is already analyzed.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6586) Add the capability of retrieving original logical plan of DataFrame

2015-03-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385261#comment-14385261
 ] 

Apache Spark commented on SPARK-6586:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/5241

> Add the capability of retrieving original logical plan of DataFrame
> ---
>
> Key: SPARK-6586
> URL: https://issues.apache.org/jira/browse/SPARK-6586
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Priority: Minor
>
> In order to solve a bug, since #5217, {{DataFrame}} now uses analyzed plan 
> instead of logical plan. However, by doing that we can't know the logical 
> plan of a {{DataFrame}}. But it might be still useful and important to 
> retrieve the original logical plan in some use cases.
> In this pr, we introduce the capability of retrieving original logical plan 
> of {{DataFrame}}.
> The approach is that we add an {{analyzed}} variable to {{LogicalPlan}}. Once 
> {{Analyzer}} finishes analysis, it sets {{analyzed}} of {{LogicalPlan}} as 
> {{true}}.  In {{QueryExecution}}, we keep the original logical plan in the 
> analyzed plan. In {{LogicalPlan}}, a method {{originalPlan}} is added to 
> recursively replace the analyzed logical plan with original logical plan and 
> retrieve it.
> Besides the capability of retrieving original logical plan, this modification 
> also can avoid do plan analysis if it is already analyzed.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6586) Add the capability of retrieving original logical plan of DataFrame

2015-03-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6586:
---

Assignee: Apache Spark

> Add the capability of retrieving original logical plan of DataFrame
> ---
>
> Key: SPARK-6586
> URL: https://issues.apache.org/jira/browse/SPARK-6586
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>Priority: Minor
>
> In order to solve a bug, since #5217, {{DataFrame}} now uses analyzed plan 
> instead of logical plan. However, by doing that we can't know the logical 
> plan of a {{DataFrame}}. But it might be still useful and important to 
> retrieve the original logical plan in some use cases.
> In this pr, we introduce the capability of retrieving original logical plan 
> of {{DataFrame}}.
> The approach is that we add an {{analyzed}} variable to {{LogicalPlan}}. Once 
> {{Analyzer}} finishes analysis, it sets {{analyzed}} of {{LogicalPlan}} as 
> {{true}}.  In {{QueryExecution}}, we keep the original logical plan in the 
> analyzed plan. In {{LogicalPlan}}, a method {{originalPlan}} is added to 
> recursively replace the analyzed logical plan with original logical plan and 
> retrieve it.
> Besides the capability of retrieving original logical plan, this modification 
> also can avoid do plan analysis if it is already analyzed.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6552) expose start-slave.sh to user and update outdated doc

2015-03-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6552.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5205
[https://github.com/apache/spark/pull/5205]

> expose start-slave.sh to user and update outdated doc
> -
>
> Key: SPARK-6552
> URL: https://issues.apache.org/jira/browse/SPARK-6552
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Documentation
>Reporter: Tao Wang
>Priority: Minor
> Fix For: 1.4.0
>
>
> It would be better to expose start-slave.sh to user to allow starting a 
> worker on single node.
> As the description for starting a worker in document is in foregroud way, I 
> alse changed it to backgroud way(using start-slave.sh).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6552) expose start-slave.sh to user and update outdated doc

2015-03-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6552:
-
Assignee: Tao Wang

> expose start-slave.sh to user and update outdated doc
> -
>
> Key: SPARK-6552
> URL: https://issues.apache.org/jira/browse/SPARK-6552
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Documentation
>Reporter: Tao Wang
>Assignee: Tao Wang
>Priority: Minor
> Fix For: 1.4.0
>
>
> It would be better to expose start-slave.sh to user to allow starting a 
> worker on single node.
> As the description for starting a worker in document is in foregroud way, I 
> alse changed it to backgroud way(using start-slave.sh).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4941) Yarn cluster mode does not upload all needed jars to driver node (Spark 1.2.0)

2015-03-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4941.
--
Resolution: Cannot Reproduce

OK, we can reopen if this if typos etc are ruled out, and it is reproducible vs 
at least 1.3.0.

> Yarn cluster mode does not upload all needed jars to driver node (Spark 1.2.0)
> --
>
> Key: SPARK-4941
> URL: https://issues.apache.org/jira/browse/SPARK-4941
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Gurpreet Singh
>
> I am specifying additional jars and config xml file with --jars and --files 
> option to be uploaded to driver in the following spark-submit command. 
> However they are not getting uploaded.
> This results in the the job failure. It was working in spark 1.0.2 build.
> Spark-Build being used (spark-1.2.0.tgz)
> 
> $SPARK_HOME/bin/spark-submit \
> --class com.ebay.inc.scala.testScalaXML \
> --driver-class-path 
> /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1--2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop/share/hadoop/common/lib/hadoop--0.1--2.jar:/apache/hive/lib/mysql-connector-java-5.0.8-bin.jar:/apache/hadoop/share/hadoop/common/lib/guava-11.0.2.jar
>  \
> --master yarn \
> --deploy-mode cluster \
> --num-executors 3 \
> --driver-memory 1G  \
> --executor-memory 1G \
> /export/home/b_incdata_rw/gurpreetsingh/jar/testscalaxml_2.11-1.0.jar 
> /export/home/b_incdata_rw/gurpreetsingh/sqlFramework.xml next_gen_linking \
> --queue hdmi-spark \
> --jars 
> /export/home/b_incdata_rw/gurpreetsingh/jar/datanucleus-api-jdo-3.2.1.jar,/export/home/b_incdata_rw/gurpreetsingh/jar/datanucleus-core-3.2.2.jar,/export/home/b_incdata_rw/gurpreetsingh/jar/datanucleus-rdbms-3.2.1.jar,/apache/hive/lib/mysql-connector-java-5.0.8-bin.jar,/apache/hadoop/share/hadoop/common/lib/hadoop--0.1--2.jar,/apache/hadoop/share/hadoop/common/lib/hadoop-lzo-0.6.0.jar,/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1--2.jar\
> --files 
> /export/home/b_incdata_rw/gurpreetsingh/spark-1.0.2-bin-2.4.1/conf/hive-site.xml
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 14/12/22 23:00:17 INFO client.ConfiguredRMFailoverProxyProvider: Failing over 
> to rm2
> 14/12/22 23:00:17 INFO yarn.Client: Requesting a new application from cluster 
> with 2026 NodeManagers
> 14/12/22 23:00:17 INFO yarn.Client: Verifying our application has not 
> requested more than the maximum memory capability of the cluster (16384 MB 
> per container)
> 14/12/22 23:00:17 INFO yarn.Client: Will allocate AM container, with 1408 MB 
> memory including 384 MB overhead
> 14/12/22 23:00:17 INFO yarn.Client: Setting up container launch context for 
> our AM
> 14/12/22 23:00:17 INFO yarn.Client: Preparing resources for our AM container
> 14/12/22 23:00:18 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 14/12/22 23:00:18 WARN hdfs.BlockReaderLocal: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> 14/12/22 23:00:21 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 
> 6623380 for b_incdata_rw on 10.115.201.75:8020
> 14/12/22 23:00:21 INFO yarn.Client: 
> Uploading resource 
> file:/home/b_incdata_rw/gurpreetsingh/spark-1.2.0-bin-hadoop2.4/lib/spark-assembly-1.2.0-hadoop2.4.0.jar
>  -> 
> hdfs://-nn.vip.xxx.com:8020/user/b_incdata_rw/.sparkStaging/application_1419242629195_8432/spark-assembly-1.2.0-hadoop2.4.0.jar
> 14/12/22 23:00:24 INFO yarn.Client: Uploading resource 
> file:/export/home/b_incdata_rw/gurpreetsingh/jar/firstsparkcode_2.11-1.0.jar 
> -> 
> hdfs://-nn.vip.xxx.com:8020:8020/user/b_incdata_rw/.sparkStaging/application_1419242629195_8432/firstsparkcode_2.11-1.0.jar
> 14/12/22 23:00:25 INFO yarn.Client: Setting up the launch environment for our 
> AM container



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6489) Optimize lateral view with explode to not read unnecessary columns

2015-03-28 Thread sdfox (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385317#comment-14385317
 ] 

sdfox commented on SPARK-6489:
--

Yes.

> Optimize lateral view with explode to not read unnecessary columns
> --
>
> Key: SPARK-6489
> URL: https://issues.apache.org/jira/browse/SPARK-6489
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Konstantin Shaposhnikov
>  Labels: starter
>
> Currently a query with "lateral view explode(...)" results in an execution 
> plan that reads all columns of the underlying RDD.
> E.g. given *ppl* table is DF created from Person case class:
> {code}
> case class Person(val name: String, val age: Int, val data: Array[Int])
> {code}
> the following SQL:
> {code}
> select name, sum(d) from ppl lateral view explode(data) d as d group by name
> {code}
> executes as follows:
> {noformat}
> == Physical Plan ==
> Aggregate false, [name#0], [name#0,SUM(PartialSum#38L) AS _c1#18L]
>  Exchange (HashPartitioning [name#0], 200)
>   Aggregate true, [name#0], [name#0,SUM(CAST(d#21, LongType)) AS 
> PartialSum#38L]
>Project [name#0,d#21]
> Generate explode(data#2), true, false
>  InMemoryColumnarTableScan [name#0,age#1,data#2], [], (InMemoryRelation 
> [name#0,age#1,data#2], true, 1, StorageLevel(true, true, false, true, 1), 
> (PhysicalRDD [name#0,age#1,data#2], MapPartitionsRDD[1] at mapPartitions at 
> ExistingRDD.scala:35), Some(ppl))
> {noformat}
> Note that *age* column is not needed to produce the output but it is still 
> read from the underlying RDD.
> A sample program to demonstrate the issue:
> {code}
> case class Person(val name: String, val age: Int, val data: Array[Int])
> object ExplodeDemo extends App {
>   val ppl = Array(
> Person("A", 20, Array(10, 12, 19)),
> Person("B", 25, Array(7, 8, 4)),
> Person("C", 19, Array(12, 4, 232)))
>   
>   val conf = new SparkConf().setMaster("local[2]").setAppName("sql")
>   val sc = new SparkContext(conf)
>   val sqlCtx = new HiveContext(sc)
>   import sqlCtx.implicits._
>   val df = sc.makeRDD(ppl).toDF
>   df.registerTempTable("ppl")
>   sqlCtx.cacheTable("ppl") // cache table otherwise ExistingRDD will be used 
> that do not support column pruning
>   val s = sqlCtx.sql("select name, sum(d) from ppl lateral view explode(data) 
> d as d group by name")
>   s.explain(true)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6582) Support ssl for this AvroSink in Spark Streaming External

2015-03-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6582:
-
Fix Version/s: (was: 1.4.0)

> Support ssl for this AvroSink in Spark Streaming External
> -
>
> Key: SPARK-6582
> URL: https://issues.apache.org/jira/browse/SPARK-6582
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: SaintBacchus
>
> AvroSink had already support the *ssl*,  so it's better to support *ssl* in 
> the Spark Streaming External Flume. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6582) Support ssl for this AvroSink in Spark Streaming External

2015-03-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6582:
-
Component/s: Streaming

(Components please)

> Support ssl for this AvroSink in Spark Streaming External
> -
>
> Key: SPARK-6582
> URL: https://issues.apache.org/jira/browse/SPARK-6582
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: SaintBacchus
>
> AvroSink had already support the *ssl*,  so it's better to support *ssl* in 
> the Spark Streaming External Flume. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6548) Adding stddev to DataFrame functions

2015-03-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6548:
-
Fix Version/s: (was: 1.4.0)

> Adding stddev to DataFrame functions
> 
>
> Key: SPARK-6548
> URL: https://issues.apache.org/jira/browse/SPARK-6548
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: DataFrame, starter
>
> Add it to the list of aggregate functions:
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
> Also add it to 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala
> We can either add a Stddev Catalyst expression, or just compute it using 
> existing functions like here: 
> https://github.com/apache/spark/commit/5bbcd1304cfebba31ec6857a80d3825a40d02e83#diff-c3d0394b2fc08fb2842ff0362a5ac6c9R776



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6529) Word2Vec transformer

2015-03-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6529:
-
Fix Version/s: (was: 1.4.0)

> Word2Vec transformer
> 
>
> Key: SPARK-6529
> URL: https://issues.apache.org/jira/browse/SPARK-6529
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xusen Yin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6209) ExecutorClassLoader can leak connections after failing to load classes from the REPL class server

2015-03-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6209:
-
Fix Version/s: (was: 1.3.1)
   (was: 1.4.0)

> ExecutorClassLoader can leak connections after failing to load classes from 
> the REPL class server
> -
>
> Key: SPARK-6209
> URL: https://issues.apache.org/jira/browse/SPARK-6209
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.0.3, 1.1.2, 1.2.1, 1.3.0, 1.4.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
>
> ExecutorClassLoader does not ensure proper cleanup of network connections 
> that it opens.  If it fails to load a class, it may leak partially-consumed 
> InputStreams that are connected to the REPL's HTTP class server, causing that 
> server to exhaust its thread pool, which can cause the entire job to hang.
> Here is a simple reproduction:
> With
> {code}
> ./bin/spark-shell --master local-cluster[8,8,512] 
> {code}
> run the following command:
> {code}
> sc.parallelize(1 to 1000, 1000).map { x =>
>   try {
>   Class.forName("some.class.that.does.not.Exist")
>   } catch {
>   case e: Exception => // do nothing
>   }
>   x
> }.count()
> {code}
> This job will run 253 tasks, then will completely freeze without any errors 
> or failed tasks.
> It looks like the driver has 253 threads blocked in socketRead0() calls:
> {code}
> [joshrosen ~]$ jstack 16765 | grep socketRead0 | wc
>  253 759   14674
> {code}
> e.g.
> {code}
> "qtp1287429402-13" daemon prio=5 tid=0x7f868a1c nid=0x5b03 runnable 
> [0x0001159bd000]
>java.lang.Thread.State: RUNNABLE
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:152)
> at java.net.SocketInputStream.read(SocketInputStream.java:122)
> at org.eclipse.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391)
> at org.eclipse.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
> at 
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
> at org.eclipse.jetty.http.HttpParser.fill(HttpParser.java:1044)
> at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:280)
> at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
> at 
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
> at 
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
> at java.lang.Thread.run(Thread.java:745) 
> {code}
> Jstack on the executors shows blocking in loadClass / findClass, where a 
> single thread is RUNNABLE and waiting to hear back from the driver and other 
> executor threads are BLOCKED on object monitor synchronization at 
> Class.forName0().
> Remotely triggering a GC on a hanging executor allows the job to progress and 
> complete more tasks before hanging again.  If I repeatedly trigger GC on all 
> of the executors, then the job runs to completion:
> {code}
> jps | grep CoarseGra | cut -d ' ' -f 1 | xargs -I {} -n 1 -P100 jcmd {} GC.run
> {code}
> The culprit is a {{catch}} block that ignores all exceptions and performs no 
> cleanup: 
> https://github.com/apache/spark/blob/v1.2.0/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala#L94
> This bug has been present since Spark 1.0.0, but I suspect that we haven't 
> seen it before because it's pretty hard to reproduce. Triggering this error 
> requires a job with tasks that trigger ClassNotFoundExceptions yet are still 
> able to run to completion.  It also requires that executors are able to leak 
> enough open connections to exhaust the class server's Jetty thread pool 
> limit, which requires that there are a large number of tasks (253+) and 
> either a large number of executors or a very low amount of GC pressure on 
> those executors (since GC will cause the leaked connections to be closed).
> The fix here is pretty simple: add proper resource cleanup to this class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6350) Make mesosExecutorCores configurable in mesos "fine-grained" mode

2015-03-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6350:
-
Fix Version/s: (was: 1.3.1)
   (was: 1.4.0)

> Make mesosExecutorCores configurable in mesos "fine-grained" mode
> -
>
> Key: SPARK-6350
> URL: https://issues.apache.org/jira/browse/SPARK-6350
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Jongyoul Lee
>Assignee: Jongyoul Lee
>Priority: Minor
>
> When spark runs in mesos fine-grained mode, mesos slave launches executor 
> with # of cpus and memories. By the way, # of executor's cores is always 
> CPU_PER_TASKS as same as spark.task.cpus. If I set that values as 5 for 
> running intensive task, mesos executor always consume 5 cores without any 
> running task. This waste resources. We should set executor core as a 
> configuration variable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6530) ChiSqSelector transformer

2015-03-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6530:
-
Fix Version/s: (was: 1.4.0)

> ChiSqSelector transformer
> -
>
> Key: SPARK-6530
> URL: https://issues.apache.org/jira/browse/SPARK-6530
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xusen Yin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6391) Update Tachyon version compatibility documentation

2015-03-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6391:
-
Fix Version/s: (was: 1.4.0)

> Update Tachyon version compatibility documentation
> --
>
> Key: SPARK-6391
> URL: https://issues.apache.org/jira/browse/SPARK-6391
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.3.0
>Reporter: Calvin Jia
>
> Tachyon v0.6 has an api change in the client, it would be helpful to document 
> the Tachyon-Spark compatibility across versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6528) IDF transformer

2015-03-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6528:
-
Fix Version/s: (was: 1.4.0)

> IDF transformer
> ---
>
> Key: SPARK-6528
> URL: https://issues.apache.org/jira/browse/SPARK-6528
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xusen Yin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6194) collect() in PySpark will cause memory leak in JVM

2015-03-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6194:
-
Fix Version/s: (was: 1.3.1)
   (was: 1.4.0)
   (was: 1.2.2)

> collect() in PySpark will cause memory leak in JVM
> --
>
> Key: SPARK-6194
> URL: https://issues.apache.org/jira/browse/SPARK-6194
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
>
> It could be reproduced  by:
> {code}
> for i in range(40):
> sc.parallelize(range(5000), 10).flatMap(lambda i: range(1)).collect()
> {code}
> It will fail after 2 or 3 jobs, and run totally successfully if I add
> `gc.collect()` after each job.
> We could call _detach() for the JavaList returned by collect
> in Java, will send out a PR for this.
> Reported by Michael and commented by Josh：
> On Thu, Mar 5, 2015 at 2:39 PM, Josh Rosen  wrote:
> > Based on Py4J's Memory Model page
> > (http://py4j.sourceforge.net/advanced_topics.html#py4j-memory-model):
> >
> >> Because Java objects on the Python side are involved in a circular
> >> reference (JavaObject and JavaMember reference each other), these objects
> >> are not immediately garbage collected once the last reference to the object
> >> is removed (but they are guaranteed to be eventually collected if the 
> >> Python
> >> garbage collector runs before the Python program exits).
> >
> >
> >>
> >> In doubt, users can always call the detach function on the Python gateway
> >> to explicitly delete a reference on the Java side. A call to gc.collect()
> >> also usually works.
> >
> >
> > Maybe we should be manually calling detach() when the Python-side has
> > finished consuming temporary objects from the JVM.  Do you have a small
> > workload / configuration that reproduces the OOM which we can use to test a
> > fix?  I don't think that I've seen this issue in the past, but this might be
> > because we mistook Java OOMs as being caused by collecting too much data
> > rather than due to memory leaks.
> >
> > On Thu, Mar 5, 2015 at 10:41 AM, Michael Nazario 
> > wrote:
> >>
> >> Hi Josh,
> >>
> >> I have a question about how PySpark does memory management in the Py4J
> >> bridge between the Java driver and the Python driver. I was wondering if
> >> there have been any memory problems in this system because the Python
> >> garbage collector does not collect circular references immediately and Py4J
> >> has circular references in each object it receives from Java.
> >>
> >> When I dug through the PySpark code, I seemed to find that most RDD
> >> actions return by calling collect. In collect, you end up calling the Java
> >> RDD collect and getting an iterator from that. Would this be a possible
> >> cause for a Java driver OutOfMemoryException because there are resources in
> >> Java which do not get freed up immediately?
> >>
> >> I have also seen that trying to take a lot of values from a dataset twice
> >> in a row can cause the Java driver to OOM (while just once works). Are 
> >> there
> >> some other memory considerations that are relevant in the driver?
> >>
> >> Thanks,
> >> Michael



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6323:
-
Fix Version/s: (was: 1.4.0)

> Large rank matrix factorization with Nonlinear loss and constraints
> ---
>
> Key: SPARK-6323
> URL: https://issues.apache.org/jira/browse/SPARK-6323
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 1.4.0
>Reporter: Debasish Das
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> Currently ml.recommendation.ALS is optimized for gram matrix generation which 
> scales to modest ranks. The problems that we can solve are in the normal 
> equation/quadratic form: 0.5x'Hx + c'x + g(z)
> g(z) can be one of the constraints from Breeze proximal library:
> https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala
> In this PR we will re-use ml.recommendation.ALS design and come up with 
> ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
> changes, it's straightforward to do it now !
> ALM will be capable of solving the following problems: min f ( x ) + g ( z )
> 1. Loss function f ( x ) can be LeastSquareLoss and LoglikelihoodLoss. Most 
> likely we will re-use the Gradient interfaces already defined and implement 
> LoglikelihoodLoss
> 2. Constraints g ( z ) supported are same as above except that we don't 
> support affine + bounds yet Aeq x = beq , lb <= x <= ub yet. Most likely we 
> don't need that for ML applications
> 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which 
> in turn uses projection based solver (SPG) or proximal solvers (ADMM) based 
> on convergence speed.
> https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala
> 4. The factors will be SparseVector so that we keep shuffle size in check. 
> For example we will run with 10K ranks but we will force factors to be 
> 100-sparse.
> This is closely related to Sparse LDA 
> https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we 
> are not using graph representation here.
> As we do scaling experiments, we will understand which flow is more suited as 
> ratings get denser (my understanding is that since we already scaled ALS to 2 
> billion ratings and we will keep sparsity in check, the same 2 billion flow 
> will scale to 10K ranks as well)...
> This JIRA is intended to extend the capabilities of ml recommendation to 
> generalized loss function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6209) ExecutorClassLoader can leak connections after failing to load classes from the REPL class server

2015-03-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6209:
-
Fix Version/s: 1.4.0
   1.3.1

Oops, my bulk change shouldn't have caught this one. I see why it is unresolved 
but has Fix versions

> ExecutorClassLoader can leak connections after failing to load classes from 
> the REPL class server
> -
>
> Key: SPARK-6209
> URL: https://issues.apache.org/jira/browse/SPARK-6209
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.0.3, 1.1.2, 1.2.1, 1.3.0, 1.4.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.3.1, 1.4.0
>
>
> ExecutorClassLoader does not ensure proper cleanup of network connections 
> that it opens.  If it fails to load a class, it may leak partially-consumed 
> InputStreams that are connected to the REPL's HTTP class server, causing that 
> server to exhaust its thread pool, which can cause the entire job to hang.
> Here is a simple reproduction:
> With
> {code}
> ./bin/spark-shell --master local-cluster[8,8,512] 
> {code}
> run the following command:
> {code}
> sc.parallelize(1 to 1000, 1000).map { x =>
>   try {
>   Class.forName("some.class.that.does.not.Exist")
>   } catch {
>   case e: Exception => // do nothing
>   }
>   x
> }.count()
> {code}
> This job will run 253 tasks, then will completely freeze without any errors 
> or failed tasks.
> It looks like the driver has 253 threads blocked in socketRead0() calls:
> {code}
> [joshrosen ~]$ jstack 16765 | grep socketRead0 | wc
>  253 759   14674
> {code}
> e.g.
> {code}
> "qtp1287429402-13" daemon prio=5 tid=0x7f868a1c nid=0x5b03 runnable 
> [0x0001159bd000]
>java.lang.Thread.State: RUNNABLE
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:152)
> at java.net.SocketInputStream.read(SocketInputStream.java:122)
> at org.eclipse.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391)
> at org.eclipse.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
> at 
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
> at org.eclipse.jetty.http.HttpParser.fill(HttpParser.java:1044)
> at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:280)
> at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
> at 
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
> at 
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
> at java.lang.Thread.run(Thread.java:745) 
> {code}
> Jstack on the executors shows blocking in loadClass / findClass, where a 
> single thread is RUNNABLE and waiting to hear back from the driver and other 
> executor threads are BLOCKED on object monitor synchronization at 
> Class.forName0().
> Remotely triggering a GC on a hanging executor allows the job to progress and 
> complete more tasks before hanging again.  If I repeatedly trigger GC on all 
> of the executors, then the job runs to completion:
> {code}
> jps | grep CoarseGra | cut -d ' ' -f 1 | xargs -I {} -n 1 -P100 jcmd {} GC.run
> {code}
> The culprit is a {{catch}} block that ignores all exceptions and performs no 
> cleanup: 
> https://github.com/apache/spark/blob/v1.2.0/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala#L94
> This bug has been present since Spark 1.0.0, but I suspect that we haven't 
> seen it before because it's pretty hard to reproduce. Triggering this error 
> requires a job with tasks that trigger ClassNotFoundExceptions yet are still 
> able to run to completion.  It also requires that executors are able to leak 
> enough open connections to exhaust the class server's Jetty thread pool 
> limit, which requires that there are a large number of tasks (253+) and 
> either a large number of executors or a very low amount of GC pressure on 
> those executors (since GC will cause the leaked connections to be closed).
> The fix here is pretty simple: add proper resource cleanup to this class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6194) collect() in PySpark will cause memory leak in JVM

2015-03-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6194:
-
Target Version/s: 1.3.0, 1.0.3, 1.1.2, 1.2.2  (was: 1.0.3, 1.1.2, 1.2.2, 
1.3.0)
   Fix Version/s: 1.4.0
  1.3.1
  1.2.2

Same, restored Fix versions. I fixed my query now.

> collect() in PySpark will cause memory leak in JVM
> --
>
> Key: SPARK-6194
> URL: https://issues.apache.org/jira/browse/SPARK-6194
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.2.2, 1.3.1, 1.4.0
>
>
> It could be reproduced  by:
> {code}
> for i in range(40):
> sc.parallelize(range(5000), 10).flatMap(lambda i: range(1)).collect()
> {code}
> It will fail after 2 or 3 jobs, and run totally successfully if I add
> `gc.collect()` after each job.
> We could call _detach() for the JavaList returned by collect
> in Java, will send out a PR for this.
> Reported by Michael and commented by Josh：
> On Thu, Mar 5, 2015 at 2:39 PM, Josh Rosen  wrote:
> > Based on Py4J's Memory Model page
> > (http://py4j.sourceforge.net/advanced_topics.html#py4j-memory-model):
> >
> >> Because Java objects on the Python side are involved in a circular
> >> reference (JavaObject and JavaMember reference each other), these objects
> >> are not immediately garbage collected once the last reference to the object
> >> is removed (but they are guaranteed to be eventually collected if the 
> >> Python
> >> garbage collector runs before the Python program exits).
> >
> >
> >>
> >> In doubt, users can always call the detach function on the Python gateway
> >> to explicitly delete a reference on the Java side. A call to gc.collect()
> >> also usually works.
> >
> >
> > Maybe we should be manually calling detach() when the Python-side has
> > finished consuming temporary objects from the JVM.  Do you have a small
> > workload / configuration that reproduces the OOM which we can use to test a
> > fix?  I don't think that I've seen this issue in the past, but this might be
> > because we mistook Java OOMs as being caused by collecting too much data
> > rather than due to memory leaks.
> >
> > On Thu, Mar 5, 2015 at 10:41 AM, Michael Nazario 
> > wrote:
> >>
> >> Hi Josh,
> >>
> >> I have a question about how PySpark does memory management in the Py4J
> >> bridge between the Java driver and the Python driver. I was wondering if
> >> there have been any memory problems in this system because the Python
> >> garbage collector does not collect circular references immediately and Py4J
> >> has circular references in each object it receives from Java.
> >>
> >> When I dug through the PySpark code, I seemed to find that most RDD
> >> actions return by calling collect. In collect, you end up calling the Java
> >> RDD collect and getting an iterator from that. Would this be a possible
> >> cause for a Java driver OutOfMemoryException because there are resources in
> >> Java which do not get freed up immediately?
> >>
> >> I have also seen that trying to take a lot of values from a dataset twice
> >> in a row can cause the Java driver to OOM (while just once works). Are 
> >> there
> >> some other memory considerations that are relevant in the driver?
> >>
> >> Thanks,
> >> Michael



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6006) Optimize count distinct in case of high cardinality columns

2015-03-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6006:
-
Fix Version/s: (was: 1.3.0)

> Optimize count distinct in case of high cardinality columns
> ---
>
> Key: SPARK-6006
> URL: https://issues.apache.org/jira/browse/SPARK-6006
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.1, 1.2.1
>Reporter: Yash Datta
>Priority: Minor
>
> In case there are a lot of distinct values, count distinct becomes too slow 
> since it tries to hash partial results to one map. It can be improved by 
> creating buckets/partial maps in an intermediate stage where same key from 
> multiple partial maps of first stage hash to the same bucket. Later we can 
> sum the size of these buckets to get total distinct count.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5720) `Create Table Like` in HiveContext need support `like registered temporary table`

2015-03-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5720:
-
Fix Version/s: (was: 1.3.0)

> `Create Table Like` in HiveContext need support `like registered temporary 
> table`
> -
>
> Key: SPARK-5720
> URL: https://issues.apache.org/jira/browse/SPARK-5720
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Li Sheng
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6060) List type missing for catalyst's package.scala

2015-03-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6060:
-
Fix Version/s: (was: 1.3.0)

> List type missing for catalyst's package.scala
> --
>
> Key: SPARK-6060
> URL: https://issues.apache.org/jira/browse/SPARK-6060
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Linux zeno 3.18.5 #1 SMP Sun Feb 1 23:51:17 CET 2015 
> ppc64 GNU/Linux,
> java version "1.7.0_65"
> OpenJDK Runtime Environment (IcedTea 2.5.3) (7u71-2.5.3-2)
> OpenJDK Zero VM (build 24.65-b04, interpreted mode),
> sbt launcher version 0.13.7
>Reporter: Stephan Drescher
>Priority: Minor
>  Labels: build, error
>
> Used command line: 
> build/sbt -mem 1024 -Pyarn -Phive -Dhadoop.version=2.4.0 -Pbigtop-dist 
> -DskipTests assembly
> Output:
> [error]  while compiling: 
> /home/spark/Developer/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/types/package.scala
> [error] during phase: jvm
> [error]  library version: version 2.10.4
> [error] compiler version: version 2.10.4
> [error]   reconstructed args: -bootclasspath 
> /usr/lib/jvm/java-7-openjdk-powerpc/jre/lib/resources.jar:/usr/lib/jvm/java-7-openjdk-powerpc/jre/lib/rt.jar:/usr/lib/jvm/java-7-openjdk-powerpc/jre/lib/sunrsasign.jar:/usr/lib/jvm/java-7-openjdk-powerpc/jre/lib/jsse.jar:/usr/lib/jvm/java-7-openjdk-powerpc/jre/lib/jce.jar:/usr/lib/jvm/java-7-openjdk-powerpc/jre/lib/charsets.jar:/usr/lib/jvm/java-7-openjdk-powerpc/jre/lib/rhino.jar:/usr/lib/jvm/java-7-openjdk-powerpc/jre/lib/jfr.jar:/usr/lib/jvm/java-7-openjdk-powerpc/jre/classes:/home/spark/.sbt/boot/scala-2.10.4/lib/scala-library.jar
>  -deprecation -classpath 
> /home/spark/Developer/spark/sql/catalyst/target/scala-2.10/classes:/home/spark/Developer/spark/core/target/scala-2.10/classes:/home/spark/Developer/spark/network/common/target/scala-2.10/classes:/home/spark/Developer/spark/network/shuffle/target/scala-2.10/classes:/home/spark/.sbt/boot/scala-2.10.4/lib/scala-compiler.jar:/home/spark/.sbt/boot/scala-2.10.4/lib/scala-reflect.jar:/home/spark/Developer/spark/lib_managed/jars/netty-all-4.0.23.Final.jar:/home/spark/Developer/spark/lib_managed/jars/unused-1.0.0.jar:/home/spark/Developer/spark/lib_managed/jars/chill_2.10-0.5.0.jar:/home/spark/Developer/spark/lib_managed/jars/chill-java-0.5.0.jar:/home/spark/Developer/spark/lib_managed/bundles/kryo-2.21.jar:/home/spark/Developer/spark/lib_managed/jars/reflectasm-1.07-shaded.jar:/home/spark/Developer/spark/lib_managed/jars/minlog-1.2.jar:/home/spark/Developer/spark/lib_managed/jars/objenesis-1.2.jar:/home/spark/Developer/spark/lib_managed/jars/hadoop-client-2.4.0.jar:/home/spark/Developer/spark/lib_managed/jars/hadoop-common-2.4.0.jar:/home/spark/Developer/spark/lib_managed/jars/hadoop-annotations-2.4.0.jar:/home/spark/Developer/spark/lib_managed/jars/commons-cli-1.2.jar:/home/spark/Developer/spark/lib_managed/jars/commons-math3-3.1.1.jar:/home/spark/Developer/spark/lib_managed/jars/xmlenc-0.52.jar:/home/spark/Developer/spark/lib_managed/jars/commons-httpclient-3.1.jar:/home/spark/Developer/spark/lib_managed/jars/commons-logging-1.1.3.jar:/home/spark/Developer/spark/lib_managed/jars/commons-io-2.4.jar:/home/spark/Developer/spark/lib_managed/jars/commons-net-3.1.jar:/home/spark/Developer/spark/lib_managed/jars/commons-collections-3.2.1.jar:/home/spark/Developer/spark/lib_managed/bundles/log4j-1.2.17.jar:/home/spark/Developer/spark/lib_managed/jars/commons-lang-2.6.jar:/home/spark/Developer/spark/lib_managed/jars/commons-configuration-1.6.jar:/home/spark/Developer/spark/lib_managed/jars/commons-digester-1.8.jar:/home/spark/Developer/spark/lib_managed/jars/commons-beanutils-1.7.0.jar:/home/spark/Developer/spark/lib_managed/jars/commons-beanutils-core-1.8.0.jar:/home/spark/Developer/spark/lib_managed/jars/slf4j-api-1.7.10.jar:/home/spark/Developer/spark/lib_managed/jars/jackson-core-asl-1.8.8.jar:/home/spark/Developer/spark/lib_managed/jars/jackson-mapper-asl-1.8.8.jar:/home/spark/Developer/spark/lib_managed/jars/avro-1.7.4.jar:/home/spark/Developer/spark/lib_managed/jars/commons-compress-1.4.1.jar:/home/spark/Developer/spark/lib_managed/jars/xz-1.0.jar:/home/spark/Developer/spark/lib_managed/bundles/protobuf-java-2.5.0.jar:/home/spark/Developer/spark/lib_managed/jars/hadoop-auth-2.4.0.jar:/home/spark/Developer/spark/lib_managed/jars/httpclient-4.2.5.jar:/home/spark/Developer/spark/lib_managed/jars/httpcore-4.2.4.jar:/home/spark/Developer/spark/lib_managed/jars/commons-codec-1.6.jar:/home/spark/Developer/spark/lib_managed/jars/hadoop-hdfs-2.4.0.jar:/home/spark/Developer/spark/lib_managed/jars/jetty-util-6.1.26.jar:/home/spark/Developer/spark/lib_managed/jars/hadoop-mapreduce-client-app-2.4.0.jar:/home/spark/Developer/spark/lib_

[jira] [Updated] (SPARK-5880) Change log level of batch pruning string in InMemoryColumnarTableScan from Info to Debug

2015-03-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5880:
-
Fix Version/s: (was: 1.3.0)

> Change log level of batch pruning string in InMemoryColumnarTableScan from 
> Info to Debug
> 
>
> Key: SPARK-5880
> URL: https://issues.apache.org/jira/browse/SPARK-5880
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.1, 1.3.0
>Reporter: Nitin Goyal
>Priority: Trivial
>
> In InMemoryColumnarTableScan, we make string of the statistics of all the 
> columns and log them at INFO level whenever batch pruning happens. We get a 
> performance hit in case there are a large number of batches and good number 
> of columns and almost every batch gets pruned.
> We can make the string to evaluate lazily and change log level to DEBUG



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5684) Key not found exception is thrown in case location of added partition to a parquet table is different than a path containing the partition values

2015-03-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5684:
-
Fix Version/s: (was: 1.3.0)

> Key not found exception is thrown in case location of added partition to a 
> parquet table is different than a path containing the partition values
> -
>
> Key: SPARK-5684
> URL: https://issues.apache.org/jira/browse/SPARK-5684
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0, 1.1.1, 1.2.0
>Reporter: Yash Datta
>
> Create a partitioned parquet table : 
> create table test_table (dummy string) partitioned by (timestamp bigint) 
> stored as parquet;
> Add a partition to the table and specify a different location:
> alter table test_table add partition (timestamp=9) location 
> '/data/pth/different'
> Run a simple select  * query 
> we get an exception :
> 15/02/09 08:27:25 ERROR thriftserver.SparkSQLDriver: Failed in [select * from 
> db4_mi2mi_binsrc1_default limit 5]
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 21.0 failed 1 times, most recent failure: Lost task 0.0 in stage 21.0 
> (TID 21, localhost): java
> .util.NoSuchElementException: key not found: timestamp
> at scala.collection.MapLike$class.default(MapLike.scala:228)
> at scala.collection.AbstractMap.default(Map.scala:58)
> at scala.collection.MapLike$class.apply(MapLike.scala:141)
> at scala.collection.AbstractMap.apply(Map.scala:58)
> at 
> org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4$$anonfun$6.apply(ParquetTableOperations.scala:141)
> at 
> org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4$$anonfun$6.apply(ParquetTableOperations.scala:141)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4.apply(ParquetTableOperations.scala:141)
> at 
> org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4.apply(ParquetTableOperations.scala:128)
> at 
> org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:247)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> This happens because in parquet path it is assumed that (key=value) patterns 
> are present in the partition location, which is not always the case!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4558) History Server waits ~10s before starting up

2015-03-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4558.
--
Resolution: Duplicate

> History Server waits ~10s before starting up
> 
>
> Key: SPARK-4558
> URL: https://issues.apache.org/jira/browse/SPARK-4558
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>Priority: Minor
>
> After you call `sbin/start-history-server.sh`, it waits about 10s before 
> actually starting up. I suspect this is a subtle bug related to log checking.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5522) Accelerate the History Server start

2015-03-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5522.
--
Resolution: Fixed

Looks resolved by https://github.com/apache/spark/pull/4525 but just never got 
marked as such.

> Accelerate the History Server start
> ---
>
> Key: SPARK-5522
> URL: https://issues.apache.org/jira/browse/SPARK-5522
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 1.0.0
>Reporter: Liangliang Gu
>Assignee: Liangliang Gu
> Fix For: 1.4.0
>
>
> When starting the history server, all the log files will be fetched and 
> parsed in order to get the applications' meta data e.g. App Name, Start Time, 
> Duration, etc. In our production cluster, there exist 2600 log files (160G) 
> in HDFS and it costs 3 hours to restart the history server, which is a little 
> bit too long for us.
> It would be better, if the history server can show logs with missing 
> information during start-up and fill the missing information after fetching 
> and parsing a log file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5192) Parquet fails to parse schema contains '\r'

2015-03-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5192:
-
Fix Version/s: (was: 1.3.0)

> Parquet fails to parse schema contains '\r'
> ---
>
> Key: SPARK-5192
> URL: https://issues.apache.org/jira/browse/SPARK-5192
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
> Environment: windows7 + Intellj idea 13.0.2 
>Reporter: cen yuhai
>Priority: Minor
>
> I think this is actually a bug in parquet, when i debuged 'ParquetTestData', 
> i found a exception as below. So i  download the source of MessageTypeParser, 
> the funtion 'isWhitespace' do not check for '\r'
> private boolean isWhitespace(String t) {
>   return t.equals(" ") || t.equals("\t") || t.equals("\n");
> }
> So I replace all '\r' to work around this issue.
>   val subTestSchema =
> """
>   message myrecord {
>   optional boolean myboolean;
>   optional int64 mylong;
>   }
> """.replaceAll("\r","")
> at line 0: message myrecord {
>   at 
> parquet.schema.MessageTypeParser.asRepetition(MessageTypeParser.java:203)
>   at parquet.schema.MessageTypeParser.addType(MessageTypeParser.java:101)
>   at 
> parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:96)
>   at parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:89)
>   at 
> parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:79)
>   at 
> org.apache.spark.sql.parquet.ParquetTestData$.writeFile(ParquetTestData.scala:221)
>   at 
> org.apache.spark.sql.parquet.ParquetQuerySuite.beforeAll(ParquetQuerySuite.scala:92)
>   at 
> org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
>   at 
> org.apache.spark.sql.parquet.ParquetQuerySuite.beforeAll(ParquetQuerySuite.scala:85)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
>   at 
> org.apache.spark.sql.parquet.ParquetQuerySuite.run(ParquetQuerySuite.scala:85)
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5264) Support `drop temporary table [if exists]` DDL command

2015-03-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5264:
-
Fix Version/s: (was: 1.3.0)

> Support `drop temporary table [if exists]` DDL command 
> ---
>
> Key: SPARK-5264
> URL: https://issues.apache.org/jira/browse/SPARK-5264
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Li Sheng
>Priority: Minor
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Support `drop table` DDL command 
> i.e DROP [TEMPORARY] TABLE [IF EXISTS]tbl_name



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4752) Classifier based on artificial neural network

2015-03-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4752:
-
Fix Version/s: (was: 1.3.0)

> Classifier based on artificial neural network
> -
>
> Key: SPARK-4752
> URL: https://issues.apache.org/jira/browse/SPARK-4752
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Alexander Ulanov
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Implement classifier based on artificial neural network (ANN). Requirements:
> 1) Use the existing artificial neural network implementation 
> https://issues.apache.org/jira/browse/SPARK-2352, 
> https://github.com/apache/spark/pull/1290
> 2) Extend MLlib ClassificationModel trait, 
> 3) Like other classifiers in MLlib, accept RDD[LabeledPoint] for training,
> 4) Be able to return the ANN model



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5362) Gradient and Optimizer to support generic output (instead of label) and data batches

2015-03-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5362:
-
Fix Version/s: (was: 1.3.0)

> Gradient and Optimizer to support generic output (instead of label) and data 
> batches
> 
>
> Key: SPARK-5362
> URL: https://issues.apache.org/jira/browse/SPARK-5362
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently, Gradient and Optimizer interfaces support data in form of 
> RDD[Double, Vector] which refers to label and features. This limits its 
> application to classification problems. For example, artificial neural 
> network demands Vector as output (instead of label: Double). Moreover, 
> current interface does not support data batches. I propose to replace label: 
> Double with output: Vector. It enables passing generic output instead of 
> label and also passing data and output batches stored in corresponding 
> vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6128) Update Spark Streaming Guide for Spark 1.3

2015-03-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6128.
--
Resolution: Fixed

> Update Spark Streaming Guide for Spark 1.3
> --
>
> Key: SPARK-6128
> URL: https://issues.apache.org/jira/browse/SPARK-6128
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
> Fix For: 1.3.0
>
>
> Things to update
> - New Kafka Direct API
> - Python Kafka API
> - Add joins to streaming guide



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2253) [Core] Disable partial aggregation automatically when reduction factor is low

2015-03-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2253:
-
Fix Version/s: (was: 1.3.0)

> [Core] Disable partial aggregation automatically when reduction factor is low
> -
>
> Key: SPARK-2253
> URL: https://issues.apache.org/jira/browse/SPARK-2253
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Reynold Xin
>
> Once we see enough number of rows in partial aggregation and don't observe 
> any reduction, Aggregator should just turn off partial aggregation. This 
> reduces memory usage for high cardinality aggregations.
> This one is for Spark core. There is another ticket tracking this for SQL. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6470) Allow Spark apps to put YARN node labels in their requests

2015-03-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6470:
---

Assignee: Apache Spark  (was: Sandy Ryza)

> Allow Spark apps to put YARN node labels in their requests
> --
>
> Key: SPARK-6470
> URL: https://issues.apache.org/jira/browse/SPARK-6470
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Sandy Ryza
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6470) Allow Spark apps to put YARN node labels in their requests

2015-03-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6470:
---

Assignee: Sandy Ryza  (was: Apache Spark)

> Allow Spark apps to put YARN node labels in their requests
> --
>
> Key: SPARK-6470
> URL: https://issues.apache.org/jira/browse/SPARK-6470
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6470) Allow Spark apps to put YARN node labels in their requests

2015-03-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385363#comment-14385363
 ] 

Apache Spark commented on SPARK-6470:
-

User 'sryza' has created a pull request for this issue:
https://github.com/apache/spark/pull/5242

> Allow Spark apps to put YARN node labels in their requests
> --
>
> Key: SPARK-6470
> URL: https://issues.apache.org/jira/browse/SPARK-6470
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6587) Inferring schema for case class hierarchy fails with mysterious message

2015-03-28 Thread Spiro Michaylov (JIRA)

Spiro Michaylov created SPARK-6587:
--

 Summary: Inferring schema for case class hierarchy fails with 
mysterious message
 Key: SPARK-6587
 URL: https://issues.apache.org/jira/browse/SPARK-6587
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
 Environment: At least Windows 8, Scala 2.11.2.  
Reporter: Spiro Michaylov


(Don't know if this is a functionality bug, error reporting bug or an RFE ...)

I define the following hierarchy:

{code}
private abstract class MyHolder
private case class StringHolder(s: String) extends MyHolder
private case class IntHolder(i: Int) extends MyHolder
private case class BooleanHolder(b: Boolean) extends MyHolder
{code}

and a top level case class:

{code}
private case class Thing(key: Integer, foo: MyHolder)
{code}

When I try to convert it:

{code}
val things = Seq(
  Thing(1, IntHolder(42)),
  Thing(2, StringHolder("hello")),
  Thing(3, BooleanHolder(false))
)
val thingsDF = sc.parallelize(things, 4).toDF()

thingsDF.registerTempTable("things")

val all = sqlContext.sql("SELECT * from things")
{code}

I get the following stack trace:

{quote}
Exception in thread "main" scala.MatchError: 
sql.CaseClassSchemaProblem.MyHolder (of class 
scala.reflect.internal.Types$ClassNoArgsTypeRef)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:112)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:159)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:157)
at scala.collection.immutable.List.map(List.scala:276)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:157)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:107)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:312)
at 
org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHolder(SQLContext.scala:250)
at sql.CaseClassSchemaProblem$.main(CaseClassSchemaProblem.scala:35)
at sql.CaseClassSchemaProblem.main(CaseClassSchemaProblem.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
{quote}

I wrote this to answer [a question on 
StackOverflow|http://stackoverflow.com/questions/29310405/what-is-the-right-way-to-represent-an-any-type-in-spark-sql]
 which uses a much simpler approach and suffers the same problem.

Looking at what seems to me to be the [relevant unit test 
suite|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/ScalaReflectionRelationSuite.scala]
 I see that this case is not covered.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2426) Quadratic Minimization for MLlib ALS

2015-03-28 Thread Debasish Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Debasish Das updated SPARK-2426:

Affects Version/s: (was: 1.3.0)
   1.4.0

> Quadratic Minimization for MLlib ALS
> 
>
> Key: SPARK-2426
> URL: https://issues.apache.org/jira/browse/SPARK-2426
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.4.0
>Reporter: Debasish Das
>Assignee: Debasish Das
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Current ALS supports least squares and nonnegative least squares.
> I presented ADMM and IPM based Quadratic Minimization solvers to be used for 
> the following ALS problems:
> 1. ALS with bounds
> 2. ALS with L1 regularization
> 3. ALS with Equality constraint and bounds
> Initial runtime comparisons are presented at Spark Summit. 
> http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark
> Based on Xiangrui's feedback I am currently comparing the ADMM based 
> Quadratic Minimization solvers with IPM based QpSolvers and the default 
> ALS/NNLS. I will keep updating the runtime comparison results.
> For integration the detailed plan is as follows:
> 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization
> 2. Integrate QuadraticMinimizer in mllib ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6571) MatrixFactorizationModel created by load fails on predictAll

2015-03-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385414#comment-14385414
 ] 

Apache Spark commented on SPARK-6571:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/5243

> MatrixFactorizationModel created by load fails on predictAll
> 
>
> Key: SPARK-6571
> URL: https://issues.apache.org/jira/browse/SPARK-6571
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0
>Reporter: Charles Hayden
>Assignee: Xiangrui Meng
>
> This code, adapted from the documentation, fails when using a loaded model.
> from pyspark.mllib.recommendation import ALS, Rating, MatrixFactorizationModel
> r1 = (1, 1, 1.0)
> r2 = (1, 2, 2.0)
> r3 = (2, 1, 2.0)
> ratings = sc.parallelize([r1, r2, r3])
> model = ALS.trainImplicit(ratings, 1, seed=10)
> print '(2, 2)', model.predict(2, 2)
> #0.43...
> testset = sc.parallelize([(1, 2), (1, 1)])
> print 'all', model.predictAll(testset).collect()
> #[Rating(user=1, product=1, rating=1.0...), Rating(user=1, product=2, 
> rating=1.9...)]
> import os, tempfile
> path = tempfile.mkdtemp()
> model.save(sc, path)
> sameModel = MatrixFactorizationModel.load(sc, path)
> print '(2, 2)', sameModel.predict(2,2)
> sameModel.predictAll(testset).collect()
> This gives
> (2, 2) 0.443547642944
> all [Rating(user=1, product=1, rating=1.1538351103381217), Rating(user=1, 
> product=2, rating=0.7153473708381739)]
> (2, 2) 0.443547642944
> ---
> Py4JError Traceback (most recent call last)
>  in ()
>  19 sameModel = MatrixFactorizationModel.load(sc, path)
>  20 print '(2, 2)', sameModel.predict(2,2)
> ---> 21 sameModel.predictAll(testset).collect()
>  22 
> /home/ubuntu/spark/python/pyspark/mllib/recommendation.pyc in 
> predictAll(self, user_product)
> 104 assert len(first) == 2, "user_product should be RDD of (user, 
> product)"
> 105 user_product = user_product.map(lambda (u, p): (int(u), 
> int(p)))
> --> 106 return self.call("predict", user_product)
> 107 
> 108 def userFeatures(self):
> /home/ubuntu/spark/python/pyspark/mllib/common.pyc in call(self, name, *a)
> 134 def call(self, name, *a):
> 135 """Call method of java_model"""
> --> 136 return callJavaFunc(self._sc, getattr(self._java_model, 
> name), *a)
> 137 
> 138 
> /home/ubuntu/spark/python/pyspark/mllib/common.pyc in callJavaFunc(sc, func, 
> *args)
> 111 """ Call Java Function """
> 112 args = [_py2java(sc, a) for a in args]
> --> 113 return _java2py(sc, func(*args))
> 114 
> 115 
> /home/ubuntu/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
> 536 answer = self.gateway_client.send_command(command)
> 537 return_value = get_return_value(answer, self.gateway_client,
> --> 538 self.target_id, self.name)
> 539 
> 540 for temp_arg in temp_args:
> /home/ubuntu/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
> 302 raise Py4JError(
> 303 'An error occurred while calling {0}{1}{2}. 
> Trace:\n{3}\n'.
> --> 304 format(target_id, '.', name, value))
> 305 else:
> 306 raise Py4JError(
> Py4JError: An error occurred while calling o450.predict. Trace:
> py4j.Py4JException: Method predict([class org.apache.spark.api.java.JavaRDD]) 
> does not exist
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
>   at py4j.Gateway.invoke(Gateway.java:252)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6581) Metadata is missing when saving parquet file using hadoop 1.0.4

2015-03-28 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6581:
--
Target Version/s: 1.4.0

> Metadata is missing when saving parquet file using hadoop 1.0.4
> ---
>
> Key: SPARK-6581
> URL: https://issues.apache.org/jira/browse/SPARK-6581
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: hadoop 1.0.4
>Reporter: Pei-Lun Lee
>
> When saving parquet file with {code}df.save("foo", "parquet"){code}
> It generates only _common_data while _metadata is missing:
> {noformat}
> -rwxrwxrwx  1 peilunlee  staff0 Mar 27 11:29 _SUCCESS*
> -rwxrwxrwx  1 peilunlee  staff  250 Mar 27 11:29 _common_metadata*
> -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-1.parquet*
> -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-2.parquet*
> -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-3.parquet*
> -rwxrwxrwx  1 peilunlee  staff  488 Mar 27 11:29 part-r-4.parquet*
> {noformat}
> If saving with {code}df.save("foo", "parquet", SaveMode.Overwrite){code} Both 
> _metadata and _common_metadata are missing:
> {noformat}
> -rwxrwxrwx  1 peilunlee  staff0 Mar 27 11:29 _SUCCESS*
> -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-1.parquet*
> -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-2.parquet*
> -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-3.parquet*
> -rwxrwxrwx  1 peilunlee  staff  488 Mar 27 11:29 part-r-4.parquet*
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6570) Spark SQL arrays: "explode()" fails and cannot save array type to Parquet

2015-03-28 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6570:
--
Target Version/s: 1.4.0

> Spark SQL arrays: "explode()" fails and cannot save array type to Parquet
> -
>
> Key: SPARK-6570
> URL: https://issues.apache.org/jira/browse/SPARK-6570
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Jon Chase
>
> {code}
> @Rule
> public TemporaryFolder tmp = new TemporaryFolder();
> @Test
> public void testPercentileWithExplode() throws Exception {
> StructType schema = DataTypes.createStructType(Lists.newArrayList(
> DataTypes.createStructField("col1", DataTypes.StringType, 
> false),
> DataTypes.createStructField("col2s", 
> DataTypes.createArrayType(DataTypes.IntegerType, true), true)
> ));
> JavaRDD rowRDD = sc.parallelize(Lists.newArrayList(
> RowFactory.create("test", new int[]{1, 2, 3})
> ));
> DataFrame df = sql.createDataFrame(rowRDD, schema);
> df.registerTempTable("df");
> df.printSchema();
> List ints = sql.sql("select col2s from df").javaRDD()
>   .map(row -> (int[]) row.get(0)).collect();
> assertEquals(1, ints.size());
> assertArrayEquals(new int[]{1, 2, 3}, ints.get(0));
> // fails: lateral view explode does not work: 
> java.lang.ClassCastException: [I cannot be cast to scala.collection.Seq
> List explodedInts = sql.sql("select col2 from df lateral 
> view explode(col2s) splode as col2").javaRDD()
> .map(row -> row.getInt(0)).collect();
> assertEquals(3, explodedInts.size());
> assertEquals(Lists.newArrayList(1, 2, 3), explodedInts);
> // fails: java.lang.ClassCastException: [I cannot be cast to 
> scala.collection.Seq
> df.saveAsParquetFile(tmp.getRoot().getAbsolutePath() + "/parquet");
> DataFrame loadedDf = sql.load(tmp.getRoot().getAbsolutePath() + 
> "/parquet");
> loadedDf.registerTempTable("loadedDf");
> List moreInts = sql.sql("select col2s from loadedDf").javaRDD()
>   .map(row -> (int[]) row.get(0)).collect();
> assertEquals(1, moreInts.size());
> assertArrayEquals(new int[]{1, 2, 3}, moreInts.get(0));
> }
> {code}
> {code}
> root
>  |-- col1: string (nullable = false)
>  |-- col2s: array (nullable = true)
>  ||-- element: integer (containsNull = true)
> ERROR org.apache.spark.executor.Executor Exception in task 7.0 in stage 1.0 
> (TID 15)
> java.lang.ClassCastException: [I cannot be cast to scala.collection.Seq
>   at 
> org.apache.spark.sql.catalyst.expressions.Explode.eval(generators.scala:125) 
> ~[spark-catalyst_2.10-1.3.0.jar:1.3.0]
>   at 
> org.apache.spark.sql.execution.Generate$$anonfun$2$$anonfun$apply$1.apply(Generate.scala:70)
>  ~[spark-sql_2.10-1.3.0.jar:1.3.0]
>   at 
> org.apache.spark.sql.execution.Generate$$anonfun$2$$anonfun$apply$1.apply(Generate.scala:69)
>  ~[spark-sql_2.10-1.3.0.jar:1.3.0]
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) 
> ~[scala-library-2.10.4.jar:na]
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) 
> ~[scala-library-2.10.4.jar:na]
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) 
> ~[scala-library-2.10.4.jar:na]
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) 
> ~[scala-library-2.10.4.jar:na]
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
> ~[scala-library-2.10.4.jar:na]
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
> ~[scala-library-2.10.4.jar:na]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6588) Private VPC's and subnets currently don't work with the Spark ec2 script

2015-03-28 Thread Michelangelo D'Agostino (JIRA)

Michelangelo D'Agostino created SPARK-6588:
--

 Summary: Private VPC's and subnets currently don't work with the 
Spark ec2 script
 Key: SPARK-6588
 URL: https://issues.apache.org/jira/browse/SPARK-6588
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 1.3.0
Reporter: Michelangelo D'Agostino
Priority: Minor


The spark_ec2.py script currently references the ip_address and public_dns_name 
attributes of an instance.  On private networks, these fields aren't set, so we 
have problems.

The solution, which I've just finished coding up, is to introduce a 
--private-ips flag that instead refers to the private_ip_address attribute in 
both cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6588) Private VPC's and subnets currently don't work with the Spark ec2 script

2015-03-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6588.
--
Resolution: Duplicate

Also SPARK-5246, SPARK-6220. Have a look at the existing JIRAs and see if you 
can resolve one of them to this effect.

> Private VPC's and subnets currently don't work with the Spark ec2 script
> 
>
> Key: SPARK-6588
> URL: https://issues.apache.org/jira/browse/SPARK-6588
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Affects Versions: 1.3.0
>Reporter: Michelangelo D'Agostino
>Priority: Minor
>
> The spark_ec2.py script currently references the ip_address and 
> public_dns_name attributes of an instance.  On private networks, these fields 
> aren't set, so we have problems.
> The solution, which I've just finished coding up, is to introduce a 
> --private-ips flag that instead refers to the private_ip_address attribute in 
> both cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6588) Private VPC's and subnets currently don't work with the Spark ec2 script

2015-03-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385435#comment-14385435
 ] 

Apache Spark commented on SPARK-6588:
-

User 'mdagost' has created a pull request for this issue:
https://github.com/apache/spark/pull/5244

> Private VPC's and subnets currently don't work with the Spark ec2 script
> 
>
> Key: SPARK-6588
> URL: https://issues.apache.org/jira/browse/SPARK-6588
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Affects Versions: 1.3.0
>Reporter: Michelangelo D'Agostino
>Priority: Minor
>
> The spark_ec2.py script currently references the ip_address and 
> public_dns_name attributes of an instance.  On private networks, these fields 
> aren't set, so we have problems.
> The solution, which I've just finished coding up, is to introduce a 
> --private-ips flag that instead refers to the private_ip_address attribute in 
> both cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-5894) Add PolynomialMapper

2015-03-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5894:
---

Assignee: Apache Spark

> Add PolynomialMapper
> 
>
> Key: SPARK-5894
> URL: https://issues.apache.org/jira/browse/SPARK-5894
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>
> `PolynomialMapper` takes a vector column and outputs a vector column with 
> polynomial feature mapping.
> {code}
> val poly = new PolynomialMapper()
>   .setInputCol("features")
>   .setDegree(2)
>   .setOutputCols("polyFeatures")
> {code}
> It should handle the output feature names properly. Maybe we can make a 
> better name for it instead of calling it `PolynomialMapper`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5894) Add PolynomialMapper

2015-03-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385446#comment-14385446
 ] 

Apache Spark commented on SPARK-5894:
-

User 'yinxusen' has created a pull request for this issue:
https://github.com/apache/spark/pull/5245

> Add PolynomialMapper
> 
>
> Key: SPARK-5894
> URL: https://issues.apache.org/jira/browse/SPARK-5894
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>
> `PolynomialMapper` takes a vector column and outputs a vector column with 
> polynomial feature mapping.
> {code}
> val poly = new PolynomialMapper()
>   .setInputCol("features")
>   .setDegree(2)
>   .setOutputCols("polyFeatures")
> {code}
> It should handle the output feature names properly. Maybe we can make a 
> better name for it instead of calling it `PolynomialMapper`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-5894) Add PolynomialMapper

2015-03-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5894:
---

Assignee: (was: Apache Spark)

> Add PolynomialMapper
> 
>
> Key: SPARK-5894
> URL: https://issues.apache.org/jira/browse/SPARK-5894
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>
> `PolynomialMapper` takes a vector column and outputs a vector column with 
> polynomial feature mapping.
> {code}
> val poly = new PolynomialMapper()
>   .setInputCol("features")
>   .setDegree(2)
>   .setOutputCols("polyFeatures")
> {code}
> It should handle the output feature names properly. Maybe we can make a 
> better name for it instead of calling it `PolynomialMapper`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6589) SQLUserDefinedType failed in spark-shell

2015-03-28 Thread Benyi Wang (JIRA)

Benyi Wang created SPARK-6589:
-

 Summary: SQLUserDefinedType failed in spark-shell
 Key: SPARK-6589
 URL: https://issues.apache.org/jira/browse/SPARK-6589
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
 Environment: CDH 5.3.2
Reporter: Benyi Wang


{{DataType.fromJson}} will fail in spark-shell if the schema includes "udt". It 
works if running in an application. 

This causes that I cannot read a parquet file including a UDT field. 
{{DataType.fromCaseClass}} does not support UDT.

I can load the class which shows that my UDT is in the classpath.
{code}
scala> Class.forName("com.bwang.MyTestUDT")
res6: Class[_] = class com.bwang.MyTestUDT
{code}

But DataType fails:
{code}
scala> DataType.fromJson(json)  

java.lang.ClassNotFoundException: com.bwang.MyTestUDT
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:190)
at 
org.apache.spark.sql.catalyst.types.DataType$.parseDataType(dataTypes.scala:77)
{code}

The reason is DataType.fromJson tries to load {{udtClass}} using this code:
{code}
case JSortedObject(
("class", JString(udtClass)),
("pyClass", _),
("sqlType", _),
("type", JString("udt"))) =>
  Class.forName(udtClass).newInstance().asInstanceOf[UserDefinedType[_]]
  }
{code}

Unfortunately, my UDT is loaded by {{SparkIMain$TranslatingClassLoader}}, but 
DataType is loaded by {{Launcher$AppClassLoader}}.

{code}
scala> DataType.getClass.getClassLoader
res2: ClassLoader = sun.misc.Launcher$AppClassLoader@6876fb1b

scala> this.getClass.getClassLoader
res3: ClassLoader = 
org.apache.spark.repl.SparkIMain$TranslatingClassLoader@63d36b29
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6590) Make DataFrame.where accept a string conditionExpr

2015-03-28 Thread Yin Huai (JIRA)

Yin Huai created SPARK-6590:
---

 Summary: Make DataFrame.where accept a string conditionExpr
 Key: SPARK-6590
 URL: https://issues.apache.org/jira/browse/SPARK-6590
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yin Huai
Assignee: Yin Huai


In our doc, we say where is an alias of filter. However, where does not support 
a conditionExpr in string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6590) Make DataFrame.where accept a string conditionExpr

2015-03-28 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6590:

Priority: Minor  (was: Major)

> Make DataFrame.where accept a string conditionExpr
> --
>
> Key: SPARK-6590
> URL: https://issues.apache.org/jira/browse/SPARK-6590
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Minor
>
> In our doc, we say where is an alias of filter. However, where does not 
> support a conditionExpr in string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6589) SQLUserDefinedType failed in spark-shell

2015-03-28 Thread Benyi Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385499#comment-14385499
 ] 

Benyi Wang commented on SPARK-6589:
---

I found a method to fix this issue. But I still think DataType should find a 
better way to find the correct class loader.
{code}
# put the UDT jar to SPARK_CLASSPATH so that Launcher$AppClassLoader can find 
it.
export SPARK_CLASSPATH=myUDT.jar

spark-shell --jars myUDT.jar ...
{code}

> SQLUserDefinedType failed in spark-shell
> 
>
> Key: SPARK-6589
> URL: https://issues.apache.org/jira/browse/SPARK-6589
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
> Environment: CDH 5.3.2
>Reporter: Benyi Wang
>
> {{DataType.fromJson}} will fail in spark-shell if the schema includes "udt". 
> It works if running in an application. 
> This causes that I cannot read a parquet file including a UDT field. 
> {{DataType.fromCaseClass}} does not support UDT.
> I can load the class which shows that my UDT is in the classpath.
> {code}
> scala> Class.forName("com.bwang.MyTestUDT")
> res6: Class[_] = class com.bwang.MyTestUDT
> {code}
> But DataType fails:
> {code}
> scala> DataType.fromJson(json)
>   
> java.lang.ClassNotFoundException: com.bwang.MyTestUDT
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:190)
> at 
> org.apache.spark.sql.catalyst.types.DataType$.parseDataType(dataTypes.scala:77)
> {code}
> The reason is DataType.fromJson tries to load {{udtClass}} using this code:
> {code}
> case JSortedObject(
> ("class", JString(udtClass)),
> ("pyClass", _),
> ("sqlType", _),
> ("type", JString("udt"))) =>
>   Class.forName(udtClass).newInstance().asInstanceOf[UserDefinedType[_]]
>   }
> {code}
> Unfortunately, my UDT is loaded by {{SparkIMain$TranslatingClassLoader}}, but 
> DataType is loaded by {{Launcher$AppClassLoader}}.
> {code}
> scala> DataType.getClass.getClassLoader
> res2: ClassLoader = sun.misc.Launcher$AppClassLoader@6876fb1b
> scala> this.getClass.getClassLoader
> res3: ClassLoader = 
> org.apache.spark.repl.SparkIMain$TranslatingClassLoader@63d36b29
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6391) Update Tachyon version compatibility documentation

2015-03-28 Thread Haoyuan Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haoyuan Li updated SPARK-6391:
--
Fix Version/s: 1.4.0

> Update Tachyon version compatibility documentation
> --
>
> Key: SPARK-6391
> URL: https://issues.apache.org/jira/browse/SPARK-6391
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.3.0
>Reporter: Calvin Jia
> Fix For: 1.4.0
>
>
> Tachyon v0.6 has an api change in the client, it would be helpful to document 
> the Tachyon-Spark compatibility across versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6299) ClassNotFoundException in standalone mode when running groupByKey with class defined in REPL.

2015-03-28 Thread Chip Senkbeil (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385516#comment-14385516
 ] 

Chip Senkbeil commented on SPARK-6299:
--

FYI, we had the same issue on Mesos for 1.2.1 when the class was defined 
through the REPL. So, it was not just limited to standalone mode.

> ClassNotFoundException in standalone mode when running groupByKey with class 
> defined in REPL.
> -
>
> Key: SPARK-6299
> URL: https://issues.apache.org/jira/browse/SPARK-6299
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.2.1, 1.3.0
>Reporter: Kevin (Sangwoo) Kim
>Assignee: Kevin (Sangwoo) Kim
> Fix For: 1.3.1, 1.4.0
>
>
> Anyone can reproduce this issue by the code below
> (runs well in local mode, got exception with clusters)
> (it runs well in Spark 1.1.1)
> {code}
> case class ClassA(value: String)
> val rdd = sc.parallelize(List(("k1", ClassA("v1")), ("k1", ClassA("v2")) ))
> rdd.groupByKey.collect
> {code}
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 162 
> in stage 1.0 failed 4 times, most recent failure: Lost task 162.3 in stage 
> 1.0 (TID 1027, ip-172-16-182-27.ap-northeast-1.compute.internal): 
> java.lang.ClassNotFoundException: $iwC$$iwC$$iwC$$iwC$UserRelationshipRow
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:274)
> at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59)
> at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
> at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
> at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:91)
> at 
> org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:44)
> at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.s

[jira] [Updated] (SPARK-6391) Update Tachyon version compatibility documentation

2015-03-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6391:
-
Target Version/s: 1.4.0
   Fix Version/s: (was: 1.4.0)

[~haoyuan] we set Fix Version when the issue is Resolved. At best, set Target 
Version.

> Update Tachyon version compatibility documentation
> --
>
> Key: SPARK-6391
> URL: https://issues.apache.org/jira/browse/SPARK-6391
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.3.0
>Reporter: Calvin Jia
>
> Tachyon v0.6 has an api change in the client, it would be helpful to document 
> the Tachyon-Spark compatibility across versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-5124) Standardize internal RPC interface

2015-03-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5124:
---

Assignee: Shixiong Zhu  (was: Apache Spark)

> Standardize internal RPC interface
> --
>
> Key: SPARK-5124
> URL: https://issues.apache.org/jira/browse/SPARK-5124
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
> Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf
>
>
> In Spark we use Akka as the RPC layer. It would be great if we can 
> standardize the internal RPC interface to facilitate testing. This will also 
> provide the foundation to try other RPC implementations in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-5124) Standardize internal RPC interface

2015-03-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5124:
---

Assignee: Apache Spark  (was: Shixiong Zhu)

> Standardize internal RPC interface
> --
>
> Key: SPARK-5124
> URL: https://issues.apache.org/jira/browse/SPARK-5124
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Apache Spark
> Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf
>
>
> In Spark we use Akka as the RPC layer. It would be great if we can 
> standardize the internal RPC interface to facilitate testing. This will also 
> provide the foundation to try other RPC implementations in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-5494) SparkSqlSerializer Ignores KryoRegistrators

2015-03-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5494:
---

Assignee: Apache Spark

> SparkSqlSerializer Ignores KryoRegistrators
> ---
>
> Key: SPARK-5494
> URL: https://issues.apache.org/jira/browse/SPARK-5494
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Hamel Ajay Kothari
>Assignee: Apache Spark
>
> We should make SparkSqlSerializer call {{super.newKryo}} before doing any of 
> it's custom stuff in order to make sure it picks up on custom 
> KryoRegistrators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-5494) SparkSqlSerializer Ignores KryoRegistrators

2015-03-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5494:
---

Assignee: (was: Apache Spark)

> SparkSqlSerializer Ignores KryoRegistrators
> ---
>
> Key: SPARK-5494
> URL: https://issues.apache.org/jira/browse/SPARK-5494
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Hamel Ajay Kothari
>
> We should make SparkSqlSerializer call {{super.newKryo}} before doing any of 
> it's custom stuff in order to make sure it picks up on custom 
> KryoRegistrators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6391) Update Tachyon version compatibility documentation

2015-03-28 Thread Haoyuan Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385584#comment-14385584
 ] 

Haoyuan Li commented on SPARK-6391:
---

Thanks [~sowen].

> Update Tachyon version compatibility documentation
> --
>
> Key: SPARK-6391
> URL: https://issues.apache.org/jira/browse/SPARK-6391
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.3.0
>Reporter: Calvin Jia
>
> Tachyon v0.6 has an api change in the client, it would be helpful to document 
> the Tachyon-Spark compatibility across versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5946) Add Python API for Kafka direct stream

2015-03-28 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-5946:
-
Target Version/s: 1.4.0

> Add Python API for Kafka direct stream
> --
>
> Key: SPARK-5946
> URL: https://issues.apache.org/jira/browse/SPARK-5946
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Streaming
>Affects Versions: 1.3.0
>Reporter: Saisai Shao
>
> Add the Python API for Kafka direct stream. Currently only adds 
> {{createDirectStream}} API, no {{createRDD}} API, since it needs some Python 
> wraps of Java object, will improve this according to the comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2973) Use LocalRelation for all ExecutedCommands, avoid job for take/collect()

2015-03-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385606#comment-14385606
 ] 

Apache Spark commented on SPARK-2973:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/5247

> Use LocalRelation for all ExecutedCommands, avoid job for take/collect()
> 
>
> Key: SPARK-2973
> URL: https://issues.apache.org/jira/browse/SPARK-2973
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Aaron Davidson
>Assignee: Cheng Lian
>Priority: Blocker
> Fix For: 1.2.0
>
>
> Right now, sql("show tables").collect() will start a Spark job which shows up 
> in the UI. There should be a way to get these without this step.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6591) Python data source load options should auto convert common types into strings

2015-03-28 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6591:
---
Labels: DataFrame DataSource  (was: )

> Python data source load options should auto convert common types into strings
> -
>
> Key: SPARK-6591
> URL: https://issues.apache.org/jira/browse/SPARK-6591
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Reynold Xin
>Assignee: Davies Liu
>  Labels: DataFrame, DataSource
>
> See the discussion at : https://github.com/databricks/spark-csv/pull/39
> If the caller invokes
> {code}
> sqlContext.load("com.databricks.spark.csv", path = "cars.csv", header = True)
> {code}
> We should automatically turn header into "true" in string form.
> We should do this for booleans and numeric values.
> cc [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6591) Python data source load options should auto convert common types into strings

2015-03-28 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-6591:
--

 Summary: Python data source load options should auto convert 
common types into strings
 Key: SPARK-6591
 URL: https://issues.apache.org/jira/browse/SPARK-6591
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Reporter: Reynold Xin
Assignee: Davies Liu


See the discussion at : https://github.com/databricks/spark-csv/pull/39

If the caller invokes
{code}
sqlContext.load("com.databricks.spark.csv", path = "cars.csv", header = True)
{code}

We should automatically turn header into "true" in string form.

We should do this for booleans and numeric values.

cc [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6592) API of Row trait should be presented in Scala doc

2015-03-28 Thread Nan Zhu (JIRA)

Nan Zhu created SPARK-6592:
--

 Summary: API of Row trait should be presented in Scala doc
 Key: SPARK-6592
 URL: https://issues.apache.org/jira/browse/SPARK-6592
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SQL
Affects Versions: 1.3.0
Reporter: Nan Zhu


Currently, the API of Row class is not presented in Scaladoc, though we have 
many chances to use it 

the reason is that we ignore all files under catalyst directly in 
SparkBuild.scala when generating Scaladoc, 
(https://github.com/apache/spark/blob/f75f633b21faaf911f04aeff847f25749b1ecd89/project/SparkBuild.scala#L369)

What's the best approach to fix this? [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6592) API of Row trait should be presented in Scala doc

2015-03-28 Thread Nan Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385616#comment-14385616
 ] 

Nan Zhu commented on SPARK-6592:


also cc: [~lian cheng] [~marmbrus]

> API of Row trait should be presented in Scala doc
> -
>
> Key: SPARK-6592
> URL: https://issues.apache.org/jira/browse/SPARK-6592
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 1.3.0
>Reporter: Nan Zhu
>
> Currently, the API of Row class is not presented in Scaladoc, though we have 
> many chances to use it 
> the reason is that we ignore all files under catalyst directly in 
> SparkBuild.scala when generating Scaladoc, 
> (https://github.com/apache/spark/blob/f75f633b21faaf911f04aeff847f25749b1ecd89/project/SparkBuild.scala#L369)
> What's the best approach to fix this? [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6592) API of Row trait should be presented in Scala doc

2015-03-28 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385617#comment-14385617
 ] 

Reynold Xin commented on SPARK-6592:


Can you try change that line to 

spark/sql/catalyst?

then it should only filter out the catalyst package, but not the catalyst 
module.


> API of Row trait should be presented in Scala doc
> -
>
> Key: SPARK-6592
> URL: https://issues.apache.org/jira/browse/SPARK-6592
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 1.3.0
>Reporter: Nan Zhu
>
> Currently, the API of Row class is not presented in Scaladoc, though we have 
> many chances to use it 
> the reason is that we ignore all files under catalyst directly in 
> SparkBuild.scala when generating Scaladoc, 
> (https://github.com/apache/spark/blob/f75f633b21faaf911f04aeff847f25749b1ecd89/project/SparkBuild.scala#L369)
> What's the best approach to fix this? [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6369) InsertIntoHiveTable should use logic from SparkHadoopWriter

2015-03-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6369:
---

Assignee: Cheng Lian  (was: Apache Spark)

> InsertIntoHiveTable should use logic from SparkHadoopWriter
> ---
>
> Key: SPARK-6369
> URL: https://issues.apache.org/jira/browse/SPARK-6369
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cheng Lian
>Priority: Blocker
>
> Right now it is possible that we will corrupt the output if there is a race 
> between competing speculative tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6369) InsertIntoHiveTable should use logic from SparkHadoopWriter

2015-03-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6369:
---

Assignee: Apache Spark  (was: Cheng Lian)

> InsertIntoHiveTable should use logic from SparkHadoopWriter
> ---
>
> Key: SPARK-6369
> URL: https://issues.apache.org/jira/browse/SPARK-6369
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>Priority: Blocker
>
> Right now it is possible that we will corrupt the output if there is a race 
> between competing speculative tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6575) Add configuration to disable schema merging while converting metastore Parquet tables

2015-03-28 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6575:
--
Description: 
Consider a metastore Parquet table that
# doesn't have schema evolution issue
# has lots of data files and/or partitions

In this case, driver schema merging can be both slow and unnecessary. Would be 
good to have a configuration to let the use disable schema merging when 
converting such a metastore Parquet table.

  was:
Consider a metastore Parquet table that
# doesn't have schema evolution issue
# has lots of data files and/or partitions

In this case, driver schema merging can be both slow and unnecessary. Would be 
good to have a configuration to let the use disable schema merging when 
coverting such a metastore Parquet table.


> Add configuration to disable schema merging while converting metastore 
> Parquet tables
> -
>
> Key: SPARK-6575
> URL: https://issues.apache.org/jira/browse/SPARK-6575
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> Consider a metastore Parquet table that
> # doesn't have schema evolution issue
> # has lots of data files and/or partitions
> In this case, driver schema merging can be both slow and unnecessary. Would 
> be good to have a configuration to let the use disable schema merging when 
> converting such a metastore Parquet table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6119) DataFrame.dropna support

2015-03-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385642#comment-14385642
 ] 

Apache Spark commented on SPARK-6119:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5248

> DataFrame.dropna support
> 
>
> Key: SPARK-6119
> URL: https://issues.apache.org/jira/browse/SPARK-6119
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: DataFrame
>
> Support dropping rows with null values (dropna). Similar to Pandas' dropna
> http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.dropna.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-03-28 Thread Kannan Rajah (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385648#comment-14385648
 ] 

Kannan Rajah commented on SPARK-1529:
-

I have pushed the first round of commits to my repo. I would like to get some 
early feedback on the overall design.
https://github.com/rkannan82/spark/commits/dfs_shuffle

Commits:
https://github.com/rkannan82/spark/commit/ce8b430512b31e932ffdab6e0a2c1a6a1768ffbf
https://github.com/rkannan82/spark/commit/8f5415c248c0a9ca5ad3ec9f48f839b24c259813
https://github.com/rkannan82/spark/commit/d9d179ba6c685cc8eb181f442e9bd6ad91cc4290

> Support setting spark.local.dirs to a hadoop FileSystem 
> 
>
> Key: SPARK-1529
> URL: https://issues.apache.org/jira/browse/SPARK-1529
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Kannan Rajah
> Attachments: Spark Shuffle using HDFS.pdf
>
>
> In some environments, like with MapR, local volumes are accessed through the 
> Hadoop filesystem interface. We should allow setting spark.local.dir to a 
> Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

93 matches

Mail list logo