date:20141116

[jira] [Updated] (SPARK-4445) Don't display storage level in toDebugString unless RDD is persisted

2014-11-16 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4445:
---
Issue Type: Bug  (was: Improvement)

> Don't display storage level in toDebugString unless RDD is persisted
> 
>
> Key: SPARK-4445
> URL: https://issues.apache.org/jira/browse/SPARK-4445
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Prashant Sharma
>Priority: Blocker
>
> The current approach lists the storage level all the time, even if the RDD is 
> not persisted. The storage level should only be listed if the RDD is 
> persisted. We just need to guard it with a check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4445) Don't display storage level in toDebugString unless RDD is persisted

2014-11-16 Thread Patrick Wendell (JIRA)

Patrick Wendell created SPARK-4445:
--

 Summary: Don't display storage level in toDebugString unless RDD 
is persisted
 Key: SPARK-4445
 URL: https://issues.apache.org/jira/browse/SPARK-4445
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Prashant Sharma
Priority: Blocker


The current approach lists the storage level all the time, even if the RDD is 
not persisted. The storage level should only be listed if the RDD is persisted. 
We just need to guard it with a check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4435) Add setThreshold in Python LogisticRegressionModel and SVMModel

2014-11-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214382#comment-14214382
 ] 

Apache Spark commented on SPARK-4435:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/3305

> Add setThreshold in Python LogisticRegressionModel and SVMModel
> ---
>
> Key: SPARK-4435
> URL: https://issues.apache.org/jira/browse/SPARK-4435
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Matei Zaharia
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4444) Drop VD type parameter from EdgeRDD

2014-11-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214369#comment-14214369
 ] 

Apache Spark commented on SPARK-:
-

User 'ankurdave' has created a pull request for this issue:
https://github.com/apache/spark/pull/3303

> Drop VD type parameter from EdgeRDD
> ---
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>Priority: Blocker
>
> Due to vertex attribute caching, EdgeRDD previously took two type parameters: 
> ED and VD. However, this is an implementation detail that should not be 
> exposed in the interface, so this PR drops the VD type parameter.
> This requires removing the filter method from the EdgeRDD interface, because 
> it depends on vertex attribute caching.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4444) Drop VD type parameter from EdgeRDD

2014-11-16 Thread Ankur Dave (JIRA)

Ankur Dave created SPARK-:
-

 Summary: Drop VD type parameter from EdgeRDD
 Key: SPARK-
 URL: https://issues.apache.org/jira/browse/SPARK-
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Ankur Dave
Assignee: Ankur Dave
Priority: Blocker


Due to vertex attribute caching, EdgeRDD previously took two type parameters: 
ED and VD. However, this is an implementation detail that should not be exposed 
in the interface, so this PR drops the VD type parameter.

This requires removing the filter method from the EdgeRDD interface, because it 
depends on vertex attribute caching.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4443) Statistics bug for external table in spark sql hive

2014-11-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214351#comment-14214351
 ] 

Apache Spark commented on SPARK-4443:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/3304

> Statistics bug for external table in spark sql hive
> ---
>
> Key: SPARK-4443
> URL: https://issues.apache.org/jira/browse/SPARK-4443
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: wangfei
> Fix For: 1.2.0
>
>
> When table is external, the `totalSize` is always zero, which will influence 
> join strategy(always use broadcast join for external table)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4443) Statistics bug for external table in spark sql hive

2014-11-16 Thread wangfei (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangfei updated SPARK-4443:
---
  Description: When table is external, `totalSize` is always zero, 
which will influence join strategy(always use broadcast join for external table)
 Target Version/s: 1.2.0
Affects Version/s: 1.1.0
Fix Version/s: 1.2.0

> Statistics bug for external table in spark sql hive
> ---
>
> Key: SPARK-4443
> URL: https://issues.apache.org/jira/browse/SPARK-4443
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: wangfei
> Fix For: 1.2.0
>
>
> When table is external, `totalSize` is always zero, which will influence join 
> strategy(always use broadcast join for external table)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4443) Statistics bug for external table in spark sql hive

2014-11-16 Thread wangfei (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangfei updated SPARK-4443:
---
Description: When table is external, the `totalSize` is always zero, which 
will influence join strategy(always use broadcast join for external table)  
(was: When table is external, `totalSize` is always zero, which will influence 
join strategy(always use broadcast join for external table))

> Statistics bug for external table in spark sql hive
> ---
>
> Key: SPARK-4443
> URL: https://issues.apache.org/jira/browse/SPARK-4443
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: wangfei
> Fix For: 1.2.0
>
>
> When table is external, the `totalSize` is always zero, which will influence 
> join strategy(always use broadcast join for external table)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4443) Statistics bug for external table in spark sql hive

2014-11-16 Thread wangfei (JIRA)

wangfei created SPARK-4443:
--

 Summary: Statistics bug for external table in spark sql hive
 Key: SPARK-4443
 URL: https://issues.apache.org/jira/browse/SPARK-4443
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: wangfei






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4437) Docs for difference between WholeTextFileRecordReader and WholeCombineFileRecordReader

2014-11-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214335#comment-14214335
 ] 

Apache Spark commented on SPARK-4437:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/3301

> Docs for difference between WholeTextFileRecordReader and 
> WholeCombineFileRecordReader
> --
>
> Key: SPARK-4437
> URL: https://issues.apache.org/jira/browse/SPARK-4437
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Andrew Ash
>Assignee: Davies Liu
>
> Tracking per this dev@ thread:
> {quote}
> On Sun, Nov 16, 2014 at 4:49 PM, Reynold Xin  wrote:
> I don't think the code is immediately obvious.
> Davies - I think you added the code, and Josh reviewed it. Can you guys
> explain and maybe submit a patch to add more documentation on the whole
> thing?
> Thanks.
> On Sun, Nov 16, 2014 at 3:22 AM, Vibhanshu Prasad 
> wrote:
> > Hello Everyone,
> >
> > I am going through the source code of rdd and Record readers
> > There are found 2 classes
> >
> > 1. WholeTextFileRecordReader
> > 2. WholeCombineFileRecordReader  ( extends CombineFileRecordReader )
> >
> > The description of both the classes is perfectly similar.
> >
> > I am not able to understand why we have 2 classes. Is
> > CombineFileRecordReader providing some extra advantage?
> >
> > Regards
> > Vibhanshu
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4402) Output path validation of an action statement resulting in runtime exception

2014-11-16 Thread Vijay (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214320#comment-14214320
 ] 

Vijay commented on SPARK-4402:
--

Yes, output path is being validated in PairRDDFunctions.saveAsHadoopDataset. 
Please find the below exception details.
So, the output path is validated only during the execution  
saveAsHadoopDataset. After completing all the preceding statements. 

My query is that is it possible to make this validation in the first place when 
the program executon starts.

Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: 
Output directory file:/home/HadoopUser/eclipse-scala/test/output1 already exists
at 
org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:968)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:878)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:792)
at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1159)
at test.OutputTest$.main(OutputTest.scala:19)
at test.OutputTest.main(OutputTest.scala)

> Output path validation of an action statement resulting in runtime exception
> 
>
> Key: SPARK-4402
> URL: https://issues.apache.org/jira/browse/SPARK-4402
> Project: Spark
>  Issue Type: Wish
>Reporter: Vijay
>Priority: Minor
>
> Output path validation is happening at the time of statement execution as a 
> part of lazyevolution of action statement. But if the path already exists 
> then it throws a runtime exception. Hence all the processing completed till 
> that point is lost which results in resource wastage (processing time and CPU 
> usage).
> If this I/O related validation is done before the RDD action operations then 
> this runtime exception can be avoided.
> I believe similar validation/ feature is implemented in hadoop also.
> Example:
> SchemaRDD.saveAsTextFile() evaluated the path during runtime 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4422) In some cases, Vectors.fromBreeze get wrong results.

2014-11-16 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4422:
-
Affects Version/s: 1.2.0

> In some cases, Vectors.fromBreeze get wrong results.
> 
>
> Key: SPARK-4422
> URL: https://issues.apache.org/jira/browse/SPARK-4422
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.2, 1.1.1, 1.2.0
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>Priority: Minor
> Fix For: 1.2.0
>
>
> {noformat}
> import breeze.linalg.{DenseVector => BDV, DenseMatrix => BDM, sum => brzSum} 
> var x = BDM.zeros[Double](10, 10)
> val v = Vectors.fromBreeze(x(::, 0))
> assert(v.size == x.rows)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-4422) In some cases, Vectors.fromBreeze get wrong results.

2014-11-16 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reopened SPARK-4422:
--

Reopened for branch-1.0 and branch-1.1. Changed the priority to minor because 
`fromBreeze` is private and we don't use breeze matrix slicing in MLlib.

> In some cases, Vectors.fromBreeze get wrong results.
> 
>
> Key: SPARK-4422
> URL: https://issues.apache.org/jira/browse/SPARK-4422
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.2, 1.1.1
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>Priority: Minor
> Fix For: 1.2.0
>
>
> {noformat}
> import breeze.linalg.{DenseVector => BDV, DenseMatrix => BDM, sum => brzSum} 
> var x = BDM.zeros[Double](10, 10)
> val v = Vectors.fromBreeze(x(::, 0))
> assert(v.size == x.rows)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4422) In some cases, Vectors.fromBreeze get wrong results.

2014-11-16 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4422:
-
Affects Version/s: (was: 1.3.0)
   (was: 1.2.0)
   (was: 1.1.0)
   1.1.1
   1.0.2

> In some cases, Vectors.fromBreeze get wrong results.
> 
>
> Key: SPARK-4422
> URL: https://issues.apache.org/jira/browse/SPARK-4422
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.2, 1.1.1
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>Priority: Critical
> Fix For: 1.2.0
>
>
> {noformat}
> import breeze.linalg.{DenseVector => BDV, DenseMatrix => BDM, sum => brzSum} 
> var x = BDM.zeros[Double](10, 10)
> val v = Vectors.fromBreeze(x(::, 0))
> assert(v.size == x.rows)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4422) In some cases, Vectors.fromBreeze get wrong results.

2014-11-16 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4422:
-
Target Version/s: 1.2.0, 1.0.3, 1.1.2  (was: 1.1.0, 1.2.0, 1.3.0)

> In some cases, Vectors.fromBreeze get wrong results.
> 
>
> Key: SPARK-4422
> URL: https://issues.apache.org/jira/browse/SPARK-4422
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.2, 1.1.1
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>Priority: Critical
> Fix For: 1.2.0
>
>
> {noformat}
> import breeze.linalg.{DenseVector => BDV, DenseMatrix => BDM, sum => brzSum} 
> var x = BDM.zeros[Double](10, 10)
> val v = Vectors.fromBreeze(x(::, 0))
> assert(v.size == x.rows)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4422) In some cases, Vectors.fromBreeze get wrong results.

2014-11-16 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4422:
-
Priority: Minor  (was: Critical)

> In some cases, Vectors.fromBreeze get wrong results.
> 
>
> Key: SPARK-4422
> URL: https://issues.apache.org/jira/browse/SPARK-4422
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.2, 1.1.1
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>Priority: Minor
> Fix For: 1.2.0
>
>
> {noformat}
> import breeze.linalg.{DenseVector => BDV, DenseMatrix => BDM, sum => brzSum} 
> var x = BDM.zeros[Double](10, 10)
> val v = Vectors.fromBreeze(x(::, 0))
> assert(v.size == x.rows)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4422) In some cases, Vectors.fromBreeze get wrong results.

2014-11-16 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-4422.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3281
[https://github.com/apache/spark/pull/3281]

> In some cases, Vectors.fromBreeze get wrong results.
> 
>
> Key: SPARK-4422
> URL: https://issues.apache.org/jira/browse/SPARK-4422
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.1.0, 1.2.0, 1.3.0
>Reporter: Guoqiang Li
>Priority: Critical
> Fix For: 1.2.0
>
>
> {noformat}
> import breeze.linalg.{DenseVector => BDV, DenseMatrix => BDM, sum => brzSum} 
> var x = BDM.zeros[Double](10, 10)
> val v = Vectors.fromBreeze(x(::, 0))
> assert(v.size == x.rows)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4422) In some cases, Vectors.fromBreeze get wrong results.

2014-11-16 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4422:
-
Assignee: Guoqiang Li

> In some cases, Vectors.fromBreeze get wrong results.
> 
>
> Key: SPARK-4422
> URL: https://issues.apache.org/jira/browse/SPARK-4422
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.1.0, 1.2.0, 1.3.0
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>Priority: Critical
> Fix For: 1.2.0
>
>
> {noformat}
> import breeze.linalg.{DenseVector => BDV, DenseMatrix => BDM, sum => brzSum} 
> var x = BDM.zeros[Double](10, 10)
> val v = Vectors.fromBreeze(x(::, 0))
> assert(v.size == x.rows)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4056) Upgrade snappy-java to 1.1.1.5

2014-11-16 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4056:
--
Fix Version/s: (was: 1.2.0)
   (was: 1.1.1)

> Upgrade snappy-java to 1.1.1.5
> --
>
> Key: SPARK-4056
> URL: https://issues.apache.org/jira/browse/SPARK-4056
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> We should upgrade snappy-java to 1.1.1.5 across all of our maintenance 
> branches.  This release improves error messages when attempting to 
> deserialize empty inputs using SnappyInputStream (this operation is always an 
> error, but the old error messages made it hard to distinguish failures due to 
> empty streams from ones due to reading invalid / corrupted streams); see 
> https://github.com/xerial/snappy-java/issues/89 for more context.
> This should be a major help in the Snappy debugging work that I've been doing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4056) Upgrade snappy-java to 1.1.1.5

2014-11-16 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214278#comment-14214278
 ] 

Josh Rosen commented on SPARK-4056:
---

I've removed the "Fixed Versions" here, since we reverted this particular 
commit.

> Upgrade snappy-java to 1.1.1.5
> --
>
> Key: SPARK-4056
> URL: https://issues.apache.org/jira/browse/SPARK-4056
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> We should upgrade snappy-java to 1.1.1.5 across all of our maintenance 
> branches.  This release improves error messages when attempting to 
> deserialize empty inputs using SnappyInputStream (this operation is always an 
> error, but the old error messages made it hard to distinguish failures due to 
> empty streams from ones due to reading invalid / corrupted streams); see 
> https://github.com/xerial/snappy-java/issues/89 for more context.
> This should be a major help in the Snappy debugging work that I've been doing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4441) Close Tachyon client when TachyonBlockManager is shut down

2014-11-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214274#comment-14214274
 ] 

Apache Spark commented on SPARK-4441:
-

User 'shimingfei' has created a pull request for this issue:
https://github.com/apache/spark/pull/3299

> Close Tachyon client when TachyonBlockManager is shut down
> --
>
> Key: SPARK-4441
> URL: https://issues.apache.org/jira/browse/SPARK-4441
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.1.0
>Reporter: shimingfei
>
> Currently Tachyon client is not shut down when TachyonBlockManager is shut 
> down. which causes some resources in Tachyon not reclaimed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4441) Close Tachyon client when TachyonBlockManager is shut down

2014-11-16 Thread shimingfei (JIRA)

shimingfei created SPARK-4441:
-

 Summary: Close Tachyon client when TachyonBlockManager is shut down
 Key: SPARK-4441
 URL: https://issues.apache.org/jira/browse/SPARK-4441
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Affects Versions: 1.1.0
Reporter: shimingfei


Currently Tachyon client is not shut down when TachyonBlockManager is shut 
down. which causes some resources in Tachyon not reclaimed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4442) Move common unit test utilities into their own package / module

2014-11-16 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-4442:
-

 Summary: Move common unit test utilities into their own package / 
module
 Key: SPARK-4442
 URL: https://issues.apache.org/jira/browse/SPARK-4442
 Project: Spark
  Issue Type: Improvement
Reporter: Josh Rosen
Priority: Minor


We should move generally-useful unit test fixtures / utility methods to their 
own test utilities set package / module to make them easier to find / use.

See https://github.com/apache/spark/pull/3121#discussion-diff-20413659 for one 
example of this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4147) Remove log4j dependency

2014-11-16 Thread Nathan M (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214261#comment-14214261
 ] 

Nathan M commented on SPARK-4147:
-

This code just forces log4j on the end user which is less than ideal. SLF4J 
should avoid this, seems like something wrong is being done trying to set the 
log level like this... 

> Remove log4j dependency
> ---
>
> Key: SPARK-4147
> URL: https://issues.apache.org/jira/browse/SPARK-4147
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Tobias Pfeiffer
>
> spark-core has a hard dependency on log4j, which shouldn't be necessary since 
> slf4j is used. I tried to exclude slf4j-log4j12 and log4j dependencies in my 
> sbt file.
> Excluding org.slf4j.slf4j-log4j12 works fine if logback is on the classpath. 
> However, removing the log4j dependency fails because in 
> https://github.com/apache/spark/blob/v1.1.0/core/src/main/scala/org/apache/spark/Logging.scala#L121
>  a static method of org.apache.log4j.LogManager is accessed *even if* log4j 
> is not in use.
> I guess removing all dependencies on log4j may be a bigger task, but it would 
> be a great help if the access to LogManager would be done only if log4j use 
> was detected before. (This is a 2-line change.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4325) Improve spark-ec2 cluster launch times

2014-11-16 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214239#comment-14214239
 ] 

Nicholas Chammas commented on SPARK-4325:
-

{quote}
Replace instances of download; rsync to rest of cluster with parallel downloads 
on all nodes of the cluster.
{quote}

Actually, the current way may be better. If you are launching a 100+ node 
cluster, for example, it probably isn't a good idea to have all of them hit a 
resource (e.g. a file at an Apache mirror) at once without some thought.

I'd bet it's safer and more reliable for now to have a single node download and 
then broadcast to the rest of the cluster. This is the current behavior.

> Improve spark-ec2 cluster launch times
> --
>
> Key: SPARK-4325
> URL: https://issues.apache.org/jira/browse/SPARK-4325
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Priority: Minor
>
> There are several optimizations we know we can make to [{{setup.sh}} | 
> https://github.com/mesos/spark-ec2/blob/v4/setup.sh] to make cluster launches 
> faster.
> There are also some improvements to the AMIs that will help a lot.
> Potential improvements:
> * Upgrade the Spark AMIs and pre-install tools like Ganglia on them. This 
> will reduce or eliminate SSH wait time and Ganglia init time.
> * Replace instances of {{download; rsync to rest of cluster}} with parallel 
> downloads on all nodes of the cluster.
> * Replace instances of 
>  {code}
> for node in $NODES; do
>   command
>   sleep 0.3
> done
> wait{code}
>  with simpler calls to {{pssh}}.
> * Remove the [linear backoff | 
> https://github.com/apache/spark/blob/b32734e12d5197bad26c080e529edd875604c6fb/ec2/spark_ec2.py#L665]
>  when we wait for SSH availability now that we are already waiting for EC2 
> status checks to clear before testing SSH.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2321) Design a proper progress reporting & event listener API

2014-11-16 Thread Rui Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214223#comment-14214223
 ] 

Rui Li commented on SPARK-2321:
---

Hey [~joshrosen],

Thanks a lot for the update! I created SPARK-4440 for the enhancement.

> Design a proper progress reporting & event listener API
> ---
>
> Key: SPARK-2321
> URL: https://issues.apache.org/jira/browse/SPARK-2321
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, Spark Core
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.2.0
>
>
> This is a ticket to track progress on redesigning the SparkListener and 
> JobProgressListener API.
> There are multiple problems with the current design, including:
> 0. I'm not sure if the API is usable in Java (there are at least some enums 
> we used in Scala and a bunch of case classes that might complicate things).
> 1. The whole API is marked as DeveloperApi, because we haven't paid a lot of 
> attention to it yet. Something as important as progress reporting deserves a 
> more stable API.
> 2. There is no easy way to connect jobs with stages. Similarly, there is no 
> easy way to connect job groups with jobs / stages.
> 3. JobProgressListener itself has no encapsulation at all. States can be 
> arbitrarily mutated by external programs. Variable names are sort of randomly 
> decided and inconsistent. 
> We should just revisit these and propose a new, concrete design. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4440) Enhance the job progress API to expose more information

2014-11-16 Thread Rui Li (JIRA)

Rui Li created SPARK-4440:
-

 Summary: Enhance the job progress API to expose more information
 Key: SPARK-4440
 URL: https://issues.apache.org/jira/browse/SPARK-4440
 Project: Spark
  Issue Type: Improvement
Reporter: Rui Li


The progress API introduced in SPARK-2321 provides a new way for user to 
monitor job progress. However the information exposed in the API is relatively 
limited. It'll be much more useful if we can enhance the API to expose more 
data.
Some improvement for example may include but not limited to:
1. Stage submission and completion time.
2. Task metrics.
The requirement is initially identified for the hive on spark 
project(HIVE-7292), other application should benefit as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4309) Date type support missing in HiveThriftServer2

2014-11-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214202#comment-14214202
 ] 

Apache Spark commented on SPARK-4309:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/3298

> Date type support missing in HiveThriftServer2
> --
>
> Key: SPARK-4309
> URL: https://issues.apache.org/jira/browse/SPARK-4309
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.1
>Reporter: Cheng Lian
> Fix For: 1.2.0
>
>
> Date type is not supported while retrieving result set in HiveThriftServer2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4407) Thrift server for 0.13.1 doesn't deserialize complex types properly

2014-11-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214203#comment-14214203
 ] 

Apache Spark commented on SPARK-4407:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/3298

> Thrift server for 0.13.1 doesn't deserialize complex types properly
> ---
>
> Key: SPARK-4407
> URL: https://issues.apache.org/jira/browse/SPARK-4407
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.2
>Reporter: Cheng Lian
>Priority: Blocker
> Fix For: 1.2.0
>
>
> The following snippet can reproduce this issue:
> {code}
> CREATE TABLE t0(m MAP);
> INSERT OVERWRITE TABLE t0 SELECT MAP(key, value) FROM src LIMIT 10;
> SELECT * FROM t0;
> {code}
> Exception throw:
> {code}
> java.lang.RuntimeException: java.lang.ClassCastException: 
> scala.collection.immutable.Map$Map1 cannot be cast to java.lang.String
> at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:84)
> at 
> org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
> at 
> org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> at 
> org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
> at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
> at com.sun.proxy.$Proxy21.fetchResults(Unknown Source)
> at 
> org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:405)
> at 
> org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:530)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538)
> at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
> at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
> at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
> at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassCastException: scala.collection.immutable.Map$Map1 
> cannot be cast to java.lang.String
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.addNonNullColumnValue(Shim13.scala:142)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(Shim13.scala:165)
> at 
> org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:192)
> at 
> org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:471)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
> ... 19 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4439) Expose RandomForest in Python

2014-11-16 Thread Matei Zaharia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-4439:
-
Summary: Expose RandomForest in Python  (was: Export RandomForest in Python)

> Expose RandomForest in Python
> -
>
> Key: SPARK-4439
> URL: https://issues.apache.org/jira/browse/SPARK-4439
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Matei Zaharia
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4439) Export RandomForest in Python

2014-11-16 Thread Matei Zaharia (JIRA)

Matei Zaharia created SPARK-4439:


 Summary: Export RandomForest in Python
 Key: SPARK-4439
 URL: https://issues.apache.org/jira/browse/SPARK-4439
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Matei Zaharia






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4038) Outlier Detection Algorithm for MLlib

2014-11-16 Thread Ashutosh Trivedi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207925#comment-14207925
 ] 

Ashutosh Trivedi edited comment on SPARK-4038 at 11/17/14 2:18 AM:
---

I think I am following the procedure. I opened a discussion on dev mailing list 
and [~mengxr] asked me to open this JIRA.  If you read the description-- this 
JIRA is to discuss about various Outlier/anomaly detection algorithms. I don't 
just 'care to code' in Spark. Since I am using spark for my projects, I found 
that there are no algorithms on Outliers and I think  it should have it and I 
can contribute. I am aware of one algorithm AVF (link attached).  

The questions raised are valid and we want community to discuss it. 

This algorithm deals with categorical data, It uses the simplest approach by 
calculating frequency of each attribute in the data set. Some of the people in 
community are already doing the review and I am working on it.

I did not find any other algorithm which work on categorical data to find 
outliers. If you are aware of any other algorithm which is well known please 
share with us.

  


was (Author: rusty):
I think I am following the procedure. I opened a discussion on dev mailing list 
and Xiangrui asked me to open this JIRA.  If you read the description this JIRA 
is to discuss about various Outlier/anomaly detection algorithms. I don't just 
'care to code' in Spark. Since I am using spark for my projects, I found that 
there are no algorithms on Outliers and I think  it should have algorithms for 
it. I am aware of one algorithm AVF (link attached).  

The questions raised are valid and we want community to discuss it. 

This algorithm deals with categorical data, It uses the simplest approach by 
calculating frequency of each attribute in the data set. Some of the people in 
community are already doing the review and I am working on it.

I did not find any other algorithm which work on categorical data to find 
outliers. If you are aware of any other algorithm which is well known please 
share with us.

  

> Outlier Detection Algorithm for MLlib
> -
>
> Key: SPARK-4038
> URL: https://issues.apache.org/jira/browse/SPARK-4038
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Ashutosh Trivedi
>Priority: Minor
>
> The aim of this JIRA is to discuss about which parallel outlier detection 
> algorithms can be included in MLlib. 
> The one which I am familiar with is Attribute Value Frequency (AVF). It 
> scales linearly with the number of data points and attributes, and relies on 
> a single data scan. It is not distance based and well suited for categorical 
> data. In original paper  a parallel version is also given, which is not 
> complected to implement.  I am working on the implementation and soon submit 
> the initial code for review.
> Here is the Link for the paper
> http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4410382
> As pointed out by Xiangrui in discussion 
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-td8880.html
> There are other algorithms also. Lets discuss about which will be more 
> general and easily paralleled.
>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4434) spark-submit cluster deploy mode JAR URLs are broken in 1.1.1

2014-11-16 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214199#comment-14214199
 ] 

Josh Rosen commented on SPARK-4434:
---

As a regression test, we should probably add a triple-slash test case to 
ClientSuite.

> spark-submit cluster deploy mode JAR URLs are broken in 1.1.1
> -
>
> Key: SPARK-4434
> URL: https://issues.apache.org/jira/browse/SPARK-4434
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.1.1, 1.2.0
>Reporter: Josh Rosen
>Assignee: Andrew Or
>Priority: Blocker
>
> When submitting a driver using {{spark-submit}} in cluster mode, Spark 1.1.0 
> allowed you to omit the {{file://}} or {{hdfs://}} prefix from the 
> application JAR URL, e.g.
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> /Users/joshrosen/Documents/old-spark-releases/spark-1.1.0-bin-hadoop1/lib/spark-examples-1.1.0-hadoop1.0.4.jar
> {code}
> In Spark 1.1.1 and 1.2.0, this same command now fails with an error:
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> /Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
> Jar url 
> 'file:/Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar'
>  is not in valid format.
> Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar)
> Usage: DriverClient [options] launch
> [driver options]
> Usage: DriverClient kill  
> {code}
> I tried changing my URL to conform to the new format, but this either 
> resulted in an error or a job that failed:
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
> Jar url 
> 'file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar'
>  is not in valid format.
> Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar)
> {code}
> If I omit the extra slash:
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
> Sending launch command to spark://joshs-mbp.att.net:7077
> Driver successfully submitted as driver-20141116143235-0002
> ... waiting before polling master for driver state
> ... polling master for driver state
> State of driver-20141116143235-0002 is ERROR
> Exception from cluster was: java.lang.IllegalArgumentException: Wrong FS: 
> file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar,
>  expected: file:///
> java.lang.IllegalArgumentException: Wrong FS: 
> file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar,
>  expected: file:///
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:55)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:393)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329)
>   at 
> org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:157)
>   at 
> org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:74)
> {code}
> This bug effectively prevents users from using {{spark-submit}} in cluster 
> mode to run drivers whose JARs are stored on shared cluster filesystems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4038) Outlier Detection Algorithm for MLlib

2014-11-16 Thread Ashutosh Trivedi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207925#comment-14207925
 ] 

Ashutosh Trivedi edited comment on SPARK-4038 at 11/17/14 2:15 AM:
---

I think I am following the procedure. I opened a discussion on dev mailing list 
and Xiangrui asked me to open this JIRA.  If you read the description this JIRA 
is to discuss about various Outlier/anomaly detection algorithms. I don't just 
'care to code' in Spark. Since I am using spark for my projects, I found that 
there are no algorithms on Outliers and I think  it should have algorithms for 
it. I am aware of one algorithm AVF (link attached).  

The questions raised are valid and we want community to discuss it. 

This algorithm deals with categorical data, It uses the simplest approach by 
calculating frequency of each attribute in the data set. Some of the people in 
community are already doing the review and I am working on it.

I did not find any other algorithm which work on categorical data to find 
outliers. If you are aware of any other algorithm which is well known please 
share with us.

  


was (Author: rusty):
The questions raised are valid and we want community to discuss it. 

This algorithm deals with categorical data, It uses the simplest approach by 
calculating frequency of each attribute in the data set. Some of the people in 
community are already doing the review and I am working on it.

I did not find any other algorithm which work on categorical data to find 
outliers. If you are aware of any other algorithm which is well known please 
share with us.

  

> Outlier Detection Algorithm for MLlib
> -
>
> Key: SPARK-4038
> URL: https://issues.apache.org/jira/browse/SPARK-4038
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Ashutosh Trivedi
>Priority: Minor
>
> The aim of this JIRA is to discuss about which parallel outlier detection 
> algorithms can be included in MLlib. 
> The one which I am familiar with is Attribute Value Frequency (AVF). It 
> scales linearly with the number of data points and attributes, and relies on 
> a single data scan. It is not distance based and well suited for categorical 
> data. In original paper  a parallel version is also given, which is not 
> complected to implement.  I am working on the implementation and soon submit 
> the initial code for review.
> Here is the Link for the paper
> http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4410382
> As pointed out by Xiangrui in discussion 
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-td8880.html
> There are other algorithms also. Lets discuss about which will be more 
> general and easily paralleled.
>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4434) spark-submit cluster deploy mode JAR URLs are broken in 1.1.1

2014-11-16 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214196#comment-14214196
 ] 

Josh Rosen commented on SPARK-4434:
---

Fellow Databricks folks: I've added a regression test for this in 
https://github.com/databricks/spark-integration-tests/commit/f121f45aecbeafcec21d3bb670737fc9f7d6da0b
 (I'm sharing this link here so that it's easy to find this test once we 
open-source that repository).  The test is essentially a scripted / automated 
version of the commands that I've listed in this JIRA.  These tests confirm 
that reverting that earlier PR fixes this issue.

[~adav], do you want to open a separate JIRA to fix the "file://" error message?

> spark-submit cluster deploy mode JAR URLs are broken in 1.1.1
> -
>
> Key: SPARK-4434
> URL: https://issues.apache.org/jira/browse/SPARK-4434
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.1.1, 1.2.0
>Reporter: Josh Rosen
>Assignee: Andrew Or
>Priority: Blocker
>
> When submitting a driver using {{spark-submit}} in cluster mode, Spark 1.1.0 
> allowed you to omit the {{file://}} or {{hdfs://}} prefix from the 
> application JAR URL, e.g.
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> /Users/joshrosen/Documents/old-spark-releases/spark-1.1.0-bin-hadoop1/lib/spark-examples-1.1.0-hadoop1.0.4.jar
> {code}
> In Spark 1.1.1 and 1.2.0, this same command now fails with an error:
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> /Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
> Jar url 
> 'file:/Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar'
>  is not in valid format.
> Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar)
> Usage: DriverClient [options] launch
> [driver options]
> Usage: DriverClient kill  
> {code}
> I tried changing my URL to conform to the new format, but this either 
> resulted in an error or a job that failed:
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
> Jar url 
> 'file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar'
>  is not in valid format.
> Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar)
> {code}
> If I omit the extra slash:
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
> Sending launch command to spark://joshs-mbp.att.net:7077
> Driver successfully submitted as driver-20141116143235-0002
> ... waiting before polling master for driver state
> ... polling master for driver state
> State of driver-20141116143235-0002 is ERROR
> Exception from cluster was: java.lang.IllegalArgumentException: Wrong FS: 
> file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar,
>  expected: file:///
> java.lang.IllegalArgumentException: Wrong FS: 
> file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar,
>  expected: file:///
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:55)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:393)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329)
>   at 
> org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:157)
>   at 
> org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:74)
> {code}
> This bug effectively prevents users from using {{spark-submit}} in cluster 
> mode to run drivers whose JARs are stored on shared cluster filesystems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4438) Add HistoryServer RESTful API

2014-11-16 Thread Gankun Luo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gankun Luo updated SPARK-4438:
--
Attachment: HistoryServer RESTful API Design Doc.pdf

> Add HistoryServer RESTful API
> -
>
> Key: SPARK-4438
> URL: https://issues.apache.org/jira/browse/SPARK-4438
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: Gankun Luo
> Attachments: HistoryServer RESTful API Design Doc.pdf
>
>
> Spark HistoryServer currently only supports keep track of all completed 
> applications through the WEBUI, does not provide RESTful API for external 
> system query completed application information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4436) Debian packaging misses datanucleus jars

2014-11-16 Thread Mark Hamstra (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra updated SPARK-4436:

Description: 
If Spark is built with Hive support (i.e. -Phive), then the necessary 
datanucleus jars end up in lib_managed, not as part of the uber-jar.  The 
debian packaging isn't including anything from lib_managed.  As a consequence, 
HiveContext et al. will fail with the packaged Spark even though it was built 
with -Phive.

see comment in bin/compute-classpath.sh

Packaging everything from lib_managed/jars into /lib is an adequate 
solution.

  was:
If Spark is built with HIve support (i.e. -Phive), then the necessary 
datanucleus jars end up in lib_managed, not as part of the uber-jar.  The 
debian packaging isn't including anything from lib_managed.  As a consequence, 
HiveContext et al. will fail with the packaged Spark even though it was built 
with -Phive.

see comment in bin/compute-classpath.sh

Packaging everything from lib_managed/jars into /lib is an adequate 
solution.


> Debian packaging misses datanucleus jars
> 
>
> Key: SPARK-4436
> URL: https://issues.apache.org/jira/browse/SPARK-4436
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.0.0, 1.0.1, 1.1.0
>Reporter: Mark Hamstra
>Assignee: Mark Hamstra
>Priority: Minor
>
> If Spark is built with Hive support (i.e. -Phive), then the necessary 
> datanucleus jars end up in lib_managed, not as part of the uber-jar.  The 
> debian packaging isn't including anything from lib_managed.  As a 
> consequence, HiveContext et al. will fail with the packaged Spark even though 
> it was built with -Phive.
> see comment in bin/compute-classpath.sh
> Packaging everything from lib_managed/jars into /lib is an adequate 
> solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4438) Add HistoryServer RESTful API

2014-11-16 Thread Gankun Luo (JIRA)

Gankun Luo created SPARK-4438:
-

 Summary: Add HistoryServer RESTful API
 Key: SPARK-4438
 URL: https://issues.apache.org/jira/browse/SPARK-4438
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: Gankun Luo


Spark HistoryServer currently only supports keep track of all completed 
applications through the WEBUI, does not provide RESTful API for external 
system query completed application information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4436) Debian packaging misses datanucleus jars

2014-11-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214182#comment-14214182
 ] 

Apache Spark commented on SPARK-4436:
-

User 'markhamstra' has created a pull request for this issue:
https://github.com/apache/spark/pull/3297

> Debian packaging misses datanucleus jars
> 
>
> Key: SPARK-4436
> URL: https://issues.apache.org/jira/browse/SPARK-4436
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.0.0, 1.0.1, 1.1.0
>Reporter: Mark Hamstra
>Assignee: Mark Hamstra
>Priority: Minor
>
> If Spark is built with HIve support (i.e. -Phive), then the necessary 
> datanucleus jars end up in lib_managed, not as part of the uber-jar.  The 
> debian packaging isn't including anything from lib_managed.  As a 
> consequence, HiveContext et al. will fail with the packaged Spark even though 
> it was built with -Phive.
> see comment in bin/compute-classpath.sh
> Packaging everything from lib_managed/jars into /lib is an adequate 
> solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3624) "Failed to find Spark assembly in /usr/share/spark/lib" for RELEASED debian packages

2014-11-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214183#comment-14214183
 ] 

Apache Spark commented on SPARK-3624:
-

User 'markhamstra' has created a pull request for this issue:
https://github.com/apache/spark/pull/3297

> "Failed to find Spark assembly in /usr/share/spark/lib" for RELEASED debian 
> packages
> 
>
> Key: SPARK-3624
> URL: https://issues.apache.org/jira/browse/SPARK-3624
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Deploy
>Affects Versions: 1.1.0
>Reporter: Christian Tzolov
>Priority: Minor
>
> The compute-classpath.sh requires that for a 'RELASED' package the Spark 
> assembly jar is accessible from a /lib folder.
> Currently the jdeb packaging (assembly module) bundles the assembly jar into 
> a folder called 'jars'. 
> The result is :
> /usr/share/spark/bin/spark-submit   --num-executors 10--master 
> yarn-cluster   --class org.apache.spark.examples.SparkPi   
> /usr/share/spark/jars/spark-examples-1.1.0-hadoop2.2.0-gphd-3.0.1.0.jar 10
> ls: cannot access /usr/share/spark/lib: No such file or directory
> Failed to find Spark assembly in /usr/share/spark/lib
> You need to build Spark before running this program.
> Trivial solution is to rename the '${deb.install.path}/jars' 
> inside assembly/pom.xml to ${deb.install.path}/lib.
> Another less impactful (considering backward compatibility) solution is to 
> define a lib->jars symlink in the assembly/pom.xml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4399) Support multiple cloud providers

2014-11-16 Thread Andrew Ash (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214167#comment-14214167
 ] 

Andrew Ash commented on SPARK-4399:
---

Agreed that it does seem out of scope for what an open source project would 
normally focus on.  In my mind the Apache Spark team's responsibility ends at 
producing the source and binary tarball releases.  If distributors or others 
want to make deploying those releases on particular cloud providers easier they 
are free to do it, but that is not the Spark team's responsibility.

Have you observed much demand for non-EC2 cloud providers from the users list?

> Support multiple cloud providers
> 
>
> Key: SPARK-4399
> URL: https://issues.apache.org/jira/browse/SPARK-4399
> Project: Spark
>  Issue Type: New Feature
>  Components: EC2
>Affects Versions: 1.2.0
>Reporter: Andrew Ash
>
> We currently have Spark startup scripts for Amazon EC2 but not for various 
> other cloud providers.  This ticket is an umbrella to support multiple cloud 
> providers in the bundled scripts, not just Amazon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4434) spark-submit cluster deploy mode JAR URLs are broken in 1.1.1

2014-11-16 Thread Aaron Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214166#comment-14214166
 ] 

Aaron Davidson commented on SPARK-4434:
---

Side note: the error message about "file://", which was not introduced in the 
patch you reverted, is incorrect. A "file://XX.jar" URI is never valid. One or 
three slashes must be used; two slashes indicates that a hostname follows.

> spark-submit cluster deploy mode JAR URLs are broken in 1.1.1
> -
>
> Key: SPARK-4434
> URL: https://issues.apache.org/jira/browse/SPARK-4434
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.1.1, 1.2.0
>Reporter: Josh Rosen
>Assignee: Andrew Or
>Priority: Blocker
>
> When submitting a driver using {{spark-submit}} in cluster mode, Spark 1.1.0 
> allowed you to omit the {{file://}} or {{hdfs://}} prefix from the 
> application JAR URL, e.g.
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> /Users/joshrosen/Documents/old-spark-releases/spark-1.1.0-bin-hadoop1/lib/spark-examples-1.1.0-hadoop1.0.4.jar
> {code}
> In Spark 1.1.1 and 1.2.0, this same command now fails with an error:
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> /Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
> Jar url 
> 'file:/Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar'
>  is not in valid format.
> Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar)
> Usage: DriverClient [options] launch
> [driver options]
> Usage: DriverClient kill  
> {code}
> I tried changing my URL to conform to the new format, but this either 
> resulted in an error or a job that failed:
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
> Jar url 
> 'file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar'
>  is not in valid format.
> Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar)
> {code}
> If I omit the extra slash:
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
> Sending launch command to spark://joshs-mbp.att.net:7077
> Driver successfully submitted as driver-20141116143235-0002
> ... waiting before polling master for driver state
> ... polling master for driver state
> State of driver-20141116143235-0002 is ERROR
> Exception from cluster was: java.lang.IllegalArgumentException: Wrong FS: 
> file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar,
>  expected: file:///
> java.lang.IllegalArgumentException: Wrong FS: 
> file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar,
>  expected: file:///
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:55)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:393)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329)
>   at 
> org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:157)
>   at 
> org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:74)
> {code}
> This bug effectively prevents users from using {{spark-submit}} in cluster 
> mode to run drivers whose JARs are stored on shared cluster filesystems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4434) spark-submit cluster deploy mode JAR URLs are broken in 1.1.1

2014-11-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214165#comment-14214165
 ] 

Apache Spark commented on SPARK-4434:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/3296

> spark-submit cluster deploy mode JAR URLs are broken in 1.1.1
> -
>
> Key: SPARK-4434
> URL: https://issues.apache.org/jira/browse/SPARK-4434
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.1.1, 1.2.0
>Reporter: Josh Rosen
>Assignee: Andrew Or
>Priority: Blocker
>
> When submitting a driver using {{spark-submit}} in cluster mode, Spark 1.1.0 
> allowed you to omit the {{file://}} or {{hdfs://}} prefix from the 
> application JAR URL, e.g.
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> /Users/joshrosen/Documents/old-spark-releases/spark-1.1.0-bin-hadoop1/lib/spark-examples-1.1.0-hadoop1.0.4.jar
> {code}
> In Spark 1.1.1 and 1.2.0, this same command now fails with an error:
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> /Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
> Jar url 
> 'file:/Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar'
>  is not in valid format.
> Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar)
> Usage: DriverClient [options] launch
> [driver options]
> Usage: DriverClient kill  
> {code}
> I tried changing my URL to conform to the new format, but this either 
> resulted in an error or a job that failed:
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
> Jar url 
> 'file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar'
>  is not in valid format.
> Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar)
> {code}
> If I omit the extra slash:
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
> Sending launch command to spark://joshs-mbp.att.net:7077
> Driver successfully submitted as driver-20141116143235-0002
> ... waiting before polling master for driver state
> ... polling master for driver state
> State of driver-20141116143235-0002 is ERROR
> Exception from cluster was: java.lang.IllegalArgumentException: Wrong FS: 
> file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar,
>  expected: file:///
> java.lang.IllegalArgumentException: Wrong FS: 
> file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar,
>  expected: file:///
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:55)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:393)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329)
>   at 
> org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:157)
>   at 
> org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:74)
> {code}
> This bug effectively prevents users from using {{spark-submit}} in cluster 
> mode to run drivers whose JARs are stored on shared cluster filesystems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4437) Docs for difference between WholeTextFileRecordReader and WholeCombineFileRecordReader

2014-11-16 Thread Andrew Ash (JIRA)

Andrew Ash created SPARK-4437:
-

 Summary: Docs for difference between WholeTextFileRecordReader and 
WholeCombineFileRecordReader
 Key: SPARK-4437
 URL: https://issues.apache.org/jira/browse/SPARK-4437
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Reporter: Andrew Ash
Assignee: Davies Liu


Tracking per this dev@ thread:

{quote}
On Sun, Nov 16, 2014 at 4:49 PM, Reynold Xin  wrote:
I don't think the code is immediately obvious.

Davies - I think you added the code, and Josh reviewed it. Can you guys
explain and maybe submit a patch to add more documentation on the whole
thing?

Thanks.


On Sun, Nov 16, 2014 at 3:22 AM, Vibhanshu Prasad 
wrote:

> Hello Everyone,
>
> I am going through the source code of rdd and Record readers
> There are found 2 classes
>
> 1. WholeTextFileRecordReader
> 2. WholeCombineFileRecordReader  ( extends CombineFileRecordReader )
>
> The description of both the classes is perfectly similar.
>
> I am not able to understand why we have 2 classes. Is
> CombineFileRecordReader providing some extra advantage?
>
> Regards
> Vibhanshu
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4434) spark-submit cluster deploy mode JAR URLs are broken in 1.1.1

2014-11-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214164#comment-14214164
 ] 

Apache Spark commented on SPARK-4434:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/3295

> spark-submit cluster deploy mode JAR URLs are broken in 1.1.1
> -
>
> Key: SPARK-4434
> URL: https://issues.apache.org/jira/browse/SPARK-4434
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.1.1, 1.2.0
>Reporter: Josh Rosen
>Assignee: Andrew Or
>Priority: Blocker
>
> When submitting a driver using {{spark-submit}} in cluster mode, Spark 1.1.0 
> allowed you to omit the {{file://}} or {{hdfs://}} prefix from the 
> application JAR URL, e.g.
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> /Users/joshrosen/Documents/old-spark-releases/spark-1.1.0-bin-hadoop1/lib/spark-examples-1.1.0-hadoop1.0.4.jar
> {code}
> In Spark 1.1.1 and 1.2.0, this same command now fails with an error:
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> /Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
> Jar url 
> 'file:/Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar'
>  is not in valid format.
> Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar)
> Usage: DriverClient [options] launch
> [driver options]
> Usage: DriverClient kill  
> {code}
> I tried changing my URL to conform to the new format, but this either 
> resulted in an error or a job that failed:
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
> Jar url 
> 'file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar'
>  is not in valid format.
> Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar)
> {code}
> If I omit the extra slash:
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
> Sending launch command to spark://joshs-mbp.att.net:7077
> Driver successfully submitted as driver-20141116143235-0002
> ... waiting before polling master for driver state
> ... polling master for driver state
> State of driver-20141116143235-0002 is ERROR
> Exception from cluster was: java.lang.IllegalArgumentException: Wrong FS: 
> file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar,
>  expected: file:///
> java.lang.IllegalArgumentException: Wrong FS: 
> file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar,
>  expected: file:///
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:55)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:393)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329)
>   at 
> org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:157)
>   at 
> org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:74)
> {code}
> This bug effectively prevents users from using {{spark-submit}} in cluster 
> mode to run drivers whose JARs are stored on shared cluster filesystems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4436) Debian packaging misses datanucleus jars

2014-11-16 Thread Mark Hamstra (JIRA)

Mark Hamstra created SPARK-4436:
---

 Summary: Debian packaging misses datanucleus jars
 Key: SPARK-4436
 URL: https://issues.apache.org/jira/browse/SPARK-4436
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0, 1.0.1, 1.0.0
Reporter: Mark Hamstra
Assignee: Mark Hamstra
Priority: Minor


If Spark is built with HIve support (i.e. -Phive), then the necessary 
datanucleus jars end up in lib_managed, not as part of the uber-jar.  The 
debian packaging isn't including anything from lib_managed.  As a consequence, 
HiveContext et al. will fail with the packaged Spark even though it was built 
with -Phive.

see comment in bin/compute-classpath.sh

Packaging everything from lib_managed/jars into /lib is an adequate 
solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4434) spark-submit cluster deploy mode JAR URLs are broken in 1.1.1

2014-11-16 Thread Matei Zaharia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214155#comment-14214155
 ] 

Matei Zaharia commented on SPARK-4434:
--

[~joshrosen] make sure to revert this on 1.2 and master as well.

> spark-submit cluster deploy mode JAR URLs are broken in 1.1.1
> -
>
> Key: SPARK-4434
> URL: https://issues.apache.org/jira/browse/SPARK-4434
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.1.1, 1.2.0
>Reporter: Josh Rosen
>Assignee: Andrew Or
>Priority: Blocker
>
> When submitting a driver using {{spark-submit}} in cluster mode, Spark 1.1.0 
> allowed you to omit the {{file://}} or {{hdfs://}} prefix from the 
> application JAR URL, e.g.
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> /Users/joshrosen/Documents/old-spark-releases/spark-1.1.0-bin-hadoop1/lib/spark-examples-1.1.0-hadoop1.0.4.jar
> {code}
> In Spark 1.1.1 and 1.2.0, this same command now fails with an error:
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> /Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
> Jar url 
> 'file:/Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar'
>  is not in valid format.
> Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar)
> Usage: DriverClient [options] launch
> [driver options]
> Usage: DriverClient kill  
> {code}
> I tried changing my URL to conform to the new format, but this either 
> resulted in an error or a job that failed:
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
> Jar url 
> 'file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar'
>  is not in valid format.
> Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar)
> {code}
> If I omit the extra slash:
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
> Sending launch command to spark://joshs-mbp.att.net:7077
> Driver successfully submitted as driver-20141116143235-0002
> ... waiting before polling master for driver state
> ... polling master for driver state
> State of driver-20141116143235-0002 is ERROR
> Exception from cluster was: java.lang.IllegalArgumentException: Wrong FS: 
> file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar,
>  expected: file:///
> java.lang.IllegalArgumentException: Wrong FS: 
> file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar,
>  expected: file:///
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:55)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:393)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329)
>   at 
> org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:157)
>   at 
> org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:74)
> {code}
> This bug effectively prevents users from using {{spark-submit}} in cluster 
> mode to run drivers whose JARs are stored on shared cluster filesystems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4434) spark-submit cluster deploy mode JAR URLs are broken in 1.1.1

2014-11-16 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214152#comment-14214152
 ] 

Josh Rosen commented on SPARK-4434:
---

It looks like this was caused by https://github.com/apache/spark/pull/2925 
(SPARK-4075), since reverting that fixes this issue.  I'll work on committing 
my test code to our internal tests repository and open a PR to investigate / 
revert that commit.

> spark-submit cluster deploy mode JAR URLs are broken in 1.1.1
> -
>
> Key: SPARK-4434
> URL: https://issues.apache.org/jira/browse/SPARK-4434
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.1.1, 1.2.0
>Reporter: Josh Rosen
>Assignee: Andrew Or
>Priority: Blocker
>
> When submitting a driver using {{spark-submit}} in cluster mode, Spark 1.1.0 
> allowed you to omit the {{file://}} or {{hdfs://}} prefix from the 
> application JAR URL, e.g.
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> /Users/joshrosen/Documents/old-spark-releases/spark-1.1.0-bin-hadoop1/lib/spark-examples-1.1.0-hadoop1.0.4.jar
> {code}
> In Spark 1.1.1 and 1.2.0, this same command now fails with an error:
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> /Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
> Jar url 
> 'file:/Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar'
>  is not in valid format.
> Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar)
> Usage: DriverClient [options] launch
> [driver options]
> Usage: DriverClient kill  
> {code}
> I tried changing my URL to conform to the new format, but this either 
> resulted in an error or a job that failed:
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
> Jar url 
> 'file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar'
>  is not in valid format.
> Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar)
> {code}
> If I omit the extra slash:
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
> Sending launch command to spark://joshs-mbp.att.net:7077
> Driver successfully submitted as driver-20141116143235-0002
> ... waiting before polling master for driver state
> ... polling master for driver state
> State of driver-20141116143235-0002 is ERROR
> Exception from cluster was: java.lang.IllegalArgumentException: Wrong FS: 
> file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar,
>  expected: file:///
> java.lang.IllegalArgumentException: Wrong FS: 
> file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar,
>  expected: file:///
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:55)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:393)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329)
>   at 
> org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:157)
>   at 
> org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:74)
> {code}
> This bug effectively prevents users from using {{spark-submit}} in cluster 
> mode to run drivers whose JARs are stored on shared cluster filesystems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4399) Support multiple cloud providers

2014-11-16 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214153#comment-14214153
 ] 

Patrick Wendell commented on SPARK-4399:


I think this might actually be "out of scope" and something for which it would 
be nice to have a community library or project outside of Spark. The Spark ec2 
scripts were designed to be a way for someone to play around with a Spark 
cluster quickly, and there is definitely user interest in something richer for 
launching production Spark clusters across different environments, etc.

> Support multiple cloud providers
> 
>
> Key: SPARK-4399
> URL: https://issues.apache.org/jira/browse/SPARK-4399
> Project: Spark
>  Issue Type: New Feature
>  Components: EC2
>Affects Versions: 1.2.0
>Reporter: Andrew Ash
>
> We currently have Spark startup scripts for Amazon EC2 but not for various 
> other cloud providers.  This ticket is an umbrella to support multiple cloud 
> providers in the bundled scripts, not just Amazon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2208) local metrics tests can fail on fast machines

2014-11-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214146#comment-14214146
 ] 

Apache Spark commented on SPARK-2208:
-

User 'XuefengWu' has created a pull request for this issue:
https://github.com/apache/spark/pull/3294

> local metrics tests can fail on fast machines
> -
>
> Key: SPARK-2208
> URL: https://issues.apache.org/jira/browse/SPARK-2208
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>  Labels: starter
>
> I'm temporarily disabling this check. I think the issue is that on fast 
> machines the fetch wait time can actually be zero, even across all tasks.
> We should see if we can write this in a different way to make sure there is a 
> delay.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-603) add simple Counter API

2014-11-16 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214142#comment-14214142
 ] 

Imran Rashid commented on SPARK-603:


Hey, this was originally reported by me too (probably I messed up when creating 
it on the old Jira, not sure if there is a way to change the reporter now?)

I think perhaps the original issue was a little unclear, I'll try to clarify a 
little bit:

I do *not* think we need to support something at the "operation" level -- 
having it work at the stage level (or even job level) is fine.  I'm not even 
sure what it would mean to work at the operation level, since individual 
records are pushed through all the operations of a stage in one go.

But the operation level is still a useful abstraction for the *developer*.  Its 
nice for them to be able to write methods which are eg., just a {{filter}}.  
For normal RDD operations, this works just fine of course -- you can have a 
bunch of util methods that take in an RDD and output an RDD, maybe some 
{{filter}}, some {{map}}, etc., they can get combined however you like, 
everything remains lazy until there is some action.  All wonderful.

Things get messy as soon as you start to include accumulators, however -- 
you've got include them in your return values and then the outside logic has to 
know when they actual contain valid data.  Rather than trying to solve this 
problem in general, I'm proposing that we do something dead-simple for basic 
counters, which might even live outside of accumulators completely.

Putting accumulator values in the web UI is not bad for just this purpose, but 
overall I don't think its the right solution:

1. It limits what we can do with accumulators (see my comments on SPARK-664)
2. The api is more complicated than it needs to be.  If the only point of 
accumulators is counters, then we can get away with something as simple as:

{code}
rdd.map{x =>
  if (isFoobar(x)) {
Counters("foobar") += 1
  }
  ...
}
{code}

(eg., no need to even declare the counter up front.)

3. Having the value in the UI is nice, but its not the same as programmatic 
access.  eg. it can be useful to have them in the job logs, the actual values 
might be used in other computation (eg., gives the size of a datastructure for 
a later step), etc.
Even with the simpler counter api, this is tricky b/c of lazy evaluation.  But 
maybe that is a reason you create a call-back up front:

{code}
Counters.addCallback("foobar"){counts => ...}
rdd.map{x =>
  if (isFoobar(x)) {
Counters("foobar") += 1
  }
  ...
}
{code}

4. If you have long-running tasks, it might be nice to get incremental feedback 
from counters *during* the task.  (There was a real need for long-running tasks 
before sort-based shuffle, when you couldn't have too many tasks in a shuffle 
... perhaps its not anymore, I'm not sure.)

We can get a little further with accumulators, eg. a SparkListener could do 
something with accumulator values when the stages finish.  But I think we're 
stuck on the other points.  I feel like right now accumulators are trapped 
between just being counters, and being a more general method of computation, 
and not quite doing either one very well.

> add simple Counter API
> --
>
> Key: SPARK-603
> URL: https://issues.apache.org/jira/browse/SPARK-603
> Project: Spark
>  Issue Type: New Feature
>Priority: Minor
>
> Users need a very simple way to create counters in their jobs.  Accumulators 
> provide a way to do this, but are a little clunky, for two reasons:
> 1) the setup is a nuisance
> 2) w/ delayed evaluation, you don't know when it will actually run, so its 
> hard to look at the values
> consider this code:
> {code}
> def filterBogus(rdd:RDD[MyCustomClass], sc: SparkContext) = {
>   val filterCount = sc.accumulator(0)
>   val filtered = rdd.filter{r =>
> if (isOK(r)) true else {filterCount += 1; false}
>   }
>   println("removed " + filterCount.value + " records)
>   filtered
> }
> {code}
> The println will always say 0 records were filtered, because its printed 
> before anything has actually run.  I could print out the value later on, but 
> note that it would destroy the modularity of the method -- kinda ugly to 
> return the accumulator just so that it can get printed later on.  (and of 
> course, the caller in turn might not know when the filter is going to get 
> applied, and would have to pass the accumulator up even further ...)
> I'd like to have Counters which just automatically get printed out whenever a 
> stage has been run, and also with some api to get them back.  I realize this 
> is tricky b/c a stage can get re-computed, so maybe you should only increment 
> the counters once.
> Maybe a more general way to do this is to provide some callback for whenever 
> an RDD is computed -- by default, you would just print

[jira] [Created] (SPARK-4435) Add setThreshold in Python LogisticRegressionModel and SVMModel

2014-11-16 Thread Matei Zaharia (JIRA)

Matei Zaharia created SPARK-4435:


 Summary: Add setThreshold in Python LogisticRegressionModel and 
SVMModel
 Key: SPARK-4435
 URL: https://issues.apache.org/jira/browse/SPARK-4435
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Matei Zaharia






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4306) LogisticRegressionWithLBFGS support for PySpark MLlib

2014-11-16 Thread Matei Zaharia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214134#comment-14214134
 ] 

Matei Zaharia commented on SPARK-4306:
--

[~srinathsmn] I've assigned it to you. When do you think you'll get this done? 
It would be great to include in 1.2 but for that we'd need it quite soon (say 
this week). If you don't have time, I can also assign it to someone else.

> LogisticRegressionWithLBFGS support for PySpark MLlib 
> --
>
> Key: SPARK-4306
> URL: https://issues.apache.org/jira/browse/SPARK-4306
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Varadharajan
>  Labels: newbie
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Currently we are supporting LogisticRegressionWithSGD in the PySpark MLlib 
> interfact. This task is to add support for LogisticRegressionWithLBFGS 
> algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4306) LogisticRegressionWithLBFGS support for PySpark MLlib

2014-11-16 Thread Matei Zaharia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-4306:
-
Assignee: Varadharajan

> LogisticRegressionWithLBFGS support for PySpark MLlib 
> --
>
> Key: SPARK-4306
> URL: https://issues.apache.org/jira/browse/SPARK-4306
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Varadharajan
>Assignee: Varadharajan
>  Labels: newbie
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Currently we are supporting LogisticRegressionWithSGD in the PySpark MLlib 
> interfact. This task is to add support for LogisticRegressionWithLBFGS 
> algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4306) LogisticRegressionWithLBFGS support for PySpark MLlib

2014-11-16 Thread Matei Zaharia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-4306:
-
Target Version/s: 1.2.0

> LogisticRegressionWithLBFGS support for PySpark MLlib 
> --
>
> Key: SPARK-4306
> URL: https://issues.apache.org/jira/browse/SPARK-4306
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Varadharajan
>  Labels: newbie
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Currently we are supporting LogisticRegressionWithSGD in the PySpark MLlib 
> interfact. This task is to add support for LogisticRegressionWithLBFGS 
> algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4407) Thrift server for 0.13.1 doesn't deserialize complex types properly

2014-11-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214127#comment-14214127
 ] 

Apache Spark commented on SPARK-4407:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/3292

> Thrift server for 0.13.1 doesn't deserialize complex types properly
> ---
>
> Key: SPARK-4407
> URL: https://issues.apache.org/jira/browse/SPARK-4407
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.2
>Reporter: Cheng Lian
>Priority: Blocker
> Fix For: 1.2.0
>
>
> The following snippet can reproduce this issue:
> {code}
> CREATE TABLE t0(m MAP);
> INSERT OVERWRITE TABLE t0 SELECT MAP(key, value) FROM src LIMIT 10;
> SELECT * FROM t0;
> {code}
> Exception throw:
> {code}
> java.lang.RuntimeException: java.lang.ClassCastException: 
> scala.collection.immutable.Map$Map1 cannot be cast to java.lang.String
> at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:84)
> at 
> org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
> at 
> org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> at 
> org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
> at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
> at com.sun.proxy.$Proxy21.fetchResults(Unknown Source)
> at 
> org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:405)
> at 
> org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:530)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538)
> at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
> at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
> at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
> at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassCastException: scala.collection.immutable.Map$Map1 
> cannot be cast to java.lang.String
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.addNonNullColumnValue(Shim13.scala:142)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(Shim13.scala:165)
> at 
> org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:192)
> at 
> org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:471)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
> ... 19 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4309) Date type support missing in HiveThriftServer2

2014-11-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214126#comment-14214126
 ] 

Apache Spark commented on SPARK-4309:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/3292

> Date type support missing in HiveThriftServer2
> --
>
> Key: SPARK-4309
> URL: https://issues.apache.org/jira/browse/SPARK-4309
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.1
>Reporter: Cheng Lian
> Fix For: 1.2.0
>
>
> Date type is not supported while retrieving result set in HiveThriftServer2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4434) spark-submit cluster deploy mode JAR URLs are broken in 1.1.1

2014-11-16 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214123#comment-14214123
 ] 

Josh Rosen commented on SPARK-4434:
---

I think that there are only a small number of patches in branch-1.1 that are 
related to this, so I'm going to see if I can narrow it down to a specific 
commit.  https://github.com/apache/spark/pull/2925 is one potential culprit, 
but there may be others.

I'm not sure whether this affects HDFS URLs; I haven't tried it yet since I 
don't have a Docker-ized HDFS set up in my integration tests project.

> spark-submit cluster deploy mode JAR URLs are broken in 1.1.1
> -
>
> Key: SPARK-4434
> URL: https://issues.apache.org/jira/browse/SPARK-4434
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.1.1, 1.2.0
>Reporter: Josh Rosen
>Assignee: Andrew Or
>Priority: Blocker
>
> When submitting a driver using {{spark-submit}} in cluster mode, Spark 1.1.0 
> allowed you to omit the {{file://}} or {{hdfs://}} prefix from the 
> application JAR URL, e.g.
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> /Users/joshrosen/Documents/old-spark-releases/spark-1.1.0-bin-hadoop1/lib/spark-examples-1.1.0-hadoop1.0.4.jar
> {code}
> In Spark 1.1.1 and 1.2.0, this same command now fails with an error:
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> /Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
> Jar url 
> 'file:/Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar'
>  is not in valid format.
> Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar)
> Usage: DriverClient [options] launch
> [driver options]
> Usage: DriverClient kill  
> {code}
> I tried changing my URL to conform to the new format, but this either 
> resulted in an error or a job that failed:
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
> Jar url 
> 'file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar'
>  is not in valid format.
> Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar)
> {code}
> If I omit the extra slash:
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
> Sending launch command to spark://joshs-mbp.att.net:7077
> Driver successfully submitted as driver-20141116143235-0002
> ... waiting before polling master for driver state
> ... polling master for driver state
> State of driver-20141116143235-0002 is ERROR
> Exception from cluster was: java.lang.IllegalArgumentException: Wrong FS: 
> file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar,
>  expected: file:///
> java.lang.IllegalArgumentException: Wrong FS: 
> file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar,
>  expected: file:///
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:55)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:393)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329)
>   at 
> org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:157)
>   at 
> org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:74)
> {code}
> This bug effectively prevents users from using {{spark-submit}} in cluster 
> mode to run drivers whose JARs are stored on shared cluster filesystems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-4434) spark-submit cluster deploy mode JAR URLs are broken in 1.1.1

2014-11-16 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-4434:
-

Assignee: Josh Rosen

> spark-submit cluster deploy mode JAR URLs are broken in 1.1.1
> -
>
> Key: SPARK-4434
> URL: https://issues.apache.org/jira/browse/SPARK-4434
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.1.1, 1.2.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
>
> When submitting a driver using {{spark-submit}} in cluster mode, Spark 1.1.0 
> allowed you to omit the {{file://}} or {{hdfs://}} prefix from the 
> application JAR URL, e.g.
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> /Users/joshrosen/Documents/old-spark-releases/spark-1.1.0-bin-hadoop1/lib/spark-examples-1.1.0-hadoop1.0.4.jar
> {code}
> In Spark 1.1.1 and 1.2.0, this same command now fails with an error:
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> /Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
> Jar url 
> 'file:/Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar'
>  is not in valid format.
> Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar)
> Usage: DriverClient [options] launch
> [driver options]
> Usage: DriverClient kill  
> {code}
> I tried changing my URL to conform to the new format, but this either 
> resulted in an error or a job that failed:
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
> Jar url 
> 'file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar'
>  is not in valid format.
> Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar)
> {code}
> If I omit the extra slash:
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
> Sending launch command to spark://joshs-mbp.att.net:7077
> Driver successfully submitted as driver-20141116143235-0002
> ... waiting before polling master for driver state
> ... polling master for driver state
> State of driver-20141116143235-0002 is ERROR
> Exception from cluster was: java.lang.IllegalArgumentException: Wrong FS: 
> file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar,
>  expected: file:///
> java.lang.IllegalArgumentException: Wrong FS: 
> file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar,
>  expected: file:///
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:55)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:393)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329)
>   at 
> org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:157)
>   at 
> org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:74)
> {code}
> This bug effectively prevents users from using {{spark-submit}} in cluster 
> mode to run drivers whose JARs are stored on shared cluster filesystems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4434) spark-submit cluster deploy mode JAR URLs are broken in 1.1.1

2014-11-16 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4434:
--
Assignee: Andrew Or  (was: Josh Rosen)

> spark-submit cluster deploy mode JAR URLs are broken in 1.1.1
> -
>
> Key: SPARK-4434
> URL: https://issues.apache.org/jira/browse/SPARK-4434
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.1.1, 1.2.0
>Reporter: Josh Rosen
>Assignee: Andrew Or
>Priority: Blocker
>
> When submitting a driver using {{spark-submit}} in cluster mode, Spark 1.1.0 
> allowed you to omit the {{file://}} or {{hdfs://}} prefix from the 
> application JAR URL, e.g.
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> /Users/joshrosen/Documents/old-spark-releases/spark-1.1.0-bin-hadoop1/lib/spark-examples-1.1.0-hadoop1.0.4.jar
> {code}
> In Spark 1.1.1 and 1.2.0, this same command now fails with an error:
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> /Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
> Jar url 
> 'file:/Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar'
>  is not in valid format.
> Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar)
> Usage: DriverClient [options] launch
> [driver options]
> Usage: DriverClient kill  
> {code}
> I tried changing my URL to conform to the new format, but this either 
> resulted in an error or a job that failed:
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
> Jar url 
> 'file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar'
>  is not in valid format.
> Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar)
> {code}
> If I omit the extra slash:
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
> Sending launch command to spark://joshs-mbp.att.net:7077
> Driver successfully submitted as driver-20141116143235-0002
> ... waiting before polling master for driver state
> ... polling master for driver state
> State of driver-20141116143235-0002 is ERROR
> Exception from cluster was: java.lang.IllegalArgumentException: Wrong FS: 
> file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar,
>  expected: file:///
> java.lang.IllegalArgumentException: Wrong FS: 
> file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar,
>  expected: file:///
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:55)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:393)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329)
>   at 
> org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:157)
>   at 
> org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:74)
> {code}
> This bug effectively prevents users from using {{spark-submit}} in cluster 
> mode to run drivers whose JARs are stored on shared cluster filesystems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4434) spark-submit cluster deploy mode JAR URLs are broken in 1.1.1

2014-11-16 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-4434:
-

 Summary: spark-submit cluster deploy mode JAR URLs are broken in 
1.1.1
 Key: SPARK-4434
 URL: https://issues.apache.org/jira/browse/SPARK-4434
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.1.1, 1.2.0
Reporter: Josh Rosen
Priority: Blocker


When submitting a driver using {{spark-submit}} in cluster mode, Spark 1.1.0 
allowed you to omit the {{file://}} or {{hdfs://}} prefix from the application 
JAR URL, e.g.

{code}
./bin/spark-submit --deploy-mode cluster --master 
spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
/Users/joshrosen/Documents/old-spark-releases/spark-1.1.0-bin-hadoop1/lib/spark-examples-1.1.0-hadoop1.0.4.jar
{code}

In Spark 1.1.1 and 1.2.0, this same command now fails with an error:

{code}
./bin/spark-submit --deploy-mode cluster --master 
spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
/Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
Jar url 
'file:/Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar'
 is not in valid format.
Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar)

Usage: DriverClient [options] launch
[driver options]
Usage: DriverClient kill  
{code}

I tried changing my URL to conform to the new format, but this either resulted 
in an error or a job that failed:

{code}
./bin/spark-submit --deploy-mode cluster --master 
spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
Jar url 
'file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar'
 is not in valid format.
Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar)
{code}

If I omit the extra slash:

{code}
./bin/spark-submit --deploy-mode cluster --master 
spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
Sending launch command to spark://joshs-mbp.att.net:7077
Driver successfully submitted as driver-20141116143235-0002
... waiting before polling master for driver state
... polling master for driver state
State of driver-20141116143235-0002 is ERROR
Exception from cluster was: java.lang.IllegalArgumentException: Wrong FS: 
file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar,
 expected: file:///
java.lang.IllegalArgumentException: Wrong FS: 
file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar,
 expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381)
at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:55)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:393)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329)
at 
org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:157)
at 
org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:74)
{code}

This bug effectively prevents users from using {{spark-submit}} in cluster mode 
to run drivers whose JARs are stored on shared cluster filesystems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4407) Thrift server for 0.13.1 doesn't deserialize complex types properly

2014-11-16 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4407.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

> Thrift server for 0.13.1 doesn't deserialize complex types properly
> ---
>
> Key: SPARK-4407
> URL: https://issues.apache.org/jira/browse/SPARK-4407
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.2
>Reporter: Cheng Lian
>Priority: Blocker
> Fix For: 1.2.0
>
>
> The following snippet can reproduce this issue:
> {code}
> CREATE TABLE t0(m MAP);
> INSERT OVERWRITE TABLE t0 SELECT MAP(key, value) FROM src LIMIT 10;
> SELECT * FROM t0;
> {code}
> Exception throw:
> {code}
> java.lang.RuntimeException: java.lang.ClassCastException: 
> scala.collection.immutable.Map$Map1 cannot be cast to java.lang.String
> at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:84)
> at 
> org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
> at 
> org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> at 
> org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
> at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
> at com.sun.proxy.$Proxy21.fetchResults(Unknown Source)
> at 
> org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:405)
> at 
> org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:530)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538)
> at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
> at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
> at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
> at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassCastException: scala.collection.immutable.Map$Map1 
> cannot be cast to java.lang.String
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.addNonNullColumnValue(Shim13.scala:142)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(Shim13.scala:165)
> at 
> org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:192)
> at 
> org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:471)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
> ... 19 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4309) Date type support missing in HiveThriftServer2

2014-11-16 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4309.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3178
[https://github.com/apache/spark/pull/3178]

> Date type support missing in HiveThriftServer2
> --
>
> Key: SPARK-4309
> URL: https://issues.apache.org/jira/browse/SPARK-4309
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.1
>Reporter: Cheng Lian
> Fix For: 1.2.0
>
>
> Date type is not supported while retrieving result set in HiveThriftServer2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-664) Accumulator updates should get locally merged before sent to the driver

2014-11-16 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214038#comment-14214038
 ] 

Imran Rashid commented on SPARK-664:


Hi [~aash] ,

thanks for taking another look at this -- sorry I have been aloof for a little 
while.  I didn't know about SPARK-2380 , obviously this was created long before 
that.  Honestly, I'm not a big fan of SPARK-2380, it seems to really limit what 
we can do with accumulators.  We could really use them to expose a completely 
different model of computation.

Let me give an example use case.  Accumulators are in principle general enough 
that they let you compute lots of different things in one pass.  Eg., by using 
accumulators, you could:

* create a bloom filter of records that meet some criteria
* assign records to different buckets, and count how many are in each bucket, 
even up to 100K buckets (eg., by having accumulator of {{Array}})
* use hyperloglog to count how many distinct ids you have
* filter down to only those records with some parsing error, for a closer look  
(just by using plain old {{rdd.filter()}}

You could do all that in one pass, if the first 3 were done w/ accumulators.  
When I started using spark, I actually wrote a bunch of code to do exactly that 
kind of thing.  But it performed really poorly -- after some profiling & 
investigating how accumulators work, I saw why.  Those big accumulators I was 
creating just put a lot of work on the driver.  Accumulators provide the right 
API to do that kind of thing, but the implementation would have to change.

I definitely agree that if the results get merged on the executor before 
getting sent to the executor, it increases the latency of the *per-task* 
results, but does that matter?  I would prefer that we have something that 
supports the more general computation model, and the important thing is only 
the latency of the *overall* result.  It feels like we're moving to 
accumulators being treated just like counters (but with an awkward api).

> Accumulator updates should get locally merged before sent to the driver
> ---
>
> Key: SPARK-664
> URL: https://issues.apache.org/jira/browse/SPARK-664
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Imran Rashid
>Priority: Minor
>
> Whenever a task finishes, the accumulator updates from that task are 
> immediately sent back to the driver.  When the accumulator updates are big, 
> this is inefficient because (a) a lot more data has to be sent to the driver 
> and (b) the driver has to do all the work of merging the updates together.
> Probably doesn't matter for small accumulators / low number of tasks, but if 
> both are big, this could be a big bottleneck.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2014-11-16 Thread RJ Nowling (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213945#comment-14213945
 ] 

RJ Nowling commented on SPARK-2429:
---

Hi Yu,

I'm having trouble finding the function to cut a dendrogram -- I see the tests 
but not the implementation.

I feel that you should be able to assign values in O(log N) time with the 
hierarchical method vs O(N) with the standard kmeans.  So, say you train a 
model (this may be slower than kmeans) then assign additional points to 
clusters after training.  If clusters at the same levels in the hierarchy do 
not overlap, you should be able to choose the closest cluster at each level 
until you find a leaf.  I'm assuming that the children of a given cluster are 
contained within that cluster (spacially) -- can you show this or find a 
reference for this?  If so, then assignment should be faster for a larger 
number of clusters as Jun was saying above.

Do you agree with this?  Or is there something I am misunderstanding!

Thanks!

> Hierarchical Implementation of KMeans
> -
>
> Key: SPARK-2429
> URL: https://issues.apache.org/jira/browse/SPARK-2429
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Assignee: Yu Ishikawa
>Priority: Minor
> Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
> Result of Benchmarking a Hierarchical Clustering.pdf, 
> benchmark-result.2014-10-29.html, benchmark2.html
>
>
> Hierarchical clustering algorithms are widely used and would make a nice 
> addition to MLlib.  Clustering algorithms are useful for determining 
> relationships between clusters as well as offering faster assignment. 
> Discussion on the dev list suggested the following possible approaches:
> * Top down, recursive application of KMeans
> * Reuse DecisionTree implementation with different objective function
> * Hierarchical SVD
> It was also suggested that support for distance metrics other than Euclidean 
> such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2014-11-16 Thread Jun Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213942#comment-14213942
 ] 

Jun Yang commented on SPARK-2429:
-

Hi Yu Ishikawa 

Thanks for your wonderful hierarchical implementation of KMeans, which just 
meets one of our project requirement :)

In our project, we initially used a MPI-based HAC implementation to do 
agglomeration bottom-up hierarchical clustering, and since 
we want to migrate the entire back-end pipeline to Spark, we just look for the 
alike hierarchical clustering implementation on Spark or we need to write it by 
ourselves. 

>From functionality perspective, you implementation looks pretty good( I have 
>already read through your code), but I still have several questions regarding 
>to performance and scalability:
1. In your implementation, in each divisive steps, there will be a "copy" 
operations to distribution the data nodes in the parent cluster tree to the 
split children cluster trees, when the document size is large, I think this 
copy cost is non-neglectable, right?
A potential optimization method is to keep the entire document data cached, and 
in each divisive steps, we just record the index 
of the documents into the ClusterTree object, so the cost could be lowered 
quite a lot.

Does this idea make sense?

2. In your test code, the cluster size is not quite large( only about 100 ), 
have you ever tested it with big cluster size and big document corpus?  e.g., 
1 clusters with 200 documents. What is the performance behavior facing 
this kind of use case?
Since in production environment, this use case is usually typical.

Look forward to your reply. 

Thanks 


> Hierarchical Implementation of KMeans
> -
>
> Key: SPARK-2429
> URL: https://issues.apache.org/jira/browse/SPARK-2429
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Assignee: Yu Ishikawa
>Priority: Minor
> Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
> Result of Benchmarking a Hierarchical Clustering.pdf, 
> benchmark-result.2014-10-29.html, benchmark2.html
>
>
> Hierarchical clustering algorithms are widely used and would make a nice 
> addition to MLlib.  Clustering algorithms are useful for determining 
> relationships between clusters as well as offering faster assignment. 
> Discussion on the dev list suggested the following possible approaches:
> * Top down, recursive application of KMeans
> * Reuse DecisionTree implementation with different objective function
> * Hierarchical SVD
> It was also suggested that support for distance metrics other than Euclidean 
> such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4393) Memory leak in connection manager timeout thread

2014-11-16 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-4393.

   Resolution: Fixed
Fix Version/s: 1.2.0

> Memory leak in connection manager timeout thread
> 
>
> Key: SPARK-4393
> URL: https://issues.apache.org/jira/browse/SPARK-4393
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
> Fix For: 1.2.0
>
>
> This JIRA tracks a fix for a memory leak in ConnectionManager's TimerTasks, 
> originally reported in [a 
> comment|https://issues.apache.org/jira/browse/SPARK-3633?focusedCommentId=14208318&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14208318]
>  on SPARK-3633.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

67 matches

Mail list logo