[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark application

2015-01-26 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291818#comment-14291818
 ] 

Robert Stupp commented on SPARK-2389:
-

[~srowen] yes, the problem is that drivers cannot share RDDs.
IMHO there are a lot of valid scenarios that can benefit from multiple drivers 
using shared RDDs.

 globally shared SparkContext / shared Spark application
 -

 Key: SPARK-2389
 URL: https://issues.apache.org/jira/browse/SPARK-2389
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Robert Stupp

 The documentation (in Cluster Mode Overview) cites:
 bq. Each application gets its own executor processes, which *stay up for the 
 duration of the whole application* and run tasks in multiple threads. This 
 has the benefit of isolating applications from each other, on both the 
 scheduling side (each driver schedules its own tasks) and executor side 
 (tasks from different applications run in different JVMs). However, it also 
 means that *data cannot be shared* across different Spark applications 
 (instances of SparkContext) without writing it to an external storage system.
 IMO this is a limitation that should be lifted to support any number of 
 --driver-- client processes to share executors and to share (persistent / 
 cached) data.
 This is especially useful if you have a bunch of frontend servers (dump web 
 app servers) that want to use Spark as a _big computing machine_. Most 
 important is the fact that Spark is quite good in caching/persisting data in 
 memory / on disk thus removing load from backend data stores.
 Means: it would be really great to let different --driver-- client JVMs 
 operate on the same RDDs and benefit from Spark's caching/persistence.
 It would however introduce some administration mechanisms to
 * start a shared context
 * update the executor configuration (# of worker nodes, # of cpus, etc) on 
 the fly
 * stop a shared context
 Even conventional batch MR applications would benefit if ran fequently 
 against the same data set.
 As an implicit requirement, RDD persistence could get a TTL for its 
 materialized state.
 With such a feature the overall performance of today's web applications could 
 then be increased by adding more web app servers, more spark nodes, more 
 nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5303) applySchema returns NullPointerException

2015-01-26 Thread Mauro Pirrone (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mauro Pirrone closed SPARK-5303.

Resolution: Not a Problem

 applySchema returns NullPointerException
 

 Key: SPARK-5303
 URL: https://issues.apache.org/jira/browse/SPARK-5303
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Mauro Pirrone

 The following code snippet returns NullPointerException:
 val result = .
   
 val rows = result.take(10)
 val rowRdd = SparkManager.getContext().parallelize(rows, 1)
 val schemaRdd = SparkManager.getSQLContext().applySchema(rowRdd, 
 result.schema)
 java.lang.NullPointerException
   at 
 org.apache.spark.sql.catalyst.expressions.AttributeReference.hashCode(namedExpressions.scala:147)
   at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:210)
   at scala.util.hashing.MurmurHash3.listHash(MurmurHash3.scala:168)
   at scala.util.hashing.MurmurHash3$.seqHash(MurmurHash3.scala:216)
   at scala.collection.LinearSeqLike$class.hashCode(LinearSeqLike.scala:53)
   at scala.collection.immutable.List.hashCode(List.scala:84)
   at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:210)
   at scala.util.hashing.MurmurHash3.productHash(MurmurHash3.scala:63)
   at scala.util.hashing.MurmurHash3$.productHash(MurmurHash3.scala:210)
   at scala.runtime.ScalaRunTime$._hashCode(ScalaRunTime.scala:172)
   at 
 org.apache.spark.sql.execution.LogicalRDD.hashCode(ExistingRDD.scala:58)
   at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:210)
   at 
 scala.collection.mutable.HashTable$HashUtils$class.elemHashCode(HashTable.scala:398)
   at scala.collection.mutable.HashMap.elemHashCode(HashMap.scala:39)
   at 
 scala.collection.mutable.HashTable$class.findEntry(HashTable.scala:130)
   at scala.collection.mutable.HashMap.findEntry(HashMap.scala:39)
   at scala.collection.mutable.HashMap.get(HashMap.scala:69)
   at 
 scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:187)
   at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:91)
   at 
 scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:329)
   at 
 scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:327)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 scala.collection.TraversableLike$class.groupBy(TraversableLike.scala:327)
   at scala.collection.AbstractTraversable.groupBy(Traversable.scala:105)
   at 
 org.apache.spark.sql.catalyst.analysis.NewRelationInstances$.apply(MultiInstanceRelation.scala:44)
   at 
 org.apache.spark.sql.catalyst.analysis.NewRelationInstances$.apply(MultiInstanceRelation.scala:40)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
   at 
 scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
   at 
 scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
   at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)
   at org.apache.spark.sql.SchemaRDD.schema$lzycompute(SchemaRDD.scala:135)
   at org.apache.spark.sql.SchemaRDD.schema(SchemaRDD.scala:135)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5409) Broken link in documentation

2015-01-26 Thread Mauro Pirrone (JIRA)
Mauro Pirrone created SPARK-5409:


 Summary: Broken link in documentation
 Key: SPARK-5409
 URL: https://issues.apache.org/jira/browse/SPARK-5409
 Project: Spark
  Issue Type: Documentation
Reporter: Mauro Pirrone
Priority: Minor


https://spark.apache.org/docs/1.2.0/streaming-kafka-integration.html

See the API docs and the example.

Link to example is broken.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark application

2015-01-26 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291797#comment-14291797
 ] 

Robert Stupp commented on SPARK-2389:
-

bq. That aside, why doesn't it scale?

Simply because it's just a single Spark client. If that machine's at its limit 
for whatever reason (VM memory, OS resources, CPU, network, ...), that's it.

Sure, you can run multiple drivers - but each has its own, private set of data.

IMO separate preloading is nice for some applications. But data is usually not 
immutable. By example:
* Imagine an application that provides offers for flights worldwide. It's a 
huge amount of data and a huge amount of processing. It cannot be simply 
preloaded - prices for tickets vary from minute to minute based on booking 
status etc etc etc
* Overall data set is quite big
* Overall load is too big for a single driver to handle - imagine thousands of 
offer requests per second
* Failure of a single driver is an absolute no-go
* All clients have to access the same set of data
* Preloading is just impossible during runtime (just at initial deployment)

So - a suitable approach would be to have:
* a Spark cluster holding all the RDDs and doing all offer and booking related 
operations
* a set of Spark clients to abstract Spark from the rest of the application
* a huge number of non-uniform frontend clients (could be web app servers, rich 
clients, SOAP / REST frontends)
* everything (except the data) stateless

 globally shared SparkContext / shared Spark application
 -

 Key: SPARK-2389
 URL: https://issues.apache.org/jira/browse/SPARK-2389
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Robert Stupp

 The documentation (in Cluster Mode Overview) cites:
 bq. Each application gets its own executor processes, which *stay up for the 
 duration of the whole application* and run tasks in multiple threads. This 
 has the benefit of isolating applications from each other, on both the 
 scheduling side (each driver schedules its own tasks) and executor side 
 (tasks from different applications run in different JVMs). However, it also 
 means that *data cannot be shared* across different Spark applications 
 (instances of SparkContext) without writing it to an external storage system.
 IMO this is a limitation that should be lifted to support any number of 
 --driver-- client processes to share executors and to share (persistent / 
 cached) data.
 This is especially useful if you have a bunch of frontend servers (dump web 
 app servers) that want to use Spark as a _big computing machine_. Most 
 important is the fact that Spark is quite good in caching/persisting data in 
 memory / on disk thus removing load from backend data stores.
 Means: it would be really great to let different --driver-- client JVMs 
 operate on the same RDDs and benefit from Spark's caching/persistence.
 It would however introduce some administration mechanisms to
 * start a shared context
 * update the executor configuration (# of worker nodes, # of cpus, etc) on 
 the fly
 * stop a shared context
 Even conventional batch MR applications would benefit if ran fequently 
 against the same data set.
 As an implicit requirement, RDD persistence could get a TTL for its 
 materialized state.
 With such a feature the overall performance of today's web applications could 
 then be increased by adding more web app servers, more spark nodes, more 
 nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark application

2015-01-26 Thread Murat Eken (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291795#comment-14291795
 ] 

Murat Eken commented on SPARK-2389:
---

[~sowen], I think Robert is talking about fault tolerance when he mentions 
scalability. Anyway, as I mentioned in my original comment, Tachyon is not an 
option, at least for us, due to interprocess serialization/deserialization 
costs. Although we haven't tried HDFS, but I would be surprised if that 
performed differently.

 globally shared SparkContext / shared Spark application
 -

 Key: SPARK-2389
 URL: https://issues.apache.org/jira/browse/SPARK-2389
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Robert Stupp

 The documentation (in Cluster Mode Overview) cites:
 bq. Each application gets its own executor processes, which *stay up for the 
 duration of the whole application* and run tasks in multiple threads. This 
 has the benefit of isolating applications from each other, on both the 
 scheduling side (each driver schedules its own tasks) and executor side 
 (tasks from different applications run in different JVMs). However, it also 
 means that *data cannot be shared* across different Spark applications 
 (instances of SparkContext) without writing it to an external storage system.
 IMO this is a limitation that should be lifted to support any number of 
 --driver-- client processes to share executors and to share (persistent / 
 cached) data.
 This is especially useful if you have a bunch of frontend servers (dump web 
 app servers) that want to use Spark as a _big computing machine_. Most 
 important is the fact that Spark is quite good in caching/persisting data in 
 memory / on disk thus removing load from backend data stores.
 Means: it would be really great to let different --driver-- client JVMs 
 operate on the same RDDs and benefit from Spark's caching/persistence.
 It would however introduce some administration mechanisms to
 * start a shared context
 * update the executor configuration (# of worker nodes, # of cpus, etc) on 
 the fly
 * stop a shared context
 Even conventional batch MR applications would benefit if ran fequently 
 against the same data set.
 As an implicit requirement, RDD persistence could get a TTL for its 
 materialized state.
 With such a feature the overall performance of today's web applications could 
 then be increased by adding more web app servers, more spark nodes, more 
 nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark application

2015-01-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291788#comment-14291788
 ] 

Sean Owen commented on SPARK-2389:
--

Yes, the SPOF problem makes sense. It doesn't seem to be what this JIRA was 
about though, which seems to be what the jobserver-style approach addresses.

That aside, why doesn't it scale? because of work that needs to be done on the 
driver? You can of course still run a bunch of drivers, just not one per client.

The preloading cache issue is what off-heap caching in Tachyon or HDFS is 
supposed to ameliorate.

 globally shared SparkContext / shared Spark application
 -

 Key: SPARK-2389
 URL: https://issues.apache.org/jira/browse/SPARK-2389
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Robert Stupp

 The documentation (in Cluster Mode Overview) cites:
 bq. Each application gets its own executor processes, which *stay up for the 
 duration of the whole application* and run tasks in multiple threads. This 
 has the benefit of isolating applications from each other, on both the 
 scheduling side (each driver schedules its own tasks) and executor side 
 (tasks from different applications run in different JVMs). However, it also 
 means that *data cannot be shared* across different Spark applications 
 (instances of SparkContext) without writing it to an external storage system.
 IMO this is a limitation that should be lifted to support any number of 
 --driver-- client processes to share executors and to share (persistent / 
 cached) data.
 This is especially useful if you have a bunch of frontend servers (dump web 
 app servers) that want to use Spark as a _big computing machine_. Most 
 important is the fact that Spark is quite good in caching/persisting data in 
 memory / on disk thus removing load from backend data stores.
 Means: it would be really great to let different --driver-- client JVMs 
 operate on the same RDDs and benefit from Spark's caching/persistence.
 It would however introduce some administration mechanisms to
 * start a shared context
 * update the executor configuration (# of worker nodes, # of cpus, etc) on 
 the fly
 * stop a shared context
 Even conventional batch MR applications would benefit if ran fequently 
 against the same data set.
 As an implicit requirement, RDD persistence could get a TTL for its 
 materialized state.
 With such a feature the overall performance of today's web applications could 
 then be increased by adding more web app servers, more spark nodes, more 
 nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark application

2015-01-26 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291798#comment-14291798
 ] 

Robert Stupp commented on SPARK-2389:
-

bq. fault tolerance when he mentions scalability

both play well together in a stateless application ;)

 globally shared SparkContext / shared Spark application
 -

 Key: SPARK-2389
 URL: https://issues.apache.org/jira/browse/SPARK-2389
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Robert Stupp

 The documentation (in Cluster Mode Overview) cites:
 bq. Each application gets its own executor processes, which *stay up for the 
 duration of the whole application* and run tasks in multiple threads. This 
 has the benefit of isolating applications from each other, on both the 
 scheduling side (each driver schedules its own tasks) and executor side 
 (tasks from different applications run in different JVMs). However, it also 
 means that *data cannot be shared* across different Spark applications 
 (instances of SparkContext) without writing it to an external storage system.
 IMO this is a limitation that should be lifted to support any number of 
 --driver-- client processes to share executors and to share (persistent / 
 cached) data.
 This is especially useful if you have a bunch of frontend servers (dump web 
 app servers) that want to use Spark as a _big computing machine_. Most 
 important is the fact that Spark is quite good in caching/persisting data in 
 memory / on disk thus removing load from backend data stores.
 Means: it would be really great to let different --driver-- client JVMs 
 operate on the same RDDs and benefit from Spark's caching/persistence.
 It would however introduce some administration mechanisms to
 * start a shared context
 * update the executor configuration (# of worker nodes, # of cpus, etc) on 
 the fly
 * stop a shared context
 Even conventional batch MR applications would benefit if ran fequently 
 against the same data set.
 As an implicit requirement, RDD persistence could get a TTL for its 
 materialized state.
 With such a feature the overall performance of today's web applications could 
 then be increased by adding more web app servers, more spark nodes, more 
 nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark application

2015-01-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291804#comment-14291804
 ] 

Sean Owen commented on SPARK-2389:
--

Yes, makes sense. Maxing out one driver isn't an issue since you can have many 
drivers (or push work into the cluster). The issue is really that each driver 
then has its own RDDs, and if you need 100s of drivers to keep up, that just 
won't work. (Although then I'd question how so much work is being done on the 
Spark driver?)

In theory the redundancy of all those RDDs is what HDFS caching and Tachyon 
could in theory help with, although those help share outside Spark. Whether 
that works for a particular use case right now is a different question, 
although I suspect it makes more sense to make those work than start yet 
another solution.

What you are describing -- mutating lots shared in-memory state -- doesn't 
sound like a problem Spark helps solve per se. That is, it doesn't sound like 
work that has to live in a Spark driver program, even if it needs to ask a 
Spark driver-based service for some results. Naturally you know your problem 
better than I, but I am wondering if the answer here isn't just using Spark 
differently, for what it's for.

 globally shared SparkContext / shared Spark application
 -

 Key: SPARK-2389
 URL: https://issues.apache.org/jira/browse/SPARK-2389
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Robert Stupp

 The documentation (in Cluster Mode Overview) cites:
 bq. Each application gets its own executor processes, which *stay up for the 
 duration of the whole application* and run tasks in multiple threads. This 
 has the benefit of isolating applications from each other, on both the 
 scheduling side (each driver schedules its own tasks) and executor side 
 (tasks from different applications run in different JVMs). However, it also 
 means that *data cannot be shared* across different Spark applications 
 (instances of SparkContext) without writing it to an external storage system.
 IMO this is a limitation that should be lifted to support any number of 
 --driver-- client processes to share executors and to share (persistent / 
 cached) data.
 This is especially useful if you have a bunch of frontend servers (dump web 
 app servers) that want to use Spark as a _big computing machine_. Most 
 important is the fact that Spark is quite good in caching/persisting data in 
 memory / on disk thus removing load from backend data stores.
 Means: it would be really great to let different --driver-- client JVMs 
 operate on the same RDDs and benefit from Spark's caching/persistence.
 It would however introduce some administration mechanisms to
 * start a shared context
 * update the executor configuration (# of worker nodes, # of cpus, etc) on 
 the fly
 * stop a shared context
 Even conventional batch MR applications would benefit if ran fequently 
 against the same data set.
 As an implicit requirement, RDD persistence could get a TTL for its 
 materialized state.
 With such a feature the overall performance of today's web applications could 
 then be increased by adding more web app servers, more spark nodes, more 
 nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5409) Broken link in documentation

2015-01-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291823#comment-14291823
 ] 

Sean Owen commented on SPARK-5409:
--

Should just be

https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala

Open a PR; this probably doesn't even need a JIRA.

 Broken link in documentation
 

 Key: SPARK-5409
 URL: https://issues.apache.org/jira/browse/SPARK-5409
 Project: Spark
  Issue Type: Documentation
Reporter: Mauro Pirrone
Priority: Minor

 https://spark.apache.org/docs/1.2.0/streaming-kafka-integration.html
 See the API docs and the example.
 Link to example is broken.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5407) No 1.2 AMI available for ec2

2015-01-26 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-5407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Håkan Jonsson closed SPARK-5407.

Resolution: Invalid

Error on my side.

 No 1.2 AMI available for ec2
 

 Key: SPARK-5407
 URL: https://issues.apache.org/jira/browse/SPARK-5407
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.2.0
Reporter: Håkan Jonsson

 When I try to launch a standalone cluster on EC2 using the scripts in the ec2 
 directory for Spark 1.2, (./spark-ec2 -k spark -i k.pem launch my12), I get 
 the following error: 
 Could not resolve AMI at: 
 https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-east-1/pvm
 It seems there is not yet any AMI available on EC2 for Spark 1.2.
 This works well for Spark 1.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5410) Error parsing scientific notation in a select statement

2015-01-26 Thread Hugo Ferrira (JIRA)
Hugo Ferrira created SPARK-5410:
---

 Summary: Error parsing scientific notation in a select statement
 Key: SPARK-5410
 URL: https://issues.apache.org/jira/browse/SPARK-5410
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Hugo Ferrira


I am using the Cassandra DB and am attempting a select through the Spark SQL 
interface.

SELECT * from key_value WHERE f2  2.2E10

And get the following error (no error if I remove the E10):

[info] - should be able to select a subset of applicable features *** FAILED ***
[info]   java.lang.RuntimeException: [1.39] failure: ``UNION'' expected but 
identifier E10 found
[info] 
[info] SELECT * from key_value WHERE f2  2.2E10
[info]   ^
[info]   at scala.sys.package$.error(package.scala:27)
[info]   at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33)
[info]   at 
org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
[info]   at 
org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
[info]   at 
org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:174)
[info]   at 
org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:173)
[info]   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
[info]   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
[info]   ...




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3852) Document spark.driver.extra* configs

2015-01-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3852.
--
  Resolution: Fixed
   Fix Version/s: 1.3.0
Assignee: Sean Owen
Target Version/s:   (was: 1.2.0)

 Document spark.driver.extra* configs
 

 Key: SPARK-3852
 URL: https://issues.apache.org/jira/browse/SPARK-3852
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Andrew Or
Assignee: Sean Owen
 Fix For: 1.3.0


 They are not documented...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4430) Apache RAT Checks fail spuriously on test files

2015-01-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4430.
--
   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: Sean Owen

 Apache RAT Checks fail spuriously on test files
 ---

 Key: SPARK-4430
 URL: https://issues.apache.org/jira/browse/SPARK-4430
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: Ryan Williams
Assignee: Sean Owen
 Fix For: 1.3.0


 Several of my recent runs of {{./dev/run-tests}} have failed quickly due to 
 Apache RAT checks, e.g.:
 {code}
 $ ./dev/run-tests
 =
 Running Apache RAT checks
 =
 Could not find Apache license headers in the following files:
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b732c105-4fd3-4330-ba6d-a366b340c303/test/28
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b732c105-4fd3-4330-ba6d-a366b340c303/test/29
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b732c105-4fd3-4330-ba6d-a366b340c303/test/30
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/10
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/11
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/12
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/13
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/14
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/15
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/16
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/17
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/18
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/19
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/20
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/21
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/22
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/23
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/24
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/25
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/26
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/27
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/28
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/29
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/30
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/7
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/8
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/9
 [error] Got a return code of 1 on line 114 of the run-tests script.
 {code}
 I think it's fair to say that these are not useful errors for {{run-tests}} 
 to crash on. Ideally we could tell the linter which files we care about 
 having it lint and which we don't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-595) Document local-cluster mode

2015-01-26 Thread Vladimir Grigor (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292052#comment-14292052
 ] 

Vladimir Grigor commented on SPARK-595:
---

+1 for reopen

 Document local-cluster mode
 -

 Key: SPARK-595
 URL: https://issues.apache.org/jira/browse/SPARK-595
 Project: Spark
  Issue Type: New Feature
  Components: Documentation
Affects Versions: 0.6.0
Reporter: Josh Rosen
Priority: Minor

 The 'Spark Standalone Mode' guide describes how to manually launch a 
 standalone cluster, which can be done locally for testing, but it does not 
 mention SparkContext's `local-cluster` option.
 What are the differences between these approaches?  Which one should I prefer 
 for local testing?  Can I still use the standalone web interface if I use 
 'local-cluster' mode?
 It would be useful to document this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5324) Results of describe can't be queried

2015-01-26 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291987#comment-14291987
 ] 

Yanbo Liang commented on SPARK-5324:


[~marmbrus]
I have pull a request for this issue which will implement DESCRIBE [FORMATTED] 
[db_name.]table_name command for SQLContext.
Meanwhile, it need to make the least effect on the corresponding command output 
of HiveContext.
And I think other metadata operation command like show databases/tables, 
analyze, explain can also leverage this scenario.
Can you assign this to me?

 Results of describe can't be queried
 

 Key: SPARK-5324
 URL: https://issues.apache.org/jira/browse/SPARK-5324
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Michael Armbrust

 {code}
 sql(DESCRIBE TABLE test).registerTempTable(describeTest)
 sql(SELECT * FROM describeTest).collect()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5324) Results of describe can't be queried

2015-01-26 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292008#comment-14292008
 ] 

Yanbo Liang commented on SPARK-5324:


https://github.com/apache/spark/pull/4207

 Results of describe can't be queried
 

 Key: SPARK-5324
 URL: https://issues.apache.org/jira/browse/SPARK-5324
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Michael Armbrust

 {code}
 sql(DESCRIBE TABLE test).registerTempTable(describeTest)
 sql(SELECT * FROM describeTest).collect()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5355) SparkConf is not thread-safe

2015-01-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292048#comment-14292048
 ] 

Apache Spark commented on SPARK-5355:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/4208

 SparkConf is not thread-safe
 

 Key: SPARK-5355
 URL: https://issues.apache.org/jira/browse/SPARK-5355
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0, 1.3.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker
 Fix For: 1.3.0, 1.2.1


 The SparkConf is not thread-safe, but is accessed by many threads. The 
 getAll() could return parts of the configs if another thread is access it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-794) Remove sleep() in ClusterScheduler.stop

2015-01-26 Thread Brennon York (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292303#comment-14292303
 ] 

Brennon York commented on SPARK-794:


[~joshrosen] How is this PR holding up? I haven't seen any issues on the dev 
board. Think we can close this JIRA ticket? Trying to help prune the JIRA tree 
:)

 Remove sleep() in ClusterScheduler.stop
 ---

 Key: SPARK-794
 URL: https://issues.apache.org/jira/browse/SPARK-794
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Matei Zaharia
  Labels: backport-needed
 Fix For: 1.3.0


 This temporary change made a while back slows down the unit tests quite a bit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-595) Document local-cluster mode

2015-01-26 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reopened SPARK-595:
--

I've re-opened this issue.  Folks are using the API in the wild and we're not 
going to break compatibility for it, so we should document it.

 Document local-cluster mode
 -

 Key: SPARK-595
 URL: https://issues.apache.org/jira/browse/SPARK-595
 Project: Spark
  Issue Type: New Feature
  Components: Documentation
Affects Versions: 0.6.0
Reporter: Josh Rosen
Priority: Minor

 The 'Spark Standalone Mode' guide describes how to manually launch a 
 standalone cluster, which can be done locally for testing, but it does not 
 mention SparkContext's `local-cluster` option.
 What are the differences between these approaches?  Which one should I prefer 
 for local testing?  Can I still use the standalone web interface if I use 
 'local-cluster' mode?
 It would be useful to document this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5395) Large number of Python workers causing resource depletion

2015-01-26 Thread Mark Khaitman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292121#comment-14292121
 ] 

Mark Khaitman commented on SPARK-5395:
--

Having the same issue in standalone deployment mode. A single spark-submitted 
job is spawning a ton of pyspark.daemon instances and depleting the cluster 
memory even though the appropriate environment variables have been set.

 Large number of Python workers causing resource depletion
 -

 Key: SPARK-5395
 URL: https://issues.apache.org/jira/browse/SPARK-5395
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
 Environment: AWS ElasticMapReduce
Reporter: Sven Krasser

 During job execution a large number of Python worker accumulates eventually 
 causing YARN to kill containers for being over their memory allocation (in 
 the case below that is about 8G for executors plus 6G for overhead per 
 container). 
 In this instance, at the time of killing the container 97 pyspark.daemon 
 processes had accumulated.
 {noformat}
 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler 
 (Logging.scala:logInfo(59)) - Container marked as failed: 
 container_1421692415636_0052_01_30. Exit status: 143. Diagnostics: 
 Container [pid=35211,containerID=container_1421692415636_0052_01_30] is 
 running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB 
 physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing 
 container.
 Dump of the process-tree for container_1421692415636_0052_01_30 :
 |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) 
 VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
 |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m 
 pyspark.daemon
 |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m 
 pyspark.daemon
 |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m 
 pyspark.daemon
   [...]
 {noformat}
 The configuration used uses 64 containers with 2 cores each.
 Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c
 Mailinglist discussion: 
 https://www.mail-archive.com/user@spark.apache.org/msg20102.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5400) Rename GaussianMixtureEM to GaussianMixture

2015-01-26 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292176#comment-14292176
 ] 

Joseph K. Bradley commented on SPARK-5400:
--

I agree this could be done either way: Algorithm[Model] or Model[Algorithm].  
For users, exposing the model type may be easiest; a person who is new to ML 
and wants to do some clustering will know the name of a clustering model 
(KMeans, GMM) but may not want to worry about picking an optimization 
algorithm.  So I'd vote for Model[Algorithm].

That said, internally, I agree that Algorithm[Model] would be handy for 
generalizing.  We could do the combination by having an internal LearningState 
class:

{code}
class GaussianMixture {
  def setOptimizer // once we have more than 1 optimization method

  def run = {
val opt = new EM(new GMMLearningState(this))
...
  }
}

private[mllib] GMMLearningState extends OurModelAbstraction {
  def this(gm: GaussianMixture) = this(...)
}

class EM(model: OurModelAbstraction)
{code}


 Rename GaussianMixtureEM to GaussianMixture
 ---

 Key: SPARK-5400
 URL: https://issues.apache.org/jira/browse/SPARK-5400
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor

 GaussianMixtureEM is following the old naming convention of including the 
 optimization algorithm name in the class title.  We should probably rename it 
 to GaussianMixture so that it can use other optimization algorithms in the 
 future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5162) Python yarn-cluster mode

2015-01-26 Thread Vladimir Grigor (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292069#comment-14292069
 ] 

Vladimir Grigor commented on SPARK-5162:


I second [~jared.holmb...@orchestro.com]
[~lianhuiwang] thank you! I'm going to try your PR.

Related issue
Even with this PR, there will be problem using Yarn in cluster mode on Amazon 
EMR.

Normally one submits yarn jobs via API or aws command line utility, so paths 
to files are evaluated later at some remote host, hence files are not found. 
Currently Spark does not support non-local files. One idea would be to add 
support for non-local (python) files, eg: if file is not local it will be 
downloaded and made available locally. Something similar to Distributed Cache 
described at 
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-input-distributed-cache.html

So following code would work:
{code}
aws emr add-steps --cluster-id j-XYWIXMD234 \
--steps 
Name=SparkPi,Jar=s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn-cluster,--py-files,s3://mybucketat.amazonaws.com/tasks/main.py,main.py,param1],ActionOnFailure=CONTINUE
{code}

What do you think? What is your way to run batch python spark scripts on Yarn 
in Amazon?

 Python yarn-cluster mode
 

 Key: SPARK-5162
 URL: https://issues.apache.org/jira/browse/SPARK-5162
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, YARN
Reporter: Dana Klassen
  Labels: cluster, python, yarn

 Running pyspark in yarn is currently limited to ‘yarn-client’ mode. It would 
 be great to be able to submit python applications to the cluster and (just 
 like java classes) have the resource manager setup an AM on any node in the 
 cluster. Does anyone know the issues blocking this feature? I was snooping 
 around with enabling python apps:
 Removing the logic stopping python and yarn-cluster from sparkSubmit.scala
 ...
 // The following modes are not supported or applicable
 (clusterManager, deployMode) match {
   ...
   case (_, CLUSTER) if args.isPython =
 printErrorAndExit(Cluster deploy mode is currently not supported for 
 python applications.)
   ...
 }
 …
 and submitting application via:
 HADOOP_CONF_DIR={{insert conf dir}} ./bin/spark-submit --master yarn-cluster 
 --num-executors 2  —-py-files {{insert location of egg here}} 
 --executor-cores 1  ../tools/canary.py
 Everything looks to run alright, pythonRunner is picked up as main class, 
 resources get setup, yarn client gets launched but falls flat on its face:
 2015-01-08 18:48:03,444 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  DEBUG: FAILED { 
 {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py, 
 1420742868009, FILE, null }, Resource 
 {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py changed 
 on src filesystem (expected 1420742868009, was 1420742869284
 and
 2015-01-08 18:48:03,446 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
  Resource 
 {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py(-/data/4/yarn/nm/usercache/klassen/filecache/11/canary.py)
  transitioned from DOWNLOADING to FAILED
 Tracked this down to the apache hadoop code(FSDownload.java line 249) related 
 to container localization of files upon downloading. At this point thought it 
 would be best to raise the issue here and get input.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5236) java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt

2015-01-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-5236:
--
Description: 
{code}
15/01/14 05:39:27 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 18.0 
(TID 28, localhost): parquet.io.ParquetDecodingException: Can not read value at 
0 in block 0 in file gs://pa-truven/20141205/parquet/P/part-r-1.parquet
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
at 
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
at 
org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141)
at 
org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1331)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1331)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to 
org.apache.spark.sql.catalyst.expressions.MutableInt
at 
org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:241)
at 
org.apache.spark.sql.parquet.CatalystPrimitiveRowConverter.updateInt(ParquetConverter.scala:375)
at 
org.apache.spark.sql.parquet.CatalystPrimitiveConverter.addInt(ParquetConverter.scala:434)
at 
parquet.column.impl.ColumnReaderImpl$2$3.writeValue(ColumnReaderImpl.java:237)
at 
parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:353)
at 
parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:402)
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:194)
... 27 more
{code}

  was:
15/01/14 05:39:27 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 18.0 
(TID 28, localhost): parquet.io.ParquetDecodingException: Can not read value at 
0 in block 0 in file gs://pa-truven/20141205/parquet/P/part-r-1.parquet
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
at 
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
at 
org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at 

[jira] [Commented] (SPARK-2688) Need a way to run multiple data pipeline concurrently

2015-01-26 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292448#comment-14292448
 ] 

Xuefu Zhang commented on SPARK-2688:


Yeah. We don't need a syntactic suger, but a transformation that just does one 
pass of the input RDD. This has performance implications on Hive's multi-insert 
use cases.

 Need a way to run multiple data pipeline concurrently
 -

 Key: SPARK-2688
 URL: https://issues.apache.org/jira/browse/SPARK-2688
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Xuefu Zhang

 Suppose we want to do the following data processing: 
 {code}
 rdd1 - rdd2 - rdd3
| - rdd4
| - rdd5
\ - rdd6
 {code}
 where - represents a transformation. rdd3 to rrdd6 are all derived from an 
 intermediate rdd2. We use foreach(fn) with a dummy function to trigger the 
 execution. However, rdd.foreach(fn) only trigger pipeline rdd1 - rdd2 - 
 rdd3. To make things worse, when we call rdd4.foreach(), rdd2 will be 
 recomputed. This is very inefficient. Ideally, we should be able to trigger 
 the execution the whole graph and reuse rdd2, but there doesn't seem to be a 
 way doing so. Tez already realized the importance of this (TEZ-391), so I 
 think Spark should provide this too.
 This is required for Hive to support multi-insert queries. HIVE-7292.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3644) REST API for Spark application info (jobs / stages / tasks / storage info)

2015-01-26 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292370#comment-14292370
 ] 

Imran Rashid commented on SPARK-3644:
-

[~joshrosen] Hi Josh, I've got time to implement this now.  You can assign to 
me if you like (or let me know if there is something else in the works ...)

 REST API for Spark application info (jobs / stages / tasks / storage info)
 --

 Key: SPARK-3644
 URL: https://issues.apache.org/jira/browse/SPARK-3644
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Reporter: Josh Rosen

 This JIRA is a forum to draft a design proposal for a REST interface for 
 accessing information about Spark applications, such as job / stage / task / 
 storage status.
 There have been a number of proposals to serve JSON representations of the 
 information displayed in Spark's web UI.  Given that we might redesign the 
 pages of the web UI (and possibly re-implement the UI as a client of a REST 
 API), the API endpoints and their responses should be independent of what we 
 choose to display on particular web UI pages / layouts.
 Let's start a discussion of what a good REST API would look like from 
 first-principles.  We can discuss what urls / endpoints expose access to 
 data, how our JSON responses will be formatted, how fields will be named, how 
 the API will be documented and tested, etc.
 Some links for inspiration:
 https://developer.github.com/v3/
 http://developer.netflix.com/docs/REST_API_Reference
 https://helloreverb.com/developers/swagger



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5339) build/mvn doesn't work because of invalid URL for maven's tgz.

2015-01-26 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5339.

   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: Kousuke Saruta

 build/mvn doesn't work because of invalid URL for maven's tgz.
 --

 Key: SPARK-5339
 URL: https://issues.apache.org/jira/browse/SPARK-5339
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.3.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
Priority: Blocker
 Fix For: 1.3.0


 build/mvn will automatically download tarball of maven. But currently, the 
 URL is invalid. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3789) Python bindings for GraphX

2015-01-26 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292477#comment-14292477
 ] 

Reynold Xin commented on SPARK-3789:


Unfortunately this is not going to make it into 1.3, given the code freeze 
deadline is in 1 week.

[~kdatta1978] thanks for working on this. Can you write some high level design 
document for this change?

 Python bindings for GraphX
 --

 Key: SPARK-3789
 URL: https://issues.apache.org/jira/browse/SPARK-3789
 Project: Spark
  Issue Type: New Feature
  Components: GraphX, PySpark
Reporter: Ameet Talwalkar
Assignee: Kushal Datta





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5411) Allow SparkListeners to be specified in SparkConf and loaded when creating SparkContext

2015-01-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292494#comment-14292494
 ] 

Apache Spark commented on SPARK-5411:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4111

 Allow SparkListeners to be specified in SparkConf and loaded when creating 
 SparkContext
 ---

 Key: SPARK-5411
 URL: https://issues.apache.org/jira/browse/SPARK-5411
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Josh Rosen
Assignee: Josh Rosen

 It would be nice if there was a mechanism to allow SparkListeners to be 
 registered through SparkConf settings.  This would allow monitoring 
 frameworks to be easily injected into Spark programs without having to modify 
 those programs' code.
 I propose to introduce a new configuration option, {{spark.extraListeners}}, 
 that allows SparkListeners to be specified in SparkConf and registered before 
 the SparkContext is created.  Here is the proposed documentation for the new 
 option:
 {quote}
 A comma-separated list of classes that implement SparkListener; when 
 initializing SparkContext, instances of these classes will be created and 
 registered with Spark's listener bus. If a class has a single-argument 
 constructor that accepts a SparkConf, that constructor will be called; 
 otherwise, a zero-argument constructor will be called. If no valid 
 constructor can be found, the SparkContext creation will fail with an 
 exception.
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5411) Allow SparkListeners to be specified in SparkConf and loaded when creating SparkContext

2015-01-26 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-5411:
-

 Summary: Allow SparkListeners to be specified in SparkConf and 
loaded when creating SparkContext
 Key: SPARK-5411
 URL: https://issues.apache.org/jira/browse/SPARK-5411
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Josh Rosen
Assignee: Josh Rosen


It would be nice if there was a mechanism to allow SparkListeners to be 
registered through SparkConf settings.  This would allow monitoring frameworks 
to be easily injected into Spark programs without having to modify those 
programs' code.

I propose to introduce a new configuration option, {{spark.extraListeners}}, 
that allows SparkListeners to be specified in SparkConf and registered before 
the SparkContext is created.  Here is the proposed documentation for the new 
option:

{quote}
A comma-separated list of classes that implement SparkListener; when 
initializing SparkContext, instances of these classes will be created and 
registered with Spark's listener bus. If a class has a single-argument 
constructor that accepts a SparkConf, that constructor will be called; 
otherwise, a zero-argument constructor will be called. If no valid constructor 
can be found, the SparkContext creation will fail with an exception.
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3789) Python bindings for GraphX

2015-01-26 Thread Kushal Datta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292499#comment-14292499
 ] 

Kushal Datta commented on SPARK-3789:
-

Sure, i will write up the design document.
@ Ameet,  do you think you can work from another branch which is not on 1.3?



 Python bindings for GraphX
 --

 Key: SPARK-3789
 URL: https://issues.apache.org/jira/browse/SPARK-3789
 Project: Spark
  Issue Type: New Feature
  Components: GraphX, PySpark
Reporter: Ameet Talwalkar
Assignee: Kushal Datta





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2688) Need a way to run multiple data pipeline concurrently

2015-01-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292431#comment-14292431
 ] 

Sean Owen commented on SPARK-2688:
--

As [~irashid] says, #1 is just syntactic sugar on what you can do already in 
Spark. I'm not clear how something can need this functionality badly, then. 
Either it's not blocking anything, really, and let's see that, or let's discuss 
what beyond #1 is actually needed.

What I think people want is a miniature push-based evaluation method inside 
of Spark's pull-based DAG evaluation: force evaluation of N children of 1 
parent at once. The outcome of a sidebar I had with Sandy on this was that it's 
probably a) fraught with gotchas, given the push-vs-pull mismatch, but not 
impossible, and b) would force the children to be persisted in the general 
case, with possible optimizations in other special cases.

Is that the kind of thing Hive on Spark needs, and if so can we hear a concrete 
elaboration of an example of this, so we can compare with what's possible now? 
I still sense there's a mismatch between the perception and reality of what's 
possible with the current API. Hence, there may be some really good news here.

 Need a way to run multiple data pipeline concurrently
 -

 Key: SPARK-2688
 URL: https://issues.apache.org/jira/browse/SPARK-2688
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Xuefu Zhang

 Suppose we want to do the following data processing: 
 {code}
 rdd1 - rdd2 - rdd3
| - rdd4
| - rdd5
\ - rdd6
 {code}
 where - represents a transformation. rdd3 to rrdd6 are all derived from an 
 intermediate rdd2. We use foreach(fn) with a dummy function to trigger the 
 execution. However, rdd.foreach(fn) only trigger pipeline rdd1 - rdd2 - 
 rdd3. To make things worse, when we call rdd4.foreach(), rdd2 will be 
 recomputed. This is very inefficient. Ideally, we should be able to trigger 
 the execution the whole graph and reuse rdd2, but there doesn't seem to be a 
 way doing so. Tez already realized the importance of this (TEZ-391), so I 
 think Spark should provide this too.
 This is required for Hive to support multi-insert queries. HIVE-7292.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib

2015-01-26 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292399#comment-14292399
 ] 

Dmitriy Lyubimov commented on SPARK-5226:
-

All attempts to parallelize dbscan in literature lately (or similar DeLiClu 
type of things) i read about include partitioning the task into smaller 
subtasks, solving each on individual level and merging it all back (see MR.Scan 
paper for example). Merging is of course is the new and the tricky thing.

As far as i understand, they all pretty much have limitations to reduce scope 
to euclidean distances  and captitalize on notions of euclidean geometry 
resulting from that, in order to solve partition and merge problems. Which 
substantially reduces attractiveness of general algorithm. However, the naive 
straightforward port of  simple DBScan algorithm is not terribly practical for 
big data because of total complexity of the problem (or impracticality of 
building something like huge distributed R-tree index system on shared-nothing 
programming models).

 Add DBSCAN Clustering Algorithm to MLlib
 

 Key: SPARK-5226
 URL: https://issues.apache.org/jira/browse/SPARK-5226
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Muhammad-Ali A'rabi
Priority: Minor
  Labels: DBSCAN

 MLlib is all k-means now, and I think we should add some new clustering 
 algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2688) Need a way to run multiple data pipeline concurrently

2015-01-26 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292415#comment-14292415
 ] 

Xuefu Zhang commented on SPARK-2688:


#1 above is exactly what Hive needs badly.

 Need a way to run multiple data pipeline concurrently
 -

 Key: SPARK-2688
 URL: https://issues.apache.org/jira/browse/SPARK-2688
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Xuefu Zhang

 Suppose we want to do the following data processing: 
 {code}
 rdd1 - rdd2 - rdd3
| - rdd4
| - rdd5
\ - rdd6
 {code}
 where - represents a transformation. rdd3 to rrdd6 are all derived from an 
 intermediate rdd2. We use foreach(fn) with a dummy function to trigger the 
 execution. However, rdd.foreach(fn) only trigger pipeline rdd1 - rdd2 - 
 rdd3. To make things worse, when we call rdd4.foreach(), rdd2 will be 
 recomputed. This is very inefficient. Ideally, we should be able to trigger 
 the execution the whole graph and reuse rdd2, but there doesn't seem to be a 
 way doing so. Tez already realized the importance of this (TEZ-391), so I 
 think Spark should provide this too.
 This is required for Hive to support multi-insert queries. HIVE-7292.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3789) Python bindings for GraphX

2015-01-26 Thread Kushal Datta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292468#comment-14292468
 ] 

Kushal Datta commented on SPARK-3789:
-

Hi Ameet,

Sorry for asking this question again.
What's the release plan for 1.3?

-Kushal.

 Python bindings for GraphX
 --

 Key: SPARK-3789
 URL: https://issues.apache.org/jira/browse/SPARK-3789
 Project: Spark
  Issue Type: New Feature
  Components: GraphX, PySpark
Reporter: Ameet Talwalkar
Assignee: Kushal Datta





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5416) Initialize Executor.threadPool before ExecutorSource

2015-01-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292784#comment-14292784
 ] 

Apache Spark commented on SPARK-5416:
-

User 'ryan-williams' has created a pull request for this issue:
https://github.com/apache/spark/pull/4212

 Initialize Executor.threadPool before ExecutorSource
 

 Key: SPARK-5416
 URL: https://issues.apache.org/jira/browse/SPARK-5416
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Ryan Williams
Priority: Minor

 I recently saw some NPEs from 
 [{{ExecutorSource:44}}|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/ExecutorSource.scala#L44]
  in the first couple seconds of my executors' being initialized.
 I think that {{ExecutorSource}} was trying to report these metrics before its 
 threadpool was initialized; there are a few LoC between the source being 
 registered 
 ([Executor.scala:82|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/Executor.scala#L82])
  and the threadpool being initialized 
 ([Executor.scala:106|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/Executor.scala#L106]).
 We should initialize the threapool before the ExecutorSource is registered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3562) Periodic cleanup event logs

2015-01-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292818#comment-14292818
 ] 

Apache Spark commented on SPARK-3562:
-

User 'viper-kun' has created a pull request for this issue:
https://github.com/apache/spark/pull/4214

 Periodic cleanup event logs
 ---

 Key: SPARK-3562
 URL: https://issues.apache.org/jira/browse/SPARK-3562
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: xukun

  If we run spark application frequently, it will write many spark event log 
 into spark.eventLog.dir. After a long time later, there will be many spark 
 event log that we do not concern in the spark.eventLog.dir.Periodic cleanups 
 will ensure that logs older than this duration will be forgotten. It is no 
 need to clean logs by hands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3880) HBase as data source to SparkSQL

2015-01-26 Thread Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan updated SPARK-3880:
---
Attachment: SparkSQLOnHBase_v2.0.docx

 HBase as data source to SparkSQL
 

 Key: SPARK-3880
 URL: https://issues.apache.org/jira/browse/SPARK-3880
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Yan
Assignee: Yan
 Attachments: HBaseOnSpark.docx, SparkSQLOnHBase_v2.0.docx






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3880) HBase as data source to SparkSQL

2015-01-26 Thread Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan updated SPARK-3880:
---
Attachment: (was: SparkSQLOnHBase_v2.docx)

 HBase as data source to SparkSQL
 

 Key: SPARK-3880
 URL: https://issues.apache.org/jira/browse/SPARK-3880
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Yan
Assignee: Yan
 Attachments: HBaseOnSpark.docx






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5388) Provide a stable application submission gateway

2015-01-26 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292874#comment-14292874
 ] 

Andrew Or commented on SPARK-5388:
--

Hi Dale, thank you for your comments. Yes, in the design doc I used REST 
roughly interchangeably with HTTP/JSON. But the goal is not to provide a 
mechanism for other entities to communicate with the Master as you suggested; 
it is simply to provide a stable mechanism for Spark to work across multiple 
versions. For instance, you might have a long-running Master that outlives 
multiple Spark versions, in which case we want to guarantee that newer versions 
of Spark will still be able to submit to the long-running Master.

I think your proposal to make this more REST-like is potentially a great idea. 
However, I find the alternative of simply putting the action in the JSON itself 
easier to reason about. This also allows us to add other messages in the future 
that are not strictly limited to the semantics of GET, POST, and DELETE. That 
said, my proposal is also not set in stone yet so if there is a reason 
compelling enough to change it then I will do so.

Also, a first-cut implementation of my design is now posted at: 
https://github.com/apache/spark/pull/4216. Please take a look if you feel 
inclined.

 Provide a stable application submission gateway
 ---

 Key: SPARK-5388
 URL: https://issues.apache.org/jira/browse/SPARK-5388
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Blocker
 Attachments: Stable Spark Standalone Submission.pdf


 The existing submission gateway in standalone mode is not compatible across 
 Spark versions. If you have a newer version of Spark submitting to an older 
 version of the standalone Master, it is currently not guaranteed to work. The 
 goal is to provide a stable REST interface to replace this channel.
 The first cut implementation will target standalone cluster mode because 
 there are very few messages exchanged. The design, however, will be general 
 enough to eventually support this for other cluster managers too. Note that 
 this is not necessarily required in YARN because we already use YARN's stable 
 interface to submit applications there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit

2015-01-26 Thread Joseph Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292853#comment-14292853
 ] 

Joseph Tang edited comment on SPARK-4846 at 1/27/15 2:46 AM:
-

Sorry about the procrastination. I just thought you meant there is no need to 
implement a dynamic strategy. I'm still working on it and I'd like to quickly 
fix this issue.

Regarding your previous comment, should I throw a customized error in Spark or 
just an OOM besides the hint about minCount and vectorSize? 


was (Author: josephtang):
Sorry about the procrastination. I'm still working on this.

Regarding your previous comment, should I throw a customized error in Spark or 
just an OOM besides the hint about minCount and vectorSize? 

 When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: 
 Requested array size exceeds VM limit
 ---

 Key: SPARK-4846
 URL: https://issues.apache.org/jira/browse/SPARK-4846
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.1, 1.2.0
 Environment: Use Word2Vec to process a corpus(sized 3.5G) with one 
 partition.
 The corpus contains about 300 million words and its vocabulary size is about 
 10 million.
Reporter: Joseph Tang
Assignee: Joseph Tang
Priority: Minor

 Exception in thread Driver java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
 Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit 
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779)
 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
 at 
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
 at 
 org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
 at 
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
 at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
 at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
 at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610)
 at 
 org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
 at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5422) Support sending to Graphite via UDP

2015-01-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292925#comment-14292925
 ] 

Apache Spark commented on SPARK-5422:
-

User 'ryan-williams' has created a pull request for this issue:
https://github.com/apache/spark/pull/4218

 Support sending to Graphite via UDP
 ---

 Key: SPARK-5422
 URL: https://issues.apache.org/jira/browse/SPARK-5422
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Ryan Williams
Priority: Minor

 {{io.dropwizard.metrics-graphite}} version {{3.1.0}} can send metrics to 
 Graphite via UDP or TCP.
 After upgrading 
 ([SPARK-5413|https://issues.apache.org/jira/browse/SPARK-5413]), we should 
 support using this facility, presumably specified via a protocol field in 
 the metrics config file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit

2015-01-26 Thread Joseph Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292926#comment-14292926
 ] 

Joseph Tang commented on SPARK-4846:


I've added some code at 
https://github.com/jinntrance/spark/compare/w2v-fix?diff=splitname=w2v-fix

If it's OK, I would send a new PR to the branch `master`.

 When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: 
 Requested array size exceeds VM limit
 ---

 Key: SPARK-4846
 URL: https://issues.apache.org/jira/browse/SPARK-4846
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.1, 1.2.0
 Environment: Use Word2Vec to process a corpus(sized 3.5G) with one 
 partition.
 The corpus contains about 300 million words and its vocabulary size is about 
 10 million.
Reporter: Joseph Tang
Assignee: Joseph Tang
Priority: Minor

 Exception in thread Driver java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
 Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit 
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779)
 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
 at 
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
 at 
 org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
 at 
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
 at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
 at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
 at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610)
 at 
 org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
 at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2015-01-26 Thread Luca Morandini (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292998#comment-14292998
 ] 

Luca Morandini commented on SPARK-1405:
---

Indeed, I have a couple students whose assignments involve Twitter data, and I 
am considering adding LDA to the mix. I would like to test it on our corpus... 
provided this feature is usable by a Spark novice: is it ?


 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Guoqiang Li
Priority: Critical
  Labels: features
 Attachments: performance_comparison.png

   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5417) Remove redundant executor-ID set() call

2015-01-26 Thread Ryan Williams (JIRA)
Ryan Williams created SPARK-5417:


 Summary: Remove redundant executor-ID set() call
 Key: SPARK-5417
 URL: https://issues.apache.org/jira/browse/SPARK-5417
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Ryan Williams
Priority: Minor


{{spark.executor.id}} no longer [needs to be set in 
Executor.scala|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/Executor.scala#L79],
 as of [#4194|https://github.com/apache/spark/pull/4194]; it is set upstream in 
[SparkEnv|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/SparkEnv.scala#L332].
 Might as well remove the redundant set() in Executor.scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5119) java.lang.ArrayIndexOutOfBoundsException on trying to train decision tree model

2015-01-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5119:
-
Assignee: Kai Sasaki

 java.lang.ArrayIndexOutOfBoundsException on trying to train decision tree 
 model
 ---

 Key: SPARK-5119
 URL: https://issues.apache.org/jira/browse/SPARK-5119
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Affects Versions: 1.1.0, 1.2.0
 Environment: Linux ubuntu 14.04
Reporter: Vivek Kulkarni
Assignee: Kai Sasaki
 Fix For: 1.3.0


 First I tried to see if there was a bug raised before with similar trace. I 
 found https://www.mail-archive.com/user@spark.apache.org/msg13708.html but 
 the suggestion to upgarde to latest code bae ( I cloned from master branch) 
 does not fix this issue.
 Issue: try to train a decision tree classifier on some data.After training 
 and when it begins colllect, it crashes:
 15/01/06 22:28:15 INFO BlockManagerMaster: Updated info of block rdd_52_1
 15/01/06 22:28:15 ERROR Executor: Exception in task 1.0 in stage 31.0 (TID 
 1895)
 java.lang.ArrayIndexOutOfBoundsException: -1
 at 
 org.apache.spark.mllib.tree.impurity.GiniAggregator.update(Gini.scala:93)
 at 
 org.apache.spark.mllib.tree.impl.DTStatsAggregator.update(DTStatsAggregator.scala:100)
 at 
 org.apache.spark.mllib.tree.DecisionTree$.orderedBinSeqOp(DecisionTree.scala:419)
 at 
 org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$nodeBinSeqOp$1(DecisionTree.scala:511)
 at 
 org.apache.spark.mllib.tree.DecisionTree$$anonfun$org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1$1.apply(DecisionTree.scala:536
 )
 at 
 org.apache.spark.mllib.tree.DecisionTree$$anonfun$org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1$1.apply(DecisionTree.scala:533
 )
 at scala.collection.immutable.Map$Map1.foreach(Map.scala:109)
 at 
 org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1(DecisionTree.scala:533)
 at 
 org.apache.spark.mllib.tree.DecisionTree$$anonfun$6$$anonfun$apply$8.apply(DecisionTree.scala:628)
 at 
 org.apache.spark.mllib.tree.DecisionTree$$anonfun$6$$anonfun$apply$8.apply(DecisionTree.scala:628)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 Minimal code:
  data = MLUtils.loadLibSVMFile(sc, 
 '/scratch1/vivek/datasets/private/a1a').cache()
 model = DecisionTree.trainClassifier(data, numClasses=2, 
 categoricalFeaturesInfo={}, maxDepth=5, maxBins=100)
 Just download the data from: 
 http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a1a



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit

2015-01-26 Thread Joseph Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292853#comment-14292853
 ] 

Joseph Tang commented on SPARK-4846:


Sorry about the procrastination. I'm still working on this.

Regarding your previous comment, should I throw an customized error in Spark or 
just OOM besides the hint about minCount and vectorSize? 

 When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: 
 Requested array size exceeds VM limit
 ---

 Key: SPARK-4846
 URL: https://issues.apache.org/jira/browse/SPARK-4846
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.1, 1.2.0
 Environment: Use Word2Vec to process a corpus(sized 3.5G) with one 
 partition.
 The corpus contains about 300 million words and its vocabulary size is about 
 10 million.
Reporter: Joseph Tang
Assignee: Joseph Tang
Priority: Minor

 Exception in thread Driver java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
 Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit 
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779)
 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
 at 
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
 at 
 org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
 at 
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
 at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
 at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
 at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610)
 at 
 org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
 at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit

2015-01-26 Thread Joseph Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292855#comment-14292855
 ] 

Joseph Tang commented on SPARK-4846:


Sorry about the procrastination. I'm still working on this.

Regarding your previous comment, should I throw an customized error in Spark or 
just OOM besides the hint about minCount and vectorSize? 

 When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: 
 Requested array size exceeds VM limit
 ---

 Key: SPARK-4846
 URL: https://issues.apache.org/jira/browse/SPARK-4846
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.1, 1.2.0
 Environment: Use Word2Vec to process a corpus(sized 3.5G) with one 
 partition.
 The corpus contains about 300 million words and its vocabulary size is about 
 10 million.
Reporter: Joseph Tang
Assignee: Joseph Tang
Priority: Minor

 Exception in thread Driver java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
 Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit 
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779)
 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
 at 
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
 at 
 org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
 at 
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
 at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
 at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
 at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610)
 at 
 org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
 at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4979) Add streaming logistic regression

2015-01-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4979:
-
Assignee: Jeremy Freeman

 Add streaming logistic regression
 -

 Key: SPARK-4979
 URL: https://issues.apache.org/jira/browse/SPARK-4979
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, Streaming
Reporter: Jeremy Freeman
Assignee: Jeremy Freeman
Priority: Minor

 We currently support streaming linear regression and k-means clustering. We 
 can add support for streaming logistic regression using a strategy similar to 
 that used in streaming linear regression, applying gradient updates to 
 batches of data from a DStream, and extending the existing mllib methods with 
 minor modifications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5421) SparkSql throw OOM at shuffle

2015-01-26 Thread Hong Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Shen updated SPARK-5421:
-
Description: 
ExternalAppendOnlyMap if only for the spark job that aggregator isDefined,  but 
sparkSQL's shuffledRDD haven't define aggregator, so sparkSQL won't spill at 
shuffle, it's very easy to throw OOM at shuffle.
One of the executor's log, here is  stderr:
15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Don't have map outputs for 
shuffle 1, fetching them
15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker 
actor = 
Actor[akka.tcp://sparkDriver@10.196.128.140:40952/user/MapOutputTracker#1435377484]
15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Got the output locations
15/01/27 07:02:19 INFO storage.ShuffleBlockFetcherIterator: Getting 143 
non-empty blocks out of 143 blocks
15/01/27 07:02:19 INFO storage.ShuffleBlockFetcherIterator: Started 4 remote 
fetches in 72 ms
15/01/27 07:47:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 
15: SIGTERM

here is  stdout:
2015-01-27T07:44:43.487+0800: [Full GC 3961343K-3959868K(3961344K), 29.8959290 
secs]
2015-01-27T07:45:13.460+0800: [Full GC 3961343K-3959992K(3961344K), 27.9218150 
secs]
2015-01-27T07:45:41.407+0800: [GC 3960347K(3961344K), 3.0457450 secs]
2015-01-27T07:45:52.950+0800: [Full GC 3961343K-3960113K(3961344K), 29.3894670 
secs]
2015-01-27T07:46:22.393+0800: [Full GC 3961118K-3960240K(3961344K), 28.9879600 
secs]
2015-01-27T07:46:51.393+0800: [Full GC 3960240K-3960213K(3961344K), 34.1530900 
secs]
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError=kill %p
#   Executing /bin/sh -c kill 9050...
2015-01-27T07:47:25.921+0800: [GC 3960214K(3961344K), 3.3959300 secs]


  was:ExternalAppendOnlyMap if only for the spark job that aggregator 
isDefined,  but sparkSQL's shuffledRDD haven't define aggregator, so sparkSQL 
won't spill at shuffle, it's very easy to throw OOM at shuffle.


 SparkSql throw OOM at shuffle
 -

 Key: SPARK-5421
 URL: https://issues.apache.org/jira/browse/SPARK-5421
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Hong Shen

 ExternalAppendOnlyMap if only for the spark job that aggregator isDefined,  
 but sparkSQL's shuffledRDD haven't define aggregator, so sparkSQL won't spill 
 at shuffle, it's very easy to throw OOM at shuffle.
 One of the executor's log, here is  stderr:
 15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Don't have map outputs 
 for shuffle 1, fetching them
 15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker 
 actor = 
 Actor[akka.tcp://sparkDriver@10.196.128.140:40952/user/MapOutputTracker#1435377484]
 15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Got the output locations
 15/01/27 07:02:19 INFO storage.ShuffleBlockFetcherIterator: Getting 143 
 non-empty blocks out of 143 blocks
 15/01/27 07:02:19 INFO storage.ShuffleBlockFetcherIterator: Started 4 remote 
 fetches in 72 ms
 15/01/27 07:47:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED 
 SIGNAL 15: SIGTERM
 here is  stdout:
 2015-01-27T07:44:43.487+0800: [Full GC 3961343K-3959868K(3961344K), 
 29.8959290 secs]
 2015-01-27T07:45:13.460+0800: [Full GC 3961343K-3959992K(3961344K), 
 27.9218150 secs]
 2015-01-27T07:45:41.407+0800: [GC 3960347K(3961344K), 3.0457450 secs]
 2015-01-27T07:45:52.950+0800: [Full GC 3961343K-3960113K(3961344K), 
 29.3894670 secs]
 2015-01-27T07:46:22.393+0800: [Full GC 3961118K-3960240K(3961344K), 
 28.9879600 secs]
 2015-01-27T07:46:51.393+0800: [Full GC 3960240K-3960213K(3961344K), 
 34.1530900 secs]
 #
 # java.lang.OutOfMemoryError: Java heap space
 # -XX:OnOutOfMemoryError=kill %p
 #   Executing /bin/sh -c kill 9050...
 2015-01-27T07:47:25.921+0800: [GC 3960214K(3961344K), 3.3959300 secs]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5421) SparkSql throw OOM at shuffle

2015-01-26 Thread Hong Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Shen updated SPARK-5421:
-
Description: 
ExternalAppendOnlyMap if only for the spark job that aggregator isDefined,  but 
sparkSQL's shuffledRDD haven't define aggregator, so sparkSQL won't spill at 
shuffle, it's very easy to throw OOM at shuffle.  I think sparkSQL also need 
spill at shuffle.
One of the executor's log, here is  stderr:
15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Don't have map outputs for 
shuffle 1, fetching them
15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker 
actor = 
Actor[akka.tcp://sparkDriver@10.196.128.140:40952/user/MapOutputTracker#1435377484]
15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Got the output locations
15/01/27 07:02:19 INFO storage.ShuffleBlockFetcherIterator: Getting 143 
non-empty blocks out of 143 blocks
15/01/27 07:02:19 INFO storage.ShuffleBlockFetcherIterator: Started 4 remote 
fetches in 72 ms
15/01/27 07:47:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 
15: SIGTERM

here is  stdout:
2015-01-27T07:44:43.487+0800: [Full GC 3961343K-3959868K(3961344K), 29.8959290 
secs]
2015-01-27T07:45:13.460+0800: [Full GC 3961343K-3959992K(3961344K), 27.9218150 
secs]
2015-01-27T07:45:41.407+0800: [GC 3960347K(3961344K), 3.0457450 secs]
2015-01-27T07:45:52.950+0800: [Full GC 3961343K-3960113K(3961344K), 29.3894670 
secs]
2015-01-27T07:46:22.393+0800: [Full GC 3961118K-3960240K(3961344K), 28.9879600 
secs]
2015-01-27T07:46:51.393+0800: [Full GC 3960240K-3960213K(3961344K), 34.1530900 
secs]
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError=kill %p
#   Executing /bin/sh -c kill 9050...
2015-01-27T07:47:25.921+0800: [GC 3960214K(3961344K), 3.3959300 secs]


  was:
ExternalAppendOnlyMap if only for the spark job that aggregator isDefined,  but 
sparkSQL's shuffledRDD haven't define aggregator, so sparkSQL won't spill at 
shuffle, it's very easy to throw OOM at shuffle.
One of the executor's log, here is  stderr:
15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Don't have map outputs for 
shuffle 1, fetching them
15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker 
actor = 
Actor[akka.tcp://sparkDriver@10.196.128.140:40952/user/MapOutputTracker#1435377484]
15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Got the output locations
15/01/27 07:02:19 INFO storage.ShuffleBlockFetcherIterator: Getting 143 
non-empty blocks out of 143 blocks
15/01/27 07:02:19 INFO storage.ShuffleBlockFetcherIterator: Started 4 remote 
fetches in 72 ms
15/01/27 07:47:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 
15: SIGTERM

here is  stdout:
2015-01-27T07:44:43.487+0800: [Full GC 3961343K-3959868K(3961344K), 29.8959290 
secs]
2015-01-27T07:45:13.460+0800: [Full GC 3961343K-3959992K(3961344K), 27.9218150 
secs]
2015-01-27T07:45:41.407+0800: [GC 3960347K(3961344K), 3.0457450 secs]
2015-01-27T07:45:52.950+0800: [Full GC 3961343K-3960113K(3961344K), 29.3894670 
secs]
2015-01-27T07:46:22.393+0800: [Full GC 3961118K-3960240K(3961344K), 28.9879600 
secs]
2015-01-27T07:46:51.393+0800: [Full GC 3960240K-3960213K(3961344K), 34.1530900 
secs]
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError=kill %p
#   Executing /bin/sh -c kill 9050...
2015-01-27T07:47:25.921+0800: [GC 3960214K(3961344K), 3.3959300 secs]



 SparkSql throw OOM at shuffle
 -

 Key: SPARK-5421
 URL: https://issues.apache.org/jira/browse/SPARK-5421
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Hong Shen

 ExternalAppendOnlyMap if only for the spark job that aggregator isDefined,  
 but sparkSQL's shuffledRDD haven't define aggregator, so sparkSQL won't spill 
 at shuffle, it's very easy to throw OOM at shuffle.  I think sparkSQL also 
 need spill at shuffle.
 One of the executor's log, here is  stderr:
 15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Don't have map outputs 
 for shuffle 1, fetching them
 15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker 
 actor = 
 Actor[akka.tcp://sparkDriver@10.196.128.140:40952/user/MapOutputTracker#1435377484]
 15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Got the output locations
 15/01/27 07:02:19 INFO storage.ShuffleBlockFetcherIterator: Getting 143 
 non-empty blocks out of 143 blocks
 15/01/27 07:02:19 INFO storage.ShuffleBlockFetcherIterator: Started 4 remote 
 fetches in 72 ms
 15/01/27 07:47:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED 
 SIGNAL 15: SIGTERM
 here is  stdout:
 2015-01-27T07:44:43.487+0800: [Full GC 3961343K-3959868K(3961344K), 
 29.8959290 secs]
 2015-01-27T07:45:13.460+0800: [Full GC 3961343K-3959992K(3961344K), 
 27.9218150 secs]
 2015-01-27T07:45:41.407+0800: [GC 

[jira] [Updated] (SPARK-3726) RandomForest: Support for bootstrap options

2015-01-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3726:
-
Target Version/s: 1.3.0

 RandomForest: Support for bootstrap options
 ---

 Key: SPARK-3726
 URL: https://issues.apache.org/jira/browse/SPARK-3726
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar
Priority: Minor
 Fix For: 1.3.0


 RandomForest uses BaggedPoint to simulate bootstrapped samples of the data.  
 The expected size of each sample is the same as the original data (sampling 
 rate = 1.0), and sampling is done with replacement.  Adding support for other 
 sampling rates and for sampling without replacement would be useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2015-01-26 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293001#comment-14293001
 ] 

Joseph K. Bradley commented on SPARK-1405:
--

It has not yet been merged into Spark master, but hopefully will be soon.  The 
initial version should be usable.  We will continue to add improvements to the 
API, especially helper functionality such as prediction and summaries about the 
learned model.

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Guoqiang Li
Priority: Critical
  Labels: features
 Attachments: performance_comparison.png

   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5416) Initialize Executor.threadPool before ExecutorSource

2015-01-26 Thread Ryan Williams (JIRA)
Ryan Williams created SPARK-5416:


 Summary: Initialize Executor.threadPool before ExecutorSource
 Key: SPARK-5416
 URL: https://issues.apache.org/jira/browse/SPARK-5416
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Ryan Williams
Priority: Minor


I recently saw some NPEs from 
[{{ExecutorSource:44}}|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/ExecutorSource.scala#L44]
 in the first couple seconds of my executors' being initialized.

I think that {{ExecutorSource}} was trying to report these metrics before its 
threadpool was initialized; there are a few LoC between the source being 
registered 
([Executor.scala:82|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/Executor.scala#L82])
 and the threadpool being initialized 
([Executor.scala:106|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/Executor.scala#L106]).

We should initialize the threapool before the ExecutorSource is registered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5341) Support maven coordinates in spark-shell and spark-submit

2015-01-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292826#comment-14292826
 ] 

Apache Spark commented on SPARK-5341:
-

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/4215

 Support maven coordinates in spark-shell and spark-submit
 -

 Key: SPARK-5341
 URL: https://issues.apache.org/jira/browse/SPARK-5341
 Project: Spark
  Issue Type: New Feature
  Components: Deploy, Spark Shell
Reporter: Burak Yavuz

 This feature will allow users to provide the maven coordinates of jars they 
 wish to use in their spark application. Coordinates can be a comma-delimited 
 list and be supplied like:
 ```spark-submit --maven org.apache.example.a,org.apache.example.b```
 This feature will also be added to spark-shell (where it is more critical to 
 have this feature)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5119) java.lang.ArrayIndexOutOfBoundsException on trying to train decision tree model

2015-01-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5119:
-
Target Version/s: 1.3.0

 java.lang.ArrayIndexOutOfBoundsException on trying to train decision tree 
 model
 ---

 Key: SPARK-5119
 URL: https://issues.apache.org/jira/browse/SPARK-5119
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Affects Versions: 1.1.0, 1.2.0
 Environment: Linux ubuntu 14.04
Reporter: Vivek Kulkarni
Assignee: Kai Sasaki
 Fix For: 1.3.0


 First I tried to see if there was a bug raised before with similar trace. I 
 found https://www.mail-archive.com/user@spark.apache.org/msg13708.html but 
 the suggestion to upgarde to latest code bae ( I cloned from master branch) 
 does not fix this issue.
 Issue: try to train a decision tree classifier on some data.After training 
 and when it begins colllect, it crashes:
 15/01/06 22:28:15 INFO BlockManagerMaster: Updated info of block rdd_52_1
 15/01/06 22:28:15 ERROR Executor: Exception in task 1.0 in stage 31.0 (TID 
 1895)
 java.lang.ArrayIndexOutOfBoundsException: -1
 at 
 org.apache.spark.mllib.tree.impurity.GiniAggregator.update(Gini.scala:93)
 at 
 org.apache.spark.mllib.tree.impl.DTStatsAggregator.update(DTStatsAggregator.scala:100)
 at 
 org.apache.spark.mllib.tree.DecisionTree$.orderedBinSeqOp(DecisionTree.scala:419)
 at 
 org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$nodeBinSeqOp$1(DecisionTree.scala:511)
 at 
 org.apache.spark.mllib.tree.DecisionTree$$anonfun$org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1$1.apply(DecisionTree.scala:536
 )
 at 
 org.apache.spark.mllib.tree.DecisionTree$$anonfun$org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1$1.apply(DecisionTree.scala:533
 )
 at scala.collection.immutable.Map$Map1.foreach(Map.scala:109)
 at 
 org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1(DecisionTree.scala:533)
 at 
 org.apache.spark.mllib.tree.DecisionTree$$anonfun$6$$anonfun$apply$8.apply(DecisionTree.scala:628)
 at 
 org.apache.spark.mllib.tree.DecisionTree$$anonfun$6$$anonfun$apply$8.apply(DecisionTree.scala:628)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 Minimal code:
  data = MLUtils.loadLibSVMFile(sc, 
 '/scratch1/vivek/datasets/private/a1a').cache()
 model = DecisionTree.trainClassifier(data, numClasses=2, 
 categoricalFeaturesInfo={}, maxDepth=5, maxBins=100)
 Just download the data from: 
 http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a1a



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5418) Output directory for shuffle should consider left space of each directory set in conf

2015-01-26 Thread ding (JIRA)
ding created SPARK-5418:
---

 Summary: Output directory for shuffle should consider left space 
of each directory set in conf
 Key: SPARK-5418
 URL: https://issues.apache.org/jira/browse/SPARK-5418
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
 Environment: Ubuntu, others should be similar
Reporter: ding
Priority: Minor


I set multiple directorys in conf spark.local.dir as scratch space, one of 
them(eg. /mnt/disk1) have 30G left space while others(eg./mnt/disk2) have 100G. 
In current version, spark use hash to figure out which directory is used for 
scratch space. It means each directory has the same chance. After hounds of 
iteration of pagerank, there is No space left exception and driver crashes. 
It does not make sense since there is still 70G+ left space in other 
directorys. We should take consider left space on each directorys when figure 
out which directory should be map output dir. I will send a PR for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5119) java.lang.ArrayIndexOutOfBoundsException on trying to train decision tree model

2015-01-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5119.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3975
[https://github.com/apache/spark/pull/3975]

 java.lang.ArrayIndexOutOfBoundsException on trying to train decision tree 
 model
 ---

 Key: SPARK-5119
 URL: https://issues.apache.org/jira/browse/SPARK-5119
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Affects Versions: 1.1.0, 1.2.0
 Environment: Linux ubuntu 14.04
Reporter: Vivek Kulkarni
 Fix For: 1.3.0


 First I tried to see if there was a bug raised before with similar trace. I 
 found https://www.mail-archive.com/user@spark.apache.org/msg13708.html but 
 the suggestion to upgarde to latest code bae ( I cloned from master branch) 
 does not fix this issue.
 Issue: try to train a decision tree classifier on some data.After training 
 and when it begins colllect, it crashes:
 15/01/06 22:28:15 INFO BlockManagerMaster: Updated info of block rdd_52_1
 15/01/06 22:28:15 ERROR Executor: Exception in task 1.0 in stage 31.0 (TID 
 1895)
 java.lang.ArrayIndexOutOfBoundsException: -1
 at 
 org.apache.spark.mllib.tree.impurity.GiniAggregator.update(Gini.scala:93)
 at 
 org.apache.spark.mllib.tree.impl.DTStatsAggregator.update(DTStatsAggregator.scala:100)
 at 
 org.apache.spark.mllib.tree.DecisionTree$.orderedBinSeqOp(DecisionTree.scala:419)
 at 
 org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$nodeBinSeqOp$1(DecisionTree.scala:511)
 at 
 org.apache.spark.mllib.tree.DecisionTree$$anonfun$org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1$1.apply(DecisionTree.scala:536
 )
 at 
 org.apache.spark.mllib.tree.DecisionTree$$anonfun$org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1$1.apply(DecisionTree.scala:533
 )
 at scala.collection.immutable.Map$Map1.foreach(Map.scala:109)
 at 
 org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1(DecisionTree.scala:533)
 at 
 org.apache.spark.mllib.tree.DecisionTree$$anonfun$6$$anonfun$apply$8.apply(DecisionTree.scala:628)
 at 
 org.apache.spark.mllib.tree.DecisionTree$$anonfun$6$$anonfun$apply$8.apply(DecisionTree.scala:628)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 Minimal code:
  data = MLUtils.loadLibSVMFile(sc, 
 '/scratch1/vivek/datasets/private/a1a').cache()
 model = DecisionTree.trainClassifier(data, numClasses=2, 
 categoricalFeaturesInfo={}, maxDepth=5, maxBins=100)
 Just download the data from: 
 http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a1a



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5388) Provide a stable application submission gateway

2015-01-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292866#comment-14292866
 ] 

Apache Spark commented on SPARK-5388:
-

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/4216

 Provide a stable application submission gateway
 ---

 Key: SPARK-5388
 URL: https://issues.apache.org/jira/browse/SPARK-5388
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Blocker
 Attachments: Stable Spark Standalone Submission.pdf


 The existing submission gateway in standalone mode is not compatible across 
 Spark versions. If you have a newer version of Spark submitting to an older 
 version of the standalone Master, it is currently not guaranteed to work. The 
 goal is to provide a stable REST interface to replace this channel.
 The first cut implementation will target standalone cluster mode because 
 there are very few messages exchanged. The design, however, will be general 
 enough to eventually support this for other cluster managers too. Note that 
 this is not necessarily required in YARN because we already use YARN's stable 
 interface to submit applications there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2015-01-26 Thread Aniket Bhatnagar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293019#comment-14293019
 ] 

Aniket Bhatnagar commented on SPARK-2243:
-

I am also interested in having this fixed. Can someone please outline what are 
specific things that need to be fixed to make this work so that interested 
people can contribute?

 Support multiple SparkContexts in the same JVM
 --

 Key: SPARK-2243
 URL: https://issues.apache.org/jira/browse/SPARK-2243
 Project: Spark
  Issue Type: New Feature
  Components: Block Manager, Spark Core
Affects Versions: 0.7.0, 1.0.0, 1.1.0
Reporter: Miguel Angel Fernandez Diaz

 We're developing a platform where we create several Spark contexts for 
 carrying out different calculations. Is there any restriction when using 
 several Spark contexts? We have two contexts, one for Spark calculations and 
 another one for Spark Streaming jobs. The next error arises when we first 
 execute a Spark calculation and, once the execution is finished, a Spark 
 Streaming job is launched:
 {code}
 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
   at 
 org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
   at 
 org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
   at 
 org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
   at 
 org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
   at 
 java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
   at 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
   at 
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to 
 java.io.FileNotFoundException
 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
   at 
 org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
   at 
 org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
   at 

[jira] [Updated] (SPARK-5420) Cross-langauge load/store functions for creating and saving DataFrames

2015-01-26 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5420:
---
Summary: Cross-langauge load/store functions for creating and saving 
DataFrames  (was: Create cross-langauge load/store functions for creating and 
saving DataFrames)

 Cross-langauge load/store functions for creating and saving DataFrames
 --

 Key: SPARK-5420
 URL: https://issues.apache.org/jira/browse/SPARK-5420
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Patrick Wendell

 We should have standard API's for loading or saving a table from a data 
 store. One idea:
 {code}
 df = sc.loadTable(path.to.DataSource, {a: b, c: d})
 sc.storeTable(path.to.DataSouce, {a:b, c:d})
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit

2015-01-26 Thread Joseph Tang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Tang updated SPARK-4846:
---
Comment: was deleted

(was: Sorry about the procrastination. I'm still working on this.

Regarding your previous comment, should I throw an customized error in Spark or 
just OOM besides the hint about minCount and vectorSize? )

 When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: 
 Requested array size exceeds VM limit
 ---

 Key: SPARK-4846
 URL: https://issues.apache.org/jira/browse/SPARK-4846
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.1, 1.2.0
 Environment: Use Word2Vec to process a corpus(sized 3.5G) with one 
 partition.
 The corpus contains about 300 million words and its vocabulary size is about 
 10 million.
Reporter: Joseph Tang
Assignee: Joseph Tang
Priority: Minor

 Exception in thread Driver java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
 Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit 
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779)
 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
 at 
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
 at 
 org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
 at 
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
 at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
 at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
 at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610)
 at 
 org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
 at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit

2015-01-26 Thread Joseph Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292853#comment-14292853
 ] 

Joseph Tang edited comment on SPARK-4846 at 1/27/15 2:44 AM:
-

Sorry about the procrastination. I'm still working on this.

Regarding your previous comment, should I throw a customized error in Spark or 
just an OOM besides the hint about minCount and vectorSize? 


was (Author: josephtang):
Sorry about the procrastination. I'm still working on this.

Regarding your previous comment, should I throw an customized error in Spark or 
just OOM besides the hint about minCount and vectorSize? 

 When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: 
 Requested array size exceeds VM limit
 ---

 Key: SPARK-4846
 URL: https://issues.apache.org/jira/browse/SPARK-4846
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.1, 1.2.0
 Environment: Use Word2Vec to process a corpus(sized 3.5G) with one 
 partition.
 The corpus contains about 300 million words and its vocabulary size is about 
 10 million.
Reporter: Joseph Tang
Assignee: Joseph Tang
Priority: Minor

 Exception in thread Driver java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
 Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit 
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779)
 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
 at 
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
 at 
 org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
 at 
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
 at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
 at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
 at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610)
 at 
 org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
 at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit

2015-01-26 Thread Joseph Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292886#comment-14292886
 ] 

Joseph Tang commented on SPARK-4846:


Hi Xiangrui, here is a problem.

PR #3693 that added the `setMinCount ` was merged to the branch `master`, while 
my PR #3697 was sent to `branch-1.1`.

Should I better close  PR #3697 and send a new PR based on PR #3693?

 When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: 
 Requested array size exceeds VM limit
 ---

 Key: SPARK-4846
 URL: https://issues.apache.org/jira/browse/SPARK-4846
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.1, 1.2.0
 Environment: Use Word2Vec to process a corpus(sized 3.5G) with one 
 partition.
 The corpus contains about 300 million words and its vocabulary size is about 
 10 million.
Reporter: Joseph Tang
Assignee: Joseph Tang
Priority: Minor

 Exception in thread Driver java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
 Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit 
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779)
 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
 at 
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
 at 
 org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
 at 
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
 at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
 at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
 at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610)
 at 
 org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
 at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5420) Create cross-langauge load/store functions for creating and saving DataFrames

2015-01-26 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-5420:
--

 Summary: Create cross-langauge load/store functions for creating 
and saving DataFrames
 Key: SPARK-5420
 URL: https://issues.apache.org/jira/browse/SPARK-5420
 Project: Spark
  Issue Type: Sub-task
Reporter: Patrick Wendell


We should have standard API's for loading or saving a table from a data store. 
One idea:

{code}
df = sc.loadTable(path.to.DataSource, {a: b, c: d})
sc.storeTable(path.to.DataSouce, {a:b, c:d})
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5395) Large number of Python workers causing resource depletion

2015-01-26 Thread Sven Krasser (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292913#comment-14292913
 ] 

Sven Krasser commented on SPARK-5395:
-

Some additional findings from my side: I've managed to trigger the problem 
using a simpler job on production data that basically does a reduceByKey 
followed by a count action. I get 20 workers (2 cores per executor) before any 
tasks in the first stage (reduceByKey) complete (i.e. different from the stage 
transition behavior you noticed). However, this doesn't occur if I run over a 
smaller data set, i.e. fewer production data files.

Before calling reduceByKey I have a coalesce call. Without that the error does 
not occur (at least in this smaller script). This at first glance looked 
potentially spilling related (more data per task), but attempting to force 
spills by setting the worker memory very low did not help with my attempts to 
get a repro on test data.

 Large number of Python workers causing resource depletion
 -

 Key: SPARK-5395
 URL: https://issues.apache.org/jira/browse/SPARK-5395
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
 Environment: AWS ElasticMapReduce
Reporter: Sven Krasser

 During job execution a large number of Python worker accumulates eventually 
 causing YARN to kill containers for being over their memory allocation (in 
 the case below that is about 8G for executors plus 6G for overhead per 
 container). 
 In this instance, at the time of killing the container 97 pyspark.daemon 
 processes had accumulated.
 {noformat}
 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler 
 (Logging.scala:logInfo(59)) - Container marked as failed: 
 container_1421692415636_0052_01_30. Exit status: 143. Diagnostics: 
 Container [pid=35211,containerID=container_1421692415636_0052_01_30] is 
 running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB 
 physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing 
 container.
 Dump of the process-tree for container_1421692415636_0052_01_30 :
 |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) 
 VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
 |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m 
 pyspark.daemon
 |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m 
 pyspark.daemon
 |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m 
 pyspark.daemon
   [...]
 {noformat}
 The configuration used uses 64 containers with 2 cores each.
 Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c
 Mailinglist discussion: 
 https://www.mail-archive.com/user@spark.apache.org/msg20102.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5052) com.google.common.base.Optional binary has a wrong method signatures

2015-01-26 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5052.

   Resolution: Fixed
Fix Version/s: 1.3.0

 com.google.common.base.Optional binary has a wrong method signatures
 

 Key: SPARK-5052
 URL: https://issues.apache.org/jira/browse/SPARK-5052
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Elmer Garduno
 Fix For: 1.3.0


 PR https://github.com/apache/spark/pull/1813 shaded Guava jar file and moved 
 Guava classes to package org.spark-project.guava when Spark is built by Maven.
 When a user jar uses the actual com.google.common.base.Optional 
 transform(com.google.common.base.Function); method from Guava,  a 
 java.lang.NoSuchMethodError: 
 com.google.common.base.Optional.transform(Lcom/google/common/base/Function;)Lcom/google/common/base/Optional;
  is thrown.
 The reason seems to be that the Optional class included on 
 spark-assembly-1.2.0-hadoop1.0.4.jar has an incorrect method signature that 
 includes the shaded class as an argument:
 Expected:
 javap -classpath 
 target/scala-2.10/googlegenomics-spark-examples-assembly-1.0.jar 
 com.google.common.base.Optional
   public abstract V extends java/lang/Object 
 com.google.common.base.OptionalV 
 transform(com.google.common.base.Function? super T, V);
 Found:
 javap -classpath lib/spark-assembly-1.2.0-hadoop1.0.4.jar 
 com.google.common.base.Optional
   public abstract V extends java/lang/Object 
 com.google.common.base.OptionalV 
 transform(org.spark-project.guava.common.base.Function? super T, V);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5052) com.google.common.base.Optional binary has a wrong method signatures

2015-01-26 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5052:
---
Assignee: Elmer Garduno

 com.google.common.base.Optional binary has a wrong method signatures
 

 Key: SPARK-5052
 URL: https://issues.apache.org/jira/browse/SPARK-5052
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Elmer Garduno
Assignee: Elmer Garduno
 Fix For: 1.3.0


 PR https://github.com/apache/spark/pull/1813 shaded Guava jar file and moved 
 Guava classes to package org.spark-project.guava when Spark is built by Maven.
 When a user jar uses the actual com.google.common.base.Optional 
 transform(com.google.common.base.Function); method from Guava,  a 
 java.lang.NoSuchMethodError: 
 com.google.common.base.Optional.transform(Lcom/google/common/base/Function;)Lcom/google/common/base/Optional;
  is thrown.
 The reason seems to be that the Optional class included on 
 spark-assembly-1.2.0-hadoop1.0.4.jar has an incorrect method signature that 
 includes the shaded class as an argument:
 Expected:
 javap -classpath 
 target/scala-2.10/googlegenomics-spark-examples-assembly-1.0.jar 
 com.google.common.base.Optional
   public abstract V extends java/lang/Object 
 com.google.common.base.OptionalV 
 transform(com.google.common.base.Function? super T, V);
 Found:
 javap -classpath lib/spark-assembly-1.2.0-hadoop1.0.4.jar 
 com.google.common.base.Optional
   public abstract V extends java/lang/Object 
 com.google.common.base.OptionalV 
 transform(org.spark-project.guava.common.base.Function? super T, V);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5261) In some cases ,The value of word's vector representation is too big

2015-01-26 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292823#comment-14292823
 ] 

Guoqiang Li commented on SPARK-5261:


[~lewuathe]
{code}
normalize_text() {
  awk '{print tolower($0);}' | sed -e s/’/'/g -e s/′/'/g -e s/''/ /g -e 
s/'/ ' /g -e s/“/\/g -e s/”/\/g \
  -e 's//  /g' -e 's/\./ \. /g' -e 's/br \// /g' -e 's/, / , /g' -e 's/(/ ( 
/g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
  -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 
's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \
  -e 's/«/ /g' | tr 0-9  
}
wget 
http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
gzip -d news.2013.en.shuffled.gz
normalize_text  news.2013.en.shuffled  data.txt
{code}

 In some cases ,The value of word's vector representation is too big
 ---

 Key: SPARK-5261
 URL: https://issues.apache.org/jira/browse/SPARK-5261
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Guoqiang Li

 {code}
 val word2Vec = new Word2Vec()
 word2Vec.
   setVectorSize(100).
   setSeed(42L).
   setNumIterations(5).
   setNumPartitions(36)
 {code}
 The average absolute value of the word's vector representation is 60731.8
 {code}
 val word2Vec = new Word2Vec()
 word2Vec.
   setVectorSize(100).
   setSeed(42L).
   setNumIterations(5).
   setNumPartitions(1)
 {code}
 The average  absolute value of the word's vector representation is 0.13889



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5206) Accumulators are not re-registered during recovering from checkpoint

2015-01-26 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292896#comment-14292896
 ] 

Saisai Shao commented on SPARK-5206:


IMHO I think this is a general problem in Spark Streaming, any variable which 
should be registered both in driver and executor side will lead to error when 
recovering from failure if the behavior of readObject lacks of driver 
re-register.

Also object like broadcast variable will also meet exception when recovering 
from checkpoint, since actual data is lost in executor side, and recovery from 
driver side is not possible if I understand correctly.

 Accumulators are not re-registered during recovering from checkpoint
 

 Key: SPARK-5206
 URL: https://issues.apache.org/jira/browse/SPARK-5206
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.1.0
Reporter: vincent ye

 I got exception as following while my streaming application restarts from 
 crash from checkpoit:
 15/01/12 10:31:06 sparkDriver-akka.actor.default-dispatcher-4 ERROR 
 scheduler.DAGScheduler: Failed to update accumulators for ShuffleMapTask(41, 
 4)
 java.util.NoSuchElementException: key not found: 1
   at scala.collection.MapLike$class.default(MapLike.scala:228)
   at scala.collection.AbstractMap.default(Map.scala:58)
   at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:939)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:938)
   at 
 scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
   at 
 scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
   at 
 scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
   at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:938)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1388)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 I guess that an Accumulator is registered to a singleton Accumulators in Line 
 58 of org.apache.spark.Accumulable:
 Accumulators.register(this, true)
 This code need to be executed in the driver once. But when the application is 
 recovered from checkpoint. It won't be executed in the driver. So when the 
 driver process it at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:938),
  It can't find the Accumulator because it's not re-register during the 
 recovery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit

2015-01-26 Thread Joseph Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292926#comment-14292926
 ] 

Joseph Tang edited comment on SPARK-4846 at 1/27/15 3:42 AM:
-

I've added some code at 
https://github.com/jinntrance/spark/compare/w2v-fix?diff=splitname=w2v-fix

If it's OK, I would send a new PR to the branch `master`.

BTW, sorry for the horrible readability of the difference because of the space 
indent.


was (Author: josephtang):
I've added some code at 
https://github.com/jinntrance/spark/compare/w2v-fix?diff=splitname=w2v-fix

If it's OK, I would send a new PR to the branch `master`.

 When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: 
 Requested array size exceeds VM limit
 ---

 Key: SPARK-4846
 URL: https://issues.apache.org/jira/browse/SPARK-4846
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.1, 1.2.0
 Environment: Use Word2Vec to process a corpus(sized 3.5G) with one 
 partition.
 The corpus contains about 300 million words and its vocabulary size is about 
 10 million.
Reporter: Joseph Tang
Assignee: Joseph Tang
Priority: Minor

 Exception in thread Driver java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
 Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit 
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779)
 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
 at 
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
 at 
 org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
 at 
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
 at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
 at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
 at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610)
 at 
 org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
 at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5417) Remove redundant executor-ID set() call

2015-01-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292790#comment-14292790
 ] 

Apache Spark commented on SPARK-5417:
-

User 'ryan-williams' has created a pull request for this issue:
https://github.com/apache/spark/pull/4213

 Remove redundant executor-ID set() call
 ---

 Key: SPARK-5417
 URL: https://issues.apache.org/jira/browse/SPARK-5417
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Ryan Williams
Priority: Minor

 {{spark.executor.id}} no longer [needs to be set in 
 Executor.scala|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/Executor.scala#L79],
  as of [#4194|https://github.com/apache/spark/pull/4194]; it is set upstream 
 in 
 [SparkEnv|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/SparkEnv.scala#L332].
  Might as well remove the redundant set() in Executor.scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5419) Fix the logic in Vectors.sqdist

2015-01-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292898#comment-14292898
 ] 

Apache Spark commented on SPARK-5419:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/4217

 Fix the logic in Vectors.sqdist
 ---

 Key: SPARK-5419
 URL: https://issues.apache.org/jira/browse/SPARK-5419
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Liang-Chi Hsieh

 The current implementation of sqdist tries to convert sparse vectors to dense 
 if they are close to dense. This is not efficient because we need to allocate 
 temp arrays. We should simply implement sqdist without allocating new memory.
 The current implementation also contains a bug on deciding whether to convert 
 a sparse vector to dense.
 {code}
 v1.indices.length / v1.size  0.5
 {code}
 which should get removed with the changes described above.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5422) Support sending to Graphite via UDP

2015-01-26 Thread Ryan Williams (JIRA)
Ryan Williams created SPARK-5422:


 Summary: Support sending to Graphite via UDP
 Key: SPARK-5422
 URL: https://issues.apache.org/jira/browse/SPARK-5422
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Ryan Williams
Priority: Minor


{{io.dropwizard.metrics-graphite}} version {{3.1.0}} can send metrics to 
Graphite via UDP or TCP.

After upgrading 
([SPARK-5413|https://issues.apache.org/jira/browse/SPARK-5413]), we should 
support using this facility, presumably specified via a protocol field in the 
metrics config file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5384) Vectors.sqdist return inconsistent result for sparse/dense vectors when the vectors have different lengths

2015-01-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5384:
-
Priority: Minor  (was: Critical)

 Vectors.sqdist return inconsistent result for sparse/dense vectors when the 
 vectors have different lengths
 --

 Key: SPARK-5384
 URL: https://issues.apache.org/jira/browse/SPARK-5384
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.1
 Environment: centos, others should be similar
Reporter: yuhao yang
Assignee: yuhao yang
Priority: Minor
 Fix For: 1.3.0

   Original Estimate: 24h
  Remaining Estimate: 24h

 For two vectors of different lengths, Vectors.sqdist would return different 
 result when the vectors are represented as sparse and dense respectively. 
 Sample:   
 val s1 = new SparseVector(4, Array(0,1,2,3), Array(1.0, 2.0, 3.0, 4.0))
 val s2 = new SparseVector(1, Array(0), Array(9.0))
 val d1 = new DenseVector(Array(1.0, 2.0, 3.0, 4.0))
 val d2 = new DenseVector(Array(9.0))
 println(s1 == d1  s2 == d2)
 println(Vectors.sqdist(s1, s2))
 println(Vectors.sqdist(d1, d2))
 result:
  true
  93.0
  64.0
 More precisely, for the extra part, Vectors.sqdist would include it for 
 sparse vectors and exclude it for dense vectors. I'll send a PR and we can 
 have more detailed discussion there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3439) Add Canopy Clustering Algorithm

2015-01-26 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292688#comment-14292688
 ] 

Xiangrui Meng commented on SPARK-3439:
--

[~angellandros] The public API and the complexity analysis are more important 
than implementation details. What is the expected input and output of the 
algorithm and how does it scale?

 Add Canopy Clustering Algorithm
 ---

 Key: SPARK-3439
 URL: https://issues.apache.org/jira/browse/SPARK-3439
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Yu Ishikawa
Assignee: Muhammad-Ali A'rabi
Priority: Minor

 The canopy clustering algorithm is an unsupervised pre-clustering algorithm. 
 It is often used as a preprocessing step for the K-means algorithm or the 
 Hierarchical clustering algorithm. It is intended to speed up clustering 
 operations on large data sets, where using another algorithm directly may be 
 impractical due to the size of the data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4587) Model export/import

2015-01-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4587:
-
Assignee: Joseph K. Bradley

 Model export/import
 ---

 Key: SPARK-4587
 URL: https://issues.apache.org/jira/browse/SPARK-4587
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Xiangrui Meng
Assignee: Joseph K. Bradley
Priority: Critical

 This is an umbrella JIRA for one of the most requested features on the user 
 mailing list. Model export/import can be done via Java serialization. But it 
 doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we 
 should provide save/load methods to every model. PMML is an option but it has 
 its limitations. There are couple things we need to discuss: 1) data format, 
 2) how to preserve partitioning, 3) data compatibility between versions and 
 language APIs, etc.
 UPDATE: [Design doc for model import/export | 
 https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing]
 This document sketches machine learning model import/export plans, including 
 goals, an API, and development plans.
 The design doc proposes:
 * Support our own Spark-specific format.
 ** This is needed to (a) support distributed models and (b) get model 
 import/export support into Spark quickly (while avoiding new dependencies).
 * Also support PMML
 ** This is needed since it is the only thing approaching an industry standard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1856) Standardize MLlib interfaces

2015-01-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1856:
-
Target Version/s:   (was: 1.3.0)

 Standardize MLlib interfaces
 

 Key: SPARK-1856
 URL: https://issues.apache.org/jira/browse/SPARK-1856
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Blocker

 Instead of expanding MLlib based on the current class naming scheme 
 (ProblemWithAlgorithm),  we should standardize MLlib's interfaces that 
 clearly separate datasets, formulations, algorithms, parameter sets, and 
 models.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1856) Standardize MLlib interfaces

2015-01-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1856:
-
Priority: Critical  (was: Blocker)

 Standardize MLlib interfaces
 

 Key: SPARK-1856
 URL: https://issues.apache.org/jira/browse/SPARK-1856
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical

 Instead of expanding MLlib based on the current class naming scheme 
 (ProblemWithAlgorithm),  we should standardize MLlib's interfaces that 
 clearly separate datasets, formulations, algorithms, parameter sets, and 
 models.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1486) Support multi-model training in MLlib

2015-01-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1486:
-
Target Version/s:   (was: 1.3.0)

 Support multi-model training in MLlib
 -

 Key: SPARK-1486
 URL: https://issues.apache.org/jira/browse/SPARK-1486
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Burak Yavuz
Priority: Critical

 It is rare in practice to train just one model with a given set of 
 parameters. Usually, this is done by training multiple models with different 
 sets of parameters and then select the best based on their performance on the 
 validation set. MLlib should provide native support for multi-model 
 training/scoring. It requires decoupling of concepts like problem, 
 formulation, algorithm, parameter set, and model, which are missing in MLlib 
 now. MLI implements similar concepts, which we can borrow. There are 
 different approaches for multi-model training:
 0) Keep one copy of the data, and train models one after another (or maybe in 
 parallel, depending on the scheduler).
 1) Keep one copy of the data, and train multiple models at the same time 
 (similar to `runs` in KMeans).
 2) Make multiple copies of the data (still stored distributively), and use 
 more cores to distribute the work.
 3) Collect the data, make the entire dataset available on workers, and train 
 one or more models on each worker.
 Users should be able to choose which execution mode they want to use. Note 
 that 3) could cover many use cases in practice when the training data is not 
 huge, e.g., 1GB.
 This task will be divided into sub-tasks and this JIRA is created to discuss 
 the design and track the overall progress.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3717) DecisionTree, RandomForest: Partition by feature

2015-01-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3717:
-
Target Version/s:   (was: 1.3.0)

 DecisionTree, RandomForest: Partition by feature
 

 Key: SPARK-3717
 URL: https://issues.apache.org/jira/browse/SPARK-3717
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 h1. Summary
 Currently, data are partitioned by row/instance for DecisionTree and 
 RandomForest.  This JIRA argues for partitioning by feature for training deep 
 trees.  This is especially relevant for random forests, which are often 
 trained to be deeper than single decision trees.
 h1. Details
 Dataset dimensions and the depth of the tree to be trained are the main 
 problem parameters determining whether it is better to partition features or 
 instances.  For random forests (training many deep trees), partitioning 
 features could be much better.
 Notation:
 * P = # workers
 * N = # instances
 * M = # features
 * D = depth of tree
 h2. Partitioning Features
 Algorithm sketch:
 * Each worker stores:
 ** a subset of columns (i.e., a subset of features).  If a worker stores 
 feature j, then the worker stores the feature value for all instances (i.e., 
 the whole column).
 ** all labels
 * Train one level at a time.
 * Invariants:
 ** Each worker stores a mapping: instance → node in current level
 * On each iteration:
 ** Each worker: For each node in level, compute (best feature to split, info 
 gain).
 ** Reduce (P x M) values to M values to find best split for each node.
 ** Workers who have features used in best splits communicate left/right for 
 relevant instances.  Gather total of N bits to master, then broadcast.
 * Total communication:
 ** Depth D iterations
 ** On each iteration, reduce to M values (~8 bytes each), broadcast N values 
 (1 bit each).
 ** Estimate: D * (M * 8 + N)
 h2. Partitioning Instances
 Algorithm sketch:
 * Train one group of nodes at a time.
 * Invariants:
  * Each worker stores a mapping: instance → node
 * On each iteration:
 ** Each worker: For each instance, add to aggregate statistics.
 ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes)
 *** (“# classes” is for classification.  3 for regression)
 ** Reduce aggregate.
 ** Master chooses best split for each node in group and broadcasts.
 * Local training: Once all instances for a node fit on one machine, it can be 
 best to shuffle data and training subtrees locally.  This can mean shuffling 
 the entire dataset for each tree trained.
 * Summing over all iterations, reduce to total of:
 ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each)
 ** Estimate: 2^D * M * B * C * 8
 h2. Comparing Partitioning Methods
 Partitioning features cost  partitioning instances cost when:
 * D * (M * 8 + N)  2^D * M * B * C * 8
 * D * N  2^D * M * B * C * 8  (assuming D * M * 8 is small compared to the 
 right hand side)
 * N  [ 2^D * M * B * C * 8 ] / D
 Example: many instances:
 * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 
 5)
 * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7
 * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5321) Add transpose() method to Matrix

2015-01-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5321:
-
Assignee: Burak Yavuz

 Add transpose() method to Matrix
 

 Key: SPARK-5321
 URL: https://issues.apache.org/jira/browse/SPARK-5321
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Burak Yavuz
Assignee: Burak Yavuz

 While we are working on BlockMatrix, it will be nice to add the support to 
 transpose matrices. .transpose() will just modify a private flag in local 
 matrices. Operations that follow will be performed based on this flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5114) Should Evaluator be a PipelineStage

2015-01-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5114:
-
Summary: Should Evaluator be a PipelineStage  (was: Should Evaluator by a 
PipelineStage)

 Should Evaluator be a PipelineStage
 ---

 Key: SPARK-5114
 URL: https://issues.apache.org/jira/browse/SPARK-5114
 Project: Spark
  Issue Type: Question
  Components: ML
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 Pipelines can currently contain Estimators and Transformers.
 Question for debate: Should Pipelines be able to contain Evaluators?
 Pros:
 * Evaluators take input datasets with particular schema, which should perhaps 
 be checked before running a Pipeline.
 Cons:
 * Evaluators do not transform datasets.   They produce a scalar (or a few 
 values), which makes it hard to say how they fit into a Pipeline or a 
 PipelineModel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5094) Python API for gradient-boosted trees

2015-01-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5094:
-
Assignee: Kazuki Taniguchi

 Python API for gradient-boosted trees
 -

 Key: SPARK-5094
 URL: https://issues.apache.org/jira/browse/SPARK-5094
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Kazuki Taniguchi





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5413) Upgrade metrics dependency to 3.1.0

2015-01-26 Thread Ryan Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Williams updated SPARK-5413:
-
Description: 
Spark currently uses Coda Hale's metrics library version {{3.0.0}}.

Version {{3.1.0}} includes some useful improvements, like [batching metrics in 
TCP|https://github.com/dropwizard/metrics/issues/660#issuecomment-55626888] and 
[supporting Graphite's UDP 
interface|https://github.com/dropwizard/metrics/blob/v3.1.0/metrics-graphite/src/main/java/com/codahale/metrics/graphite/GraphiteUDP.java].

I'd like to bump Spark's version to take advantage of these; I'm playing with 
GraphiteSink and seeing Graphite struggle to ingest all executors' metrics.

  was:
Spark currently uses Coda Hale's metrics library version {{3.0.0}}.

Version {{3.1.0}} includes some useful improvements, like [batching metrics in 
TCP|https://github.com/dropwizard/metrics/issues/660#issuecomment-55626888] and 
supporting Graphite's UDP interface.

I'd like to bump Spark's version to take advantage of these; I'm playing with 
GraphiteSink and seeing Graphite struggle to ingest all executors' metrics.


 Upgrade metrics dependency to 3.1.0
 -

 Key: SPARK-5413
 URL: https://issues.apache.org/jira/browse/SPARK-5413
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Ryan Williams

 Spark currently uses Coda Hale's metrics library version {{3.0.0}}.
 Version {{3.1.0}} includes some useful improvements, like [batching metrics 
 in 
 TCP|https://github.com/dropwizard/metrics/issues/660#issuecomment-55626888] 
 and [supporting Graphite's UDP 
 interface|https://github.com/dropwizard/metrics/blob/v3.1.0/metrics-graphite/src/main/java/com/codahale/metrics/graphite/GraphiteUDP.java].
 I'd like to bump Spark's version to take advantage of these; I'm playing with 
 GraphiteSink and seeing Graphite struggle to ingest all executors' metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5413) Upgrade metrics dependency to 3.1.0

2015-01-26 Thread Ryan Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Williams updated SPARK-5413:
-
Description: 
Spark currently uses Coda Hale's metrics library version {{3.0.0}}.

Version {{3.1.0}} includes some useful improvements, like [batching metrics in 
TCP|https://github.com/dropwizard/metrics/issues/660#issuecomment-55626888] and 
supporting Graphite's UDP interface.

I'd like to bump Spark's version to take advantage of these; I'm playing with 
GraphiteSink and seeing Graphite struggle to ingest all executors' metrics.

  was:
Spark currently uses Coda Hale's metrics library version {{3.0.0}}.

Version {{3.1.0}} includes some useful improvements, like batching metrics in 
TCP and supporting Graphite's UDP interface.

I'd like to bump Spark's version to take advantage of these; I'm playing with 
GraphiteSink and seeing Graphite struggle to ingest all executors' metrics.


 Upgrade metrics dependency to 3.1.0
 -

 Key: SPARK-5413
 URL: https://issues.apache.org/jira/browse/SPARK-5413
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Ryan Williams

 Spark currently uses Coda Hale's metrics library version {{3.0.0}}.
 Version {{3.1.0}} includes some useful improvements, like [batching metrics 
 in 
 TCP|https://github.com/dropwizard/metrics/issues/660#issuecomment-55626888] 
 and supporting Graphite's UDP interface.
 I'd like to bump Spark's version to take advantage of these; I'm playing with 
 GraphiteSink and seeing Graphite struggle to ingest all executors' metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5413) Upgrade metrics dependency to 3.1.0

2015-01-26 Thread Ryan Williams (JIRA)
Ryan Williams created SPARK-5413:


 Summary: Upgrade metrics dependency to 3.1.0
 Key: SPARK-5413
 URL: https://issues.apache.org/jira/browse/SPARK-5413
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Ryan Williams


Spark currently uses Coda Hale's metrics library version {{3.0.0}}.

Version {{3.1.0}} includes some useful improvements, like batching metrics in 
TCP and supporting Graphite's UDP interface.

I'd like to bump Spark's version to take advantage of these; I'm playing with 
GraphiteSink and seeing Graphite struggle to ingest all executors' metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit

2015-01-26 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292718#comment-14292718
 ] 

Xiangrui Meng commented on SPARK-4846:
--

[~josephtang] Are you working on this issue? If not, do you mind me sending a 
PR that throwing an exception if vectorSize is beyond limit?

 When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: 
 Requested array size exceeds VM limit
 ---

 Key: SPARK-4846
 URL: https://issues.apache.org/jira/browse/SPARK-4846
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.1, 1.2.0
 Environment: Use Word2Vec to process a corpus(sized 3.5G) with one 
 partition.
 The corpus contains about 300 million words and its vocabulary size is about 
 10 million.
Reporter: Joseph Tang
Assignee: Joseph Tang
Priority: Minor

 Exception in thread Driver java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
 Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit 
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779)
 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
 at 
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
 at 
 org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
 at 
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
 at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
 at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
 at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610)
 at 
 org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
 at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5414) Add SparkListener implementation that allows users to receive all listener events in one method

2015-01-26 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5414:
--
Component/s: Spark Core

 Add SparkListener implementation that allows users to receive all listener 
 events in one method
 ---

 Key: SPARK-5414
 URL: https://issues.apache.org/jira/browse/SPARK-5414
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Josh Rosen
Assignee: Josh Rosen

 Currently, users don't have a very good way to write a SparkListener that 
 receives all SparkListener events and which will be future-compatible (e.g. 
 it will receive events introduced in newer versions of Spark without having 
 to override new methods to process those events).
 Therefore, I think Spark should include a concrete SparkListener 
 implementation that implements all of the message-handling methods and 
 dispatches all of them to a single {{onEvent}} method.  By putting this code 
 in Spark, we isolate users from changes to the listener API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5415) Upgrade sbt to 0.13.7

2015-01-26 Thread Ryan Williams (JIRA)
Ryan Williams created SPARK-5415:


 Summary: Upgrade sbt to 0.13.7
 Key: SPARK-5415
 URL: https://issues.apache.org/jira/browse/SPARK-5415
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Ryan Williams
Priority: Minor


Spark currently uses sbt {{0.13.6}}, which has a regression related to 
processing parent POM's in Maven projects.

{{0.13.7}} does not have this issue (though it's unclear whether it was fixed 
intentionally), so I'd like to bump up one version.

I ran into this while locally building a Spark assembly against a locally-built 
metrics JAR dependency; {{0.13.6}} could not build Spark but {{0.13.7}} 
worked fine. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5261) In some cases ,The value of word's vector representation is too big

2015-01-26 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292750#comment-14292750
 ] 

Kai Sasaki commented on SPARK-5261:
---

[~gq] Can you provide us data set? I tried with some patterns of num partitions 
but cannot reproduced it. 

 In some cases ,The value of word's vector representation is too big
 ---

 Key: SPARK-5261
 URL: https://issues.apache.org/jira/browse/SPARK-5261
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Guoqiang Li

 {code}
 val word2Vec = new Word2Vec()
 word2Vec.
   setVectorSize(100).
   setSeed(42L).
   setNumIterations(5).
   setNumPartitions(36)
 {code}
 The average absolute value of the word's vector representation is 60731.8
 {code}
 val word2Vec = new Word2Vec()
 word2Vec.
   setVectorSize(100).
   setSeed(42L).
   setNumIterations(5).
   setNumPartitions(1)
 {code}
 The average  absolute value of the word's vector representation is 0.13889



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5409) Broken link in documentation

2015-01-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5409.
--
Resolution: Duplicate

Actually this was already fixed

 Broken link in documentation
 

 Key: SPARK-5409
 URL: https://issues.apache.org/jira/browse/SPARK-5409
 Project: Spark
  Issue Type: Documentation
Reporter: Mauro Pirrone
Priority: Minor

 https://spark.apache.org/docs/1.2.0/streaming-kafka-integration.html
 See the API docs and the example.
 Link to example is broken.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5395) Large number of Python workers causing resource depletion

2015-01-26 Thread Mark Khaitman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292769#comment-14292769
 ] 

Mark Khaitman commented on SPARK-5395:
--

This may prove to be useful...

I'm watching a presently running spark-submitted job, while watching the 
pyspark.daemon processes. The framework is permitted to only use 8 cores on 
each node with the default python worker memory of 512mb per node (not the 
executor memory which is set to higher than this).

Ignoring the exact RDD actions for a moment, it looks like while it transitions 
from Stage 1 - Stage 2, it spawned up 8-10 additional pyspark.daemon processes 
making the box use more cores than it was even allowed to... A few seconds 
after that, the other 8 processes entered a sleeping state while still holding 
onto the physical memory it ate up in Stage 1. As soon as Stage 2 finished, 
practically all of the pyspark.daemons vanished and freed up the memory usage. 
I was keeping an eye on 2 random nodes and the exact same thing occurred on 
both. It was also the only currently executing job at the time so there was 
really no other interference/contention for resources.

I will try to provide a bit more detail on the exact transformations/actions 
occurring between the 2 stages, although I know a PartionBy and cogroup are 
occurring at the very least without inspecting the spark-submitted code 
directly.

 Large number of Python workers causing resource depletion
 -

 Key: SPARK-5395
 URL: https://issues.apache.org/jira/browse/SPARK-5395
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
 Environment: AWS ElasticMapReduce
Reporter: Sven Krasser

 During job execution a large number of Python worker accumulates eventually 
 causing YARN to kill containers for being over their memory allocation (in 
 the case below that is about 8G for executors plus 6G for overhead per 
 container). 
 In this instance, at the time of killing the container 97 pyspark.daemon 
 processes had accumulated.
 {noformat}
 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler 
 (Logging.scala:logInfo(59)) - Container marked as failed: 
 container_1421692415636_0052_01_30. Exit status: 143. Diagnostics: 
 Container [pid=35211,containerID=container_1421692415636_0052_01_30] is 
 running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB 
 physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing 
 container.
 Dump of the process-tree for container_1421692415636_0052_01_30 :
 |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) 
 VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
 |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m 
 pyspark.daemon
 |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m 
 pyspark.daemon
 |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m 
 pyspark.daemon
   [...]
 {noformat}
 The configuration used uses 64 containers with 2 cores each.
 Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c
 Mailinglist discussion: 
 https://www.mail-archive.com/user@spark.apache.org/msg20102.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5419) Fix the logic in Vectors.sqdist

2015-01-26 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-5419:


 Summary: Fix the logic in Vectors.sqdist
 Key: SPARK-5419
 URL: https://issues.apache.org/jira/browse/SPARK-5419
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Liang-Chi Hsieh


The current implementation of sqdist tries to convert sparse vectors to dense 
if they are close to dense. This is not efficient because we need to allocate 
temp arrays. We should simply implement sqdist without allocating new memory.

The current implementation also contains a bug on deciding whether to convert a 
sparse vector to dense.

{code}
v1.indices.length / v1.size  0.5
{code}

which should get removed with the changes described above.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5421) SparkSql throw OOM at shuffle

2015-01-26 Thread Hong Shen (JIRA)
Hong Shen created SPARK-5421:


 Summary: SparkSql throw OOM at shuffle
 Key: SPARK-5421
 URL: https://issues.apache.org/jira/browse/SPARK-5421
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Hong Shen


ExternalAppendOnlyMap if only for the spark job that aggregator isDefined,  but 
sparkSQL's shuffledRDD haven't define aggregator, so sparkSQL won't spill at 
shuffle, it's very easy to throw OOM at shuffle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3726) RandomForest: Support for bootstrap options

2015-01-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-3726.
--
   Resolution: Fixed
Fix Version/s: (was: 1.2.0)
   1.3.0

Issue resolved by pull request 4073
[https://github.com/apache/spark/pull/4073]

 RandomForest: Support for bootstrap options
 ---

 Key: SPARK-3726
 URL: https://issues.apache.org/jira/browse/SPARK-3726
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar
Priority: Minor
 Fix For: 1.3.0


 RandomForest uses BaggedPoint to simulate bootstrapped samples of the data.  
 The expected size of each sample is the same as the original data (sampling 
 rate = 1.0), and sampling is done with replacement.  Adding support for other 
 sampling rates and for sampling without replacement would be useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5267) Add a streaming module to ingest Apache Camel Messages from a configured endpoints

2015-01-26 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293043#comment-14293043
 ] 

Tathagata Das commented on SPARK-5267:
--

Hey this is a great initiative! However, we are trying to limit the number of 
external dependencies in the Spark umbrella project for better manageability. 
So while we want to add more functionality and integration with other systems, 
we are very cautious about adding these dependencies into the Spark project. A 
better place for such contributions is the http://spark-packages.org/ where 
people contribute such functionality and maintain them on their own. I strongly 
encourage you to add your camel integration to the spark-packages.


 Add a streaming module to ingest Apache Camel Messages from a configured 
 endpoints
 --

 Key: SPARK-5267
 URL: https://issues.apache.org/jira/browse/SPARK-5267
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.2.0
Reporter: Steve Brewin
  Labels: features
   Original Estimate: 120h
  Remaining Estimate: 120h

 The number of input stream protocols supported by Spark Streaming is quite 
 limited, which constrains the number of systems with which it can be 
 integrated.
 This proposal solves the problem by adding an optional module that integrates 
 Apache Camel, which supports many additional input protocols. Our tried and 
 tested implementation of this proposal is spark-streaming-camel. 
 An Apache Camel service is run on a separate Thread, consuming each 
 http://camel.apache.org/maven/current/camel-core/apidocs/org/apache/camel/Message.html
  and storing it into Spark's memory. The provider of the Message is specified 
 by any consuming component URI documented at 
 http://camel.apache.org/components.html, making all of these protocols 
 available to Spark Streaming.
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4964) Exactly-once + WAL-free Kafka Support in Spark Streaming

2015-01-26 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293140#comment-14293140
 ] 

Tathagata Das commented on SPARK-4964:
--

[~dibbhatt][~jerryshao][~hshreedharan][~c...@koeninger.org] 
Please take a look at the design doc and comment on it. 
Thank you very much!



 Exactly-once + WAL-free Kafka Support in Spark Streaming
 

 Key: SPARK-4964
 URL: https://issues.apache.org/jira/browse/SPARK-4964
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Cody Koeninger

 There are two issues with the current Kafka support 
  - Use of Write Ahead Logs in Spark Streaming to ensure no data is lost - 
 Causes data replication in both Kafka AND Spark Streaming. 
  - Lack of exactly-once semantics - For background, see 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Which-committers-care-about-Kafka-td9827.html
 We want to solve both these problem in JIRA. Please see the following design 
 doc for the solution. 
 https://docs.google.com/a/databricks.com/document/d/1IuvZhg9cOueTf1mq4qwc1fhPb5FVcaRLcyjrtG4XU1k/edit#heading=h.itproy77j3p



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4964) Exactly-once + WAL-free Kafka Support in Spark Streaming

2015-01-26 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-4964:
-
Summary: Exactly-once + WAL-free Kafka Support in Spark Streaming  (was: 
Exactly-once semantics for Kafka)

 Exactly-once + WAL-free Kafka Support in Spark Streaming
 

 Key: SPARK-4964
 URL: https://issues.apache.org/jira/browse/SPARK-4964
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Cody Koeninger

 for background, see 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Which-committers-care-about-Kafka-td9827.html
 Requirements:
 - allow client code to implement exactly-once end-to-end semantics for Kafka 
 messages, in cases where their output storage is either idempotent or 
 transactional
 - allow client code access to Kafka offsets, rather than automatically 
 committing them
 - do not assume Zookeeper as a repository for offsets (for the transactional 
 case, offsets need to be stored in the same store as the data)
 - allow failure recovery without lost or duplicated messages, even in cases 
 where a checkpoint cannot be restored (for instance, because code must be 
 updated)
 Design:
 The basic idea is to make an rdd where each partition corresponds to a given 
 Kafka topic, partition, starting offset, and ending offset.  That allows for 
 deterministic replay of data from Kafka (as long as there is enough log 
 retention).
 Client code is responsible for committing offsets, either transactionally to 
 the same store that data is being written to, or in the case of idempotent 
 data, after data has been written.
 PR of a sample implementation for both the batch and dstream case is 
 forthcoming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4964) Exactly-once semantics for Kafka

2015-01-26 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293129#comment-14293129
 ] 

Tathagata Das commented on SPARK-4964:
--

I am renaming this JIRA to Native Kafka Support because there are two 
problems that we are trying to solve, which gets solved by the associated PR.

 Exactly-once semantics for Kafka
 

 Key: SPARK-4964
 URL: https://issues.apache.org/jira/browse/SPARK-4964
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Cody Koeninger

 for background, see 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Which-committers-care-about-Kafka-td9827.html
 Requirements:
 - allow client code to implement exactly-once end-to-end semantics for Kafka 
 messages, in cases where their output storage is either idempotent or 
 transactional
 - allow client code access to Kafka offsets, rather than automatically 
 committing them
 - do not assume Zookeeper as a repository for offsets (for the transactional 
 case, offsets need to be stored in the same store as the data)
 - allow failure recovery without lost or duplicated messages, even in cases 
 where a checkpoint cannot be restored (for instance, because code must be 
 updated)
 Design:
 The basic idea is to make an rdd where each partition corresponds to a given 
 Kafka topic, partition, starting offset, and ending offset.  That allows for 
 deterministic replay of data from Kafka (as long as there is enough log 
 retention).
 Client code is responsible for committing offsets, either transactionally to 
 the same store that data is being written to, or in the case of idempotent 
 data, after data has been written.
 PR of a sample implementation for both the batch and dstream case is 
 forthcoming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4964) Exactly-once + WAL-free Kafka Support in Spark Streaming

2015-01-26 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-4964:
-
Description: 
There are two issues with the current Kafka support 
 - Use of Write Ahead Logs in Spark Streaming to ensure no data is lost - 
Causes data replication in both Kafka AND Spark Streaming. 
 - Lack of exactly-once semantics - For background, see 
http://apache-spark-developers-list.1001551.n3.nabble.com/Which-committers-care-about-Kafka-td9827.html

We want to solve both these problem in JIRA. Please see the following design 
doc for the solution. 
https://docs.google.com/a/databricks.com/document/d/1IuvZhg9cOueTf1mq4qwc1fhPb5FVcaRLcyjrtG4XU1k/edit#heading=h.itproy77j3p


  was:
for background, see 
http://apache-spark-developers-list.1001551.n3.nabble.com/Which-committers-care-about-Kafka-td9827.html

There ar


 Exactly-once + WAL-free Kafka Support in Spark Streaming
 

 Key: SPARK-4964
 URL: https://issues.apache.org/jira/browse/SPARK-4964
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Cody Koeninger

 There are two issues with the current Kafka support 
  - Use of Write Ahead Logs in Spark Streaming to ensure no data is lost - 
 Causes data replication in both Kafka AND Spark Streaming. 
  - Lack of exactly-once semantics - For background, see 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Which-committers-care-about-Kafka-td9827.html
 We want to solve both these problem in JIRA. Please see the following design 
 doc for the solution. 
 https://docs.google.com/a/databricks.com/document/d/1IuvZhg9cOueTf1mq4qwc1fhPb5FVcaRLcyjrtG4XU1k/edit#heading=h.itproy77j3p



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >