date:20150126

[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark application

2015-01-26 Thread Robert Stupp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291818#comment-14291818
 ] 

Robert Stupp commented on SPARK-2389:
-

[~srowen] yes, the problem is that drivers cannot share RDDs.
IMHO there are a lot of valid scenarios that can benefit from multiple drivers 
using shared RDDs.

 globally shared SparkContext / shared Spark application
 -

 Key: SPARK-2389
 URL: https://issues.apache.org/jira/browse/SPARK-2389
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Robert Stupp

 The documentation (in Cluster Mode Overview) cites:
 bq. Each application gets its own executor processes, which *stay up for the 
 duration of the whole application* and run tasks in multiple threads. This 
 has the benefit of isolating applications from each other, on both the 
 scheduling side (each driver schedules its own tasks) and executor side 
 (tasks from different applications run in different JVMs). However, it also 
 means that *data cannot be shared* across different Spark applications 
 (instances of SparkContext) without writing it to an external storage system.
 IMO this is a limitation that should be lifted to support any number of 
 --driver-- client processes to share executors and to share (persistent / 
 cached) data.
 This is especially useful if you have a bunch of frontend servers (dump web 
 app servers) that want to use Spark as a _big computing machine_. Most 
 important is the fact that Spark is quite good in caching/persisting data in 
 memory / on disk thus removing load from backend data stores.
 Means: it would be really great to let different --driver-- client JVMs 
 operate on the same RDDs and benefit from Spark's caching/persistence.
 It would however introduce some administration mechanisms to
 * start a shared context
 * update the executor configuration (# of worker nodes, # of cpus, etc) on 
 the fly
 * stop a shared context
 Even conventional batch MR applications would benefit if ran fequently 
 against the same data set.
 As an implicit requirement, RDD persistence could get a TTL for its 
 materialized state.
 With such a feature the overall performance of today's web applications could 
 then be increased by adding more web app servers, more spark nodes, more 
 nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-5303) applySchema returns NullPointerException

2015-01-26 Thread Mauro Pirrone (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mauro Pirrone closed SPARK-5303.

Resolution: Not a Problem

 applySchema returns NullPointerException
 

 Key: SPARK-5303
 URL: https://issues.apache.org/jira/browse/SPARK-5303
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Mauro Pirrone

 The following code snippet returns NullPointerException:
 val result = .
   
 val rows = result.take(10)
 val rowRdd = SparkManager.getContext().parallelize(rows, 1)
 val schemaRdd = SparkManager.getSQLContext().applySchema(rowRdd, 
 result.schema)
 java.lang.NullPointerException
   at 
 org.apache.spark.sql.catalyst.expressions.AttributeReference.hashCode(namedExpressions.scala:147)
   at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:210)
   at scala.util.hashing.MurmurHash3.listHash(MurmurHash3.scala:168)
   at scala.util.hashing.MurmurHash3$.seqHash(MurmurHash3.scala:216)
   at scala.collection.LinearSeqLike$class.hashCode(LinearSeqLike.scala:53)
   at scala.collection.immutable.List.hashCode(List.scala:84)
   at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:210)
   at scala.util.hashing.MurmurHash3.productHash(MurmurHash3.scala:63)
   at scala.util.hashing.MurmurHash3$.productHash(MurmurHash3.scala:210)
   at scala.runtime.ScalaRunTime$._hashCode(ScalaRunTime.scala:172)
   at 
 org.apache.spark.sql.execution.LogicalRDD.hashCode(ExistingRDD.scala:58)
   at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:210)
   at 
 scala.collection.mutable.HashTable$HashUtils$class.elemHashCode(HashTable.scala:398)
   at scala.collection.mutable.HashMap.elemHashCode(HashMap.scala:39)
   at 
 scala.collection.mutable.HashTable$class.findEntry(HashTable.scala:130)
   at scala.collection.mutable.HashMap.findEntry(HashMap.scala:39)
   at scala.collection.mutable.HashMap.get(HashMap.scala:69)
   at 
 scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:187)
   at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:91)
   at 
 scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:329)
   at 
 scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:327)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 scala.collection.TraversableLike$class.groupBy(TraversableLike.scala:327)
   at scala.collection.AbstractTraversable.groupBy(Traversable.scala:105)
   at 
 org.apache.spark.sql.catalyst.analysis.NewRelationInstances$.apply(MultiInstanceRelation.scala:44)
   at 
 org.apache.spark.sql.catalyst.analysis.NewRelationInstances$.apply(MultiInstanceRelation.scala:40)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
   at 
 scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
   at 
 scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
   at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)
   at org.apache.spark.sql.SchemaRDD.schema$lzycompute(SchemaRDD.scala:135)
   at org.apache.spark.sql.SchemaRDD.schema(SchemaRDD.scala:135)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5409) Broken link in documentation

2015-01-26 Thread Mauro Pirrone (JIRA)

Mauro Pirrone created SPARK-5409:


 Summary: Broken link in documentation
 Key: SPARK-5409
 URL: https://issues.apache.org/jira/browse/SPARK-5409
 Project: Spark
  Issue Type: Documentation
Reporter: Mauro Pirrone
Priority: Minor


https://spark.apache.org/docs/1.2.0/streaming-kafka-integration.html

See the API docs and the example.

Link to example is broken.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark application

2015-01-26 Thread Robert Stupp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291797#comment-14291797
 ] 

Robert Stupp commented on SPARK-2389:
-

bq. That aside, why doesn't it scale?

Simply because it's just a single Spark client. If that machine's at its limit 
for whatever reason (VM memory, OS resources, CPU, network, ...), that's it.

Sure, you can run multiple drivers - but each has its own, private set of data.

IMO separate preloading is nice for some applications. But data is usually not 
immutable. By example:
* Imagine an application that provides offers for flights worldwide. It's a 
huge amount of data and a huge amount of processing. It cannot be simply 
preloaded - prices for tickets vary from minute to minute based on booking 
status etc etc etc
* Overall data set is quite big
* Overall load is too big for a single driver to handle - imagine thousands of 
offer requests per second
* Failure of a single driver is an absolute no-go
* All clients have to access the same set of data
* Preloading is just impossible during runtime (just at initial deployment)

So - a suitable approach would be to have:
* a Spark cluster holding all the RDDs and doing all offer and booking related 
operations
* a set of Spark clients to abstract Spark from the rest of the application
* a huge number of non-uniform frontend clients (could be web app servers, rich 
clients, SOAP / REST frontends)
* everything (except the data) stateless

 globally shared SparkContext / shared Spark application
 -

 Key: SPARK-2389
 URL: https://issues.apache.org/jira/browse/SPARK-2389
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Robert Stupp

 The documentation (in Cluster Mode Overview) cites:
 bq. Each application gets its own executor processes, which *stay up for the 
 duration of the whole application* and run tasks in multiple threads. This 
 has the benefit of isolating applications from each other, on both the 
 scheduling side (each driver schedules its own tasks) and executor side 
 (tasks from different applications run in different JVMs). However, it also 
 means that *data cannot be shared* across different Spark applications 
 (instances of SparkContext) without writing it to an external storage system.
 IMO this is a limitation that should be lifted to support any number of 
 --driver-- client processes to share executors and to share (persistent / 
 cached) data.
 This is especially useful if you have a bunch of frontend servers (dump web 
 app servers) that want to use Spark as a _big computing machine_. Most 
 important is the fact that Spark is quite good in caching/persisting data in 
 memory / on disk thus removing load from backend data stores.
 Means: it would be really great to let different --driver-- client JVMs 
 operate on the same RDDs and benefit from Spark's caching/persistence.
 It would however introduce some administration mechanisms to
 * start a shared context
 * update the executor configuration (# of worker nodes, # of cpus, etc) on 
 the fly
 * stop a shared context
 Even conventional batch MR applications would benefit if ran fequently 
 against the same data set.
 As an implicit requirement, RDD persistence could get a TTL for its 
 materialized state.
 With such a feature the overall performance of today's web applications could 
 then be increased by adding more web app servers, more spark nodes, more 
 nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark application

2015-01-26 Thread Murat Eken (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291795#comment-14291795
 ] 

Murat Eken commented on SPARK-2389:
---

[~sowen], I think Robert is talking about fault tolerance when he mentions 
scalability. Anyway, as I mentioned in my original comment, Tachyon is not an 
option, at least for us, due to interprocess serialization/deserialization 
costs. Although we haven't tried HDFS, but I would be surprised if that 
performed differently.

 globally shared SparkContext / shared Spark application
 -

 Key: SPARK-2389
 URL: https://issues.apache.org/jira/browse/SPARK-2389
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Robert Stupp

 The documentation (in Cluster Mode Overview) cites:
 bq. Each application gets its own executor processes, which *stay up for the 
 duration of the whole application* and run tasks in multiple threads. This 
 has the benefit of isolating applications from each other, on both the 
 scheduling side (each driver schedules its own tasks) and executor side 
 (tasks from different applications run in different JVMs). However, it also 
 means that *data cannot be shared* across different Spark applications 
 (instances of SparkContext) without writing it to an external storage system.
 IMO this is a limitation that should be lifted to support any number of 
 --driver-- client processes to share executors and to share (persistent / 
 cached) data.
 This is especially useful if you have a bunch of frontend servers (dump web 
 app servers) that want to use Spark as a _big computing machine_. Most 
 important is the fact that Spark is quite good in caching/persisting data in 
 memory / on disk thus removing load from backend data stores.
 Means: it would be really great to let different --driver-- client JVMs 
 operate on the same RDDs and benefit from Spark's caching/persistence.
 It would however introduce some administration mechanisms to
 * start a shared context
 * update the executor configuration (# of worker nodes, # of cpus, etc) on 
 the fly
 * stop a shared context
 Even conventional batch MR applications would benefit if ran fequently 
 against the same data set.
 As an implicit requirement, RDD persistence could get a TTL for its 
 materialized state.
 With such a feature the overall performance of today's web applications could 
 then be increased by adding more web app servers, more spark nodes, more 
 nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark application

2015-01-26 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291788#comment-14291788
]

Sean Owen commented on SPARK-2389:
--

Yes, the SPOF problem makes sense. It doesn't seem to be what this JIRA was
about though, which seems to be what the jobserver-style approach addresses.

That aside, why doesn't it scale? because of work that needs to be done on the
driver? You can of course still run a bunch of drivers, just not one per client.

The preloading cache issue is what off-heap caching in Tachyon or HDFS is
supposed to ameliorate.

globally shared SparkContext / shared Spark application
-

Key: SPARK-2389
URL: https://issues.apache.org/jira/browse/SPARK-2389
Project: Spark
Issue Type: Improvement
Components: Spark Core
Reporter: Robert Stupp

The documentation (in Cluster Mode Overview) cites:
bq. Each application gets its own executor processes, which *stay up for the
duration of the whole application* and run tasks in multiple threads. This
has the benefit of isolating applications from each other, on both the
scheduling side (each driver schedules its own tasks) and executor side
(tasks from different applications run in different JVMs). However, it also
means that *data cannot be shared* across different Spark applications
(instances of SparkContext) without writing it to an external storage system.
IMO this is a limitation that should be lifted to support any number of
--driver-- client processes to share executors and to share (persistent /
cached) data.
This is especially useful if you have a bunch of frontend servers (dump web
app servers) that want to use Spark as a _big computing machine_. Most
important is the fact that Spark is quite good in caching/persisting data in
memory / on disk thus removing load from backend data stores.
Means: it would be really great to let different --driver-- client JVMs
operate on the same RDDs and benefit from Spark's caching/persistence.
It would however introduce some administration mechanisms to
* start a shared context
* update the executor configuration (# of worker nodes, # of cpus, etc) on
the fly
* stop a shared context
Even conventional batch MR applications would benefit if ran fequently
against the same data set.
As an implicit requirement, RDD persistence could get a TTL for its
materialized state.
With such a feature the overall performance of today's web applications could
then be increased by adding more web app servers, more spark nodes, more
nosql nodes etc

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark application

2015-01-26 Thread Robert Stupp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291798#comment-14291798
 ] 

Robert Stupp commented on SPARK-2389:
-

bq. fault tolerance when he mentions scalability

both play well together in a stateless application ;)

 globally shared SparkContext / shared Spark application
 -

 Key: SPARK-2389
 URL: https://issues.apache.org/jira/browse/SPARK-2389
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Robert Stupp

 The documentation (in Cluster Mode Overview) cites:
 bq. Each application gets its own executor processes, which *stay up for the 
 duration of the whole application* and run tasks in multiple threads. This 
 has the benefit of isolating applications from each other, on both the 
 scheduling side (each driver schedules its own tasks) and executor side 
 (tasks from different applications run in different JVMs). However, it also 
 means that *data cannot be shared* across different Spark applications 
 (instances of SparkContext) without writing it to an external storage system.
 IMO this is a limitation that should be lifted to support any number of 
 --driver-- client processes to share executors and to share (persistent / 
 cached) data.
 This is especially useful if you have a bunch of frontend servers (dump web 
 app servers) that want to use Spark as a _big computing machine_. Most 
 important is the fact that Spark is quite good in caching/persisting data in 
 memory / on disk thus removing load from backend data stores.
 Means: it would be really great to let different --driver-- client JVMs 
 operate on the same RDDs and benefit from Spark's caching/persistence.
 It would however introduce some administration mechanisms to
 * start a shared context
 * update the executor configuration (# of worker nodes, # of cpus, etc) on 
 the fly
 * stop a shared context
 Even conventional batch MR applications would benefit if ran fequently 
 against the same data set.
 As an implicit requirement, RDD persistence could get a TTL for its 
 materialized state.
 With such a feature the overall performance of today's web applications could 
 then be increased by adding more web app servers, more spark nodes, more 
 nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark application

2015-01-26 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291804#comment-14291804
]

Sean Owen commented on SPARK-2389:
--

Yes, makes sense. Maxing out one driver isn't an issue since you can have many
drivers (or push work into the cluster). The issue is really that each driver
then has its own RDDs, and if you need 100s of drivers to keep up, that just
won't work. (Although then I'd question how so much work is being done on the
Spark driver?)

In theory the redundancy of all those RDDs is what HDFS caching and Tachyon
could in theory help with, although those help share outside Spark. Whether
that works for a particular use case right now is a different question,
although I suspect it makes more sense to make those work than start yet
another solution.

What you are describing -- mutating lots shared in-memory state -- doesn't
sound like a problem Spark helps solve per se. That is, it doesn't sound like
work that has to live in a Spark driver program, even if it needs to ask a
Spark driver-based service for some results. Naturally you know your problem
better than I, but I am wondering if the answer here isn't just using Spark
differently, for what it's for.

globally shared SparkContext / shared Spark application
-

Key: SPARK-2389
URL: https://issues.apache.org/jira/browse/SPARK-2389
Project: Spark
Issue Type: Improvement
Components: Spark Core
Reporter: Robert Stupp

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5409) Broken link in documentation

2015-01-26 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291823#comment-14291823
 ] 

Sean Owen commented on SPARK-5409:
--

Should just be

https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala

Open a PR; this probably doesn't even need a JIRA.

 Broken link in documentation
 

 Key: SPARK-5409
 URL: https://issues.apache.org/jira/browse/SPARK-5409
 Project: Spark
  Issue Type: Documentation
Reporter: Mauro Pirrone
Priority: Minor

 https://spark.apache.org/docs/1.2.0/streaming-kafka-integration.html
 See the API docs and the example.
 Link to example is broken.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-5407) No 1.2 AMI available for ec2

2015-01-26 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-5407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Håkan Jonsson closed SPARK-5407.

Resolution: Invalid

Error on my side.

 No 1.2 AMI available for ec2
 

 Key: SPARK-5407
 URL: https://issues.apache.org/jira/browse/SPARK-5407
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.2.0
Reporter: Håkan Jonsson

 When I try to launch a standalone cluster on EC2 using the scripts in the ec2 
 directory for Spark 1.2, (./spark-ec2 -k spark -i k.pem launch my12), I get 
 the following error: 
 Could not resolve AMI at: 
 https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-east-1/pvm
 It seems there is not yet any AMI available on EC2 for Spark 1.2.
 This works well for Spark 1.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5410) Error parsing scientific notation in a select statement

2015-01-26 Thread Hugo Ferrira (JIRA)

Hugo Ferrira created SPARK-5410:
---

 Summary: Error parsing scientific notation in a select statement
 Key: SPARK-5410
 URL: https://issues.apache.org/jira/browse/SPARK-5410
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Hugo Ferrira


I am using the Cassandra DB and am attempting a select through the Spark SQL 
interface.

SELECT * from key_value WHERE f2  2.2E10

And get the following error (no error if I remove the E10):

[info] - should be able to select a subset of applicable features *** FAILED ***
[info]   java.lang.RuntimeException: [1.39] failure: ``UNION'' expected but 
identifier E10 found
[info] 
[info] SELECT * from key_value WHERE f2  2.2E10
[info]   ^
[info]   at scala.sys.package$.error(package.scala:27)
[info]   at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33)
[info]   at 
org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
[info]   at 
org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
[info]   at 
org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:174)
[info]   at 
org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:173)
[info]   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
[info]   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
[info]   ...




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3852) Document spark.driver.extra* configs

2015-01-26 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3852.
--
  Resolution: Fixed
   Fix Version/s: 1.3.0
Assignee: Sean Owen
Target Version/s:   (was: 1.2.0)

 Document spark.driver.extra* configs
 

 Key: SPARK-3852
 URL: https://issues.apache.org/jira/browse/SPARK-3852
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Andrew Or
Assignee: Sean Owen
 Fix For: 1.3.0


 They are not documented...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4430) Apache RAT Checks fail spuriously on test files

2015-01-26 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4430.
--
   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: Sean Owen

 Apache RAT Checks fail spuriously on test files
 ---

 Key: SPARK-4430
 URL: https://issues.apache.org/jira/browse/SPARK-4430
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: Ryan Williams
Assignee: Sean Owen
 Fix For: 1.3.0


 Several of my recent runs of {{./dev/run-tests}} have failed quickly due to 
 Apache RAT checks, e.g.:
 {code}
 $ ./dev/run-tests
 =
 Running Apache RAT checks
 =
 Could not find Apache license headers in the following files:
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b732c105-4fd3-4330-ba6d-a366b340c303/test/28
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b732c105-4fd3-4330-ba6d-a366b340c303/test/29
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b732c105-4fd3-4330-ba6d-a366b340c303/test/30
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/10
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/11
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/12
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/13
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/14
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/15
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/16
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/17
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/18
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/19
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/20
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/21
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/22
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/23
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/24
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/25
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/26
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/27
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/28
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/29
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/30
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/7
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/8
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/9
 [error] Got a return code of 1 on line 114 of the run-tests script.
 {code}
 I think it's fair to say that these are not useful errors for {{run-tests}} 
 to crash on. Ideally we could tell the linter which files we care about 
 having it lint and which we don't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-595) Document local-cluster mode

2015-01-26 Thread Vladimir Grigor (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292052#comment-14292052
 ] 

Vladimir Grigor commented on SPARK-595:
---

+1 for reopen

 Document local-cluster mode
 -

 Key: SPARK-595
 URL: https://issues.apache.org/jira/browse/SPARK-595
 Project: Spark
  Issue Type: New Feature
  Components: Documentation
Affects Versions: 0.6.0
Reporter: Josh Rosen
Priority: Minor

 The 'Spark Standalone Mode' guide describes how to manually launch a 
 standalone cluster, which can be done locally for testing, but it does not 
 mention SparkContext's `local-cluster` option.
 What are the differences between these approaches?  Which one should I prefer 
 for local testing?  Can I still use the standalone web interface if I use 
 'local-cluster' mode?
 It would be useful to document this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5324) Results of describe can't be queried

2015-01-26 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291987#comment-14291987
 ] 

Yanbo Liang commented on SPARK-5324:


[~marmbrus]
I have pull a request for this issue which will implement DESCRIBE [FORMATTED] 
[db_name.]table_name command for SQLContext.
Meanwhile, it need to make the least effect on the corresponding command output 
of HiveContext.
And I think other metadata operation command like show databases/tables, 
analyze, explain can also leverage this scenario.
Can you assign this to me?

 Results of describe can't be queried
 

 Key: SPARK-5324
 URL: https://issues.apache.org/jira/browse/SPARK-5324
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Michael Armbrust

 {code}
 sql(DESCRIBE TABLE test).registerTempTable(describeTest)
 sql(SELECT * FROM describeTest).collect()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5324) Results of describe can't be queried

2015-01-26 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292008#comment-14292008
 ] 

Yanbo Liang commented on SPARK-5324:


https://github.com/apache/spark/pull/4207

 Results of describe can't be queried
 

 Key: SPARK-5324
 URL: https://issues.apache.org/jira/browse/SPARK-5324
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Michael Armbrust

 {code}
 sql(DESCRIBE TABLE test).registerTempTable(describeTest)
 sql(SELECT * FROM describeTest).collect()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5355) SparkConf is not thread-safe

2015-01-26 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292048#comment-14292048
 ] 

Apache Spark commented on SPARK-5355:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/4208

 SparkConf is not thread-safe
 

 Key: SPARK-5355
 URL: https://issues.apache.org/jira/browse/SPARK-5355
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0, 1.3.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker
 Fix For: 1.3.0, 1.2.1


 The SparkConf is not thread-safe, but is accessed by many threads. The 
 getAll() could return parts of the configs if another thread is access it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-794) Remove sleep() in ClusterScheduler.stop

2015-01-26 Thread Brennon York (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292303#comment-14292303
 ] 

Brennon York commented on SPARK-794:


[~joshrosen] How is this PR holding up? I haven't seen any issues on the dev 
board. Think we can close this JIRA ticket? Trying to help prune the JIRA tree 
:)

 Remove sleep() in ClusterScheduler.stop
 ---

 Key: SPARK-794
 URL: https://issues.apache.org/jira/browse/SPARK-794
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Matei Zaharia
  Labels: backport-needed
 Fix For: 1.3.0


 This temporary change made a while back slows down the unit tests quite a bit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-595) Document local-cluster mode

2015-01-26 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reopened SPARK-595:
--

I've re-opened this issue.  Folks are using the API in the wild and we're not 
going to break compatibility for it, so we should document it.

 Document local-cluster mode
 -

 Key: SPARK-595
 URL: https://issues.apache.org/jira/browse/SPARK-595
 Project: Spark
  Issue Type: New Feature
  Components: Documentation
Affects Versions: 0.6.0
Reporter: Josh Rosen
Priority: Minor

 The 'Spark Standalone Mode' guide describes how to manually launch a 
 standalone cluster, which can be done locally for testing, but it does not 
 mention SparkContext's `local-cluster` option.
 What are the differences between these approaches?  Which one should I prefer 
 for local testing?  Can I still use the standalone web interface if I use 
 'local-cluster' mode?
 It would be useful to document this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5395) Large number of Python workers causing resource depletion

2015-01-26 Thread Mark Khaitman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292121#comment-14292121
 ] 

Mark Khaitman commented on SPARK-5395:
--

Having the same issue in standalone deployment mode. A single spark-submitted 
job is spawning a ton of pyspark.daemon instances and depleting the cluster 
memory even though the appropriate environment variables have been set.

 Large number of Python workers causing resource depletion
 -

 Key: SPARK-5395
 URL: https://issues.apache.org/jira/browse/SPARK-5395
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
 Environment: AWS ElasticMapReduce
Reporter: Sven Krasser

 During job execution a large number of Python worker accumulates eventually 
 causing YARN to kill containers for being over their memory allocation (in 
 the case below that is about 8G for executors plus 6G for overhead per 
 container). 
 In this instance, at the time of killing the container 97 pyspark.daemon 
 processes had accumulated.
 {noformat}
 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler 
 (Logging.scala:logInfo(59)) - Container marked as failed: 
 container_1421692415636_0052_01_30. Exit status: 143. Diagnostics: 
 Container [pid=35211,containerID=container_1421692415636_0052_01_30] is 
 running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB 
 physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing 
 container.
 Dump of the process-tree for container_1421692415636_0052_01_30 :
 |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) 
 VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
 |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m 
 pyspark.daemon
 |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m 
 pyspark.daemon
 |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m 
 pyspark.daemon
   [...]
 {noformat}
 The configuration used uses 64 containers with 2 cores each.
 Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c
 Mailinglist discussion: 
 https://www.mail-archive.com/user@spark.apache.org/msg20102.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5400) Rename GaussianMixtureEM to GaussianMixture

2015-01-26 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292176#comment-14292176
 ] 

Joseph K. Bradley commented on SPARK-5400:
--

I agree this could be done either way: Algorithm[Model] or Model[Algorithm].  
For users, exposing the model type may be easiest; a person who is new to ML 
and wants to do some clustering will know the name of a clustering model 
(KMeans, GMM) but may not want to worry about picking an optimization 
algorithm.  So I'd vote for Model[Algorithm].

That said, internally, I agree that Algorithm[Model] would be handy for 
generalizing.  We could do the combination by having an internal LearningState 
class:

{code}
class GaussianMixture {
  def setOptimizer // once we have more than 1 optimization method

  def run = {
val opt = new EM(new GMMLearningState(this))
...
  }
}

private[mllib] GMMLearningState extends OurModelAbstraction {
  def this(gm: GaussianMixture) = this(...)
}

class EM(model: OurModelAbstraction)
{code}


 Rename GaussianMixtureEM to GaussianMixture
 ---

 Key: SPARK-5400
 URL: https://issues.apache.org/jira/browse/SPARK-5400
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor

 GaussianMixtureEM is following the old naming convention of including the 
 optimization algorithm name in the class title.  We should probably rename it 
 to GaussianMixture so that it can use other optimization algorithms in the 
 future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5162) Python yarn-cluster mode

2015-01-26 Thread Vladimir Grigor (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292069#comment-14292069
 ] 

Vladimir Grigor commented on SPARK-5162:


I second [~jared.holmb...@orchestro.com]
[~lianhuiwang] thank you! I'm going to try your PR.

Related issue
Even with this PR, there will be problem using Yarn in cluster mode on Amazon 
EMR.

Normally one submits yarn jobs via API or aws command line utility, so paths 
to files are evaluated later at some remote host, hence files are not found. 
Currently Spark does not support non-local files. One idea would be to add 
support for non-local (python) files, eg: if file is not local it will be 
downloaded and made available locally. Something similar to Distributed Cache 
described at 
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-input-distributed-cache.html

So following code would work:
{code}
aws emr add-steps --cluster-id j-XYWIXMD234 \
--steps 
Name=SparkPi,Jar=s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn-cluster,--py-files,s3://mybucketat.amazonaws.com/tasks/main.py,main.py,param1],ActionOnFailure=CONTINUE
{code}

What do you think? What is your way to run batch python spark scripts on Yarn 
in Amazon?

 Python yarn-cluster mode
 

 Key: SPARK-5162
 URL: https://issues.apache.org/jira/browse/SPARK-5162
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, YARN
Reporter: Dana Klassen
  Labels: cluster, python, yarn

 Running pyspark in yarn is currently limited to ‘yarn-client’ mode. It would 
 be great to be able to submit python applications to the cluster and (just 
 like java classes) have the resource manager setup an AM on any node in the 
 cluster. Does anyone know the issues blocking this feature? I was snooping 
 around with enabling python apps:
 Removing the logic stopping python and yarn-cluster from sparkSubmit.scala
 ...
 // The following modes are not supported or applicable
 (clusterManager, deployMode) match {
   ...
   case (_, CLUSTER) if args.isPython =
 printErrorAndExit(Cluster deploy mode is currently not supported for 
 python applications.)
   ...
 }
 …
 and submitting application via:
 HADOOP_CONF_DIR={{insert conf dir}} ./bin/spark-submit --master yarn-cluster 
 --num-executors 2  —-py-files {{insert location of egg here}} 
 --executor-cores 1  ../tools/canary.py
 Everything looks to run alright, pythonRunner is picked up as main class, 
 resources get setup, yarn client gets launched but falls flat on its face:
 2015-01-08 18:48:03,444 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  DEBUG: FAILED { 
 {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py, 
 1420742868009, FILE, null }, Resource 
 {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py changed 
 on src filesystem (expected 1420742868009, was 1420742869284
 and
 2015-01-08 18:48:03,446 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
  Resource 
 {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py(-/data/4/yarn/nm/usercache/klassen/filecache/11/canary.py)
  transitioned from DOWNLOADING to FAILED
 Tracked this down to the apache hadoop code(FSDownload.java line 249) related 
 to container localization of files upon downloading. At this point thought it 
 would be best to raise the issue here and get input.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5236) java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt

2015-01-26 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-5236:
--
Description: 
{code}
15/01/14 05:39:27 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 18.0 
(TID 28, localhost): parquet.io.ParquetDecodingException: Can not read value at 
0 in block 0 in file gs://pa-truven/20141205/parquet/P/part-r-1.parquet
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
at 
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
at 
org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141)
at 
org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1331)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1331)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to 
org.apache.spark.sql.catalyst.expressions.MutableInt
at 
org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:241)
at 
org.apache.spark.sql.parquet.CatalystPrimitiveRowConverter.updateInt(ParquetConverter.scala:375)
at 
org.apache.spark.sql.parquet.CatalystPrimitiveConverter.addInt(ParquetConverter.scala:434)
at 
parquet.column.impl.ColumnReaderImpl$2$3.writeValue(ColumnReaderImpl.java:237)
at 
parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:353)
at 
parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:402)
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:194)
... 27 more
{code}

  was:
15/01/14 05:39:27 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 18.0 
(TID 28, localhost): parquet.io.ParquetDecodingException: Can not read value at 
0 in block 0 in file gs://pa-truven/20141205/parquet/P/part-r-1.parquet
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
at 
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
at 
org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at

[jira] [Commented] (SPARK-2688) Need a way to run multiple data pipeline concurrently

2015-01-26 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292448#comment-14292448
 ] 

Xuefu Zhang commented on SPARK-2688:


Yeah. We don't need a syntactic suger, but a transformation that just does one 
pass of the input RDD. This has performance implications on Hive's multi-insert 
use cases.

 Need a way to run multiple data pipeline concurrently
 -

 Key: SPARK-2688
 URL: https://issues.apache.org/jira/browse/SPARK-2688
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Xuefu Zhang

 Suppose we want to do the following data processing: 
 {code}
 rdd1 - rdd2 - rdd3
| - rdd4
| - rdd5
\ - rdd6
 {code}
 where - represents a transformation. rdd3 to rrdd6 are all derived from an 
 intermediate rdd2. We use foreach(fn) with a dummy function to trigger the 
 execution. However, rdd.foreach(fn) only trigger pipeline rdd1 - rdd2 - 
 rdd3. To make things worse, when we call rdd4.foreach(), rdd2 will be 
 recomputed. This is very inefficient. Ideally, we should be able to trigger 
 the execution the whole graph and reuse rdd2, but there doesn't seem to be a 
 way doing so. Tez already realized the importance of this (TEZ-391), so I 
 think Spark should provide this too.
 This is required for Hive to support multi-insert queries. HIVE-7292.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3644) REST API for Spark application info (jobs / stages / tasks / storage info)

2015-01-26 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292370#comment-14292370
 ] 

Imran Rashid commented on SPARK-3644:
-

[~joshrosen] Hi Josh, I've got time to implement this now.  You can assign to 
me if you like (or let me know if there is something else in the works ...)

 REST API for Spark application info (jobs / stages / tasks / storage info)
 --

 Key: SPARK-3644
 URL: https://issues.apache.org/jira/browse/SPARK-3644
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Reporter: Josh Rosen

 This JIRA is a forum to draft a design proposal for a REST interface for 
 accessing information about Spark applications, such as job / stage / task / 
 storage status.
 There have been a number of proposals to serve JSON representations of the 
 information displayed in Spark's web UI.  Given that we might redesign the 
 pages of the web UI (and possibly re-implement the UI as a client of a REST 
 API), the API endpoints and their responses should be independent of what we 
 choose to display on particular web UI pages / layouts.
 Let's start a discussion of what a good REST API would look like from 
 first-principles.  We can discuss what urls / endpoints expose access to 
 data, how our JSON responses will be formatted, how fields will be named, how 
 the API will be documented and tested, etc.
 Some links for inspiration:
 https://developer.github.com/v3/
 http://developer.netflix.com/docs/REST_API_Reference
 https://helloreverb.com/developers/swagger



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5339) build/mvn doesn't work because of invalid URL for maven's tgz.

2015-01-26 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5339.

   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: Kousuke Saruta

 build/mvn doesn't work because of invalid URL for maven's tgz.
 --

 Key: SPARK-5339
 URL: https://issues.apache.org/jira/browse/SPARK-5339
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.3.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
Priority: Blocker
 Fix For: 1.3.0


 build/mvn will automatically download tarball of maven. But currently, the 
 URL is invalid. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3789) Python bindings for GraphX

2015-01-26 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292477#comment-14292477
 ] 

Reynold Xin commented on SPARK-3789:


Unfortunately this is not going to make it into 1.3, given the code freeze 
deadline is in 1 week.

[~kdatta1978] thanks for working on this. Can you write some high level design 
document for this change?

 Python bindings for GraphX
 --

 Key: SPARK-3789
 URL: https://issues.apache.org/jira/browse/SPARK-3789
 Project: Spark
  Issue Type: New Feature
  Components: GraphX, PySpark
Reporter: Ameet Talwalkar
Assignee: Kushal Datta





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5411) Allow SparkListeners to be specified in SparkConf and loaded when creating SparkContext

2015-01-26 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292494#comment-14292494
 ] 

Apache Spark commented on SPARK-5411:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4111

 Allow SparkListeners to be specified in SparkConf and loaded when creating 
 SparkContext
 ---

 Key: SPARK-5411
 URL: https://issues.apache.org/jira/browse/SPARK-5411
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Josh Rosen
Assignee: Josh Rosen

 It would be nice if there was a mechanism to allow SparkListeners to be 
 registered through SparkConf settings.  This would allow monitoring 
 frameworks to be easily injected into Spark programs without having to modify 
 those programs' code.
 I propose to introduce a new configuration option, {{spark.extraListeners}}, 
 that allows SparkListeners to be specified in SparkConf and registered before 
 the SparkContext is created.  Here is the proposed documentation for the new 
 option:
 {quote}
 A comma-separated list of classes that implement SparkListener; when 
 initializing SparkContext, instances of these classes will be created and 
 registered with Spark's listener bus. If a class has a single-argument 
 constructor that accepts a SparkConf, that constructor will be called; 
 otherwise, a zero-argument constructor will be called. If no valid 
 constructor can be found, the SparkContext creation will fail with an 
 exception.
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5411) Allow SparkListeners to be specified in SparkConf and loaded when creating SparkContext

2015-01-26 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-5411:
-

 Summary: Allow SparkListeners to be specified in SparkConf and 
loaded when creating SparkContext
 Key: SPARK-5411
 URL: https://issues.apache.org/jira/browse/SPARK-5411
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Josh Rosen
Assignee: Josh Rosen


It would be nice if there was a mechanism to allow SparkListeners to be 
registered through SparkConf settings.  This would allow monitoring frameworks 
to be easily injected into Spark programs without having to modify those 
programs' code.

I propose to introduce a new configuration option, {{spark.extraListeners}}, 
that allows SparkListeners to be specified in SparkConf and registered before 
the SparkContext is created.  Here is the proposed documentation for the new 
option:

{quote}
A comma-separated list of classes that implement SparkListener; when 
initializing SparkContext, instances of these classes will be created and 
registered with Spark's listener bus. If a class has a single-argument 
constructor that accepts a SparkConf, that constructor will be called; 
otherwise, a zero-argument constructor will be called. If no valid constructor 
can be found, the SparkContext creation will fail with an exception.
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3789) Python bindings for GraphX

2015-01-26 Thread Kushal Datta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292499#comment-14292499
 ] 

Kushal Datta commented on SPARK-3789:
-

Sure, i will write up the design document.
@ Ameet,  do you think you can work from another branch which is not on 1.3?



 Python bindings for GraphX
 --

 Key: SPARK-3789
 URL: https://issues.apache.org/jira/browse/SPARK-3789
 Project: Spark
  Issue Type: New Feature
  Components: GraphX, PySpark
Reporter: Ameet Talwalkar
Assignee: Kushal Datta





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2688) Need a way to run multiple data pipeline concurrently

2015-01-26 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292431#comment-14292431
]

Sean Owen commented on SPARK-2688:
--

As [~irashid] says, #1 is just syntactic sugar on what you can do already in
Spark. I'm not clear how something can need this functionality badly, then.
Either it's not blocking anything, really, and let's see that, or let's discuss
what beyond #1 is actually needed.

What I think people want is a miniature push-based evaluation method inside
of Spark's pull-based DAG evaluation: force evaluation of N children of 1
parent at once. The outcome of a sidebar I had with Sandy on this was that it's
probably a) fraught with gotchas, given the push-vs-pull mismatch, but not
impossible, and b) would force the children to be persisted in the general
case, with possible optimizations in other special cases.

Is that the kind of thing Hive on Spark needs, and if so can we hear a concrete
elaboration of an example of this, so we can compare with what's possible now?
I still sense there's a mismatch between the perception and reality of what's
possible with the current API. Hence, there may be some really good news here.

Need a way to run multiple data pipeline concurrently
-

Key: SPARK-2688
URL: https://issues.apache.org/jira/browse/SPARK-2688
Project: Spark
Issue Type: New Feature
Components: Spark Core
Affects Versions: 1.0.1
Reporter: Xuefu Zhang

Suppose we want to do the following data processing:
{code}
rdd1 - rdd2 - rdd3
| - rdd4
| - rdd5
\ - rdd6
{code}
where - represents a transformation. rdd3 to rrdd6 are all derived from an
intermediate rdd2. We use foreach(fn) with a dummy function to trigger the
execution. However, rdd.foreach(fn) only trigger pipeline rdd1 - rdd2 -
rdd3. To make things worse, when we call rdd4.foreach(), rdd2 will be
recomputed. This is very inefficient. Ideally, we should be able to trigger
the execution the whole graph and reuse rdd2, but there doesn't seem to be a
way doing so. Tez already realized the importance of this (TEZ-391), so I
think Spark should provide this too.
This is required for Hive to support multi-insert queries. HIVE-7292.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib

2015-01-26 Thread Dmitriy Lyubimov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292399#comment-14292399
 ] 

Dmitriy Lyubimov commented on SPARK-5226:
-

All attempts to parallelize dbscan in literature lately (or similar DeLiClu 
type of things) i read about include partitioning the task into smaller 
subtasks, solving each on individual level and merging it all back (see MR.Scan 
paper for example). Merging is of course is the new and the tricky thing.

As far as i understand, they all pretty much have limitations to reduce scope 
to euclidean distances  and captitalize on notions of euclidean geometry 
resulting from that, in order to solve partition and merge problems. Which 
substantially reduces attractiveness of general algorithm. However, the naive 
straightforward port of  simple DBScan algorithm is not terribly practical for 
big data because of total complexity of the problem (or impracticality of 
building something like huge distributed R-tree index system on shared-nothing 
programming models).

 Add DBSCAN Clustering Algorithm to MLlib
 

 Key: SPARK-5226
 URL: https://issues.apache.org/jira/browse/SPARK-5226
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Muhammad-Ali A'rabi
Priority: Minor
  Labels: DBSCAN

 MLlib is all k-means now, and I think we should add some new clustering 
 algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2688) Need a way to run multiple data pipeline concurrently

2015-01-26 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292415#comment-14292415
 ] 

Xuefu Zhang commented on SPARK-2688:


#1 above is exactly what Hive needs badly.

 Need a way to run multiple data pipeline concurrently
 -

 Key: SPARK-2688
 URL: https://issues.apache.org/jira/browse/SPARK-2688
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Xuefu Zhang

 Suppose we want to do the following data processing: 
 {code}
 rdd1 - rdd2 - rdd3
| - rdd4
| - rdd5
\ - rdd6
 {code}
 where - represents a transformation. rdd3 to rrdd6 are all derived from an 
 intermediate rdd2. We use foreach(fn) with a dummy function to trigger the 
 execution. However, rdd.foreach(fn) only trigger pipeline rdd1 - rdd2 - 
 rdd3. To make things worse, when we call rdd4.foreach(), rdd2 will be 
 recomputed. This is very inefficient. Ideally, we should be able to trigger 
 the execution the whole graph and reuse rdd2, but there doesn't seem to be a 
 way doing so. Tez already realized the importance of this (TEZ-391), so I 
 think Spark should provide this too.
 This is required for Hive to support multi-insert queries. HIVE-7292.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3789) Python bindings for GraphX

2015-01-26 Thread Kushal Datta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292468#comment-14292468
 ] 

Kushal Datta commented on SPARK-3789:
-

Hi Ameet,

Sorry for asking this question again.
What's the release plan for 1.3?

-Kushal.

 Python bindings for GraphX
 --

 Key: SPARK-3789
 URL: https://issues.apache.org/jira/browse/SPARK-3789
 Project: Spark
  Issue Type: New Feature
  Components: GraphX, PySpark
Reporter: Ameet Talwalkar
Assignee: Kushal Datta





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5416) Initialize Executor.threadPool before ExecutorSource

2015-01-26 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292784#comment-14292784
 ] 

Apache Spark commented on SPARK-5416:
-

User 'ryan-williams' has created a pull request for this issue:
https://github.com/apache/spark/pull/4212

 Initialize Executor.threadPool before ExecutorSource
 

 Key: SPARK-5416
 URL: https://issues.apache.org/jira/browse/SPARK-5416
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Ryan Williams
Priority: Minor

 I recently saw some NPEs from 
 [{{ExecutorSource:44}}|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/ExecutorSource.scala#L44]
  in the first couple seconds of my executors' being initialized.
 I think that {{ExecutorSource}} was trying to report these metrics before its 
 threadpool was initialized; there are a few LoC between the source being 
 registered 
 ([Executor.scala:82|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/Executor.scala#L82])
  and the threadpool being initialized 
 ([Executor.scala:106|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/Executor.scala#L106]).
 We should initialize the threapool before the ExecutorSource is registered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3562) Periodic cleanup event logs

2015-01-26 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292818#comment-14292818
 ] 

Apache Spark commented on SPARK-3562:
-

User 'viper-kun' has created a pull request for this issue:
https://github.com/apache/spark/pull/4214

 Periodic cleanup event logs
 ---

 Key: SPARK-3562
 URL: https://issues.apache.org/jira/browse/SPARK-3562
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: xukun

  If we run spark application frequently, it will write many spark event log 
 into spark.eventLog.dir. After a long time later, there will be many spark 
 event log that we do not concern in the spark.eventLog.dir.Periodic cleanups 
 will ensure that logs older than this duration will be forgotten. It is no 
 need to clean logs by hands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3880) HBase as data source to SparkSQL

2015-01-26 Thread Yan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan updated SPARK-3880:
---
Attachment: SparkSQLOnHBase_v2.0.docx

 HBase as data source to SparkSQL
 

 Key: SPARK-3880
 URL: https://issues.apache.org/jira/browse/SPARK-3880
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Yan
Assignee: Yan
 Attachments: HBaseOnSpark.docx, SparkSQLOnHBase_v2.0.docx






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3880) HBase as data source to SparkSQL

2015-01-26 Thread Yan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan updated SPARK-3880:
---
Attachment: (was: SparkSQLOnHBase_v2.docx)

 HBase as data source to SparkSQL
 

 Key: SPARK-3880
 URL: https://issues.apache.org/jira/browse/SPARK-3880
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Yan
Assignee: Yan
 Attachments: HBaseOnSpark.docx






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5388) Provide a stable application submission gateway

2015-01-26 Thread Andrew Or (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292874#comment-14292874
]

Andrew Or commented on SPARK-5388:
--

Hi Dale, thank you for your comments. Yes, in the design doc I used REST
roughly interchangeably with HTTP/JSON. But the goal is not to provide a
mechanism for other entities to communicate with the Master as you suggested;
it is simply to provide a stable mechanism for Spark to work across multiple
versions. For instance, you might have a long-running Master that outlives
multiple Spark versions, in which case we want to guarantee that newer versions
of Spark will still be able to submit to the long-running Master.

I think your proposal to make this more REST-like is potentially a great idea.
However, I find the alternative of simply putting the action in the JSON itself
easier to reason about. This also allows us to add other messages in the future
that are not strictly limited to the semantics of GET, POST, and DELETE. That
said, my proposal is also not set in stone yet so if there is a reason
compelling enough to change it then I will do so.

Also, a first-cut implementation of my design is now posted at:
https://github.com/apache/spark/pull/4216. Please take a look if you feel
inclined.

Provide a stable application submission gateway
---

Key: SPARK-5388
URL: https://issues.apache.org/jira/browse/SPARK-5388
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Blocker
Attachments: Stable Spark Standalone Submission.pdf

The existing submission gateway in standalone mode is not compatible across
Spark versions. If you have a newer version of Spark submitting to an older
version of the standalone Master, it is currently not guaranteed to work. The
goal is to provide a stable REST interface to replace this channel.
The first cut implementation will target standalone cluster mode because
there are very few messages exchanged. The design, however, will be general
enough to eventually support this for other cluster managers too. Note that
this is not necessarily required in YARN because we already use YARN's stable
interface to submit applications there.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit

2015-01-26 Thread Joseph Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292853#comment-14292853
 ] 

Joseph Tang edited comment on SPARK-4846 at 1/27/15 2:46 AM:
-

Sorry about the procrastination. I just thought you meant there is no need to 
implement a dynamic strategy. I'm still working on it and I'd like to quickly 
fix this issue.

Regarding your previous comment, should I throw a customized error in Spark or 
just an OOM besides the hint about minCount and vectorSize? 


was (Author: josephtang):
Sorry about the procrastination. I'm still working on this.

Regarding your previous comment, should I throw a customized error in Spark or 
just an OOM besides the hint about minCount and vectorSize? 

 When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: 
 Requested array size exceeds VM limit
 ---

 Key: SPARK-4846
 URL: https://issues.apache.org/jira/browse/SPARK-4846
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.1, 1.2.0
 Environment: Use Word2Vec to process a corpus(sized 3.5G) with one 
 partition.
 The corpus contains about 300 million words and its vocabulary size is about 
 10 million.
Reporter: Joseph Tang
Assignee: Joseph Tang
Priority: Minor

 Exception in thread Driver java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
 Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit 
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779)
 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
 at 
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
 at 
 org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
 at 
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
 at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
 at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
 at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610)
 at 
 org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
 at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5422) Support sending to Graphite via UDP

2015-01-26 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292925#comment-14292925
 ] 

Apache Spark commented on SPARK-5422:
-

User 'ryan-williams' has created a pull request for this issue:
https://github.com/apache/spark/pull/4218

 Support sending to Graphite via UDP
 ---

 Key: SPARK-5422
 URL: https://issues.apache.org/jira/browse/SPARK-5422
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Ryan Williams
Priority: Minor

 {{io.dropwizard.metrics-graphite}} version {{3.1.0}} can send metrics to 
 Graphite via UDP or TCP.
 After upgrading 
 ([SPARK-5413|https://issues.apache.org/jira/browse/SPARK-5413]), we should 
 support using this facility, presumably specified via a protocol field in 
 the metrics config file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit

2015-01-26 Thread Joseph Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292926#comment-14292926
 ] 

Joseph Tang commented on SPARK-4846:


I've added some code at 
https://github.com/jinntrance/spark/compare/w2v-fix?diff=splitname=w2v-fix

If it's OK, I would send a new PR to the branch `master`.

 When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: 
 Requested array size exceeds VM limit
 ---

 Key: SPARK-4846
 URL: https://issues.apache.org/jira/browse/SPARK-4846
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.1, 1.2.0
 Environment: Use Word2Vec to process a corpus(sized 3.5G) with one 
 partition.
 The corpus contains about 300 million words and its vocabulary size is about 
 10 million.
Reporter: Joseph Tang
Assignee: Joseph Tang
Priority: Minor

 Exception in thread Driver java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
 Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit 
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779)
 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
 at 
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
 at 
 org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
 at 
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
 at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
 at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
 at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610)
 at 
 org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
 at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2015-01-26 Thread Luca Morandini (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292998#comment-14292998
 ] 

Luca Morandini commented on SPARK-1405:
---

Indeed, I have a couple students whose assignments involve Twitter data, and I 
am considering adding LDA to the mix. I would like to test it on our corpus... 
provided this feature is usable by a Spark novice: is it ?


 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Guoqiang Li
Priority: Critical
  Labels: features
 Attachments: performance_comparison.png

   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5417) Remove redundant executor-ID set() call

2015-01-26 Thread Ryan Williams (JIRA)

Ryan Williams created SPARK-5417:


 Summary: Remove redundant executor-ID set() call
 Key: SPARK-5417
 URL: https://issues.apache.org/jira/browse/SPARK-5417
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Ryan Williams
Priority: Minor


{{spark.executor.id}} no longer [needs to be set in 
Executor.scala|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/Executor.scala#L79],
 as of [#4194|https://github.com/apache/spark/pull/4194]; it is set upstream in 
[SparkEnv|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/SparkEnv.scala#L332].
 Might as well remove the redundant set() in Executor.scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 140 matches

Mail list logo