date:20150508


 [ 
https://issues.apache.org/jira/browse/SPARK-7466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7466:
---

Assignee: Andrew Or  (was: Apache Spark)

 DAG visualization: orphaned nodes are not rendered correctly
 

 Key: SPARK-7466
 URL: https://issues.apache.org/jira/browse/SPARK-7466
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical
 Attachments: after.png, before.png


 If you have an RDD instantiated outside of a scope, it is rendered as a weird 
 badge outside of a stage. This is because we keep the edge but do not inform 
 dagre-d3 of the node, resulting in the library rendering the node for us 
 without the expected styles and labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7466) DAG visualization: orphaned nodes are not rendered correctly


 [ 
https://issues.apache.org/jira/browse/SPARK-7466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7466:
---

Assignee: Apache Spark  (was: Andrew Or)

 DAG visualization: orphaned nodes are not rendered correctly
 

 Key: SPARK-7466
 URL: https://issues.apache.org/jira/browse/SPARK-7466
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Andrew Or
Assignee: Apache Spark
Priority: Critical
 Attachments: after.png, before.png


 If you have an RDD instantiated outside of a scope, it is rendered as a weird 
 badge outside of a stage. This is because we keep the edge but do not inform 
 dagre-d3 of the node, resulting in the library rendering the node for us 
 without the expected styles and labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7466) DAG visualization: orphaned nodes are not rendered correctly


[ 
https://issues.apache.org/jira/browse/SPARK-7466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534013#comment-14534013
 ] 

Apache Spark commented on SPARK-7466:
-

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/6002

 DAG visualization: orphaned nodes are not rendered correctly
 

 Key: SPARK-7466
 URL: https://issues.apache.org/jira/browse/SPARK-7466
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical
 Attachments: after.png, before.png


 If you have an RDD instantiated outside of a scope, it is rendered as a weird 
 badge outside of a stage. This is because we keep the edge but do not inform 
 dagre-d3 of the node, resulting in the library rendering the node for us 
 without the expected styles and labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6770) DirectKafkaInputDStream has not been initialized when recovery from checkpoint


[ 
https://issues.apache.org/jira/browse/SPARK-6770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534078#comment-14534078
 ] 

Tathagata Das commented on SPARK-6770:
--

Was this problem solved? I think I discuss this explicitly in the Streaming 
guide here. 
http://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations

If this solves the issue, I am inclined to close this JIRA. Either way, this is 
not a problem with DirectKafkaInputDStream as this JIRA title seem to indicate.

 DirectKafkaInputDStream has not been initialized when recovery from checkpoint
 --

 Key: SPARK-6770
 URL: https://issues.apache.org/jira/browse/SPARK-6770
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: yangping wu

 I am  read data from kafka using createDirectStream method and save the 
 received log to Mysql, the code snippets as follows
 {code}
 def functionToCreateContext(): StreamingContext = {
   val sparkConf = new SparkConf()
   val sc = new SparkContext(sparkConf)
   val ssc = new StreamingContext(sc, Seconds(10))
   ssc.checkpoint(/tmp/kafka/channel/offset) // set checkpoint directory
   ssc
 }
 val struct = StructType(StructField(log, StringType) ::Nil)
 // Get StreamingContext from checkpoint data or create a new one
 val ssc = StreamingContext.getOrCreate(/tmp/kafka/channel/offset, 
 functionToCreateContext)
 val SDB = KafkaUtils.createDirectStream[String, String, StringDecoder, 
 StringDecoder](ssc, kafkaParams, topics)
 val sqlContext = new org.apache.spark.sql.SQLContext(ssc.sparkContext)
 SDB.foreachRDD(rdd = {
   val result = rdd.map(item = {
 println(item)
 val result = item._2 match {
   case e: String = Row.apply(e)
   case _ = Row.apply()
 }
 result
   })
   println(result.count())
   val df = sqlContext.createDataFrame(result, struct)
   df.insertIntoJDBC(url, test, overwrite = false)
 })
 ssc.start()
 ssc.awaitTermination()
 ssc.stop()
 {code}
 But when I  recovery the program from checkpoint, I encountered an exception:
 {code}
 Exception in thread main org.apache.spark.SparkException: 
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream@41a80e5a has not 
 been initialized
   at 
 org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:266)
   at 
 org.apache.spark.streaming.dstream.InputDStream.isTimeValid(InputDStream.scala:51)
   at 
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
   at 
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
   at scala.Option.orElse(Option.scala:257)
   at 
 org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
   at 
 org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
   at 
 org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
   at 
 org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
   at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
   at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
   at 
 org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
   at 
 org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:223)
   at 
 org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:218)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
   at 
 org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:218)
   at 
 org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:89)
   at 
 org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:67)
   at 
 org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:512)
   at logstatstreaming.UserChannelTodb$.main(UserChannelTodb.scala:57)
   at logstatstreaming.UserChannelTodb.main(UserChannelTodb.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at

[jira] [Created] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors

2015-05-08 Thread Steve Loughran (JIRA)

Steve Loughran created SPARK-7481:
-

 Summary: Add Hadoop 2.6+ profile to pull in object store FS 
accessors
 Key: SPARK-7481
 URL: https://issues.apache.org/jira/browse/SPARK-7481
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.3.1
Reporter: Steve Loughran


To keep the s3n classpath right, to add s3a, swift  azure, the dependencies of 
spark in a 2.6+ profile need to add the relevant object store packages 
(hadoop-aws, hadoop-openstack, hadoop-azure)

this adds more stuff to the client bundle, but will mean a single spark package 
can talk to all of the stores.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6091) Add MulticlassMetrics in PySpark/MLlib


 [ 
https://issues.apache.org/jira/browse/SPARK-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6091:
-
Target Version/s: 1.4.0

 Add MulticlassMetrics in PySpark/MLlib
 --

 Key: SPARK-6091
 URL: https://issues.apache.org/jira/browse/SPARK-6091
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Yanbo Liang





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6091) Add MulticlassMetrics in PySpark/MLlib


 [ 
https://issues.apache.org/jira/browse/SPARK-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6091:
-
Assignee: Yanbo Liang

 Add MulticlassMetrics in PySpark/MLlib
 --

 Key: SPARK-6091
 URL: https://issues.apache.org/jira/browse/SPARK-6091
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Yanbo Liang





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6092) Add RankingMetrics in PySpark/MLlib


 [ 
https://issues.apache.org/jira/browse/SPARK-6092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6092:
-
Assignee: Yanbo Liang

 Add RankingMetrics in PySpark/MLlib
 ---

 Key: SPARK-6092
 URL: https://issues.apache.org/jira/browse/SPARK-6092
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Yanbo Liang





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6092) Add RankingMetrics in PySpark/MLlib


 [ 
https://issues.apache.org/jira/browse/SPARK-6092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6092:
-
Target Version/s: 1.4.0

 Add RankingMetrics in PySpark/MLlib
 ---

 Key: SPARK-6092
 URL: https://issues.apache.org/jira/browse/SPARK-6092
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Yanbo Liang





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7478) Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext

[
https://issues.apache.org/jira/browse/SPARK-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tathagata Das updated SPARK-7478:
-
Description:
Having a SQLContext singleton would make it easier for applications to use a
lazily instantiated single shared instance of SQLContext when needed. It would
avoid problems like

1. In REPL/notebook environment, rerunning the line {{val sqlContext = new
SQLContext}} multiple times created different contexts while overriding the
reference to previous context, leading to issues like registered temp tables
going missing.

2. In Streaming, creating SQLContext directly leads to
serialization/deserialization issues when attempting to recover from DStream
checkpoints. See [SPARK-6770].

This can be solved by {{SQLContext.getOrCreate}} which get or creates a new
singleton instance of SQLContext using either a given SparkContext or a given
SparkConf

was:
Having a SQLContext singleton would make it easier for applications to use a
lazily instantiated single shared instance of SQLContext when needed. It would
avoid problems like

2. In Streaming, creating SQLContext directly leads to
serialization/deserialization issues when attempting to recover from DStream
checkpoints. See [SPARK-6770].

This can be solved by {{SQLContext.getOrCreate}}

Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext
---

Key: SPARK-7478
URL: https://issues.apache.org/jira/browse/SPARK-7478
Project: Spark
Issue Type: New Feature
Components: SQL
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker

Having a SQLContext singleton would make it easier for applications to use a
lazily instantiated single shared instance of SQLContext when needed. It
would avoid problems like
1. In REPL/notebook environment, rerunning the line {{val sqlContext = new
SQLContext}} multiple times created different contexts while overriding the
reference to previous context, leading to issues like registered temp tables
going missing.
2. In Streaming, creating SQLContext directly leads to
serialization/deserialization issues when attempting to recover from DStream
checkpoints. See [SPARK-6770].
This can be solved by {{SQLContext.getOrCreate}} which get or creates a new
singleton instance of SQLContext using either a given SparkContext or a given
SparkConf

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7478) Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext


 [ 
https://issues.apache.org/jira/browse/SPARK-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7478:
---

Assignee: Tathagata Das  (was: Apache Spark)

 Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext
 ---

 Key: SPARK-7478
 URL: https://issues.apache.org/jira/browse/SPARK-7478
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker

 Having a SQLContext singleton would make it easier for applications to use a 
 lazily instantiated single shared instance of SQLContext when needed. It 
 would avoid problems like 
 1. In REPL/notebook environment, rerunning the line {{val sqlContext = new 
 SQLContext}} multiple times created different contexts while overriding the 
 reference to previous context, leading to issues like registered temp tables 
 going missing.
 2. In Streaming, creating SQLContext directly leads to 
 serialization/deserialization issues when attempting to recover from DStream 
 checkpoints. See [SPARK-6770]. Also to get around this problem I had to 
 suggest creating a singleton instance - 
 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala
 This can be solved by {{SQLContext.getOrCreate}} which get or creates a new 
 singleton instance of SQLContext using either a given SparkContext or a given 
 SparkConf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6889) Streamline contribution process with update to Contribution wiki, JIRA rules


 [ 
https://issues.apache.org/jira/browse/SPARK-6889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6889.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

 Streamline contribution process with update to Contribution wiki, JIRA rules
 

 Key: SPARK-6889
 URL: https://issues.apache.org/jira/browse/SPARK-6889
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Sean Owen
Assignee: Sean Owen
 Fix For: 1.4.0

 Attachments: ContributingtoSpark.pdf, 
 SparkProjectMechanicsChallenges.pdf, faq.html.patch


 From about 6 months of intimate experience with the Spark JIRA and the 
 reality of the JIRA / PR flow, I've observed some challenges, problems and 
 growing pains that have begun to encumber the project mechanics. In the 
 attached SparkProjectMechanicsChallenges.pdf document, I've collected these 
 observations and a few statistics that summarize much of what I've seen. From 
 side conversations with several of you, I think some of these will resonate. 
 (Read it first for this to make sense.)
 I'd like to improve just one aspect to start: the contribution process. A lot 
 of inbound contribution effort gets misdirected, and can burn a lot of cycles 
 for everyone, and that's a barrier to scaling up further and to general 
 happiness. I'd like to propose for discussion a change to the wiki pages, and 
 a change to some JIRA settings. 
 *Wiki*
 - Replace 
 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark with 
 proposed text (NewContributingToSpark.pdf)
 - Delete 
 https://cwiki.apache.org/confluence/display/SPARK/Reviewing+and+Merging+Patches
  as it is subsumed by the new text
 - Move the IDE Setup section to 
 https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools
 - Delete 
 https://cwiki.apache.org/confluence/display/SPARK/Jira+Permissions+Scheme as 
 it's a  bit out of date and not all that useful
 *JIRA*
 Now: 
 Start by removing everyone from the 'Developer' role and add them to 
 'Contributor'. Right now Developer has no permission that Contributor 
 doesn't. We may reuse Developer later for some level between Committer and 
 Contributor.
 Later, with Apache admin assistance:
 - Make Component and Affects Version required for new JIRAs
 - Set default priority to Minor and type to Question for new JIRAs. If 
 defaults aren't changed, by default it can't be that important
 - Only let Committers set Target Version and Fix Version
 - Only let Committers set Blocker Priority



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7478) Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext


[ 
https://issues.apache.org/jira/browse/SPARK-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534099#comment-14534099
 ] 

Apache Spark commented on SPARK-7478:
-

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/6006

 Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext
 ---

 Key: SPARK-7478
 URL: https://issues.apache.org/jira/browse/SPARK-7478
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker

 Having a SQLContext singleton would make it easier for applications to use a 
 lazily instantiated single shared instance of SQLContext when needed. It 
 would avoid problems like 
 1. In REPL/notebook environment, rerunning the line {{val sqlContext = new 
 SQLContext}} multiple times created different contexts while overriding the 
 reference to previous context, leading to issues like registered temp tables 
 going missing.
 2. In Streaming, creating SQLContext directly leads to 
 serialization/deserialization issues when attempting to recover from DStream 
 checkpoints. See [SPARK-6770]. Also to get around this problem I had to 
 suggest creating a singleton instance - 
 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala
 This can be solved by {{SQLContext.getOrCreate}} which get or creates a new 
 singleton instance of SQLContext using either a given SparkContext or a given 
 SparkConf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7479) SparkR can not work

2015-05-08 Thread Weizhong (JIRA)

Weizhong created SPARK-7479:
---

 Summary: SparkR can not work
 Key: SPARK-7479
 URL: https://issues.apache.org/jira/browse/SPARK-7479
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Reporter: Weizhong
Priority: Minor


I have build master branch Spark, and run SparkR, but it failed when i run 
pi.R. And the error info is:
Error: could not find function parallelize
But if I add the namespace in the code, for example SparkR:::parallelize then 
it work correctly.

My cluster info:
JDK: 1.8.0_40
Hadoop: 2.6.0
R: 3.2.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7479) SparkR can not work


 [ 
https://issues.apache.org/jira/browse/SPARK-7479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7479.
--
Resolution: Invalid

This kind of thing should begin as a question at user@ as I suspect it is a 
basic problem with your env.

 SparkR can not work
 ---

 Key: SPARK-7479
 URL: https://issues.apache.org/jira/browse/SPARK-7479
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Reporter: Weizhong
Priority: Minor

 I have build master branch Spark, and run SparkR, but it failed when i run 
 pi.R. And the error info is:
 Error: could not find function parallelize
 But if I add the namespace in the code, for example SparkR:::parallelize 
 then it work correctly.
 My cluster info:
 JDK: 1.8.0_40
 Hadoop: 2.6.0
 R: 3.2.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors


[ 
https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534168#comment-14534168
 ] 

Sean Owen commented on SPARK-7481:
--

Yikes, that seems like a load of stuff to pull in. Can't this / shouldn't this 
be added by the end user if desired?

 Add Hadoop 2.6+ profile to pull in object store FS accessors
 

 Key: SPARK-7481
 URL: https://issues.apache.org/jira/browse/SPARK-7481
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.3.1
Reporter: Steve Loughran

 To keep the s3n classpath right, to add s3a, swift  azure, the dependencies 
 of spark in a 2.6+ profile need to add the relevant object store packages 
 (hadoop-aws, hadoop-openstack, hadoop-azure)
 this adds more stuff to the client bundle, but will mean a single spark 
 package can talk to all of the stores.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7478) Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext

[
https://issues.apache.org/jira/browse/SPARK-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

1. In REPL/notebook environment, rerunning the line val sqlContext = new
SQLContext multiple times created different contexts while overriding the
reference to previous context, leading to issues like registered temp tables
going missing.

2. In Streaming, creating SQLContext directly leads to
serialization/deserialization issues when attempting to recover from DStream
checkpoints. See [SPARK-6770].

was:
Having a SQLContext singleton would make it easier for applications to use a
lazily instantiated single shared instance of SQLContext when needed. It would
avoid problems like

2. In Streaming, creating SQLContext directly leads to
serialization/deserialization issues when attempting to recover from DStream
checkpoints. See [SPARK-6770](https://issues.apache.org/jira/browse/SPARK-6770)

Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext
---

Key: SPARK-7478
URL: https://issues.apache.org/jira/browse/SPARK-7478
Project: Spark
Issue Type: New Feature
Components: SQL
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5034) Spark on Yarn launch failure on HDInsight on Windows


 [ 
https://issues.apache.org/jira/browse/SPARK-5034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5034.
--
Resolution: Cannot Reproduce

I don't know what to make of this without more info. I don't think it is a 
parsing or quoting issue as it just looks like the main class name is incorrect 
and overwritten in part by some host name.

 Spark on Yarn launch failure on HDInsight on Windows
 

 Key: SPARK-5034
 URL: https://issues.apache.org/jira/browse/SPARK-5034
 Project: Spark
  Issue Type: Bug
  Components: Windows, YARN
Affects Versions: 1.1.0, 1.1.1, 1.2.0
 Environment: Spark on Yarn within HDInsight on Windows Azure
Reporter: Rice

 Windows Environment
 I'm trying to run JavaSparkPi example on YARN with master = yarn-client but I 
 have a problem. 
 It runs smoothly with submitting application, first container for Application 
 Master works too. 
 When job is starting and there are some tasks to do I'm getting this warning 
 on console (I'm using windows cmd if this makes any difference): 
 WARN cluster.YarnClientClusterScheduler: Initial job has not accepted any 
 resources; check your cluster UI to ensure that workers are registered and 
 have sufficient memory 
 When I'm checking logs for container with Application Masters it is launching 
 containers for executors properly, then goes with: 
 INFO YarnAllocationHandler: Completed container 
 container_1409217202587_0003_01_02 (state: COMPLETE, exit status: 1) 
 INFO YarnAllocationHandler: Container marked as failed: 
 container_1409217202587_0003_01_02 
 And tries to re-launch them. 
 On failed container log there is only this: 
 Error: Could not find or load main class 
 pwd..sp...@gbv06758291.my.secret.address.net:63680.user.CoarseGrainedScheduler
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7478) Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext

[
https://issues.apache.org/jira/browse/SPARK-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

was:
Having a SQLContext singleton would make it easier for applications to use a
lazily instantiated single shared instance of SQLContext when needed. It would
avoid problems like

2. In Streaming, creating SQLContext directly leads to
serialization/deserialization issues when attempting to recover from DStream
checkpoints. See

Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext
---

Key: SPARK-7478
URL: https://issues.apache.org/jira/browse/SPARK-7478
Project: Spark
Issue Type: New Feature
Components: SQL
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7478) Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext

[
https://issues.apache.org/jira/browse/SPARK-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

2. In Streaming, creating SQLContext directly leads to
serialization/deserialization issues when attempting to recover from DStream
checkpoints. See [SPARK-6770].

This can be solved by {{SQLContext.getOrCreate}}

was:
Having a SQLContext singleton would make it easier for applications to use a
lazily instantiated single shared instance of SQLContext when needed. It would
avoid problems like

1. In REPL/notebook environment, rerunning the line {val sqlContext = new
SQLContext} multiple times created different contexts while overriding the
reference to previous context, leading to issues like registered temp tables
going missing.

2. In Streaming, creating SQLContext directly leads to
serialization/deserialization issues when attempting to recover from DStream
checkpoints. See [SPARK-6770].

This can be solved by {SQLContext.getOrCreate}

Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext
---

Key: SPARK-7478
URL: https://issues.apache.org/jira/browse/SPARK-7478
Project: Spark
Issue Type: New Feature
Components: SQL
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7478) Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext


 [ 
https://issues.apache.org/jira/browse/SPARK-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7478:
---

Assignee: Apache Spark  (was: Tathagata Das)

 Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext
 ---

 Key: SPARK-7478
 URL: https://issues.apache.org/jira/browse/SPARK-7478
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Tathagata Das
Assignee: Apache Spark
Priority: Blocker

 Having a SQLContext singleton would make it easier for applications to use a 
 lazily instantiated single shared instance of SQLContext when needed. It 
 would avoid problems like 
 1. In REPL/notebook environment, rerunning the line {{val sqlContext = new 
 SQLContext}} multiple times created different contexts while overriding the 
 reference to previous context, leading to issues like registered temp tables 
 going missing.
 2. In Streaming, creating SQLContext directly leads to 
 serialization/deserialization issues when attempting to recover from DStream 
 checkpoints. See [SPARK-6770]. Also to get around this problem I had to 
 suggest creating a singleton instance - 
 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala
 This can be solved by {{SQLContext.getOrCreate}} which get or creates a new 
 singleton instance of SQLContext using either a given SparkContext or a given 
 SparkConf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6876) DataFrame.na.replace value support for Python


[ 
https://issues.apache.org/jira/browse/SPARK-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534026#comment-14534026
 ] 

Apache Spark commented on SPARK-6876:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/6003

 DataFrame.na.replace value support for Python
 -

 Key: SPARK-6876
 URL: https://issues.apache.org/jira/browse/SPARK-6876
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 Scala/Java support is in. We should provide the Python version, similar to 
 what Pandas supports.
 http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.replace.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7467) DAG visualization: handle checkpoint correctly


 [ 
https://issues.apache.org/jira/browse/SPARK-7467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7467:
---

Assignee: Apache Spark  (was: Andrew Or)

 DAG visualization: handle checkpoint correctly
 --

 Key: SPARK-7467
 URL: https://issues.apache.org/jira/browse/SPARK-7467
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Andrew Or
Assignee: Apache Spark

 We need to wrap RDD#doCheckpoint in a scope. Otherwise CheckpointRDDs may 
 belong to other operators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7467) DAG visualization: handle checkpoint correctly


 [ 
https://issues.apache.org/jira/browse/SPARK-7467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7467:
---

Assignee: Andrew Or  (was: Apache Spark)

 DAG visualization: handle checkpoint correctly
 --

 Key: SPARK-7467
 URL: https://issues.apache.org/jira/browse/SPARK-7467
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Andrew Or
Assignee: Andrew Or

 We need to wrap RDD#doCheckpoint in a scope. Otherwise CheckpointRDDs may 
 belong to other operators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7467) DAG visualization: handle checkpoint correctly


[ 
https://issues.apache.org/jira/browse/SPARK-7467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534032#comment-14534032
 ] 

Apache Spark commented on SPARK-7467:
-

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/6004

 DAG visualization: handle checkpoint correctly
 --

 Key: SPARK-7467
 URL: https://issues.apache.org/jira/browse/SPARK-7467
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Andrew Or
Assignee: Andrew Or

 We need to wrap RDD#doCheckpoint in a scope. Otherwise CheckpointRDDs may 
 belong to other operators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7478) Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext


[ 
https://issues.apache.org/jira/browse/SPARK-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534094#comment-14534094
 ] 

Tathagata Das edited comment on SPARK-7478 at 5/8/15 8:24 AM:
--

[~rxin] [~marmbrus] Thoughts?


was (Author: tdas):
[~rxin][~marmbrus] Thoughts?

 Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext
 ---

 Key: SPARK-7478
 URL: https://issues.apache.org/jira/browse/SPARK-7478
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker

 Having a SQLContext singleton would make it easier for applications to use a 
 lazily instantiated single shared instance of SQLContext when needed. It 
 would avoid problems like 
 1. In REPL/notebook environment, rerunning the line {{val sqlContext = new 
 SQLContext}} multiple times created different contexts while overriding the 
 reference to previous context, leading to issues like registered temp tables 
 going missing.
 2. In Streaming, creating SQLContext directly leads to 
 serialization/deserialization issues when attempting to recover from DStream 
 checkpoints. See [SPARK-6770].
 This can be solved by {{SQLContext.getOrCreate}} which get or creates a new 
 singleton instance of SQLContext using either a given SparkContext or a given 
 SparkConf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7478) Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext


[ 
https://issues.apache.org/jira/browse/SPARK-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534094#comment-14534094
 ] 

Tathagata Das edited comment on SPARK-7478 at 5/8/15 8:24 AM:
--

[~rxin][~marmbrus] Thoughts?


was (Author: tdas):
[~rxin] Thoughts?

 Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext
 ---

 Key: SPARK-7478
 URL: https://issues.apache.org/jira/browse/SPARK-7478
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker

 Having a SQLContext singleton would make it easier for applications to use a 
 lazily instantiated single shared instance of SQLContext when needed. It 
 would avoid problems like 
 1. In REPL/notebook environment, rerunning the line {{val sqlContext = new 
 SQLContext}} multiple times created different contexts while overriding the 
 reference to previous context, leading to issues like registered temp tables 
 going missing.
 2. In Streaming, creating SQLContext directly leads to 
 serialization/deserialization issues when attempting to recover from DStream 
 checkpoints. See [SPARK-6770].
 This can be solved by {{SQLContext.getOrCreate}} which get or creates a new 
 singleton instance of SQLContext using either a given SparkContext or a given 
 SparkConf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7478) Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext


[ 
https://issues.apache.org/jira/browse/SPARK-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534094#comment-14534094
 ] 

Tathagata Das commented on SPARK-7478:
--

[~rxin] Thoughts?

 Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext
 ---

 Key: SPARK-7478
 URL: https://issues.apache.org/jira/browse/SPARK-7478
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker

 Having a SQLContext singleton would make it easier for applications to use a 
 lazily instantiated single shared instance of SQLContext when needed. It 
 would avoid problems like 
 1. In REPL/notebook environment, rerunning the line {{val sqlContext = new 
 SQLContext}} multiple times created different contexts while overriding the 
 reference to previous context, leading to issues like registered temp tables 
 going missing.
 2. In Streaming, creating SQLContext directly leads to 
 serialization/deserialization issues when attempting to recover from DStream 
 checkpoints. See [SPARK-6770].
 This can be solved by {{SQLContext.getOrCreate}} which get or creates a new 
 singleton instance of SQLContext using either a given SparkContext or a given 
 SparkConf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6876) DataFrame.na.replace value support for Python


 [ 
https://issues.apache.org/jira/browse/SPARK-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6876:
---

Assignee: Apache Spark

 DataFrame.na.replace value support for Python
 -

 Key: SPARK-6876
 URL: https://issues.apache.org/jira/browse/SPARK-6876
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 Scala/Java support is in. We should provide the Python version, similar to 
 what Pandas supports.
 http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.replace.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6876) DataFrame.na.replace value support for Python


 [ 
https://issues.apache.org/jira/browse/SPARK-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6876:
---

Assignee: (was: Apache Spark)

 DataFrame.na.replace value support for Python
 -

 Key: SPARK-6876
 URL: https://issues.apache.org/jira/browse/SPARK-6876
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 Scala/Java support is in. We should provide the Python version, similar to 
 what Pandas supports.
 http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.replace.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7231) Make SparkR DataFrame API more dplyr friendly


 [ 
https://issues.apache.org/jira/browse/SPARK-7231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7231:
---

Assignee: Apache Spark  (was: Shivaram Venkataraman)

 Make SparkR DataFrame API more dplyr friendly
 -

 Key: SPARK-7231
 URL: https://issues.apache.org/jira/browse/SPARK-7231
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Shivaram Venkataraman
Assignee: Apache Spark
Priority: Critical

 This ticket tracks auditing the SparkR dataframe API and ensuring that the 
 API is friendly to existing R users. 
 Mainly we wish to make sure the DataFrame API we expose has functions similar 
 to those which exist on native R data frames and in popular packages like 
 `dplyr`. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7231) Make SparkR DataFrame API more dplyr friendly


[ 
https://issues.apache.org/jira/browse/SPARK-7231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534034#comment-14534034
 ] 

Apache Spark commented on SPARK-7231:
-

User 'shivaram' has created a pull request for this issue:
https://github.com/apache/spark/pull/6005

 Make SparkR DataFrame API more dplyr friendly
 -

 Key: SPARK-7231
 URL: https://issues.apache.org/jira/browse/SPARK-7231
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Critical

 This ticket tracks auditing the SparkR dataframe API and ensuring that the 
 API is friendly to existing R users. 
 Mainly we wish to make sure the DataFrame API we expose has functions similar 
 to those which exist on native R data frames and in popular packages like 
 `dplyr`. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1423) Add scripts for launching Spark on Windows Azure


 [ 
https://issues.apache.org/jira/browse/SPARK-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1423.
--
Resolution: Won't Fix

Given the lack of activity and resolution of 
https://issues.apache.org/jira/browse/SPARK-1422 I think that's probably 
correct.

 Add scripts for launching Spark on Windows Azure
 

 Key: SPARK-1423
 URL: https://issues.apache.org/jira/browse/SPARK-1423
 Project: Spark
  Issue Type: Improvement
  Components: Windows
Reporter: Matei Zaharia





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7392) Kryo buffer size can not be larger than 2M


 [ 
https://issues.apache.org/jira/browse/SPARK-7392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7392.
--
   Resolution: Fixed
Fix Version/s: 1.4.0
 Assignee: Zhang, Liye

Resolved by https://github.com/apache/spark/pull/5934

 Kryo buffer size can not be larger than 2M
 --

 Key: SPARK-7392
 URL: https://issues.apache.org/jira/browse/SPARK-7392
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Zhang, Liye
Assignee: Zhang, Liye
Priority: Critical
 Fix For: 1.4.0


 when set *spark.kryoserializer.buffer* larger than 2048k, 
 *IllegalArgumentException* will be thrown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7478) Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext

Tathagata Das created SPARK-7478:


 Summary: Add a SQLContext.getOrCreate to maintain a singleton 
instance of SQLContext
 Key: SPARK-7478
 URL: https://issues.apache.org/jira/browse/SPARK-7478
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker


Having a SQLContext singleton would make it easier for applications to use a 
lazily instantiated single shared instance of SQLContext when needed. It would 
avoid problems like 

1. In REPL/notebook environment, rerunning the line val sqlContext = new 
SQLContext multiple times created different contexts while overriding the 
reference to previous context, leading to issues like registered temp tables 
going missing.

2. In Streaming, creating SQLContext directly leads to 
serialization/deserialization issues when attempting to recover from DStream 
checkpoints. See



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7478) Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext

[
https://issues.apache.org/jira/browse/SPARK-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

2. In Streaming, creating SQLContext directly leads to
serialization/deserialization issues when attempting to recover from DStream
checkpoints. See [SPARK-6770]. Also to get around this problem I had to suggest
creating a singleton instance -
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala

This can be solved by {{SQLContext.getOrCreate}} which get or creates a new
singleton instance of SQLContext using either a given SparkContext or a given
SparkConf

was:
Having a SQLContext singleton would make it easier for applications to use a
lazily instantiated single shared instance of SQLContext when needed. It would
avoid problems like

2. In Streaming, creating SQLContext directly leads to
serialization/deserialization issues when attempting to recover from DStream
checkpoints. See [SPARK-6770].

This can be solved by {{SQLContext.getOrCreate}} which get or creates a new
singleton instance of SQLContext using either a given SparkContext or a given
SparkConf

Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext
---

Key: SPARK-7478
URL: https://issues.apache.org/jira/browse/SPARK-7478
Project: Spark
Issue Type: New Feature
Components: SQL
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker

Having a SQLContext singleton would make it easier for applications to use a
lazily instantiated single shared instance of SQLContext when needed. It
would avoid problems like
1. In REPL/notebook environment, rerunning the line {{val sqlContext = new
SQLContext}} multiple times created different contexts while overriding the
reference to previous context, leading to issues like registered temp tables
going missing.
2. In Streaming, creating SQLContext directly leads to
serialization/deserialization issues when attempting to recover from DStream
checkpoints. See [SPARK-6770]. Also to get around this problem I had to
suggest creating a singleton instance -
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala
This can be solved by {{SQLContext.getOrCreate}} which get or creates a new
singleton instance of SQLContext using either a given SparkContext or a given
SparkConf

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7480) Get exception when DataFrame saveAsTable and run sql on the same table at the same time

2015-05-08 Thread pin_zhang (JIRA)

pin_zhang created SPARK-7480:


 Summary: Get exception when DataFrame saveAsTable and run sql on 
the same table at the same time
 Key: SPARK-7480
 URL: https://issues.apache.org/jira/browse/SPARK-7480
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1, 1.3.0
Reporter: pin_zhang


There is a case 
1) In the main thread call  DataFrame.saveAsTable(table,
SaveMode.Overwrite); save json rdd to hive table
2) In another thread run sql the table simultaneously 
You can see many exceptions to indicate the table not exit or table is not 
complete.
Does Spark SQL support such usage?

Thanks

[Main Thread]
DataFrame df = hiveContext_.jsonFile(test.json);
  String table = UNIT_TEST;
 while (true) {
df = hiveContext_.jsonFile(test.json);
df.saveAsTable(table, SaveMode.Overwrite);
System.out.println(new Timestamp(System.currentTimeMillis()) +  [ 
+Thread.currentThread().getName()
+ ] override table);   
try {
Thread.sleep(3000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}

[Query Thread]
   DataFrame query = hiveContext_.sql(select * from UNIT_TEST);
Row[] rows = query.collect();
System.out.println(new Timestamp(System.currentTimeMillis()) + 
 [ + Thread.currentThread().getName()
+ ]  [query result count:]  + rows.length);


[Exceptions in log]

15/05/08 16:05:49 ERROR Hive: NoSuchObjectException(message:default.unit_test 
table not found)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table(HiveMetaStore.java:1560)
at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105)
at com.sun.proxy.$Proxy20.get_table(Unknown Source)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:997)
at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
at com.sun.proxy.$Proxy21.getTable(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:976)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:950)
at 
org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:201)
at 
org.apache.spark.sql.hive.HiveContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:262)
at 
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:161)
at 
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:161)
at scala.Option.getOrElse(Option.scala:120)
at 
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:161)
at 
org.apache.spark.sql.hive.HiveContext$$anon$2.lookupRelation(HiveContext.scala:262)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:174)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$6.applyOrElse(Analyzer.scala:186)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$6.applyOrElse(Analyzer.scala:181)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:188)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:188)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:187)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:208)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)

[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors

2015-05-08 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534174#comment-14534174
 ] 

Steve Loughran commented on SPARK-7481:
---

This doesn't contain any endorsement of the use of s3a in Hadoop 2.6; see 
HADOOP-11571

I'm not planning to add any tests for this, but its something to consider for 
regression testing all the object stores —the tests just need to:
* be skipped if there's no credentials
* make a best effort to stop anyone accidentally checking in their credentials
* work on deskop/jenkins rather than just on cloud.
* not run up massive bills
* not take forever

AWS publishes some free-to-read datasets, such as [this 
one|http://datasets.elasticmapreduce.s3.amazonaws.com/] which won't need 
credentials, work remote and don't ring up bills for the read part of the 
process, but would take a long time to complete on a single executor. 

 Add Hadoop 2.6+ profile to pull in object store FS accessors
 

 Key: SPARK-7481
 URL: https://issues.apache.org/jira/browse/SPARK-7481
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.3.1
Reporter: Steve Loughran

 To keep the s3n classpath right, to add s3a, swift  azure, the dependencies 
 of spark in a 2.6+ profile need to add the relevant object store packages 
 (hadoop-aws, hadoop-openstack, hadoop-azure)
 this adds more stuff to the client bundle, but will mean a single spark 
 package can talk to all of the stores.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6770) DirectKafkaInputDStream has not been initialized when recovery from checkpoint

2015-05-08 Thread yangping wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534209#comment-14534209
 ] 

yangping wu commented on SPARK-6770:


Hi [~tdas], I use the code you mentioned, It was successes recovery from 
checkpoint. It  solves the issue, Thank you.

 DirectKafkaInputDStream has not been initialized when recovery from checkpoint
 --

 Key: SPARK-6770
 URL: https://issues.apache.org/jira/browse/SPARK-6770
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: yangping wu

 I am  read data from kafka using createDirectStream method and save the 
 received log to Mysql, the code snippets as follows
 {code}
 def functionToCreateContext(): StreamingContext = {
   val sparkConf = new SparkConf()
   val sc = new SparkContext(sparkConf)
   val ssc = new StreamingContext(sc, Seconds(10))
   ssc.checkpoint(/tmp/kafka/channel/offset) // set checkpoint directory
   ssc
 }
 val struct = StructType(StructField(log, StringType) ::Nil)
 // Get StreamingContext from checkpoint data or create a new one
 val ssc = StreamingContext.getOrCreate(/tmp/kafka/channel/offset, 
 functionToCreateContext)
 val SDB = KafkaUtils.createDirectStream[String, String, StringDecoder, 
 StringDecoder](ssc, kafkaParams, topics)
 val sqlContext = new org.apache.spark.sql.SQLContext(ssc.sparkContext)
 SDB.foreachRDD(rdd = {
   val result = rdd.map(item = {
 println(item)
 val result = item._2 match {
   case e: String = Row.apply(e)
   case _ = Row.apply()
 }
 result
   })
   println(result.count())
   val df = sqlContext.createDataFrame(result, struct)
   df.insertIntoJDBC(url, test, overwrite = false)
 })
 ssc.start()
 ssc.awaitTermination()
 ssc.stop()
 {code}
 But when I  recovery the program from checkpoint, I encountered an exception:
 {code}
 Exception in thread main org.apache.spark.SparkException: 
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream@41a80e5a has not 
 been initialized
   at 
 org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:266)
   at 
 org.apache.spark.streaming.dstream.InputDStream.isTimeValid(InputDStream.scala:51)
   at 
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
   at 
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
   at scala.Option.orElse(Option.scala:257)
   at 
 org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
   at 
 org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
   at 
 org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
   at 
 org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
   at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
   at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
   at 
 org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
   at 
 org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:223)
   at 
 org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:218)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
   at 
 org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:218)
   at 
 org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:89)
   at 
 org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:67)
   at 
 org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:512)
   at logstatstreaming.UserChannelTodb$.main(UserChannelTodb.scala:57)
   at logstatstreaming.UserChannelTodb.main(UserChannelTodb.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at

[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors


[ 
https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534210#comment-14534210
 ] 

Sean Owen commented on SPARK-7481:
--

Maybe I'd be less frightened if I knew the size of these deps and their 
dependencies was small, and the licenses were all OK, etc. This would need some 
checking; I know we had a license problem and so forth with Kinesis, and have 
had jets3t problems, etc. I am maybe needlessly wary of doing this several 
times over to add more niche FS clients to the main build for everyone.

 Add Hadoop 2.6+ profile to pull in object store FS accessors
 

 Key: SPARK-7481
 URL: https://issues.apache.org/jira/browse/SPARK-7481
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.3.1
Reporter: Steve Loughran

 To keep the s3n classpath right, to add s3a, swift  azure, the dependencies 
 of spark in a 2.6+ profile need to add the relevant object store packages 
 (hadoop-aws, hadoop-openstack, hadoop-azure)
 this adds more stuff to the client bundle, but will mean a single spark 
 package can talk to all of the stores.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6770) DirectKafkaInputDStream has not been initialized when recovery from checkpoint


[ 
https://issues.apache.org/jira/browse/SPARK-6770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534240#comment-14534240
 ] 

Tathagata Das commented on SPARK-6770:
--

Awesome! I am closing this JIRA then!

 DirectKafkaInputDStream has not been initialized when recovery from checkpoint
 --

 Key: SPARK-6770
 URL: https://issues.apache.org/jira/browse/SPARK-6770
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: yangping wu

 I am  read data from kafka using createDirectStream method and save the 
 received log to Mysql, the code snippets as follows
 {code}
 def functionToCreateContext(): StreamingContext = {
   val sparkConf = new SparkConf()
   val sc = new SparkContext(sparkConf)
   val ssc = new StreamingContext(sc, Seconds(10))
   ssc.checkpoint(/tmp/kafka/channel/offset) // set checkpoint directory
   ssc
 }
 val struct = StructType(StructField(log, StringType) ::Nil)
 // Get StreamingContext from checkpoint data or create a new one
 val ssc = StreamingContext.getOrCreate(/tmp/kafka/channel/offset, 
 functionToCreateContext)
 val SDB = KafkaUtils.createDirectStream[String, String, StringDecoder, 
 StringDecoder](ssc, kafkaParams, topics)
 val sqlContext = new org.apache.spark.sql.SQLContext(ssc.sparkContext)
 SDB.foreachRDD(rdd = {
   val result = rdd.map(item = {
 println(item)
 val result = item._2 match {
   case e: String = Row.apply(e)
   case _ = Row.apply()
 }
 result
   })
   println(result.count())
   val df = sqlContext.createDataFrame(result, struct)
   df.insertIntoJDBC(url, test, overwrite = false)
 })
 ssc.start()
 ssc.awaitTermination()
 ssc.stop()
 {code}
 But when I  recovery the program from checkpoint, I encountered an exception:
 {code}
 Exception in thread main org.apache.spark.SparkException: 
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream@41a80e5a has not 
 been initialized
   at 
 org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:266)
   at 
 org.apache.spark.streaming.dstream.InputDStream.isTimeValid(InputDStream.scala:51)
   at 
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
   at 
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
   at scala.Option.orElse(Option.scala:257)
   at 
 org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
   at 
 org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
   at 
 org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
   at 
 org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
   at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
   at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
   at 
 org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
   at 
 org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:223)
   at 
 org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:218)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
   at 
 org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:218)
   at 
 org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:89)
   at 
 org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:67)
   at 
 org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:512)
   at logstatstreaming.UserChannelTodb$.main(UserChannelTodb.scala:57)
   at logstatstreaming.UserChannelTodb.main(UserChannelTodb.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at

[jira] [Commented] (SPARK-7459) Add Java example for ElementwiseProduct in programming guide

2015-05-08 Thread Octavian Geagla (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534276#comment-14534276
 ] 

Octavian Geagla commented on SPARK-7459:


Can do!

 Add Java example for ElementwiseProduct in programming guide
 

 Key: SPARK-7459
 URL: https://issues.apache.org/jira/browse/SPARK-7459
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, Java API, ML
Reporter: Joseph K. Bradley
Priority: Minor

 Duplicate Scala example, but in Java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-6770) DirectKafkaInputDStream has not been initialized when recovery from checkpoint


 [ 
https://issues.apache.org/jira/browse/SPARK-6770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das closed SPARK-6770.

Resolution: Not A Problem

 DirectKafkaInputDStream has not been initialized when recovery from checkpoint
 --

 Key: SPARK-6770
 URL: https://issues.apache.org/jira/browse/SPARK-6770
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: yangping wu

 I am  read data from kafka using createDirectStream method and save the 
 received log to Mysql, the code snippets as follows
 {code}
 def functionToCreateContext(): StreamingContext = {
   val sparkConf = new SparkConf()
   val sc = new SparkContext(sparkConf)
   val ssc = new StreamingContext(sc, Seconds(10))
   ssc.checkpoint(/tmp/kafka/channel/offset) // set checkpoint directory
   ssc
 }
 val struct = StructType(StructField(log, StringType) ::Nil)
 // Get StreamingContext from checkpoint data or create a new one
 val ssc = StreamingContext.getOrCreate(/tmp/kafka/channel/offset, 
 functionToCreateContext)
 val SDB = KafkaUtils.createDirectStream[String, String, StringDecoder, 
 StringDecoder](ssc, kafkaParams, topics)
 val sqlContext = new org.apache.spark.sql.SQLContext(ssc.sparkContext)
 SDB.foreachRDD(rdd = {
   val result = rdd.map(item = {
 println(item)
 val result = item._2 match {
   case e: String = Row.apply(e)
   case _ = Row.apply()
 }
 result
   })
   println(result.count())
   val df = sqlContext.createDataFrame(result, struct)
   df.insertIntoJDBC(url, test, overwrite = false)
 })
 ssc.start()
 ssc.awaitTermination()
 ssc.stop()
 {code}
 But when I  recovery the program from checkpoint, I encountered an exception:
 {code}
 Exception in thread main org.apache.spark.SparkException: 
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream@41a80e5a has not 
 been initialized
   at 
 org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:266)
   at 
 org.apache.spark.streaming.dstream.InputDStream.isTimeValid(InputDStream.scala:51)
   at 
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
   at 
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
   at scala.Option.orElse(Option.scala:257)
   at 
 org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
   at 
 org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
   at 
 org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
   at 
 org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
   at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
   at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
   at 
 org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
   at 
 org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:223)
   at 
 org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:218)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
   at 
 org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:218)
   at 
 org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:89)
   at 
 org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:67)
   at 
 org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:512)
   at logstatstreaming.UserChannelTodb$.main(UserChannelTodb.scala:57)
   at logstatstreaming.UserChannelTodb.main(UserChannelTodb.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
   at

[jira] [Assigned] (SPARK-7482) Rename some DataFrame API methods in SparkR to match their counterparts in Scala


 [ 
https://issues.apache.org/jira/browse/SPARK-7482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7482:
---

Assignee: (was: Apache Spark)

 Rename some DataFrame API methods in SparkR to match their counterparts in 
 Scala
 

 Key: SPARK-7482
 URL: https://issues.apache.org/jira/browse/SPARK-7482
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Sun Rui
Priority: Critical

 This is re-consideration on how to solve name conflict. Previously, we rename 
 API names from Scala API if there is name conflict with base or other 
 commonly-used packages. However, from long term perspective, this is not good 
 for API stability, because we can't predict name conflicts, for example, if 
 in the future a name added in base package conflicts with an API in SparkR? 
 So the better policy is to keep API name same as Scala's without worrying 
 about name conflicts. When users use SparkR, they should load SparkR as last 
 package, so that all API names are effective. Use can explicitly use :: to 
 refer to hidden names from other packages.
 more discussion can be found at 
 https://issues.apache.org/jira/browse/SPARK-6812



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7482) Rename some DataFrame API methods in SparkR to match their counterparts in Scala


[ 
https://issues.apache.org/jira/browse/SPARK-7482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534255#comment-14534255
 ] 

Apache Spark commented on SPARK-7482:
-

User 'sun-rui' has created a pull request for this issue:
https://github.com/apache/spark/pull/6007

 Rename some DataFrame API methods in SparkR to match their counterparts in 
 Scala
 

 Key: SPARK-7482
 URL: https://issues.apache.org/jira/browse/SPARK-7482
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Sun Rui
Priority: Critical

 This is re-consideration on how to solve name conflict. Previously, we rename 
 API names from Scala API if there is name conflict with base or other 
 commonly-used packages. However, from long term perspective, this is not good 
 for API stability, because we can't predict name conflicts, for example, if 
 in the future a name added in base package conflicts with an API in SparkR? 
 So the better policy is to keep API name same as Scala's without worrying 
 about name conflicts. When users use SparkR, they should load SparkR as last 
 package, so that all API names are effective. Use can explicitly use :: to 
 refer to hidden names from other packages.
 more discussion can be found at 
 https://issues.apache.org/jira/browse/SPARK-6812



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7482) Rename some DataFrame API methods in SparkR to match their counterparts in Scala


 [ 
https://issues.apache.org/jira/browse/SPARK-7482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7482:
---

Assignee: Apache Spark

 Rename some DataFrame API methods in SparkR to match their counterparts in 
 Scala
 

 Key: SPARK-7482
 URL: https://issues.apache.org/jira/browse/SPARK-7482
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Sun Rui
Assignee: Apache Spark
Priority: Critical

 This is re-consideration on how to solve name conflict. Previously, we rename 
 API names from Scala API if there is name conflict with base or other 
 commonly-used packages. However, from long term perspective, this is not good 
 for API stability, because we can't predict name conflicts, for example, if 
 in the future a name added in base package conflicts with an API in SparkR? 
 So the better policy is to keep API name same as Scala's without worrying 
 about name conflicts. When users use SparkR, they should load SparkR as last 
 package, so that all API names are effective. Use can explicitly use :: to 
 refer to hidden names from other packages.
 more discussion can be found at 
 https://issues.apache.org/jira/browse/SPARK-6812



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7459) Add Java example for ElementwiseProduct in programming guide

2015-05-08 Thread Octavian Geagla (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534276#comment-14534276
 ] 

Octavian Geagla edited comment on SPARK-7459 at 5/8/15 10:26 AM:
-

Can do! Please assign to me.


was (Author: ogeagla):
Can do!

 Add Java example for ElementwiseProduct in programming guide
 

 Key: SPARK-7459
 URL: https://issues.apache.org/jira/browse/SPARK-7459
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, Java API, ML
Reporter: Joseph K. Bradley
Priority: Minor

 Duplicate Scala example, but in Java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7459) Add Java example for ElementwiseProduct in programming guide


[ 
https://issues.apache.org/jira/browse/SPARK-7459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534303#comment-14534303
 ] 

Sean Owen commented on SPARK-7459:
--

You dont need to be assigned, just go ahead.

 Add Java example for ElementwiseProduct in programming guide
 

 Key: SPARK-7459
 URL: https://issues.apache.org/jira/browse/SPARK-7459
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, Java API, ML
Reporter: Joseph K. Bradley
Priority: Minor

 Duplicate Scala example, but in Java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6154) Support Kafka, JDBC in Scala 2.11

2015-05-08 Thread Jianshi Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534344#comment-14534344
 ] 

Jianshi Huang commented on SPARK-6154:
--

Do you mean we need to upgrade the jline version for both 2.11 and 2.10?

Jianshi

 Support Kafka, JDBC in Scala 2.11
 -

 Key: SPARK-6154
 URL: https://issues.apache.org/jira/browse/SPARK-6154
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.3.0
Reporter: Jianshi Huang

 Build v1.3.0-rc2 with Scala 2.11 using instructions in the documentation 
 failed when -Phive-thriftserver is enabled.
 [info] Compiling 9 Scala sources to 
 /home/hjs/workspace/spark/sql/hive-thriftserver/target/scala-2.11/classes...
 [error] 
 /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:2
 5: object ConsoleReader is not a member of package jline
 [error] import jline.{ConsoleReader, History}
 [error]^
 [warn] Class jline.Completor not found - continuing with a stub.
 [warn] Class jline.ConsoleReader not found - continuing with a stub.
 [error] 
 /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:1
 65: not found: type ConsoleReader
 [error] val reader = new ConsoleReader()
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7482) Rename some DataFrame API methods in SparkR to match their counterparts in Scala

2015-05-08 Thread Sun Rui (JIRA)

Sun Rui created SPARK-7482:
--

 Summary: Rename some DataFrame API methods in SparkR to match 
their counterparts in Scala
 Key: SPARK-7482
 URL: https://issues.apache.org/jira/browse/SPARK-7482
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Sun Rui
Priority: Critical


This is re-consideration on how to solve name conflict. Previously, we rename 
API names from Scala API if there is name conflict with base or other 
commonly-used packages. However, from long term perspective, this is not good 
for API stability, because we can't predict name conflicts, for example, if in 
the future a name added in base package conflicts with an API in SparkR? So the 
better policy is to keep API name same as Scala's without worrying about name 
conflicts. When users use SparkR, they should load SparkR as last package, so 
that all API names are effective. Use can explicitly use :: to refer to hidden 
names from other packages.

more discussion can be found at https://issues.apache.org/jira/browse/SPARK-6812




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7459) Add Java example for ElementwiseProduct in programming guide


 [ 
https://issues.apache.org/jira/browse/SPARK-7459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7459:
-
Assignee: Octavian Geagla

 Add Java example for ElementwiseProduct in programming guide
 

 Key: SPARK-7459
 URL: https://issues.apache.org/jira/browse/SPARK-7459
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, Java API, ML
Reporter: Joseph K. Bradley
Assignee: Octavian Geagla
Priority: Minor

 Duplicate Scala example, but in Java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6869) Add pyspark archives path to PYTHONPATH


 [ 
https://issues.apache.org/jira/browse/SPARK-6869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-6869:
-
Priority: Blocker  (was: Minor)

 Add pyspark archives path to PYTHONPATH
 ---

 Key: SPARK-6869
 URL: https://issues.apache.org/jira/browse/SPARK-6869
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Weizhong
Priority: Blocker

 From SPARK-1920 and SPARK-1520 we know PySpark on Yarn can not work when the 
 assembly jar are package by JDK 1.7+, so ship pyspark archives to executors 
 by Yarn with --py-files. The pyspark archives name must contains 
 spark-pyspark.
 1st: zip pyspark to spark-pyspark_2.10.zip
 2nd:./bin/spark-submit --master yarn-client/yarn-cluster --py-files 
 spark-pyspark_2.10.zip app.py args



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7449) createPhysicalRDD should use RDD output as schema instead of relation.schema


 [ 
https://issues.apache.org/jira/browse/SPARK-7449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7449:
-
Component/s: SQL

 createPhysicalRDD should use RDD output as schema instead of relation.schema
 

 Key: SPARK-7449
 URL: https://issues.apache.org/jira/browse/SPARK-7449
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Zhan Zhang





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7483) [MLLib] Using Kryo with FPGrowth fails with an exception


 [ 
https://issues.apache.org/jira/browse/SPARK-7483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7483:
-
Component/s: MLlib
   Priority: Minor  (was: Major)

 [MLLib] Using Kryo with FPGrowth fails with an exception
 

 Key: SPARK-7483
 URL: https://issues.apache.org/jira/browse/SPARK-7483
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.1
Reporter: Tomasz Bartczak
Priority: Minor

 When using FPGrowth algorithm with KryoSerializer - Spark fails with
 {code}
 Job aborted due to stage failure: Task 0 in stage 9.0 failed 1 times, most 
 recent failure: Lost task 0.0 in stage 9.0 (TID 16, localhost): 
 com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: 
 Can not set final scala.collection.mutable.ListBuffer field 
 org.apache.spark.mllib.fpm.FPTree$Summary.nodes to 
 scala.collection.mutable.ArrayBuffer
 Serialization trace:
 nodes (org.apache.spark.mllib.fpm.FPTree$Summary)
 org$apache$spark$mllib$fpm$FPTree$$summaries 
 (org.apache.spark.mllib.fpm.FPTree)
 {code}
 This can be easily reproduced in spark codebase by setting 
 {code}
 conf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer)
 {code} and running FPGrowthSuite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7484) Support passing jdbc connection properties for dataframe.createJDBCTable and insertIntoJDBC


 [ 
https://issues.apache.org/jira/browse/SPARK-7484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-7484:
---
Issue Type: Improvement  (was: Bug)

 Support passing jdbc connection properties for dataframe.createJDBCTable and 
 insertIntoJDBC
 ---

 Key: SPARK-7484
 URL: https://issues.apache.org/jira/browse/SPARK-7484
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Venkata Ramana G
Priority: Minor

 Few jdbc drivers like SybaseIQ support passing username and password only 
 through connection properties. So the same needs to be supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7435) Make DataFrame.show() consistent with that of Scala and pySpark


 [ 
https://issues.apache.org/jira/browse/SPARK-7435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-7435:
---
Priority: Critical  (was: Blocker)

 Make DataFrame.show() consistent with that of Scala and pySpark
 ---

 Key: SPARK-7435
 URL: https://issues.apache.org/jira/browse/SPARK-7435
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Sun Rui
Priority: Critical

 Currently in SparkR, DataFrame has two methods show() and showDF(). show() 
 prints the DataFrame column names and types and showDF() prints the first 
 numRows rows of a DataFrame.
 In Scala and pySpark, show() is used to prints rows of a DataFrame. 
 We'd better keep API consistent unless there is some important reason. So 
 propose to interchange the names (show() and showDF()) in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7486) Add the streaming implementation for estimating quantiles and median

2015-05-08 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-7486:
--

 Summary: Add the streaming implementation for estimating quantiles 
and median
 Key: SPARK-7486
 URL: https://issues.apache.org/jira/browse/SPARK-7486
 Project: Spark
  Issue Type: New Feature
  Components: ML, SQL
Reporter: Liang-Chi Hsieh


Streaming implementations that can estimate quantiles, median are very useful 
for ML algorithm and data statistics. 

Apache DataFu Pig has this kind of implementation. We can port it to Spark. 
Please refer to: 
http://datafu.incubator.apache.org/docs/datafu/getting-started.html






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7110) when use saveAsNewAPIHadoopFile, sometimes it throws Delegation Token can be issued only with kerberos or web authentication


[ 
https://issues.apache.org/jira/browse/SPARK-7110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534494#comment-14534494
 ] 

Thomas Graves commented on SPARK-7110:
--

[~gu chi]  is there some of the stack trace missing from the description?  If 
so could you attach the rest of it?

Could you also provide the context in which you are NewHadoopRDD.getPartitions 
is called? Are you calling it directly or is it being called from another Spark 
routine? (if so which interface)



 when use saveAsNewAPIHadoopFile, sometimes it throws Delegation Token can be 
 issued only with kerberos or web authentication
 --

 Key: SPARK-7110
 URL: https://issues.apache.org/jira/browse/SPARK-7110
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: gu-chi

 Under yarn-client mode, this issue random occurs. Authentication method is 
 set to kerberos, and use saveAsNewAPIHadoopFile in PairRDDFunctions to save 
 data to HDFS, then exception comes as:
 org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
 can be issued only with kerberos or web authentication



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7393) How to improve Spark SQL performance?

[
https://issues.apache.org/jira/browse/SPARK-7393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Patrick Wendell resolved SPARK-7393.

Resolution: Invalid

Hi - thanks for giving feedback on your use of Spark SQL. This type of
discussions should take place on the mailing list rather than our feature issue
tracker.

How to improve Spark SQL performance?
-

Key: SPARK-7393
URL: https://issues.apache.org/jira/browse/SPARK-7393
Project: Spark
Issue Type: Improvement
Components: SQL
Reporter: Liang Lee

We want to use Spark SQL in our project ,but we found that the Spark SQL
performance is not very well as we expected. The detail is as follows:
1. We save data as parquet file on HDFS.
2.We just select one or several rows from the parquet file using spark SQL.
3. When the total record number is 61 million, it needs about 3 seconds to
get the result, which is unacceptable long for our scenario.
4.When the total record number is 2 million, it needs about 93 ms to get the
result, whcih is still a little long for us.
5. The query statement is like : SELECT * FROM DBA WHERE COLA=? AND COLB=?
And the table is not complex, which has less 10 columns and the content for
each column is less than 100 bytes.
6. Does any one know how to improve the performance or give some other ideas?
7. Can Spark SQL support micro-second-level response?

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7485) Remove python artifacts from the assembly jar

Thomas Graves created SPARK-7485:


 Summary: Remove python artifacts from the assembly jar
 Key: SPARK-7485
 URL: https://issues.apache.org/jira/browse/SPARK-7485
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.4.0
Reporter: Thomas Graves


We change it so that we distributed the python files via a zip file in 
SPARK-6869.  With that we should remove the python files from the assembly jar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3928) Support wildcard matches on Parquet files

2015-05-08 Thread Thu Kyaw (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534582#comment-14534582
 ] 

Thu Kyaw commented on SPARK-3928:
-

Hello [~lian cheng] please let me know if you want me to work on adding back 
the glob support.

 Support wildcard matches on Parquet files
 -

 Key: SPARK-3928
 URL: https://issues.apache.org/jira/browse/SPARK-3928
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Reporter: Nicholas Chammas
Assignee: Cheng Lian
Priority: Minor
 Fix For: 1.3.0


 {{SparkContext.textFile()}} supports patterns like {{part-*}} and 
 {{2014-\?\?-\?\?}}. 
 It would be nice if {{SparkContext.parquetFile()}} did the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB

2015-05-08 Thread Rangarajan Sreenivasan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534629#comment-14534629
 ] 

Rangarajan Sreenivasan commented on SPARK-5928:
---

We are hitting a very similar issue. Job fails during the repartition stage. 
* Ours is a 10-node r3.8x cluster (119 GB  16-CPU per node)
* Running Spark version 1.3.1 in Standalone cluster mode
* Tried various parallelism values - 50, 100, 200, 500, 800


 Remote Shuffle Blocks cannot be more than 2 GB
 --

 Key: SPARK-5928
 URL: https://issues.apache.org/jira/browse/SPARK-5928
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Imran Rashid

 If a shuffle block is over 2GB, the shuffle fails, with an uninformative 
 exception.  The tasks get retried a few times and then eventually the job 
 fails.
 Here is an example program which can cause the exception:
 {code}
 val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =
   val n = 3e3.toInt
   val arr = new Array[Byte](n)
   //need to make sure the array doesn't compress to something small
   scala.util.Random.nextBytes(arr)
   arr
 }
 rdd.map { x = (1, x)}.groupByKey().count()
 {code}
 Note that you can't trigger this exception in local mode, it only happens on 
 remote fetches.   I triggered these exceptions running with 
 {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}}
 {noformat}
 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, 
 imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, 
 imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message=
 org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 
 2147483647: 3021252889 - discarded
   at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
   at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
   at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
   at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
   at 
 org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
   at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame 
 length exceeds 2147483647: 3021252889 - discarded
   at 
 io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501)
   at 
 io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477)
   at 
 io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403)
   at 
 io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343)
   at 
 io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249)
   at 
 io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149)
   at 
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
   at 
 io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
   at 
 io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
   at 
 io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
   at 
 io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
   at 
 io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
   at

[jira] [Resolved] (SPARK-1920) Spark JAR compiled with Java 7 leads to PySpark not working in YARN


 [ 
https://issues.apache.org/jira/browse/SPARK-1920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-1920.
--
Resolution: Duplicate

 Spark JAR compiled with Java 7 leads to PySpark not working in YARN
 ---

 Key: SPARK-1920
 URL: https://issues.apache.org/jira/browse/SPARK-1920
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.0.0
Reporter: Tathagata Das
Priority: Blocker

 Current (Spark 1.0) implementation of PySpark on Yarn requires python to be 
 able to read Spark assembly JAR. But Spark assembly JAR compiled with Java 7 
 can sometimes be not readable by python. This can be due to the fact that 
 JARs created by Java 7 with more 2^16 files is encoded in Zip64, which python 
 cant read. 
 [SPARK-1911|https://issues.apache.org/jira/browse/SPARK-1911] warns users 
 from using Java 7 when creating Spark distribution. 
 One way to fix this is to put pyspark in a different smaller JAR than rest of 
 Spark so that it is readable by python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6869) Add pyspark archives path to PYTHONPATH


 [ 
https://issues.apache.org/jira/browse/SPARK-6869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-6869:
-
Assignee: Lianhui Wang

 Add pyspark archives path to PYTHONPATH
 ---

 Key: SPARK-6869
 URL: https://issues.apache.org/jira/browse/SPARK-6869
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Weizhong
Assignee: Lianhui Wang
Priority: Blocker
 Fix For: 1.4.0


 From SPARK-1920 and SPARK-1520 we know PySpark on Yarn can not work when the 
 assembly jar are package by JDK 1.7+, so ship pyspark archives to executors 
 by Yarn with --py-files. The pyspark archives name must contains 
 spark-pyspark.
 1st: zip pyspark to spark-pyspark_2.10.zip
 2nd:./bin/spark-submit --master yarn-client/yarn-cluster --py-files 
 spark-pyspark_2.10.zip app.py args



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6869) Add pyspark archives path to PYTHONPATH


 [ 
https://issues.apache.org/jira/browse/SPARK-6869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-6869.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

 Add pyspark archives path to PYTHONPATH
 ---

 Key: SPARK-6869
 URL: https://issues.apache.org/jira/browse/SPARK-6869
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Weizhong
Priority: Blocker
 Fix For: 1.4.0


 From SPARK-1920 and SPARK-1520 we know PySpark on Yarn can not work when the 
 assembly jar are package by JDK 1.7+, so ship pyspark archives to executors 
 by Yarn with --py-files. The pyspark archives name must contains 
 spark-pyspark.
 1st: zip pyspark to spark-pyspark_2.10.zip
 2nd:./bin/spark-submit --master yarn-client/yarn-cluster --py-files 
 spark-pyspark_2.10.zip app.py args



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6869) Add pyspark archives path to PYTHONPATH


 [ 
https://issues.apache.org/jira/browse/SPARK-6869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-6869:
-
Target Version/s: 1.4.0

 Add pyspark archives path to PYTHONPATH
 ---

 Key: SPARK-6869
 URL: https://issues.apache.org/jira/browse/SPARK-6869
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Weizhong
Priority: Blocker
 Fix For: 1.4.0


 From SPARK-1920 and SPARK-1520 we know PySpark on Yarn can not work when the 
 assembly jar are package by JDK 1.7+, so ship pyspark archives to executors 
 by Yarn with --py-files. The pyspark archives name must contains 
 spark-pyspark.
 1st: zip pyspark to spark-pyspark_2.10.zip
 2nd:./bin/spark-submit --master yarn-client/yarn-cluster --py-files 
 spark-pyspark_2.10.zip app.py args



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6961) Cannot save data to parquet files when executing from Windows from a Maven Project


 [ 
https://issues.apache.org/jira/browse/SPARK-6961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6961:
---
Priority: Critical  (was: Blocker)

 Cannot save data to parquet files when executing from Windows from a Maven 
 Project
 --

 Key: SPARK-6961
 URL: https://issues.apache.org/jira/browse/SPARK-6961
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Bogdan Niculescu
Priority: Critical

 I have setup a project where I am trying to save a DataFrame into a parquet 
 file. My project is a Maven one with Spark 1.3.0 and Scala 2.11.5 :
 {code:xml}
 spark.version1.3.0/spark.version
 dependency
 groupIdorg.apache.spark/groupId
 artifactIdspark-core_2.11/artifactId
 version${spark.version}/version
 /dependency
 dependency
 groupIdorg.apache.spark/groupId
 artifactIdspark-sql_2.11/artifactId
 version${spark.version}/version
 /dependency
 {code}
 A simple version of my code that reproduces consistently the problem that I 
 am seeing is :
 {code}
 import org.apache.spark.sql.SQLContext
 import org.apache.spark.{SparkConf, SparkContext}
 case class Person(name: String, age: Int)
 object DataFrameTest extends App {
   val conf = new SparkConf().setMaster(local[4]).setAppName(DataFrameTest)
   val sc = new SparkContext(conf)
   val sqlContext = new SQLContext(sc)
   val persons = List(Person(a, 1), Person(b, 2))
   val rdd = sc.parallelize(persons)
   val dataFrame = sqlContext.createDataFrame(rdd)
   dataFrame.saveAsParquetFile(test.parquet)
 }
 {code}
 All the time the exception that I am getting is :
 {code}
 Exception in thread main java.lang.NullPointerException
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010)
   at org.apache.hadoop.util.Shell.runCommand(Shell.java:404)
   at org.apache.hadoop.util.Shell.run(Shell.java:379)
   at 
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
   at org.apache.hadoop.util.Shell.execCommand(Shell.java:678)
   at org.apache.hadoop.util.Shell.execCommand(Shell.java:661)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:639)
   at 
 org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:468)
   at 
 org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
   at 
 org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:905)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:886)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:783)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:772)
   at 
 parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:409)
   at 
 parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:401)
   at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(ParquetTypes.scala:443)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.prepareMetadata(newParquet.scala:240)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:256)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:251)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
   at scala.collection.immutable.List.foreach(List.scala:381)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
   at scala.collection.immutable.List.map(List.scala:285)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:251)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2.init(newParquet.scala:370)
   at 
 org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:96)
   at 
 org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:125)
   at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:308)
   at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1123)
   at org.apache.spark.sql.DataFrame.saveAsParquetFile(DataFrame.scala:922)
   at 
 sparkTest.DataFrameTest$.delayedEndpoint$sparkTest$DataFrameTest$1(DataFrameTest.scala:17)
   at sparkTest.DataFrameTest$delayedInit$body.apply(DataFrameTest.scala:8)
   at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
   at 
 scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
   at

[jira] [Updated] (SPARK-6869) Add pyspark archives path to PYTHONPATH


 [ 
https://issues.apache.org/jira/browse/SPARK-6869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-6869:
-
Issue Type: Bug  (was: Improvement)

 Add pyspark archives path to PYTHONPATH
 ---

 Key: SPARK-6869
 URL: https://issues.apache.org/jira/browse/SPARK-6869
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Weizhong
Priority: Minor

 From SPARK-1920 and SPARK-1520 we know PySpark on Yarn can not work when the 
 assembly jar are package by JDK 1.7+, so ship pyspark archives to executors 
 by Yarn with --py-files. The pyspark archives name must contains 
 spark-pyspark.
 1st: zip pyspark to spark-pyspark_2.10.zip
 2nd:./bin/spark-submit --master yarn-client/yarn-cluster --py-files 
 spark-pyspark_2.10.zip app.py args



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors

2015-05-08 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534622#comment-14534622
 ] 

Steve Loughran commented on SPARK-7481:
---

hadoop openstack 100K +httpclient (400K)
hadoop-aws : 85K, jetset 500K
s3a needs the aws toolkit @ 11.5MB, so it's the big one
azure is 500K.

to retain s3n in spark, the hadoop-aws and jetset dependency needs to go in; 
s3a is a fairly large additions

 Add Hadoop 2.6+ profile to pull in object store FS accessors
 

 Key: SPARK-7481
 URL: https://issues.apache.org/jira/browse/SPARK-7481
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.3.1
Reporter: Steve Loughran

 To keep the s3n classpath right, to add s3a, swift  azure, the dependencies 
 of spark in a 2.6+ profile need to add the relevant object store packages 
 (hadoop-aws, hadoop-openstack, hadoop-azure)
 this adds more stuff to the client bundle, but will mean a single spark 
 package can talk to all of the stores.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7381) Missing Python API for o.a.s.ml

2015-05-08 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-7381:
---
Summary: Missing Python API for o.a.s.ml  (was: Python API for Transformers)

 Missing Python API for o.a.s.ml
 ---

 Key: SPARK-7381
 URL: https://issues.apache.org/jira/browse/SPARK-7381
 Project: Spark
  Issue Type: Umbrella
  Components: ML, PySpark
Reporter: Burak Yavuz
Assignee: Burak Yavuz





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7447) Large Job submission lag when using Parquet w/ Schema Merging


 [ 
https://issues.apache.org/jira/browse/SPARK-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7447:
---

Assignee: (was: Apache Spark)

 Large Job submission lag when using Parquet w/ Schema Merging
 -

 Key: SPARK-7447
 URL: https://issues.apache.org/jira/browse/SPARK-7447
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core, Spark Submit
Affects Versions: 1.3.0, 1.3.1
 Environment: Spark 1.3.1, aws, persistent hdfs version 2 with ebs 
 storage, pyspark, 8 x c3.8xlarge nodes. 
 spark-conf
 spark.executor.memory 50g
 spark.driver.cores 32
 spark.driver.memory 50g
 spark.default.parallelism 512
 spark.sql.shuffle.partitions 512
 spark.task.maxFailures  30
 spark.executor.logs.rolling.maxRetainedFiles 2
 spark.executor.logs.rolling.size.maxBytes 102400
 spark.executor.logs.rolling.strategy size
 spark.shuffle.spill false
 spark.sql.parquet.cacheMetadata true
 spark.sql.parquet.filterPushdown true
 spark.sql.codegen true
 spark.akka.threads = 64
Reporter: Brad Willard

 I have 2.6 billion rows in parquet format and I'm trying to use the new 
 schema merging feature (I was enforcing a consistent schema manually before 
 in 0.8-1.2 which was annoying). 
 I have approximate 200 parquet files with key=date. When I load the 
 dataframe with the sqlcontext that process is understandably slow because I 
 assume it's reading all the meta data from the parquet files and doing the 
 initial schema merging. So that's ok.
 However the problem I have is that once I have the dataframe. Doing any 
 operation on the dataframe seems to have a 10-30 second lag before it 
 actually starts processing the Job and shows up as an Active Job in the Spark 
 Manager. This was an instant operation in all previous versions of Spark. 
 Once the job actually is running the performance is fantastic, however this 
 job submission lag is horrible.
 I'm wondering if there is a bug with recomputing the schema merging. Running 
 top on the master node shows some thread maxed out on 1 cpu during the 
 lagging time which makes me think it's not net i/o but something 
 pre-processing before job submission.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7447) Large Job submission lag when using Parquet w/ Schema Merging


 [ 
https://issues.apache.org/jira/browse/SPARK-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7447:
---

Assignee: Apache Spark

 Large Job submission lag when using Parquet w/ Schema Merging
 -

 Key: SPARK-7447
 URL: https://issues.apache.org/jira/browse/SPARK-7447
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core, Spark Submit
Affects Versions: 1.3.0, 1.3.1
 Environment: Spark 1.3.1, aws, persistent hdfs version 2 with ebs 
 storage, pyspark, 8 x c3.8xlarge nodes. 
 spark-conf
 spark.executor.memory 50g
 spark.driver.cores 32
 spark.driver.memory 50g
 spark.default.parallelism 512
 spark.sql.shuffle.partitions 512
 spark.task.maxFailures  30
 spark.executor.logs.rolling.maxRetainedFiles 2
 spark.executor.logs.rolling.size.maxBytes 102400
 spark.executor.logs.rolling.strategy size
 spark.shuffle.spill false
 spark.sql.parquet.cacheMetadata true
 spark.sql.parquet.filterPushdown true
 spark.sql.codegen true
 spark.akka.threads = 64
Reporter: Brad Willard
Assignee: Apache Spark

 I have 2.6 billion rows in parquet format and I'm trying to use the new 
 schema merging feature (I was enforcing a consistent schema manually before 
 in 0.8-1.2 which was annoying). 
 I have approximate 200 parquet files with key=date. When I load the 
 dataframe with the sqlcontext that process is understandably slow because I 
 assume it's reading all the meta data from the parquet files and doing the 
 initial schema merging. So that's ok.
 However the problem I have is that once I have the dataframe. Doing any 
 operation on the dataframe seems to have a 10-30 second lag before it 
 actually starts processing the Job and shows up as an Active Job in the Spark 
 Manager. This was an instant operation in all previous versions of Spark. 
 Once the job actually is running the performance is fantastic, however this 
 job submission lag is horrible.
 I'm wondering if there is a bug with recomputing the schema merging. Running 
 top on the master node shows some thread maxed out on 1 cpu during the 
 lagging time which makes me think it's not net i/o but something 
 pre-processing before job submission.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7447) Large Job submission lag when using Parquet w/ Schema Merging


[ 
https://issues.apache.org/jira/browse/SPARK-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14535018#comment-14535018
 ] 

Apache Spark commented on SPARK-7447:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/6012

 Large Job submission lag when using Parquet w/ Schema Merging
 -

 Key: SPARK-7447
 URL: https://issues.apache.org/jira/browse/SPARK-7447
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core, Spark Submit
Affects Versions: 1.3.0, 1.3.1
 Environment: Spark 1.3.1, aws, persistent hdfs version 2 with ebs 
 storage, pyspark, 8 x c3.8xlarge nodes. 
 spark-conf
 spark.executor.memory 50g
 spark.driver.cores 32
 spark.driver.memory 50g
 spark.default.parallelism 512
 spark.sql.shuffle.partitions 512
 spark.task.maxFailures  30
 spark.executor.logs.rolling.maxRetainedFiles 2
 spark.executor.logs.rolling.size.maxBytes 102400
 spark.executor.logs.rolling.strategy size
 spark.shuffle.spill false
 spark.sql.parquet.cacheMetadata true
 spark.sql.parquet.filterPushdown true
 spark.sql.codegen true
 spark.akka.threads = 64
Reporter: Brad Willard

 I have 2.6 billion rows in parquet format and I'm trying to use the new 
 schema merging feature (I was enforcing a consistent schema manually before 
 in 0.8-1.2 which was annoying). 
 I have approximate 200 parquet files with key=date. When I load the 
 dataframe with the sqlcontext that process is understandably slow because I 
 assume it's reading all the meta data from the parquet files and doing the 
 initial schema merging. So that's ok.
 However the problem I have is that once I have the dataframe. Doing any 
 operation on the dataframe seems to have a 10-30 second lag before it 
 actually starts processing the Job and shows up as an Active Job in the Spark 
 Manager. This was an instant operation in all previous versions of Spark. 
 Once the job actually is running the performance is fantastic, however this 
 job submission lag is horrible.
 I'm wondering if there is a bug with recomputing the schema merging. Running 
 top on the master node shows some thread maxed out on 1 cpu during the 
 lagging time which makes me think it's not net i/o but something 
 pre-processing before job submission.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7477) TachyonBlockManager Store Block in TRY_CACHE mode which gives BlockNotFoundException when blocks are evicted from cache

2015-05-08 Thread Dibyendu Bhattacharya (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534805#comment-14534805
 ] 

Dibyendu Bhattacharya commented on SPARK-7477:
--

I tried Hierarchical Storage on Tachyon ( 
http://tachyon-project.org/Hierarchy-Storage-on-Tachyon.html ) , and that seems 
to have worked and I did not see any any Spark Job failed due to 
BlockNotFoundException. below is my  Hierarchical Storage settings..

  -Dtachyon.worker.hierarchystore.level.max=2
  -Dtachyon.worker.hierarchystore.level0.alias=MEM
  -Dtachyon.worker.hierarchystore.level0.dirs.path=$TACHYON_RAM_FOLDER
  -Dtachyon.worker.hierarchystore.level0.dirs.quota=$TACHYON_WORKER_MEMORY_SIZE
  -Dtachyon.worker.hierarchystore.level1.alias=HDD
  -Dtachyon.worker.hierarchystore.level1.dirs.path=/mnt/tachyon
  -Dtachyon.worker.hierarchystore.level1.dirs.quota=50GB
  -Dtachyon.worker.allocate.strategy=MAX_FREE
  -Dtachyon.worker.evict.strategy=LRU

 TachyonBlockManager Store Block in TRY_CACHE mode which gives 
 BlockNotFoundException when blocks are evicted from cache
 ---

 Key: SPARK-7477
 URL: https://issues.apache.org/jira/browse/SPARK-7477
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Affects Versions: 1.4.0
Reporter: Dibyendu Bhattacharya

 With Spark Streaming on Tachyon as the OFF_HEAP block store 
 I have used the low level Receiver based Kafka consumer 
 (http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) for Spark 
 Streaming to pull from Kafka and write Blocks to Tachyon 
 What I see TachyonBlockManager.scala put the blocks in WriteType.TRY_CACHE 
 configuration . And because of this Blocks ate evicted from Tachyon Cache and 
 when Spark try to find the block it throws  BlockNotFoundException . 
 When I modified the WriteType to CACHE_THROUGH , BlockDropException is gone , 
 but it impact the throughput ..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3454) Expose JSON representation of data shown in WebUI


 [ 
https://issues.apache.org/jira/browse/SPARK-3454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3454.

Resolution: Fixed

 Expose JSON representation of data shown in WebUI
 -

 Key: SPARK-3454
 URL: https://issues.apache.org/jira/browse/SPARK-3454
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
Assignee: Imran Rashid
 Fix For: 1.4.0

 Attachments: sparkmonitoringjsondesign.pdf


 If WebUI support to JSON format extracting, it's helpful for user who want to 
 analyse stage / task / executor information.
 Fortunately, WebUI has renderJson method so we can implement the method in 
 each subclass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7488) Python API for ml.recommendation

2015-05-08 Thread Burak Yavuz (JIRA)

Burak Yavuz created SPARK-7488:
--

 Summary: Python API for ml.recommendation
 Key: SPARK-7488
 URL: https://issues.apache.org/jira/browse/SPARK-7488
 Project: Spark
  Issue Type: Sub-task
Reporter: Burak Yavuz






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7489) Spark shell crashes when compiled with scala 2.11 and SPARK_PREPEND_CLASSES=true


 [ 
https://issues.apache.org/jira/browse/SPARK-7489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7489:
---

Assignee: (was: Apache Spark)

 Spark shell crashes when compiled with scala 2.11 and 
 SPARK_PREPEND_CLASSES=true
 

 Key: SPARK-7489
 URL: https://issues.apache.org/jira/browse/SPARK-7489
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Reporter: Vinod KC

 Steps followed
 export SPARK_PREPEND_CLASSES=true
 dev/change-version-to-2.11.sh
  sbt/sbt -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean assembly
 bin/spark-shell
 
 15/05/08 22:31:35 INFO Main: Created spark context..
 Spark context available as sc.
 java.lang.NoClassDefFoundError: org/apache/hadoop/hive/conf/HiveConf
   at java.lang.Class.getDeclaredConstructors0(Native Method)
   at java.lang.Class.privateGetDeclaredConstructors(Class.java:2671)
   at java.lang.Class.getConstructor0(Class.java:3075)
   at java.lang.Class.getConstructor(Class.java:1825)
   at org.apache.spark.repl.Main$.createSQLContext(Main.scala:86)
   ... 45 elided
 Caused by: java.lang.ClassNotFoundException: 
 org.apache.hadoop.hive.conf.HiveConf
   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
   ... 50 more
 console:11: error: not found: value sqlContext
import sqlContext.implicits._
   ^
 console:11: error: not found: value sqlContext
import sqlContext.sql
 There is a similar Resolved JIRA issue  -SPARK-7470 and a PR 
 https://github.com/apache/spark/pull/5997 , which handled same  issue  only 
 in scala 2.10



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7489) Spark shell crashes when compiled with scala 2.11 and SPARK_PREPEND_CLASSES=true


 [ 
https://issues.apache.org/jira/browse/SPARK-7489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7489:
---

Assignee: Apache Spark

 Spark shell crashes when compiled with scala 2.11 and 
 SPARK_PREPEND_CLASSES=true
 

 Key: SPARK-7489
 URL: https://issues.apache.org/jira/browse/SPARK-7489
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Reporter: Vinod KC
Assignee: Apache Spark

 Steps followed
 export SPARK_PREPEND_CLASSES=true
 dev/change-version-to-2.11.sh
  sbt/sbt -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean assembly
 bin/spark-shell
 
 15/05/08 22:31:35 INFO Main: Created spark context..
 Spark context available as sc.
 java.lang.NoClassDefFoundError: org/apache/hadoop/hive/conf/HiveConf
   at java.lang.Class.getDeclaredConstructors0(Native Method)
   at java.lang.Class.privateGetDeclaredConstructors(Class.java:2671)
   at java.lang.Class.getConstructor0(Class.java:3075)
   at java.lang.Class.getConstructor(Class.java:1825)
   at org.apache.spark.repl.Main$.createSQLContext(Main.scala:86)
   ... 45 elided
 Caused by: java.lang.ClassNotFoundException: 
 org.apache.hadoop.hive.conf.HiveConf
   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
   ... 50 more
 console:11: error: not found: value sqlContext
import sqlContext.implicits._
   ^
 console:11: error: not found: value sqlContext
import sqlContext.sql
 There is a similar Resolved JIRA issue  -SPARK-7470 and a PR 
 https://github.com/apache/spark/pull/5997 , which handled same  issue  only 
 in scala 2.10



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7489) Spark shell crashes when compiled with scala 2.11 and SPARK_PREPEND_CLASSES=true


[ 
https://issues.apache.org/jira/browse/SPARK-7489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14535053#comment-14535053
 ] 

Apache Spark commented on SPARK-7489:
-

User 'vinodkc' has created a pull request for this issue:
https://github.com/apache/spark/pull/6013

 Spark shell crashes when compiled with scala 2.11 and 
 SPARK_PREPEND_CLASSES=true
 

 Key: SPARK-7489
 URL: https://issues.apache.org/jira/browse/SPARK-7489
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Reporter: Vinod KC

 Steps followed
 export SPARK_PREPEND_CLASSES=true
 dev/change-version-to-2.11.sh
  sbt/sbt -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean assembly
 bin/spark-shell
 
 15/05/08 22:31:35 INFO Main: Created spark context..
 Spark context available as sc.
 java.lang.NoClassDefFoundError: org/apache/hadoop/hive/conf/HiveConf
   at java.lang.Class.getDeclaredConstructors0(Native Method)
   at java.lang.Class.privateGetDeclaredConstructors(Class.java:2671)
   at java.lang.Class.getConstructor0(Class.java:3075)
   at java.lang.Class.getConstructor(Class.java:1825)
   at org.apache.spark.repl.Main$.createSQLContext(Main.scala:86)
   ... 45 elided
 Caused by: java.lang.ClassNotFoundException: 
 org.apache.hadoop.hive.conf.HiveConf
   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
   ... 50 more
 console:11: error: not found: value sqlContext
import sqlContext.implicits._
   ^
 console:11: error: not found: value sqlContext
import sqlContext.sql
 There is a similar Resolved JIRA issue  -SPARK-7470 and a PR 
 https://github.com/apache/spark/pull/5997 , which handled same  issue  only 
 in scala 2.10



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7489) Spark shell crashes when compiled with scala 2.11 and SPARK_PREPEND_CLASSES=true

2015-05-08 Thread Vinod KC (JIRA)

Vinod KC created SPARK-7489:
---

 Summary: Spark shell crashes when compiled with scala 2.11 and 
SPARK_PREPEND_CLASSES=true
 Key: SPARK-7489
 URL: https://issues.apache.org/jira/browse/SPARK-7489
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Reporter: Vinod KC


Steps followed
export SPARK_PREPEND_CLASSES=true
dev/change-version-to-2.11.sh
 sbt/sbt -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean assembly

bin/spark-shell

15/05/08 22:31:35 INFO Main: Created spark context..
Spark context available as sc.
java.lang.NoClassDefFoundError: org/apache/hadoop/hive/conf/HiveConf
  at java.lang.Class.getDeclaredConstructors0(Native Method)
  at java.lang.Class.privateGetDeclaredConstructors(Class.java:2671)
  at java.lang.Class.getConstructor0(Class.java:3075)
  at java.lang.Class.getConstructor(Class.java:1825)
  at org.apache.spark.repl.Main$.createSQLContext(Main.scala:86)
  ... 45 elided
Caused by: java.lang.ClassNotFoundException: 
org.apache.hadoop.hive.conf.HiveConf
  at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  ... 50 more
console:11: error: not found: value sqlContext
   import sqlContext.implicits._
  ^
console:11: error: not found: value sqlContext
   import sqlContext.sql

There is a similar Resolved JIRA issue  -SPARK-7470 and a PR 
https://github.com/apache/spark/pull/5997 , which handled same  issue  only in 
scala 2.10



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6091) Add MulticlassMetrics in PySpark/MLlib


 [ 
https://issues.apache.org/jira/browse/SPARK-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6091:
---

Assignee: Apache Spark  (was: Yanbo Liang)

 Add MulticlassMetrics in PySpark/MLlib
 --

 Key: SPARK-6091
 URL: https://issues.apache.org/jira/browse/SPARK-6091
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7487) Python API for ml.regression

2015-05-08 Thread Burak Yavuz (JIRA)

Burak Yavuz created SPARK-7487:
--

 Summary: Python API for ml.regression
 Key: SPARK-7487
 URL: https://issues.apache.org/jira/browse/SPARK-7487
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Reporter: Burak Yavuz






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7448) Implement custom bye array serializer for use in PySpark shuffle


[ 
https://issues.apache.org/jira/browse/SPARK-7448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14535316#comment-14535316
 ] 

Josh Rosen commented on SPARK-7448:
---

This is a change that would be nice to performance benchmark. It might require 
a large job, such as a huge flatMap, before we see any significant improvement 
here.

 Implement custom bye array serializer for use in PySpark shuffle
 

 Key: SPARK-7448
 URL: https://issues.apache.org/jira/browse/SPARK-7448
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Shuffle
Reporter: Josh Rosen

 PySpark's shuffle typically shuffles Java RDDs that contain byte arrays. We 
 should implement a custom Serializer for use in these shuffles.  This will 
 allow us to take advantage of shuffle optimizations like SPARK-7311 for 
 PySpark without requiring users to change the default serializer to 
 KryoSerializer (this is useful for JobServer-type applications).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7490) MapOutputTracker: close input streams to free native memory

2015-05-08 Thread Evan Jones (JIRA)

Evan Jones created SPARK-7490:
-

 Summary: MapOutputTracker: close input streams to free native 
memory
 Key: SPARK-7490
 URL: https://issues.apache.org/jira/browse/SPARK-7490
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Evan Jones
Priority: Minor


GZIPInputStream allocates native memory that is not freed until close() or when 
the finalizer runs. It is best to close() these streams explicitly to avoid 
native memory leaks

Pull request here: https://github.com/apache/spark/pull/5982



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7487) Python API for ml.regression


[ 
https://issues.apache.org/jira/browse/SPARK-7487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14535324#comment-14535324
 ] 

Apache Spark commented on SPARK-7487:
-

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/6016

 Python API for ml.regression
 

 Key: SPARK-7487
 URL: https://issues.apache.org/jira/browse/SPARK-7487
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Reporter: Burak Yavuz





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7487) Python API for ml.regression


 [ 
https://issues.apache.org/jira/browse/SPARK-7487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7487:
---

Assignee: (was: Apache Spark)

 Python API for ml.regression
 

 Key: SPARK-7487
 URL: https://issues.apache.org/jira/browse/SPARK-7487
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Reporter: Burak Yavuz





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7487) Python API for ml.regression


 [ 
https://issues.apache.org/jira/browse/SPARK-7487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7487:
---

Assignee: Apache Spark

 Python API for ml.regression
 

 Key: SPARK-7487
 URL: https://issues.apache.org/jira/browse/SPARK-7487
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Reporter: Burak Yavuz
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7491) Handle drivers for Metastore JDBC

2015-05-08 Thread Michael Armbrust (JIRA)

Michael Armbrust created SPARK-7491:
---

 Summary: Handle drivers for Metastore JDBC
 Key: SPARK-7491
 URL: https://issues.apache.org/jira/browse/SPARK-7491
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Priority: Blocker






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7410) Add option to avoid broadcasting configuration with newAPIHadoopFile

[
https://issues.apache.org/jira/browse/SPARK-7410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14535386#comment-14535386
]

Josh Rosen commented on SPARK-7410:
---

We should confirm this, but if I recall the reason that we have to broadcast
these separately has something to do with configuration mutability or
thread-safety. Based on a quick glance at SPARK-2585, it looks like I tried
folding this into the RDD broadcast but this caused performance issues for RDDs
with huge numbers of tasks. If you're interested in fixing this, I'd take a
closer look through that old JIRA to try to figure out whether its discussion
is still relevant.

Add option to avoid broadcasting configuration with newAPIHadoopFile

Key: SPARK-7410
URL: https://issues.apache.org/jira/browse/SPARK-7410
Project: Spark
Issue Type: Improvement
Components: Spark Core
Affects Versions: 1.4.0
Reporter: Sandy Ryza

I'm working with a Spark application that creates thousands of HadoopRDDs and
unions them together. Certain details of the way the data is stored require
this.
Creating ten thousand of these RDDs takes about 10 minutes, even before any
of them is used in an action. I dug into why this takes so long and it looks
like the overhead of broadcasting the Hadoop configuration is taking up most
of the time. In this case, the broadcasting isn't helpful because each
HadoopRDD only corresponds to one or two tasks. When I reverted the original
change that switched to broadcasting configurations, the time it took to
instantiate these RDDs improved 10x.
It would be nice if there was a way to turn this broadcasting off. Either
through a Spark configuration option, a Hadoop configuration option, or an
argument to hadoopFile / newAPIHadoopFile.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7383) Python API for ml.feature


 [ 
https://issues.apache.org/jira/browse/SPARK-7383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-7383.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5991
[https://github.com/apache/spark/pull/5991]

 Python API for ml.feature
 -

 Key: SPARK-7383
 URL: https://issues.apache.org/jira/browse/SPARK-7383
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Reporter: Burak Yavuz
Assignee: Burak Yavuz
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7436) Cannot implement nor use custom StandaloneRecoveryModeFactory implementations


 [ 
https://issues.apache.org/jira/browse/SPARK-7436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-7436:
--
Assignee: Jacek Lewandowski

 Cannot implement nor use custom StandaloneRecoveryModeFactory implementations
 -

 Key: SPARK-7436
 URL: https://issues.apache.org/jira/browse/SPARK-7436
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.3.1
Reporter: Jacek Lewandowski
Assignee: Jacek Lewandowski
 Fix For: 1.3.2, 1.4.0


 At least, this code fragment is buggy ({{Master.scala}}):
 {code}
   case CUSTOM =
 val clazz = 
 Class.forName(conf.get(spark.deploy.recoveryMode.factory))
 val factory = clazz.getConstructor(conf.getClass, 
 Serialization.getClass)
   .newInstance(conf, SerializationExtension(context.system))
   .asInstanceOf[StandaloneRecoveryModeFactory]
 (factory.createPersistenceEngine(), 
 factory.createLeaderElectionAgent(this))
 {code}
 Because here: {{val factory = clazz.getConstructor(conf.getClass, 
 Serialization.getClass)}} it tries to find the constructor which accepts 
 {{org.apache.spark.SparkConf}} and class of companion object of 
 {{akka.serialization.Serialization}} and then it tries to instantiate 
 {{newInstance(conf, SerializationExtension(context.system))}} with instance 
 of {{SparkConf}} and instance of {{Serialization}} class - not the companion 
 objects. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7493) ALTER TABLE statement

2015-05-08 Thread Sergey Semichev (JIRA)

Sergey Semichev created SPARK-7493:
--

 Summary: ALTER TABLE statement
 Key: SPARK-7493
 URL: https://issues.apache.org/jira/browse/SPARK-7493
 Project: Spark
  Issue Type: Bug
  Components: SQL
 Environment: Databricks cloud
Reporter: Sergey Semichev
Priority: Minor


Full table name (database_name.table_name) cannot be used with ALTER TABLE 
statement 
It works with CREATE TABLE

ALTER TABLE database_name.table_name ADD PARTITION (source_year='2014', 
source_month='01').

Error in SQL statement: java.lang.RuntimeException: 
org.apache.spark.sql.AnalysisException: mismatched input 'ADD' expecting 
KW_EXCHANGE near 'test_table' in alter exchange partition;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6824) Fill the docs for DataFrame API in SparkR

2015-05-08 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-6824.
--
   Resolution: Fixed
Fix Version/s: 1.4.0
   1.5.0

Issue resolved by pull request 5969
[https://github.com/apache/spark/pull/5969]

 Fill the docs for DataFrame API in SparkR
 -

 Key: SPARK-6824
 URL: https://issues.apache.org/jira/browse/SPARK-6824
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Shivaram Venkataraman
Assignee: Qian Huang
Priority: Blocker
 Fix For: 1.5.0, 1.4.0


 Some of the DataFrame functions in SparkR do not have complete roxygen docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7298) Harmonize style of new UI visualizations

2015-05-08 Thread Matei Zaharia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-7298.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

 Harmonize style of new UI visualizations
 

 Key: SPARK-7298
 URL: https://issues.apache.org/jira/browse/SPARK-7298
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Reporter: Patrick Wendell
Assignee: Matei Zaharia
Priority: Blocker
 Fix For: 1.4.0


 We need to go through all new visualizations in the web UI and make sure they 
 have consistent style. Both consistent with each-other and consistent with 
 the rest of the UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7133) Implement struct, array, and map field accessor using apply in Scala and getitem in Python

2015-05-08 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-7133.
-
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5744
[https://github.com/apache/spark/pull/5744]

 Implement struct, array, and map field accessor using apply in Scala and 
 __getitem__ in Python
 --

 Key: SPARK-7133
 URL: https://issues.apache.org/jira/browse/SPARK-7133
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Wenchen Fan
Priority: Blocker
 Fix For: 1.4.0


 Typing 
 {code}
 df.col[1]
 {code}
 and
 {code}
 df.col['field']
 {code}
 is so much eaiser than
 {code}
 df.col.getField('field')
 df.col.getItem(1)
 {code}
 This would require us to define (in Column) an apply function in Scala, and a 
 __getitem__ function in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7448) Implement custom bye array serializer for use in PySpark shuffle


 [ 
https://issues.apache.org/jira/browse/SPARK-7448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-7448:
--
Priority: Minor  (was: Major)

 Implement custom bye array serializer for use in PySpark shuffle
 

 Key: SPARK-7448
 URL: https://issues.apache.org/jira/browse/SPARK-7448
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Shuffle
Reporter: Josh Rosen
Priority: Minor

 PySpark's shuffle typically shuffles Java RDDs that contain byte arrays. We 
 should implement a custom Serializer for use in these shuffles.  This will 
 allow us to take advantage of shuffle optimizations like SPARK-7311 for 
 PySpark without requiring users to change the default serializer to 
 KryoSerializer (this is useful for JobServer-type applications).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB

2015-05-08 Thread Rangarajan Sreenivasan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534629#comment-14534629
 ] 

Rangarajan Sreenivasan edited comment on SPARK-5928 at 5/8/15 5:51 PM:
---

We are hitting a very similar issue. Job fails during the repartition stage. 
* Ours is a 10-node r3.4x cluster (119 GB  16-CPU per node)
* Running Spark version 1.3.1 in Standalone cluster mode
* Tried various parallelism values - 50, 100, 200, 500, 800



was (Author: sranga):
We are hitting a very similar issue. Job fails during the repartition stage. 
* Ours is a 10-node r3.8x cluster (119 GB  16-CPU per node)
* Running Spark version 1.3.1 in Standalone cluster mode
* Tried various parallelism values - 50, 100, 200, 500, 800


 Remote Shuffle Blocks cannot be more than 2 GB
 --

 Key: SPARK-5928
 URL: https://issues.apache.org/jira/browse/SPARK-5928
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Imran Rashid

 If a shuffle block is over 2GB, the shuffle fails, with an uninformative 
 exception.  The tasks get retried a few times and then eventually the job 
 fails.
 Here is an example program which can cause the exception:
 {code}
 val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =
   val n = 3e3.toInt
   val arr = new Array[Byte](n)
   //need to make sure the array doesn't compress to something small
   scala.util.Random.nextBytes(arr)
   arr
 }
 rdd.map { x = (1, x)}.groupByKey().count()
 {code}
 Note that you can't trigger this exception in local mode, it only happens on 
 remote fetches.   I triggered these exceptions running with 
 {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}}
 {noformat}
 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, 
 imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, 
 imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message=
 org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 
 2147483647: 3021252889 - discarded
   at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
   at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
   at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
   at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
   at 
 org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
   at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame 
 length exceeds 2147483647: 3021252889 - discarded
   at 
 io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501)
   at 
 io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477)
   at 
 io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403)
   at 
 io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343)
   at 
 io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249)
   at 
 io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149)
   at 
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
   at 
 io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
   at 
 io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
   at

[jira] [Resolved] (SPARK-7474) ParamGridBuilder's doctest doesn't show up correctly in the generated doc


 [ 
https://issues.apache.org/jira/browse/SPARK-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-7474.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 6001
[https://github.com/apache/spark/pull/6001]

 ParamGridBuilder's doctest doesn't show up correctly in the generated doc
 -

 Key: SPARK-7474
 URL: https://issues.apache.org/jira/browse/SPARK-7474
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.4.0


 {code}
  from classification import LogisticRegression
  lr = LogisticRegression()
  output = ParamGridBuilder().baseOn({lr.labelCol: 'l'}) 
  .baseOn([lr.predictionCol, 'p']) .addGrid(lr.regParam, [1.0, 
  2.0, 3.0]) .addGrid(lr.maxIter, [1, 5]) 
  .addGrid(lr.featuresCol, ['f']) .build()
  expected = [ {lr.regParam: 1.0, lr.featuresCol: 'f', lr.maxIter: 1, 
  lr.labelCol: 'l', lr.predictionCol: 'p'}, {lr.regParam: 2.0, 
  lr.featuresCol: 'f', lr.maxIter: 1, lr.labelCol: 'l', lr.predictionCol: 
  'p'}, {lr.regParam: 3.0, lr.featuresCol: 'f', lr.maxIter: 1, lr.labelCol: 
  'l', lr.predictionCol: 'p'}, {lr.regParam: 1.0, lr.featuresCol: 'f', 
  lr.maxIter: 5, lr.labelCol: 'l', lr.predictionCol: 'p'}, {lr.regParam: 
  2.0, lr.featuresCol: 'f', lr.maxIter: 5, lr.labelCol: 'l', 
  lr.predictionCol: 'p'}, {lr.regParam: 3.0, lr.featuresCol: 'f', 
  lr.maxIter: 5, lr.labelCol: 'l', lr.predictionCol: 'p'}]
  len(output) == len(expected)
 True
  all([m in expected for m in output])
 True
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7436) Cannot implement nor use custom StandaloneRecoveryModeFactory implementations