[jira] [Updated] (SPARK-6566) Update Spark to use the latest version of Parquet libraries

2015-06-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6566:
-
Assignee: Yash Datta

 Update Spark to use the latest version of Parquet libraries
 ---

 Key: SPARK-6566
 URL: https://issues.apache.org/jira/browse/SPARK-6566
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Konstantin Shaposhnikov
Assignee: Yash Datta
 Fix For: 1.5.0


 There are a lot of bug fixes in the latest version of parquet (1.6.0rc7). 
 E.g. PARQUET-136
 It would be good to update Spark to use the latest parquet version.
 The following changes are required:
 {code}
 diff --git a/pom.xml b/pom.xml
 index 5ad39a9..095b519 100644
 --- a/pom.xml
 +++ b/pom.xml
 @@ -132,7 +132,7 @@
  !-- Version used for internal directory structure --
  hive.version.short0.13.1/hive.version.short
  derby.version10.10.1.1/derby.version
 -parquet.version1.6.0rc3/parquet.version
 +parquet.version1.6.0rc7/parquet.version
  jblas.version1.2.3/jblas.version
  jetty.version8.1.14.v20131031/jetty.version
  orbit.version3.0.0.v201112011016/orbit.version
 {code}
 and
 {code}
 --- 
 a/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala
 +++ 
 b/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala
 @@ -480,7 +480,7 @@ private[parquet] class FilteringParquetRowInputFormat
  globalMetaData = new GlobalMetaData(globalMetaData.getSchema,
mergedMetadata, globalMetaData.getCreatedBy)
  
 -val readContext = getReadSupport(configuration).init(
 +val readContext = 
 ParquetInputFormat.getReadSupportInstance(configuration).init(
new InitContext(configuration,
  globalMetaData.getKeyValueMetaData,
  globalMetaData.getSchema))
 {code}
 I am happy to prepare a pull request if necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8337) KafkaUtils.createDirectStream for python is lacking API/feature parity with the Scala/Java version

2015-06-16 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-8337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588064#comment-14588064
 ] 

Juan Rodríguez Hortalá commented on SPARK-8337:
---

Hi, 

I've made some advances. Due to the limited support for data types in pyspark 
and org.apache.spark.api.python.PythonRDD, I think adding a function to 
createDirectStream from MessageAndMetadata to arbitrary values is not such a 
good idea. In fact currently pyspark communicates with the Scala API by using 
JavaPairInputDStream[Array[Byte], Array[Byte]] and then decoding those arrays 
of bytes in python. So what I propose is adding an argument to choose between 
returning a dstream of (key, value) like it is done so far, and a dstream of 
dictionaries with entries for the key, the value (the message), and also the 
topic, partition and offset. An approximation to that is implemented in 
https://github.com/juanrh/spark/commit/7a824a814f56f839d2f3fbeda7e9f7467e683c6e 
as a python static method  KafkaUtils.createDirectStreamJ, that uses 
KafkaUtilsPythonHelper.createDirectStreamJ. The following Python code can be 
used for using it:

from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
ssc = StreamingContext(sc, 1)
topics = [test]
kafkaParams = {metadata.broker.list : localhost:9092}
kafkaStream = KafkaUtils.createDirectStreamJ(ssc, topics, kafkaParams)
kafkaStream.pprint()
ssc.start()
ssc.awaitTermination(timeout=5)

which gets the following output

15/06/16 15:31:00 INFO TaskSchedulerImpl: Removed TaskSet 8.0, whose tasks have 
all completed, from pool
15/06/16 15:31:00 INFO DAGScheduler: ResultStage 8 (start at 
NativeMethodAccessorImpl.java:-2) finished
15/06/16 15:31:00 INFO DAGScheduler: Job 8 finished: start at 
NativeMethodAccessorImpl.java:-2, took 0,0
---
Time: 2015-06-16 15:31:00
---
{'topic': u'test', 'partition': 0, 'value': u'q tal?', 'key': None, 'offset': 
87L}
()
15/06/16 15:31:00 

I have encoded the dictionary with the following Scala type alias, that uses 
types that PythonRDD can understand

/** Using this weird type due to the limited set of types
  * supported by PythonRDD. This corresponds to 
  *
  * ((key, message), (topic, (partition, offset)))
  *
  * where the key and the message are encoded as Array[Byte], 
  * and topic, partition and offset are encoded as String.
  * Note we cannot even use triples because only pairs are supported
  * (we get an exception Unexpected element type class scala.Tuple3)
  */
  type PyKafkaMsgWrapper = ((Array[Byte], Array[Byte]), (String, (String, 
String)))

If this is enough for you I can refactor thing to join  
KafkaUtils.createDirectStreamJ and  KafkaUtils.createDirectStream in a single 
method, with an additional argument to specify if the meta info is required, 
with a default value of False so the behaviour is the same as before by default

Looking forward to hearing your opinions on this.

Greetings, 

Juan Rodriguez Hortala


 KafkaUtils.createDirectStream for python is lacking API/feature parity with 
 the Scala/Java version
 --

 Key: SPARK-8337
 URL: https://issues.apache.org/jira/browse/SPARK-8337
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.0
Reporter: Amit Ramesh
Priority: Critical

 See the following thread for context.
 http://apache-spark-developers-list.1001551.n3.nabble.com/Re-Spark-1-4-Python-API-for-getting-Kafka-offsets-in-direct-mode-tt12714.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8395) spark-submit documentation is incorrect

2015-06-16 Thread Dev Lakhani (JIRA)
Dev Lakhani created SPARK-8395:
--

 Summary: spark-submit documentation is incorrect
 Key: SPARK-8395
 URL: https://issues.apache.org/jira/browse/SPARK-8395
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.4.0
Reporter: Dev Lakhani
Priority: Minor


Using a fresh checkout of 1.4.0-bin-hadoop2.6

if you run 
./start-slave.sh  1 spark://localhost:7077

you get
failed to launch org.apache.spark.deploy.worker.Worker:
 Default is conf/spark-defaults.conf.
  15/06/16 13:11:08 INFO Utils: Shutdown hook called

it seems the worker number is not being accepted  as desccribed here:
https://spark.apache.org/docs/latest/spark-standalone.html

The documentation says:
./sbin/start-slave.sh worker# master-spark-URL

but the start.slave-sh script states:
usage=Usage: start-slave.sh spark-master-URL where spark-master-URL is 
like spark://localhost:7077

I have checked for similar issues using :
https://issues.apache.org/jira/browse/SPARK-6552?jql=text%20~%20%22start-slave%22

and found nothing similar so am raising this as an issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8333) Spark failed to delete temp directory created by HiveContext

2015-06-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8333:
---

Assignee: Apache Spark

 Spark failed to delete temp directory created by HiveContext
 

 Key: SPARK-8333
 URL: https://issues.apache.org/jira/browse/SPARK-8333
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
 Environment: Windows7 64bit
Reporter: sheng
Assignee: Apache Spark
Priority: Minor
  Labels: Hive, metastore, sparksql

 Spark 1.4.0 failed to stop SparkContext.
 {code:title=LocalHiveTest.scala|borderStyle=solid}
  val sc = new SparkContext(local, local-hive-test, new SparkConf())
  val hc = Utils.createHiveContext(sc)
  ... // execute some HiveQL statements
  sc.stop()
 {code}
 sc.stop() failed to execute, it threw the following exception:
 {quote}
 15/06/13 03:19:06 INFO Utils: Shutdown hook called
 15/06/13 03:19:06 INFO Utils: Deleting directory 
 C:\Users\moshangcheng\AppData\Local\Temp\spark-d6d3c30e-512e-4693-a436-485e2af4baea
 15/06/13 03:19:06 ERROR Utils: Exception while deleting Spark temp dir: 
 C:\Users\moshangcheng\AppData\Local\Temp\spark-d6d3c30e-512e-4693-a436-485e2af4baea
 java.io.IOException: Failed to delete: 
 C:\Users\moshangcheng\AppData\Local\Temp\spark-d6d3c30e-512e-4693-a436-485e2af4baea
   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:963)
   at 
 org.apache.spark.util.Utils$$anonfun$1$$anonfun$apply$mcV$sp$5.apply(Utils.scala:204)
   at 
 org.apache.spark.util.Utils$$anonfun$1$$anonfun$apply$mcV$sp$5.apply(Utils.scala:201)
   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
   at org.apache.spark.util.Utils$$anonfun$1.apply$mcV$sp(Utils.scala:201)
   at org.apache.spark.util.SparkShutdownHook.run(Utils.scala:2292)
   at 
 org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(Utils.scala:2262)
   at 
 org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:2262)
   at 
 org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:2262)
   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)
   at 
 org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(Utils.scala:2262)
   at 
 org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2262)
   at 
 org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2262)
   at scala.util.Try$.apply(Try.scala:161)
   at 
 org.apache.spark.util.SparkShutdownHookManager.runAll(Utils.scala:2262)
   at 
 org.apache.spark.util.SparkShutdownHookManager$$anon$6.run(Utils.scala:2244)
   at 
 org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
 {quote}
 It seems this bug is introduced by this SPARK-6907. In SPARK-6907, a local 
 hive metastore is created in a temp directory. The problem is the local hive 
 metastore is not shut down correctly. At the end of application,  if 
 SparkContext.stop() is called, it tries to delete the temp directory which is 
 still used by the local hive metastore, and throws an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8333) Spark failed to delete temp directory created by HiveContext

2015-06-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8333:
---

Assignee: (was: Apache Spark)

 Spark failed to delete temp directory created by HiveContext
 

 Key: SPARK-8333
 URL: https://issues.apache.org/jira/browse/SPARK-8333
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
 Environment: Windows7 64bit
Reporter: sheng
Priority: Minor
  Labels: Hive, metastore, sparksql

 Spark 1.4.0 failed to stop SparkContext.
 {code:title=LocalHiveTest.scala|borderStyle=solid}
  val sc = new SparkContext(local, local-hive-test, new SparkConf())
  val hc = Utils.createHiveContext(sc)
  ... // execute some HiveQL statements
  sc.stop()
 {code}
 sc.stop() failed to execute, it threw the following exception:
 {quote}
 15/06/13 03:19:06 INFO Utils: Shutdown hook called
 15/06/13 03:19:06 INFO Utils: Deleting directory 
 C:\Users\moshangcheng\AppData\Local\Temp\spark-d6d3c30e-512e-4693-a436-485e2af4baea
 15/06/13 03:19:06 ERROR Utils: Exception while deleting Spark temp dir: 
 C:\Users\moshangcheng\AppData\Local\Temp\spark-d6d3c30e-512e-4693-a436-485e2af4baea
 java.io.IOException: Failed to delete: 
 C:\Users\moshangcheng\AppData\Local\Temp\spark-d6d3c30e-512e-4693-a436-485e2af4baea
   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:963)
   at 
 org.apache.spark.util.Utils$$anonfun$1$$anonfun$apply$mcV$sp$5.apply(Utils.scala:204)
   at 
 org.apache.spark.util.Utils$$anonfun$1$$anonfun$apply$mcV$sp$5.apply(Utils.scala:201)
   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
   at org.apache.spark.util.Utils$$anonfun$1.apply$mcV$sp(Utils.scala:201)
   at org.apache.spark.util.SparkShutdownHook.run(Utils.scala:2292)
   at 
 org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(Utils.scala:2262)
   at 
 org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:2262)
   at 
 org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:2262)
   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)
   at 
 org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(Utils.scala:2262)
   at 
 org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2262)
   at 
 org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2262)
   at scala.util.Try$.apply(Try.scala:161)
   at 
 org.apache.spark.util.SparkShutdownHookManager.runAll(Utils.scala:2262)
   at 
 org.apache.spark.util.SparkShutdownHookManager$$anon$6.run(Utils.scala:2244)
   at 
 org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
 {quote}
 It seems this bug is introduced by this SPARK-6907. In SPARK-6907, a local 
 hive metastore is created in a temp directory. The problem is the local hive 
 metastore is not shut down correctly. At the end of application,  if 
 SparkContext.stop() is called, it tries to delete the temp directory which is 
 still used by the local hive metastore, and throws an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8333) Spark failed to delete temp directory created by HiveContext

2015-06-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587937#comment-14587937
 ] 

Apache Spark commented on SPARK-8333:
-

User 'navis' has created a pull request for this issue:
https://github.com/apache/spark/pull/6840

 Spark failed to delete temp directory created by HiveContext
 

 Key: SPARK-8333
 URL: https://issues.apache.org/jira/browse/SPARK-8333
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
 Environment: Windows7 64bit
Reporter: sheng
Priority: Minor
  Labels: Hive, metastore, sparksql

 Spark 1.4.0 failed to stop SparkContext.
 {code:title=LocalHiveTest.scala|borderStyle=solid}
  val sc = new SparkContext(local, local-hive-test, new SparkConf())
  val hc = Utils.createHiveContext(sc)
  ... // execute some HiveQL statements
  sc.stop()
 {code}
 sc.stop() failed to execute, it threw the following exception:
 {quote}
 15/06/13 03:19:06 INFO Utils: Shutdown hook called
 15/06/13 03:19:06 INFO Utils: Deleting directory 
 C:\Users\moshangcheng\AppData\Local\Temp\spark-d6d3c30e-512e-4693-a436-485e2af4baea
 15/06/13 03:19:06 ERROR Utils: Exception while deleting Spark temp dir: 
 C:\Users\moshangcheng\AppData\Local\Temp\spark-d6d3c30e-512e-4693-a436-485e2af4baea
 java.io.IOException: Failed to delete: 
 C:\Users\moshangcheng\AppData\Local\Temp\spark-d6d3c30e-512e-4693-a436-485e2af4baea
   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:963)
   at 
 org.apache.spark.util.Utils$$anonfun$1$$anonfun$apply$mcV$sp$5.apply(Utils.scala:204)
   at 
 org.apache.spark.util.Utils$$anonfun$1$$anonfun$apply$mcV$sp$5.apply(Utils.scala:201)
   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
   at org.apache.spark.util.Utils$$anonfun$1.apply$mcV$sp(Utils.scala:201)
   at org.apache.spark.util.SparkShutdownHook.run(Utils.scala:2292)
   at 
 org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(Utils.scala:2262)
   at 
 org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:2262)
   at 
 org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:2262)
   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)
   at 
 org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(Utils.scala:2262)
   at 
 org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2262)
   at 
 org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2262)
   at scala.util.Try$.apply(Try.scala:161)
   at 
 org.apache.spark.util.SparkShutdownHookManager.runAll(Utils.scala:2262)
   at 
 org.apache.spark.util.SparkShutdownHookManager$$anon$6.run(Utils.scala:2244)
   at 
 org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
 {quote}
 It seems this bug is introduced by this SPARK-6907. In SPARK-6907, a local 
 hive metastore is created in a temp directory. The problem is the local hive 
 metastore is not shut down correctly. At the end of application,  if 
 SparkContext.stop() is called, it tries to delete the temp directory which is 
 still used by the local hive metastore, and throws an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8396) GraphLoader.edgeListFile does not populate Graph.vertices.

2015-06-16 Thread Matthew Barrett (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Barrett updated SPARK-8396:
---
Summary: GraphLoader.edgeListFile does not populate Graph.vertices.  (was: 
GraphLoader.edgeListFile does not population Graph.vertices.)

 GraphLoader.edgeListFile does not populate Graph.vertices.
 --

 Key: SPARK-8396
 URL: https://issues.apache.org/jira/browse/SPARK-8396
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.4.0
 Environment: Mac OS X.  Spark-1.4.0 pre-compiled binary for 
 Hadoop-2.4.0-bin.
Reporter: Matthew Barrett
Priority: Minor
  Labels: easyfix, newbie
   Original Estimate: 24h
  Remaining Estimate: 24h

 With input data like this
 18090 31237
 31237 31225
 31225 31285
 31285 31200
 31200 31197
 31197 31195
 31195 31346
 31346 54013
 54013 31256
 31256 23121
 The code 
 val graph : Graph[Int, Int] = GraphLoader.edgeListFile(sc, hdfsNode + 
 /data/misc/Sample_DirectedGraphData.ssv)
 graph.vertices.foreach{println}
 graph.vertices.foreach{vertex: (VertexId, Int) = println(vertex._1.toString 
 +  ***  + vertex._2.toString)}
 prints nothing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8143) Spark application history cannot be found even for finished jobs

2015-06-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8143.
--
   Resolution: Duplicate
Fix Version/s: (was: 1.4.0)

 Spark application history cannot be found even for finished jobs
 

 Key: SPARK-8143
 URL: https://issues.apache.org/jira/browse/SPARK-8143
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0, 1.3.1
Reporter: Dev Lakhani

 Whenever a job is killed or finished, because of an application error or 
 otherwise and when I then click on Application Detail UI, even through the 
 job state is : FINISHED, I get no log results and the message states:
 Application history not found for (app-xyz-abc) 
 Application ABC is still in progress. 
 An no logs are presented.
 I'm using spark.eventLog.enabled, true and spark.eventLog.dir=/tmp/spark 
 under which I see lots of files
 app-2015xyz-abc.inprogress
 Even through the job has failed or finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-8143) Spark application history cannot be found even for finished jobs

2015-06-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-8143:
--

 Spark application history cannot be found even for finished jobs
 

 Key: SPARK-8143
 URL: https://issues.apache.org/jira/browse/SPARK-8143
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0, 1.3.1
Reporter: Dev Lakhani

 Whenever a job is killed or finished, because of an application error or 
 otherwise and when I then click on Application Detail UI, even through the 
 job state is : FINISHED, I get no log results and the message states:
 Application history not found for (app-xyz-abc) 
 Application ABC is still in progress. 
 An no logs are presented.
 I'm using spark.eventLog.enabled, true and spark.eventLog.dir=/tmp/spark 
 under which I see lots of files
 app-2015xyz-abc.inprogress
 Even through the job has failed or finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7799) Move StreamingContext.actorStream to a separate project and deprecate it in StreamingContext

2015-06-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7799:
---

Assignee: Apache Spark

 Move StreamingContext.actorStream to a separate project and deprecate it in 
 StreamingContext
 --

 Key: SPARK-7799
 URL: https://issues.apache.org/jira/browse/SPARK-7799
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Shixiong Zhu
Assignee: Apache Spark

 Move {{StreamingContext.actorStream}} to a separate project and deprecate it 
 in {{StreamingContext}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7799) Move StreamingContext.actorStream to a separate project and deprecate it in StreamingContext

2015-06-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588036#comment-14588036
 ] 

Apache Spark commented on SPARK-7799:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/6841

 Move StreamingContext.actorStream to a separate project and deprecate it in 
 StreamingContext
 --

 Key: SPARK-7799
 URL: https://issues.apache.org/jira/browse/SPARK-7799
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Shixiong Zhu

 Move {{StreamingContext.actorStream}} to a separate project and deprecate it 
 in {{StreamingContext}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7799) Move StreamingContext.actorStream to a separate project and deprecate it in StreamingContext

2015-06-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7799:
---

Assignee: (was: Apache Spark)

 Move StreamingContext.actorStream to a separate project and deprecate it in 
 StreamingContext
 --

 Key: SPARK-7799
 URL: https://issues.apache.org/jira/browse/SPARK-7799
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Shixiong Zhu

 Move {{StreamingContext.actorStream}} to a separate project and deprecate it 
 in {{StreamingContext}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8393) JavaStreamingContext#awaitTermination() throws non-declared InterruptedException

2015-06-16 Thread Jaromir Vanek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jaromir Vanek updated SPARK-8393:
-
Description: 
Call to {{JavaStreamingContext#awaitTermination()}} can throw 
InterruptedException which cannot be caught easily in Java because it's not 
declared in {{@throws(classOf[InterruptedException])}} annotation.

This InterruptedException comes originally from ContextWaiter where Java 
ReentrantLock is used.

  was:
Call to JavaStreamingContext#awaitTermination() can throw InterruptedException 
which cannot be caught easily in Java because it's not declared in 
@throws(classOf[InterruptedException]) annotation.

This InterruptedException comes originally from ContextWaiter where Java 
ReentrantLock is used.


 JavaStreamingContext#awaitTermination() throws non-declared 
 InterruptedException
 

 Key: SPARK-8393
 URL: https://issues.apache.org/jira/browse/SPARK-8393
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.1
Reporter: Jaromir Vanek
Priority: Trivial

 Call to {{JavaStreamingContext#awaitTermination()}} can throw 
 InterruptedException which cannot be caught easily in Java because it's not 
 declared in {{@throws(classOf[InterruptedException])}} annotation.
 This InterruptedException comes originally from ContextWaiter where Java 
 ReentrantLock is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8393) JavaStreamingContext#awaitTermination() throws non-declared InterruptedException

2015-06-16 Thread Jaromir Vanek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588013#comment-14588013
 ] 

Jaromir Vanek commented on SPARK-8393:
--

It's not a big problem in Java. But I took me quite a bit of time to realize 
where exactly this {{InterruptedException}} comes from.
 
In Java it can be caught as general {{Exception}}:

{code}
try {
  streamingContext.awaitTermination();
} catch (Exception e) {
  if (e instanceof InterruptedException) {
// handle exception
  }
{code}

As far as I know {{awaitTerminationOrTimeout}} may throw the same exception as 
well.

 JavaStreamingContext#awaitTermination() throws non-declared 
 InterruptedException
 

 Key: SPARK-8393
 URL: https://issues.apache.org/jira/browse/SPARK-8393
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.1
Reporter: Jaromir Vanek
Priority: Trivial

 Call to JavaStreamingContext#awaitTermination() can throw 
 InterruptedException which cannot be caught easily in Java because it's not 
 declared in @throws(classOf[InterruptedException]) annotation.
 This InterruptedException comes originally from ContextWaiter where Java 
 ReentrantLock is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8393) JavaStreamingContext#awaitTermination() throws non-declared InterruptedException

2015-06-16 Thread Jaromir Vanek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jaromir Vanek updated SPARK-8393:
-
Description: 
Call to {{JavaStreamingContext#awaitTermination()}} can throw 
{{InterruptedException}} which cannot be caught easily in Java because it's not 
declared in {{@throws(classOf[InterruptedException])}} annotation.

This {{InterruptedException}} comes originally from {{ContextWaiter}} where 
Java {{ReentrantLock}} is used.

  was:
Call to {{JavaStreamingContext#awaitTermination()}} can throw 
InterruptedException which cannot be caught easily in Java because it's not 
declared in {{@throws(classOf[InterruptedException])}} annotation.

This {{InterruptedException}} comes originally from {{ContextWaiter}} where 
Java {{ReentrantLock}} is used.


 JavaStreamingContext#awaitTermination() throws non-declared 
 InterruptedException
 

 Key: SPARK-8393
 URL: https://issues.apache.org/jira/browse/SPARK-8393
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.1
Reporter: Jaromir Vanek
Priority: Trivial

 Call to {{JavaStreamingContext#awaitTermination()}} can throw 
 {{InterruptedException}} which cannot be caught easily in Java because it's 
 not declared in {{@throws(classOf[InterruptedException])}} annotation.
 This {{InterruptedException}} comes originally from {{ContextWaiter}} where 
 Java {{ReentrantLock}} is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8393) JavaStreamingContext#awaitTermination() throws non-declared InterruptedException

2015-06-16 Thread Jaromir Vanek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jaromir Vanek updated SPARK-8393:
-
Description: 
Call to {{JavaStreamingContext#awaitTermination()}} can throw 
InterruptedException which cannot be caught easily in Java because it's not 
declared in {{@throws(classOf[InterruptedException])}} annotation.

This {{InterruptedException}} comes originally from {{ContextWaiter}} where 
Java {{ReentrantLock}} is used.

  was:
Call to {{JavaStreamingContext#awaitTermination()}} can throw 
InterruptedException which cannot be caught easily in Java because it's not 
declared in {{@throws(classOf[InterruptedException])}} annotation.

This InterruptedException comes originally from ContextWaiter where Java 
ReentrantLock is used.


 JavaStreamingContext#awaitTermination() throws non-declared 
 InterruptedException
 

 Key: SPARK-8393
 URL: https://issues.apache.org/jira/browse/SPARK-8393
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.1
Reporter: Jaromir Vanek
Priority: Trivial

 Call to {{JavaStreamingContext#awaitTermination()}} can throw 
 InterruptedException which cannot be caught easily in Java because it's not 
 declared in {{@throws(classOf[InterruptedException])}} annotation.
 This {{InterruptedException}} comes originally from {{ContextWaiter}} where 
 Java {{ReentrantLock}} is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7580) Driver out of memory

2015-06-16 Thread Yuance Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588086#comment-14588086
 ] 

Yuance Li commented on SPARK-7580:
--

Hey, how did you solve the problem? I also met this problem

 Driver out of memory
 

 Key: SPARK-7580
 URL: https://issues.apache.org/jira/browse/SPARK-7580
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.3.0
 Environment: YARN, HDP 2.1, RedHat 6.4
 200 x HP DL185
Reporter: Andrew Rothstein

 My 200 node cluster has a 8k executor capacity. When I submitted a job with 
 2k executors, 2g per executor, and 4g for the driver, the 
 ApplicationMaster/driver quickly became unresponsive. It was making progress, 
 then threw a couple of these exceptions:
 2015-05-12 16:46:41,598 ERROR [Spark Context Cleaner] spark.ContextCleaner: 
 Error cleaning broadcast 4 java.util.concurrent.TimeoutException: Futures 
 timed out after [30 seconds] at 
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at 
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at 
 scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
  at scala.concurrent.Await$.result(package.scala:107) at 
 org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:137)
  at 
 org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:227)
  at 
 org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
  at 
 org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:66)
  at 
 org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:185) 
 at 
 org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:147)
  at 
 org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:138)
  at scala.Option.foreach(Option.scala:236) at 
 org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:138)
  at 
 org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:134)
  at 
 org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:134)
  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1617) at 
 org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:133)
  at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:65)
 Then the job crashed with OOM.
 2015-05-12 16:47:53,566 ERROR [sparkDriver-akka.actor.default-dispatcher-4] 
 actor.ActorSystemImpl: Uncaught fatal error from thread 
 [sparkDriver-akka.remote.default-remote-dispatcher-8] shutting down 
 ActorSystem [sparkDriver] java.lang.OutOfMemoryError: Java heap space at 
 org.spark_project.protobuf.ByteString.copyFrom(ByteString.java:216) at 
 org.spark_project.protobuf.ByteString.copyFrom(ByteString.java:229) at 
 akka.remote.transport.AkkaPduProtobufCodec$.constructPayload(AkkaPduCodec.scala:145)
  at 
 akka.remote.transport.AkkaProtocolHandle.write(AkkaProtocolTransport.scala:182)
  at akka.remote.EndpointWriter.writeSend(Endpoint.scala:760) at 
 akka.remote.EndpointWriter$$anonfun$2.applyOrElse(Endpoint.scala:722) at 
 akka.actor.Actor$class.aroundReceive(Actor.scala:465) at 
 akka.remote.EndpointActor.aroundReceive(Endpoint.scala:415) at 
 akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at 
 akka.actor.ActorCell.invoke(ActorCell.scala:487) at 
 akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at 
 akka.dispatch.Mailbox.run(Mailbox.scala:220) at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
  at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 When I reran the job with 3g of memory per executor and 1k executors it ran 
 to completion more quickly than the 2k executor run took to crash. I didn't 
 think I was pushing the envelope by using 2k executors and the stock driver 
 heap size. Is this a scale limitation of the driver? Any suggestions beyond 
 increasing the heap size of the driver and/or using less executors?
 Thanks, Andrew



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, 

[jira] [Commented] (SPARK-7515) Update documentation for PySpark on YARN with cluster mode

2015-06-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588098#comment-14588098
 ] 

Apache Spark commented on SPARK-7515:
-

User 'punya' has created a pull request for this issue:
https://github.com/apache/spark/pull/6842

 Update documentation for PySpark on YARN with cluster mode
 --

 Key: SPARK-7515
 URL: https://issues.apache.org/jira/browse/SPARK-7515
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.4.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
Priority: Minor
 Fix For: 1.5.0


 Now PySpark on YARN with cluster mode is supported so let's update doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8396) GraphLoader.edgeListFile does not population Graph.vertices.

2015-06-16 Thread Matthew Barrett (JIRA)
Matthew Barrett created SPARK-8396:
--

 Summary: GraphLoader.edgeListFile does not population 
Graph.vertices.
 Key: SPARK-8396
 URL: https://issues.apache.org/jira/browse/SPARK-8396
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.4.0
 Environment: Mac OS X.  Spark-1.4.0 pre-compiled binary for 
Hadoop-2.4.0-bin.
Reporter: Matthew Barrett
Priority: Minor


With input data like this
18090 31237
31237 31225
31225 31285
31285 31200
31200 31197
31197 31195
31195 31346
31346 54013
54013 31256
31256 23121

The code 

val graph : Graph[Int, Int] = GraphLoader.edgeListFile(sc, hdfsNode + 
/data/misc/Sample_DirectedGraphData.ssv)
graph.vertices.foreach{println}
graph.vertices.foreach{vertex: (VertexId, Int) = println(vertex._1.toString + 
 ***  + vertex._2.toString)}

prints nothing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7443) MLlib 1.4 QA plan

2015-06-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587941#comment-14587941
 ] 

Sean Owen commented on SPARK-7443:
--

[~mengxr] This contains 6 subtasks that aren't resolved, but this is a ticket 
for 1.4. Should we close them all?
I'm asking because there are still 76 issues tagged for 1.4.0 that were not 
resolved.

 MLlib 1.4 QA plan
 -

 Key: SPARK-7443
 URL: https://issues.apache.org/jira/browse/SPARK-7443
 Project: Spark
  Issue Type: Umbrella
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Joseph K. Bradley
Priority: Critical

 TODO: create JIRAs for each task and assign them accordingly.
 h2. API
 * Check API compliance using java-compliance-checker (SPARK-7458)
 * Audit new public APIs (from the generated html doc)
 ** Scala (do not forget to check the object doc) (SPARK-7537)
 ** Java compatibility (SPARK-7529)
 ** Python API coverage (SPARK-7536)
 * audit Pipeline APIs (SPARK-7535)
 * graduate spark.ml from alpha (SPARK-7748)
 ** remove AlphaComponent annotations
 ** remove mima excludes for spark.ml
 ** mark concrete classes final wherever reasonable
 h2. Algorithms and performance
 *Performance*
 * _List any other missing performance tests from spark-perf here_
 * LDA online/EM (SPARK-7455)
 * ElasticNet for linear regression and logistic regression (SPARK-7456)
 * Bernoulli naive Bayes (SPARK-7453)
 * PIC (SPARK-7454)
 * ALS.recommendAll (SPARK-7457)
 * perf-tests in Python (SPARK-7539)
 *Correctness*
 * PMML
 ** scoring using PMML evaluator vs. MLlib models (SPARK-7540)
 * model save/load (SPARK-7541)
 h2. Documentation and example code
 * Create JIRAs for the user guide to each new algorithm and assign them to 
 the corresponding author.  Link here as requires
 ** Now that we have algorithms in spark.ml which are not in spark.mllib, we 
 should start making subsections for the spark.ml API as needed.  We can 
 follow the structure of the spark.mllib user guide.
 *** The spark.ml user guide can provide: (a) code examples and (b) info on 
 algorithms which do not exist in spark.mllib.
 *** We should not duplicate info in the spark.ml guides.  Since spark.mllib 
 is still the primary API, we should provide links to the corresponding 
 algorithms in the spark.mllib user guide for more info.
 * Create example code for major components.  Link here as requires
 ** cross validation in python (SPARK-7387)
 ** pipeline with complex feature transformations (scala/java/python) 
 (SPARK-7546)
 ** elastic-net (possibly with cross validation) (SPARK-7547)
 ** kernel density (SPARK-7707)
 * Update Programming Guide for 1.4 (towards end of QA) (SPARK-7715)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8392) the process is hang on when getting cachedNodes

2015-06-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8392:
-
Priority: Minor  (was: Major)

This is not major.

 the process is hang on when getting cachedNodes
 ---

 Key: SPARK-8392
 URL: https://issues.apache.org/jira/browse/SPARK-8392
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: meiyoula
Priority: Minor

 def getAllNodes: Seq[RDDOperationNode] = {
 _childNodes ++ _childClusters.flatMap(_.childNodes)
   }
 when the _childClusters has so many nodes, the process will hang on. I think 
 we can improve the efficiency here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7715) Update MLlib Programming Guide for 1.4

2015-06-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7715.
--
   Resolution: Fixed
Fix Version/s: 1.4.0
 Assignee: Joseph K. Bradley

Assuming the umbrella can be closed

 Update MLlib Programming Guide for 1.4
 --

 Key: SPARK-7715
 URL: https://issues.apache.org/jira/browse/SPARK-7715
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, ML, MLlib
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
 Fix For: 1.4.0


 Before the release, we need to update the MLlib Programming Guide.  Updates 
 will include:
 * Add migration guide subsection.
 ** Use the results of the QA audit JIRAs.
 * Check phrasing, especially in main sections (for outdated items such as In 
 this release, ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7515) Update documentation for PySpark on YARN with cluster mode

2015-06-16 Thread Punya Biswal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Punya Biswal updated SPARK-7515:

Fix Version/s: 1.4.1

 Update documentation for PySpark on YARN with cluster mode
 --

 Key: SPARK-7515
 URL: https://issues.apache.org/jira/browse/SPARK-7515
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.4.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
Priority: Minor
 Fix For: 1.4.1, 1.5.0


 Now PySpark on YARN with cluster mode is supported so let's update doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5680) Sum function on all null values, should return zero

2015-06-16 Thread Holman Lan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588163#comment-14588163
 ] 

Holman Lan commented on SPARK-5680:
---

Hello Venkata. Thanks very much for looking into this. Could you kindly let us 
know the JIRA for the patch when you have one created? Thanks.

 Sum function on all null values, should return zero
 ---

 Key: SPARK-5680
 URL: https://issues.apache.org/jira/browse/SPARK-5680
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Venkata Ramana G
Assignee: Venkata Ramana G
Priority: Minor
 Fix For: 1.3.1, 1.4.0


 SELECT  sum('a'),  avg('a'),  variance('a'),  std('a') FROM src;
 Current output:
 NULL  NULLNULLNULL
 Expected output:
 0.0   NULLNULLNULL
 This fixes hive udaf_number_format.q 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8397) Allow custom configuration for TestHive

2015-06-16 Thread Punya Biswal (JIRA)
Punya Biswal created SPARK-8397:
---

 Summary: Allow custom configuration for TestHive
 Key: SPARK-8397
 URL: https://issues.apache.org/jira/browse/SPARK-8397
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.4.0
Reporter: Punya Biswal
Priority: Minor


We encourage people to use {{TestHive}} in unit tests, because it's impossible 
to create more than one {{HiveContext}} within one process. The current 
implementation locks people into using a {{local[2]}} {{SparkContext}} 
underlying their {{HiveContext}}. We should make it possible to override this 
using a system property so that people can test against {{local-cluster}} or 
remote spark clusters to make their tests more realistic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7443) MLlib 1.4 QA plan

2015-06-16 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588265#comment-14588265
 ] 

Joseph K. Bradley commented on SPARK-7443:
--

[~srowen]  Most of the QA items are pretty much ready to be closed, but I'd 
like to check through them, particularly for ones which need to spawn new JIRAs 
for 1.5.  Not all of the documentation was finished, but we can mark it for 
1.4.1, 1.5 and update the website doc ASAP (before 1.4.1).  I'll have some time 
later today to make a pass through the JIRAs.

 MLlib 1.4 QA plan
 -

 Key: SPARK-7443
 URL: https://issues.apache.org/jira/browse/SPARK-7443
 Project: Spark
  Issue Type: Umbrella
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Joseph K. Bradley
Priority: Critical

 TODO: create JIRAs for each task and assign them accordingly.
 h2. API
 * Check API compliance using java-compliance-checker (SPARK-7458)
 * Audit new public APIs (from the generated html doc)
 ** Scala (do not forget to check the object doc) (SPARK-7537)
 ** Java compatibility (SPARK-7529)
 ** Python API coverage (SPARK-7536)
 * audit Pipeline APIs (SPARK-7535)
 * graduate spark.ml from alpha (SPARK-7748)
 ** remove AlphaComponent annotations
 ** remove mima excludes for spark.ml
 ** mark concrete classes final wherever reasonable
 h2. Algorithms and performance
 *Performance*
 * _List any other missing performance tests from spark-perf here_
 * LDA online/EM (SPARK-7455)
 * ElasticNet for linear regression and logistic regression (SPARK-7456)
 * Bernoulli naive Bayes (SPARK-7453)
 * PIC (SPARK-7454)
 * ALS.recommendAll (SPARK-7457)
 * perf-tests in Python (SPARK-7539)
 *Correctness*
 * PMML
 ** scoring using PMML evaluator vs. MLlib models (SPARK-7540)
 * model save/load (SPARK-7541)
 h2. Documentation and example code
 * Create JIRAs for the user guide to each new algorithm and assign them to 
 the corresponding author.  Link here as requires
 ** Now that we have algorithms in spark.ml which are not in spark.mllib, we 
 should start making subsections for the spark.ml API as needed.  We can 
 follow the structure of the spark.mllib user guide.
 *** The spark.ml user guide can provide: (a) code examples and (b) info on 
 algorithms which do not exist in spark.mllib.
 *** We should not duplicate info in the spark.ml guides.  Since spark.mllib 
 is still the primary API, we should provide links to the corresponding 
 algorithms in the spark.mllib user guide for more info.
 * Create example code for major components.  Link here as requires
 ** cross validation in python (SPARK-7387)
 ** pipeline with complex feature transformations (scala/java/python) 
 (SPARK-7546)
 ** elastic-net (possibly with cross validation) (SPARK-7547)
 ** kernel density (SPARK-7707)
 * Update Programming Guide for 1.4 (towards end of QA) (SPARK-7715)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8268) string function: unbase64

2015-06-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588120#comment-14588120
 ] 

Apache Spark commented on SPARK-8268:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/6843

 string function: unbase64
 -

 Key: SPARK-8268
 URL: https://issues.apache.org/jira/browse/SPARK-8268
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Cheng Hao

 unbase64(string str): binary
 Converts the argument from a base 64 string to BINARY.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8243) string function: encode

2015-06-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588118#comment-14588118
 ] 

Apache Spark commented on SPARK-8243:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/6843

 string function: encode
 ---

 Key: SPARK-8243
 URL: https://issues.apache.org/jira/browse/SPARK-8243
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Cheng Hao

 encode(string src, string charset): binary
 Encodes the first argument into a BINARY using the provided character set 
 (one of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16'). 
 If either argument is null, the result will also be null. (As of Hive 0.12.0.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8389) Expose KafkaRDDs offsetRange in Java and Python

2015-06-16 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588202#comment-14588202
 ] 

Cody Koeninger commented on SPARK-8389:
---

There's already a ticket for the Python side of things, SPARK-8337.  Not sure 
if you want to combine them.

I'll look at the java side of things to start.

 Expose KafkaRDDs offsetRange in Java and Python
 ---

 Key: SPARK-8389
 URL: https://issues.apache.org/jira/browse/SPARK-8389
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Tathagata Das
Assignee: Cody Koeninger
Priority: Critical

 Probably requires creating a JavaKafkaPairRDD and also use that in the python 
 APIs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7122) KafkaUtils.createDirectStream - unreasonable processing time in absence of load

2015-06-16 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588211#comment-14588211
 ] 

Cody Koeninger commented on SPARK-7122:
---

It's certainly your prerogative to wait for an official release.  However, keep 
in mind that the patch in question is just a performance optimization, not 
necessarily a bug fix targeted at whatever your issue is.  Without a minimal 
reproducible case of your problem, or testing patches against your workload, 
there's no way of knowing if the performance optimization solves your problem.  
If it doesn't, you're looking at waiting for yet another release after 1.4.1.

 KafkaUtils.createDirectStream - unreasonable processing time in absence of 
 load
 ---

 Key: SPARK-7122
 URL: https://issues.apache.org/jira/browse/SPARK-7122
 Project: Spark
  Issue Type: Question
  Components: Streaming
Affects Versions: 1.3.1
 Environment: Spark Streaming 1.3.1, standalone mode running on just 1 
 box: Ubuntu 14.04.2 LTS, 4 cores, 8GB RAM, java version 1.8.0_40
Reporter: Platon Potapov
Priority: Minor
 Attachments: 10.second.window.fast.job.txt, 
 5.second.window.slow.job.txt, SparkStreamingJob.scala


 attached is the complete source code of a test spark job. no external data 
 generators are run - just the presence of a kafka topic named raw suffices.
 the spark job is run with no load whatsoever. http://localhost:4040/streaming 
 is checked to obtain job processing duration.
 * in case the test contains the following transformation:
 {code}
 // dummy transformation
 val temperature = bytes.filter(_._1 == abc)
 val abc = temperature.window(Seconds(40), Seconds(5))
 abc.print()
 {code}
 the median processing time is 3 seconds 80 ms
 * in case the test contains the following transformation:
 {code}
 // dummy transformation
 val temperature = bytes.filter(_._1 == abc)
 val abc = temperature.map(x = (1, x))
 abc.print()
 {code}
 the median processing time is just 50 ms
 please explain why does the window transformation introduce such a growth 
 of job duration?
 note: the result is the same regardless of the number of kafka topic 
 partitions (I've tried 1 and 8)
 note2: the result is the same regardless of the window parameters (I've tried 
 (20, 2) and (40, 5))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8397) Allow custom configuration for TestHive

2015-06-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8397:
---

Assignee: (was: Apache Spark)

 Allow custom configuration for TestHive
 ---

 Key: SPARK-8397
 URL: https://issues.apache.org/jira/browse/SPARK-8397
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.4.0
Reporter: Punya Biswal
Priority: Minor

 We encourage people to use {{TestHive}} in unit tests, because it's 
 impossible to create more than one {{HiveContext}} within one process. The 
 current implementation locks people into using a {{local[2]}} 
 {{SparkContext}} underlying their {{HiveContext}}. We should make it possible 
 to override this using a system property so that people can test against 
 {{local-cluster}} or remote spark clusters to make their tests more realistic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8397) Allow custom configuration for TestHive

2015-06-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588161#comment-14588161
 ] 

Apache Spark commented on SPARK-8397:
-

User 'punya' has created a pull request for this issue:
https://github.com/apache/spark/pull/6844

 Allow custom configuration for TestHive
 ---

 Key: SPARK-8397
 URL: https://issues.apache.org/jira/browse/SPARK-8397
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.4.0
Reporter: Punya Biswal
Priority: Minor

 We encourage people to use {{TestHive}} in unit tests, because it's 
 impossible to create more than one {{HiveContext}} within one process. The 
 current implementation locks people into using a {{local[2]}} 
 {{SparkContext}} underlying their {{HiveContext}}. We should make it possible 
 to override this using a system property so that people can test against 
 {{local-cluster}} or remote spark clusters to make their tests more realistic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8397) Allow custom configuration for TestHive

2015-06-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8397:
---

Assignee: Apache Spark

 Allow custom configuration for TestHive
 ---

 Key: SPARK-8397
 URL: https://issues.apache.org/jira/browse/SPARK-8397
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.4.0
Reporter: Punya Biswal
Assignee: Apache Spark
Priority: Minor

 We encourage people to use {{TestHive}} in unit tests, because it's 
 impossible to create more than one {{HiveContext}} within one process. The 
 current implementation locks people into using a {{local[2]}} 
 {{SparkContext}} underlying their {{HiveContext}}. We should make it possible 
 to override this using a system property so that people can test against 
 {{local-cluster}} or remote spark clusters to make their tests more realistic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8239) string function: base64

2015-06-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8239:
---

Assignee: Cheng Hao  (was: Apache Spark)

 string function: base64
 ---

 Key: SPARK-8239
 URL: https://issues.apache.org/jira/browse/SPARK-8239
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Cheng Hao

 base64(binary bin): string
 Converts the argument from binary to a base 64 string



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8239) string function: base64

2015-06-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8239:
---

Assignee: Apache Spark  (was: Cheng Hao)

 string function: base64
 ---

 Key: SPARK-8239
 URL: https://issues.apache.org/jira/browse/SPARK-8239
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 base64(binary bin): string
 Converts the argument from binary to a base 64 string



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8385) java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation

2015-06-16 Thread Peter Haumer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588330#comment-14588330
 ] 

Peter Haumer commented on SPARK-8385:
-

Sean, I see the class in the big assembly file of the Spark for Hadoop 2.6 
distributions for 1.3.1 and 1.4.0. However, it seems that with 1.4 a version 
was packaged that has unimplemented methods, which causes the regression. 

 java.lang.UnsupportedOperationException: Not implemented by the TFS 
 FileSystem implementation
 -

 Key: SPARK-8385
 URL: https://issues.apache.org/jira/browse/SPARK-8385
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 1.4.0
 Environment: RHEL 7.1
Reporter: Peter Haumer

 I used to be able to debug my Spark apps in Eclipse. With Spark 1.3.1 I 
 created a launch and just set the vm var -Dspark.master=local[4].  
 With 1.4 this stopped working when reading files from the OS filesystem. 
 Running the same apps with spark-submit works fine.  Loosing the ability to 
 debug that way has a major impact on the usability of Spark.
 The following exception is thrown:
 Exception in thread main java.lang.UnsupportedOperationException: Not 
 implemented by the TFS FileSystem implementation
 at org.apache.hadoop.fs.FileSystem.getScheme(FileSystem.java:213)
 at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2401)
 at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2411)
 at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2428)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88)
 at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:166)
 at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:653)
 at 
 org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:389)
 at 
 org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362)
 at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:762)
 at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:762)
 at 
 org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
 at 
 org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
 at scala.Option.map(Option.scala:145)
 at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1535)
 at org.apache.spark.rdd.RDD.reduce(RDD.scala:900)
 at org.apache.spark.api.java.JavaRDDLike$class.reduce(JavaRDDLike.scala:357)
 at org.apache.spark.api.java.AbstractJavaRDDLike.reduce(JavaRDDLike.scala:46)
 at com.databricks.apps.logs.LogAnalyzer.main(LogAnalyzer.java:60)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8242) string function: decode

2015-06-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8242:
---

Assignee: Apache Spark  (was: Cheng Hao)

 string function: decode
 ---

 Key: SPARK-8242
 URL: https://issues.apache.org/jira/browse/SPARK-8242
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 decode(binary bin, string charset): string
 Decodes the first argument into a String using the provided character set 
 (one of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16'). 
 If either argument is null, the result will also be null. (As of Hive 0.12.0.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8268) string function: unbase64

2015-06-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8268:
---

Assignee: Apache Spark  (was: Cheng Hao)

 string function: unbase64
 -

 Key: SPARK-8268
 URL: https://issues.apache.org/jira/browse/SPARK-8268
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 unbase64(string str): binary
 Converts the argument from a base 64 string to BINARY.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8268) string function: unbase64

2015-06-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8268:
---

Assignee: Cheng Hao  (was: Apache Spark)

 string function: unbase64
 -

 Key: SPARK-8268
 URL: https://issues.apache.org/jira/browse/SPARK-8268
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Cheng Hao

 unbase64(string str): binary
 Converts the argument from a base 64 string to BINARY.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8238) string function: ascii

2015-06-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8238:
---

Assignee: Cheng Hao  (was: Apache Spark)

 string function: ascii
 --

 Key: SPARK-8238
 URL: https://issues.apache.org/jira/browse/SPARK-8238
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Cheng Hao

 ascii(string str): int
 Returns the numeric value of the first character of str.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8243) string function: encode

2015-06-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8243:
---

Assignee: Cheng Hao  (was: Apache Spark)

 string function: encode
 ---

 Key: SPARK-8243
 URL: https://issues.apache.org/jira/browse/SPARK-8243
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Cheng Hao

 encode(string src, string charset): binary
 Encodes the first argument into a BINARY using the provided character set 
 (one of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16'). 
 If either argument is null, the result will also be null. (As of Hive 0.12.0.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8238) string function: ascii

2015-06-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588115#comment-14588115
 ] 

Apache Spark commented on SPARK-8238:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/6843

 string function: ascii
 --

 Key: SPARK-8238
 URL: https://issues.apache.org/jira/browse/SPARK-8238
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Cheng Hao

 ascii(string str): int
 Returns the numeric value of the first character of str.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8242) string function: decode

2015-06-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588117#comment-14588117
 ] 

Apache Spark commented on SPARK-8242:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/6843

 string function: decode
 ---

 Key: SPARK-8242
 URL: https://issues.apache.org/jira/browse/SPARK-8242
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Cheng Hao

 decode(binary bin, string charset): string
 Decodes the first argument into a String using the provided character set 
 (one of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16'). 
 If either argument is null, the result will also be null. (As of Hive 0.12.0.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8242) string function: decode

2015-06-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8242:
---

Assignee: Cheng Hao  (was: Apache Spark)

 string function: decode
 ---

 Key: SPARK-8242
 URL: https://issues.apache.org/jira/browse/SPARK-8242
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Cheng Hao

 decode(binary bin, string charset): string
 Decodes the first argument into a String using the provided character set 
 (one of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16'). 
 If either argument is null, the result will also be null. (As of Hive 0.12.0.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8239) string function: base64

2015-06-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588116#comment-14588116
 ] 

Apache Spark commented on SPARK-8239:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/6843

 string function: base64
 ---

 Key: SPARK-8239
 URL: https://issues.apache.org/jira/browse/SPARK-8239
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Cheng Hao

 base64(binary bin): string
 Converts the argument from binary to a base 64 string



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8238) string function: ascii

2015-06-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8238:
---

Assignee: Apache Spark  (was: Cheng Hao)

 string function: ascii
 --

 Key: SPARK-8238
 URL: https://issues.apache.org/jira/browse/SPARK-8238
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 ascii(string str): int
 Returns the numeric value of the first character of str.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8243) string function: encode

2015-06-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8243:
---

Assignee: Apache Spark  (was: Cheng Hao)

 string function: encode
 ---

 Key: SPARK-8243
 URL: https://issues.apache.org/jira/browse/SPARK-8243
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 encode(string src, string charset): binary
 Encodes the first argument into a BINARY using the provided character set 
 (one of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16'). 
 If either argument is null, the result will also be null. (As of Hive 0.12.0.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8380) SparkR mis-counts

2015-06-16 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588196#comment-14588196
 ] 

Shivaram Venkataraman commented on SPARK-8380:
--

Thanks for the update. I'm going to mark this issue as resolved. BTW if there 
are documentation changes that you think will be helpful feel free to create 
JIRAs / PRs for them

 SparkR mis-counts
 -

 Key: SPARK-8380
 URL: https://issues.apache.org/jira/browse/SPARK-8380
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Rick Moritz

 On my dataset of ~9 Million rows x 30 columns, queried via Hive, I can 
 perform count operations on the entirety of the dataset and get the correct 
 value, as double checked against the same code in scala.
 When I start to add conditions or even do a simple partial ascending 
 histogram, I get discrepancies.
 In particular, there are missing values in SparkR, and massively so:
 A top 6 count of a certain feature in my dataset results in an order of 
 magnitude smaller numbers, than I get via scala.
 The following logic, which I consider equivalent is the basis for this report:
 counts-summarize(groupBy(df, df$col_name), count = n(tdf$col_name))
 head(arrange(counts, desc(counts$count)))
 versus:
 val table = sql(SELECT col_name, count(col_name) as value from df  group by 
 col_name order by value desc)
 The first, in particular, is taken directly from the SparkR programming 
 guide. Since summarize isn't documented from what I can see, I'd hope it does 
 what the programming guide indicates. In that case this would be a pretty 
 serious logic bug (no errors are thrown). Otherwise, there's the possibility 
 of a lack of documentation and badly worded example in the guide being behind 
 my misperception of SparkRs functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8398) Consistently expose Hadoop Configuration/JobConf parameters for Hadoop input/output formats

2015-06-16 Thread koert kuipers (JIRA)
koert kuipers created SPARK-8398:


 Summary: Consistently expose Hadoop Configuration/JobConf 
parameters for Hadoop input/output formats
 Key: SPARK-8398
 URL: https://issues.apache.org/jira/browse/SPARK-8398
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: koert kuipers
Priority: Trivial


Currently a custom Hadoop Configuration or JobConf can be passed into quite a 
few functions that use Hadoop input formats to read or Hadoop output formats to 
write data. The goal of this JIRA is to make this consistent and expose 
Configuration/JobConf for all these methods, which facilitates re-use and 
discourages many additional parameters (that end up changing the 
Configuration/JobConf internally). 

See also:
http://apache-spark-developers-list.1001551.n3.nabble.com/Fwd-hadoop-input-output-format-advanced-control-td11168.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6816) Add SparkConf API to configure SparkR

2015-06-16 Thread Rick Moritz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588271#comment-14588271
 ] 

Rick Moritz commented on SPARK-6816:


Apparently this work-around is no longer needed for spark-1.4.0, which invokes 
a shell script instead of going directly to java as sparkR-pkg did, and fetches 
the required environment parameters.
With spark-defaults being respected, and SPARK_MEM available for memory 
options, there probably isn't a whole lot that needs to be passed by -D to 
shell script.

 Add SparkConf API to configure SparkR
 -

 Key: SPARK-6816
 URL: https://issues.apache.org/jira/browse/SPARK-6816
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Shivaram Venkataraman
Priority: Minor

 Right now the only way to configure SparkR is to pass in arguments to 
 sparkR.init. The goal is to add an API similar to SparkConf on Scala/Python 
 to make configuration easier



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8398) Consistently expose Hadoop Configuration/JobConf parameters for Hadoop input/output formats

2015-06-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8398:
---

Assignee: Apache Spark

 Consistently expose Hadoop Configuration/JobConf parameters for Hadoop 
 input/output formats
 ---

 Key: SPARK-8398
 URL: https://issues.apache.org/jira/browse/SPARK-8398
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: koert kuipers
Assignee: Apache Spark
Priority: Trivial

 Currently a custom Hadoop Configuration or JobConf can be passed into quite a 
 few functions that use Hadoop input formats to read or Hadoop output formats 
 to write data. The goal of this JIRA is to make this consistent and expose 
 Configuration/JobConf for all these methods, which facilitates re-use and 
 discourages many additional parameters (that end up changing the 
 Configuration/JobConf internally). 
 See also:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Fwd-hadoop-input-output-format-advanced-control-td11168.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8398) Consistently expose Hadoop Configuration/JobConf parameters for Hadoop input/output formats

2015-06-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588424#comment-14588424
 ] 

Apache Spark commented on SPARK-8398:
-

User 'koertkuipers' has created a pull request for this issue:
https://github.com/apache/spark/pull/6848

 Consistently expose Hadoop Configuration/JobConf parameters for Hadoop 
 input/output formats
 ---

 Key: SPARK-8398
 URL: https://issues.apache.org/jira/browse/SPARK-8398
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: koert kuipers
Priority: Trivial

 Currently a custom Hadoop Configuration or JobConf can be passed into quite a 
 few functions that use Hadoop input formats to read or Hadoop output formats 
 to write data. The goal of this JIRA is to make this consistent and expose 
 Configuration/JobConf for all these methods, which facilitates re-use and 
 discourages many additional parameters (that end up changing the 
 Configuration/JobConf internally). 
 See also:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Fwd-hadoop-input-output-format-advanced-control-td11168.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8398) Consistently expose Hadoop Configuration/JobConf parameters for Hadoop input/output formats

2015-06-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8398:
---

Assignee: (was: Apache Spark)

 Consistently expose Hadoop Configuration/JobConf parameters for Hadoop 
 input/output formats
 ---

 Key: SPARK-8398
 URL: https://issues.apache.org/jira/browse/SPARK-8398
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: koert kuipers
Priority: Trivial

 Currently a custom Hadoop Configuration or JobConf can be passed into quite a 
 few functions that use Hadoop input formats to read or Hadoop output formats 
 to write data. The goal of this JIRA is to make this consistent and expose 
 Configuration/JobConf for all these methods, which facilitates re-use and 
 discourages many additional parameters (that end up changing the 
 Configuration/JobConf internally). 
 See also:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Fwd-hadoop-input-output-format-advanced-control-td11168.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8385) java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation

2015-06-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588540#comment-14588540
 ] 

Sean Owen commented on SPARK-8385:
--

Oh, is TFS Tachyon? Not sure what the status is on that, whether it's supposed 
to work without extra steps and just happened to in the past, or what. 

 java.lang.UnsupportedOperationException: Not implemented by the TFS 
 FileSystem implementation
 -

 Key: SPARK-8385
 URL: https://issues.apache.org/jira/browse/SPARK-8385
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 1.4.0
 Environment: RHEL 7.1
Reporter: Peter Haumer

 I used to be able to debug my Spark apps in Eclipse. With Spark 1.3.1 I 
 created a launch and just set the vm var -Dspark.master=local[4].  
 With 1.4 this stopped working when reading files from the OS filesystem. 
 Running the same apps with spark-submit works fine.  Loosing the ability to 
 debug that way has a major impact on the usability of Spark.
 The following exception is thrown:
 Exception in thread main java.lang.UnsupportedOperationException: Not 
 implemented by the TFS FileSystem implementation
 at org.apache.hadoop.fs.FileSystem.getScheme(FileSystem.java:213)
 at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2401)
 at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2411)
 at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2428)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88)
 at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:166)
 at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:653)
 at 
 org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:389)
 at 
 org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362)
 at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:762)
 at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:762)
 at 
 org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
 at 
 org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
 at scala.Option.map(Option.scala:145)
 at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1535)
 at org.apache.spark.rdd.RDD.reduce(RDD.scala:900)
 at org.apache.spark.api.java.JavaRDDLike$class.reduce(JavaRDDLike.scala:357)
 at org.apache.spark.api.java.AbstractJavaRDDLike.reduce(JavaRDDLike.scala:46)
 at com.databricks.apps.logs.LogAnalyzer.main(LogAnalyzer.java:60)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8384) Can not set checkpointDuration or Interval in spark 1.3 and later

2015-06-16 Thread Norman He (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588363#comment-14588363
 ] 

Norman He commented on SPARK-8384:
--

It seems that if checkpoint interval is the same as batchDuration, you will 
have a  lot of checkpoint saving. ( old documentation talk about set checkpoint 
interval 5 ~10 times batch duration ).

We are pushing batch duration to 200ms, if checkpoint duration is the same. I 
am not sure whether checkpoint saving to hdfs or disk will impact the streaming 
processing.

Is this a limitation now or will be improved upon in the future?

 Can not set checkpointDuration or Interval in spark 1.3 and later
 -

 Key: SPARK-8384
 URL: https://issues.apache.org/jira/browse/SPARK-8384
 Project: Spark
  Issue Type: Bug
Reporter: Norman He
Priority: Critical

 StreamingContext missing setCheckpointDuration().
 No way around for now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8399) Overlap between histograms and axis' name in Spark Streaming UI

2015-06-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8399:
---

Assignee: Apache Spark

 Overlap between histograms and axis' name in Spark Streaming UI
 ---

 Key: SPARK-8399
 URL: https://issues.apache.org/jira/browse/SPARK-8399
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Web UI
Affects Versions: 1.4.0
Reporter: Benjamin Fradet
Assignee: Apache Spark
Priority: Minor

 If you have an histogram skewed towards the maximum of the displayed values 
 as is the case with the number of messages processed per batchInterval with 
 the Kafka direct API (since it's a constant) for example, the histogram will 
 overlap with the name of the X axis (#batches).
 Unfortunately, I don't have any screenshots available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8393) JavaStreamingContext#awaitTermination() throws non-declared InterruptedException

2015-06-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588546#comment-14588546
 ] 

Sean Owen commented on SPARK-8393:
--

{{awaitTerminationOrTimeout}} will return a {{boolean}} to let you know if it 
timed out, if that's what you're looking for, but I suspect it's not quite. 
Yeah, that's a good workaround for now if you really need to handle it. Hm, can 
you wrap it in a method that {{throws InterruptedException}} and catch for it 
as normal around an invocation to that method?

I think it's still a valid API change for later.

 JavaStreamingContext#awaitTermination() throws non-declared 
 InterruptedException
 

 Key: SPARK-8393
 URL: https://issues.apache.org/jira/browse/SPARK-8393
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.1
Reporter: Jaromir Vanek
Priority: Trivial

 Call to {{JavaStreamingContext#awaitTermination()}} can throw 
 {{InterruptedException}} which cannot be caught easily in Java because it's 
 not declared in {{@throws(classOf[InterruptedException])}} annotation.
 This {{InterruptedException}} comes originally from {{ContextWaiter}} where 
 Java {{ReentrantLock}} is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8395) spark-submit documentation is incorrect

2015-06-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588551#comment-14588551
 ] 

Sean Owen commented on SPARK-8395:
--

I think that's right. This looks like a hold-over from when this might have 
been controlled by spark-daemon.sh. You can raise a PR for this.

 spark-submit documentation is incorrect
 ---

 Key: SPARK-8395
 URL: https://issues.apache.org/jira/browse/SPARK-8395
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.4.0
Reporter: Dev Lakhani
Priority: Minor

 Using a fresh checkout of 1.4.0-bin-hadoop2.6
 if you run 
 ./start-slave.sh  1 spark://localhost:7077
 you get
 failed to launch org.apache.spark.deploy.worker.Worker:
  Default is conf/spark-defaults.conf.
   15/06/16 13:11:08 INFO Utils: Shutdown hook called
 it seems the worker number is not being accepted  as desccribed here:
 https://spark.apache.org/docs/latest/spark-standalone.html
 The documentation says:
 ./sbin/start-slave.sh worker# master-spark-URL
 but the start.slave-sh script states:
 usage=Usage: start-slave.sh spark-master-URL where spark-master-URL is 
 like spark://localhost:7077
 I have checked for similar issues using :
 https://issues.apache.org/jira/browse/SPARK-6552?jql=text%20~%20%22start-slave%22
 and found nothing similar so am raising this as an issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8399) Overlap between histograms and axis' name in Spark Streaming UI

2015-06-16 Thread Benjamin Fradet (JIRA)
Benjamin Fradet created SPARK-8399:
--

 Summary: Overlap between histograms and axis' name in Spark 
Streaming UI
 Key: SPARK-8399
 URL: https://issues.apache.org/jira/browse/SPARK-8399
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Web UI
Affects Versions: 1.4.0
Reporter: Benjamin Fradet
Priority: Minor


If you have an histogram skewed towards the maximum of the displayed values as 
is the case with the number of messages processed per batchInterval with the 
Kafka direct API (since it's a constant) for example, the histogram will 
overlap with the name of the X axis (#batches).

Unfortunately, I don't have any screenshots available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8399) Overlap between histograms and axis' name in Spark Streaming UI

2015-06-16 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588358#comment-14588358
 ] 

Benjamin Fradet commented on SPARK-8399:


I'll submit a patch shortly.

 Overlap between histograms and axis' name in Spark Streaming UI
 ---

 Key: SPARK-8399
 URL: https://issues.apache.org/jira/browse/SPARK-8399
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Web UI
Affects Versions: 1.4.0
Reporter: Benjamin Fradet
Priority: Minor

 If you have an histogram skewed towards the maximum of the displayed values 
 as is the case with the number of messages processed per batchInterval with 
 the Kafka direct API (since it's a constant) for example, the histogram will 
 overlap with the name of the X axis (#batches).
 Unfortunately, I don't have any screenshots available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8399) Overlap between histograms and axis' name in Spark Streaming UI

2015-06-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588379#comment-14588379
 ] 

Apache Spark commented on SPARK-8399:
-

User 'BenFradet' has created a pull request for this issue:
https://github.com/apache/spark/pull/6845

 Overlap between histograms and axis' name in Spark Streaming UI
 ---

 Key: SPARK-8399
 URL: https://issues.apache.org/jira/browse/SPARK-8399
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Web UI
Affects Versions: 1.4.0
Reporter: Benjamin Fradet
Priority: Minor

 If you have an histogram skewed towards the maximum of the displayed values 
 as is the case with the number of messages processed per batchInterval with 
 the Kafka direct API (since it's a constant) for example, the histogram will 
 overlap with the name of the X axis (#batches).
 Unfortunately, I don't have any screenshots available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8399) Overlap between histograms and axis' name in Spark Streaming UI

2015-06-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8399:
---

Assignee: (was: Apache Spark)

 Overlap between histograms and axis' name in Spark Streaming UI
 ---

 Key: SPARK-8399
 URL: https://issues.apache.org/jira/browse/SPARK-8399
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Web UI
Affects Versions: 1.4.0
Reporter: Benjamin Fradet
Priority: Minor

 If you have an histogram skewed towards the maximum of the displayed values 
 as is the case with the number of messages processed per batchInterval with 
 the Kafka direct API (since it's a constant) for example, the histogram will 
 overlap with the name of the X axis (#batches).
 Unfortunately, I don't have any screenshots available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8389) Expose KafkaRDDs offsetRange in Java and Python

2015-06-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8389:
---

Assignee: Apache Spark  (was: Cody Koeninger)

 Expose KafkaRDDs offsetRange in Java and Python
 ---

 Key: SPARK-8389
 URL: https://issues.apache.org/jira/browse/SPARK-8389
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Tathagata Das
Assignee: Apache Spark
Priority: Critical

 Probably requires creating a JavaKafkaPairRDD and also use that in the python 
 APIs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8389) Expose KafkaRDDs offsetRange in Java and Python

2015-06-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8389:
---

Assignee: Cody Koeninger  (was: Apache Spark)

 Expose KafkaRDDs offsetRange in Java and Python
 ---

 Key: SPARK-8389
 URL: https://issues.apache.org/jira/browse/SPARK-8389
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Tathagata Das
Assignee: Cody Koeninger
Priority: Critical

 Probably requires creating a JavaKafkaPairRDD and also use that in the python 
 APIs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8389) Expose KafkaRDDs offsetRange in Java and Python

2015-06-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588397#comment-14588397
 ] 

Apache Spark commented on SPARK-8389:
-

User 'koeninger' has created a pull request for this issue:
https://github.com/apache/spark/pull/6846

 Expose KafkaRDDs offsetRange in Java and Python
 ---

 Key: SPARK-8389
 URL: https://issues.apache.org/jira/browse/SPARK-8389
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Tathagata Das
Assignee: Cody Koeninger
Priority: Critical

 Probably requires creating a JavaKafkaPairRDD and also use that in the python 
 APIs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8389) Expose KafkaRDDs offsetRange in Java and Python

2015-06-16 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588451#comment-14588451
 ] 

Tathagata Das commented on SPARK-8389:
--

Then lets atleast add it to the examples and the programming guide.
But we definitely need to do something for python. Gotta brainstorm on that.

On Tue, Jun 16, 2015 at 10:24 AM, Cody Koeninger (JIRA) j...@apache.org



 Expose KafkaRDDs offsetRange in Java and Python
 ---

 Key: SPARK-8389
 URL: https://issues.apache.org/jira/browse/SPARK-8389
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Tathagata Das
Assignee: Cody Koeninger
Priority: Critical

 Probably requires creating a JavaKafkaPairRDD and also use that in the python 
 APIs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8337) KafkaUtils.createDirectStream for python is lacking API/feature parity with the Scala/Java version

2015-06-16 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588517#comment-14588517
 ] 

Cody Koeninger commented on SPARK-8337:
---

So one thing to keep in mind is that if the Kafka project ends up adding more 
fields to MessageAndMetadata, the scala interface is going to continue to give 
users access to those fields, without changing anything other than the Kafka 
version.

If you go with the approach of building a Python dict, someone's going to have 
to remember to go manually change the code to give access to the new fields.

I don't have enough Python knowledge to comment on whether the approach of 
passing a messageHandler function is feasible... I can try to get up to speed 
on it.  It may be worth trying to get the attention of Davies Liu after the 
spark conference hubub has died down.

 KafkaUtils.createDirectStream for python is lacking API/feature parity with 
 the Scala/Java version
 --

 Key: SPARK-8337
 URL: https://issues.apache.org/jira/browse/SPARK-8337
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.0
Reporter: Amit Ramesh
Priority: Critical

 See the following thread for context.
 http://apache-spark-developers-list.1001551.n3.nabble.com/Re-Spark-1-4-Python-API-for-getting-Kafka-offsets-in-direct-mode-tt12714.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8356) Reconcile callUDF and callUdf

2015-06-16 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588579#comment-14588579
 ] 

Benjamin Fradet commented on SPARK-8356:


I've started working on this issue.

 Reconcile callUDF and callUdf
 -

 Key: SPARK-8356
 URL: https://issues.apache.org/jira/browse/SPARK-8356
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Priority: Critical
  Labels: starter

 Right now we have two functions {{callUDF}} and {{callUdf}}.  I think the 
 former is used for calling Java functions (and the documentation is wrong) 
 and the latter is for calling functions by name.  Either way this is 
 confusing and we should unify or pick different names.  Also, lets make sure 
 the docs are right.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6910) Support for pushing predicates down to metastore for partition pruning

2015-06-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6910:

Target Version/s: 1.5.0

 Support for pushing predicates down to metastore for partition pruning
 --

 Key: SPARK-6910
 URL: https://issues.apache.org/jira/browse/SPARK-6910
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6910) Support for pushing predicates down to metastore for partition pruning

2015-06-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6910:

Priority: Critical  (was: Major)

 Support for pushing predicates down to metastore for partition pruning
 --

 Key: SPARK-6910
 URL: https://issues.apache.org/jira/browse/SPARK-6910
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7665) MLlib Python API breaking changes check between 1.3 1.4

2015-06-16 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588777#comment-14588777
 ] 

Joseph K. Bradley commented on SPARK-7665:
--

I'm making a final pass before I close this.

 MLlib Python API breaking changes check between 1.3  1.4
 -

 Key: SPARK-7665
 URL: https://issues.apache.org/jira/browse/SPARK-7665
 Project: Spark
  Issue Type: Documentation
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Yanbo Liang

 Comparing the MLlib Python APIs between 1.3 and 1.4, so we can note breaking 
 changes. 
 We'll need to note those changes (if any) in the user guide's Migration Guide 
 section.
 If the API change is for an Alpha/Experimental/DeveloperApi component, we 
 need also note that as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7665) MLlib Python API breaking changes check between 1.3 1.4

2015-06-16 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-7665.
--
Resolution: Done
  Assignee: Joseph K. Bradley

 MLlib Python API breaking changes check between 1.3  1.4
 -

 Key: SPARK-7665
 URL: https://issues.apache.org/jira/browse/SPARK-7665
 Project: Spark
  Issue Type: Documentation
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Yanbo Liang
Assignee: Joseph K. Bradley

 Comparing the MLlib Python APIs between 1.3 and 1.4, so we can note breaking 
 changes. 
 We'll need to note those changes (if any) in the user guide's Migration Guide 
 section.
 If the API change is for an Alpha/Experimental/DeveloperApi component, we 
 need also note that as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7665) MLlib Python API breaking changes check between 1.3 1.4

2015-06-16 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588804#comment-14588804
 ] 

Joseph K. Bradley commented on SPARK-7665:
--

I believe everything checks out, so I'm going to mark this as resolved.

 MLlib Python API breaking changes check between 1.3  1.4
 -

 Key: SPARK-7665
 URL: https://issues.apache.org/jira/browse/SPARK-7665
 Project: Spark
  Issue Type: Documentation
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Yanbo Liang

 Comparing the MLlib Python APIs between 1.3 and 1.4, so we can note breaking 
 changes. 
 We'll need to note those changes (if any) in the user guide's Migration Guide 
 section.
 If the API change is for an Alpha/Experimental/DeveloperApi component, we 
 need also note that as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7916) MLlib Python doc parity check for classification and regression.

2015-06-16 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-7916.
--
   Resolution: Fixed
Fix Version/s: 1.4.1
   1.5.0

Issue resolved by pull request 6460
[https://github.com/apache/spark/pull/6460]

 MLlib Python doc parity check for classification and regression.
 

 Key: SPARK-7916
 URL: https://issues.apache.org/jira/browse/SPARK-7916
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Yanbo Liang
 Fix For: 1.5.0, 1.4.1


 Check then make the MLlib Python classification and regression doc to be as 
 complete as the Scala doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7667) MLlib Python API consistency check

2015-06-16 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588852#comment-14588852
 ] 

Joseph K. Bradley commented on SPARK-7667:
--

[~yanboliang]  What have you checked through, and what remains for this 
consistency check?

 MLlib Python API consistency check
 --

 Key: SPARK-7667
 URL: https://issues.apache.org/jira/browse/SPARK-7667
 Project: Spark
  Issue Type: Task
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Yanbo Liang

 Check and ensure the MLlib Python API(class/method/parameter) consistent with 
 Scala.
 The following APIs are not consistent:
 * class
 * method
 * parameter
 ** feature.StandardScaler.fit()
 ** many transform() function of feature module



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7674) R-like stats for ML models

2015-06-16 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588914#comment-14588914
 ] 

Joseph K. Bradley commented on SPARK-7674:
--

Definitely.  I think the next items to do are:
* confirm whether there is feedback about the general (backend) design in the 
doc linked above
* add functionality to models one-by-one (but thinking about code sharing where 
possible)

 R-like stats for ML models
 --

 Key: SPARK-7674
 URL: https://issues.apache.org/jira/browse/SPARK-7674
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Critical

 This is an umbrella JIRA for supporting ML model summaries and statistics, 
 following the example of R's summary() and plot() functions.
 [Design 
 doc|https://docs.google.com/document/d/1oswC_Neqlqn5ElPwodlDY4IkSaHAi0Bx6Guo_LvhHK8/edit?usp=sharing]
 From the design doc:
 {quote}
 R and its well-established packages provide extensive functionality for 
 inspecting a model and its results.  This inspection is critical to 
 interpreting, debugging and improving models.
 R is arguably a gold standard for a statistics/ML library, so this doc 
 largely attempts to imitate it.  The challenge we face is supporting similar 
 functionality, but on big (distributed) data.  Data size makes both efficient 
 computation and meaningful displays/summaries difficult.
 R model and result summaries generally take 2 forms:
 * summary(model): Display text with information about the model and results 
 on data
 * plot(model): Display plots about the model and results
 We aim to provide both of these types of information.  Visualization for the 
 plottable results will not be supported in MLlib itself, but we can provide 
 results in a form which can be plotted easily with other tools.
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8389) Expose KafkaRDDs offsetRange in Java and Python

2015-06-16 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588404#comment-14588404
 ] 

Cody Koeninger commented on SPARK-8389:
---

So on the java side, just so I'm clear, are we talking about the difference 
between people writing

OffsetRange[] offsets = ((HasOffsetRanges)rdd.rdd()).offsetRanges();

which, as far as I can tell, they can do currently (see attached PR with test 
change) 
versus

OffsetRange[] offsets = ((HasOffsetRanges)rdd).offsetRanges();

I can see how the second is definitely a nicer api...  but I don't know that 
it's a critical bugfix, and I also don't know that it's worth introducing 
additional JavaKafkaRDD and JavaDirectKafkaInputDStream wrappers.  The typecast 
is kind of an ugly hack to begin with, there's only so much we can do to make 
it nicer... short of higher kinded return type parameters for rdd methods in 
Spark 2.0  :)

 Expose KafkaRDDs offsetRange in Java and Python
 ---

 Key: SPARK-8389
 URL: https://issues.apache.org/jira/browse/SPARK-8389
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Tathagata Das
Assignee: Cody Koeninger
Priority: Critical

 Probably requires creating a JavaKafkaPairRDD and also use that in the python 
 APIs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8337) KafkaUtils.createDirectStream for python is lacking API/feature parity with the Scala/Java version

2015-06-16 Thread Amit Ramesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588439#comment-14588439
 ] 

Amit Ramesh commented on SPARK-8337:


[~juanrh] this looks pretty good to me. And from what I can see shouldn't add 
much overhead compared to the existing logic. It is perfect in terms of what 
are in need of :). One stylistic suggestion is that you could return (key, 
value, kafka_offsets) where kafka_offsets is a dict of topic, parition and 
offset. This would keep things a little more consistent with what is returned 
when meta info is False.

Thanks!
Amit


 KafkaUtils.createDirectStream for python is lacking API/feature parity with 
 the Scala/Java version
 --

 Key: SPARK-8337
 URL: https://issues.apache.org/jira/browse/SPARK-8337
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.0
Reporter: Amit Ramesh
Priority: Critical

 See the following thread for context.
 http://apache-spark-developers-list.1001551.n3.nabble.com/Re-Spark-1-4-Python-API-for-getting-Kafka-offsets-in-direct-mode-tt12714.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7944) Spark-Shell 2.11 1.4.0-RC-03 does not add jars to class path

2015-06-16 Thread Alex Baretta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588506#comment-14588506
 ] 

Alex Baretta commented on SPARK-7944:
-

Bug confirmed on Spark 1.4.0 with Scala 2.11.6. The --jars option to 
spark-shell is properly passed on to the SparkSubmit class, and the jars seem 
to be loaded, but the classes are not available in the REPL.

spark-shell --jars commons-csv-1.0.jar
...
15/06/16 17:57:32 INFO SparkContext: Added JAR 
file:/home/alex/commons-csv-1.0.jar at 
http://10.240.57.53:38821/jars/commons-csv-1.0.jar with timestamp 1434477452978
...
scala org.apache.commons.csv.CSVFormat.DEFAULT
console:21: error: object csv is not a member of package org.apache.commons
  org.apache.commons.csv.CSVFormat.DEFAULT
 ^


 Spark-Shell 2.11 1.4.0-RC-03 does not add jars to class path
 

 Key: SPARK-7944
 URL: https://issues.apache.org/jira/browse/SPARK-7944
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.3.1, 1.4.0
 Environment: scala 2.11
Reporter: Alexander Nakos
Priority: Critical
 Attachments: spark_shell_output.txt, spark_shell_output_2.10.txt


 When I run the spark-shell with the --jars argument and supply a path to a 
 single jar file, none of the classes in the jar are available in the REPL.
 I have encountered this same behaviour in both 1.3.1 and 1.4.0_RC-03 builds 
 for scala 2.11. I have yet to do a 1.4.0 RC-03 build for scala 2.10, but the 
 contents of the jar are available in the 1.3.1_2.10 REPL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-7715) Update MLlib Programming Guide for 1.4

2015-06-16 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reopened SPARK-7715:
--

This should not actually be closed yet.  We need to update the programming 
guide still, mainly to provide a new migration guide (which won't have much 
content) but also to make it easier to find the Pipelines API docs.  (This 
should have happened before the release, but we can at least try to get it done 
ASAP.)  I'm going to finish closing out some other JIRAs before addressing this 
one, since some of those might indicate items to include in the migration guide.

 Update MLlib Programming Guide for 1.4
 --

 Key: SPARK-7715
 URL: https://issues.apache.org/jira/browse/SPARK-7715
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, ML, MLlib
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
 Fix For: 1.4.0


 Before the release, we need to update the MLlib Programming Guide.  Updates 
 will include:
 * Add migration guide subsection.
 ** Use the results of the QA audit JIRAs.
 * Check phrasing, especially in main sections (for outdated items such as In 
 this release, ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7666) MLlib Python doc parity check

2015-06-16 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-7666.
--
   Resolution: Fixed
Fix Version/s: 1.5.0
   1.4.1
 Assignee: Yanbo Liang

I'm resolving this.  I think we can complete this parity check for other parts 
of MLlib during this next release cycle.

 MLlib Python doc parity check
 -

 Key: SPARK-7666
 URL: https://issues.apache.org/jira/browse/SPARK-7666
 Project: Spark
  Issue Type: Documentation
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Yanbo Liang
Assignee: Yanbo Liang
 Fix For: 1.4.1, 1.5.0


 Check then make the MLlib Python doc to be as complete as the Scala doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7580) Driver out of memory

2015-06-16 Thread Andrew Rothstein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588855#comment-14588855
 ] 

Andrew Rothstein commented on SPARK-7580:
-

Haven't heard any solutions. We basically reduced the number of executors to 
500 or 1000. Upping the memory allocated to the driver will help as well. 
Unfortunately my cluster is configured to limit my driver container to 4g so I 
suspect it's thrashing. 



 Driver out of memory
 

 Key: SPARK-7580
 URL: https://issues.apache.org/jira/browse/SPARK-7580
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.3.0
 Environment: YARN, HDP 2.1, RedHat 6.4
 200 x HP DL185
Reporter: Andrew Rothstein

 My 200 node cluster has a 8k executor capacity. When I submitted a job with 
 2k executors, 2g per executor, and 4g for the driver, the 
 ApplicationMaster/driver quickly became unresponsive. It was making progress, 
 then threw a couple of these exceptions:
 2015-05-12 16:46:41,598 ERROR [Spark Context Cleaner] spark.ContextCleaner: 
 Error cleaning broadcast 4 java.util.concurrent.TimeoutException: Futures 
 timed out after [30 seconds] at 
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at 
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at 
 scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
  at scala.concurrent.Await$.result(package.scala:107) at 
 org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:137)
  at 
 org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:227)
  at 
 org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
  at 
 org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:66)
  at 
 org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:185) 
 at 
 org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:147)
  at 
 org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:138)
  at scala.Option.foreach(Option.scala:236) at 
 org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:138)
  at 
 org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:134)
  at 
 org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:134)
  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1617) at 
 org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:133)
  at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:65)
 Then the job crashed with OOM.
 2015-05-12 16:47:53,566 ERROR [sparkDriver-akka.actor.default-dispatcher-4] 
 actor.ActorSystemImpl: Uncaught fatal error from thread 
 [sparkDriver-akka.remote.default-remote-dispatcher-8] shutting down 
 ActorSystem [sparkDriver] java.lang.OutOfMemoryError: Java heap space at 
 org.spark_project.protobuf.ByteString.copyFrom(ByteString.java:216) at 
 org.spark_project.protobuf.ByteString.copyFrom(ByteString.java:229) at 
 akka.remote.transport.AkkaPduProtobufCodec$.constructPayload(AkkaPduCodec.scala:145)
  at 
 akka.remote.transport.AkkaProtocolHandle.write(AkkaProtocolTransport.scala:182)
  at akka.remote.EndpointWriter.writeSend(Endpoint.scala:760) at 
 akka.remote.EndpointWriter$$anonfun$2.applyOrElse(Endpoint.scala:722) at 
 akka.actor.Actor$class.aroundReceive(Actor.scala:465) at 
 akka.remote.EndpointActor.aroundReceive(Endpoint.scala:415) at 
 akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at 
 akka.actor.ActorCell.invoke(ActorCell.scala:487) at 
 akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at 
 akka.dispatch.Mailbox.run(Mailbox.scala:220) at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
  at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 When I reran the job with 3g of memory per executor and 1k executors it ran 
 to completion more quickly than the 2k executor run took to crash. I didn't 
 think I was pushing the envelope by using 2k executors and the stock driver 
 heap size. Is this a scale limitation of the driver? Any suggestions beyond 
 increasing the 

[jira] [Updated] (SPARK-7916) MLlib Python doc parity check for classification and regression.

2015-06-16 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7916:
-
Assignee: Yanbo Liang

 MLlib Python doc parity check for classification and regression.
 

 Key: SPARK-7916
 URL: https://issues.apache.org/jira/browse/SPARK-7916
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Yanbo Liang
Assignee: Yanbo Liang
 Fix For: 1.4.1, 1.5.0


 Check then make the MLlib Python classification and regression doc to be as 
 complete as the Scala doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8400) ml.ALS doesn't handle -1 block size

2015-06-16 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-8400:


 Summary: ml.ALS doesn't handle -1 block size
 Key: SPARK-8400
 URL: https://issues.apache.org/jira/browse/SPARK-8400
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.3.1
Reporter: Xiangrui Meng


Under spark.mllib, if number blocks is set to -1, we set the block size 
automatically based on the input partition size. However, this behavior is not 
preserved in the spark.ml API. If user sets -1 in Spark 1.3, it will not work, 
but no error messages will show.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7944) Spark-Shell 2.11 1.4.0-RC-03 does not add jars to class path

2015-06-16 Thread Iulian Dragos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588740#comment-14588740
 ] 

Iulian Dragos commented on SPARK-7944:
--

I'll have a look tomorrow, I vaguely remember a bug in the Scala REPL that was 
fixed. Since the code is forked, the fix may not be in there...

 Spark-Shell 2.11 1.4.0-RC-03 does not add jars to class path
 

 Key: SPARK-7944
 URL: https://issues.apache.org/jira/browse/SPARK-7944
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.3.1, 1.4.0
 Environment: scala 2.11
Reporter: Alexander Nakos
Priority: Critical
 Attachments: spark_shell_output.txt, spark_shell_output_2.10.txt


 When I run the spark-shell with the --jars argument and supply a path to a 
 single jar file, none of the classes in the jar are available in the REPL.
 I have encountered this same behaviour in both 1.3.1 and 1.4.0_RC-03 builds 
 for scala 2.11. I have yet to do a 1.4.0 RC-03 build for scala 2.10, but the 
 contents of the jar are available in the 1.3.1_2.10 REPL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8379) LeaseExpiredException when using dynamic partition with speculative execution

2015-06-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-8379:

Target Version/s: 1.5.0
Shepherd: Cheng Lian
Assignee: jeanlyn

 LeaseExpiredException when using dynamic partition with speculative execution
 -

 Key: SPARK-8379
 URL: https://issues.apache.org/jira/browse/SPARK-8379
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.3.1, 1.4.0
Reporter: jeanlyn
Assignee: jeanlyn

 when inserting to table using dynamic partitions with 
 *spark.speculation=true*  and there is a skew data of some partitions trigger 
 the speculative tasks ,it will throws the exception like
 {code}
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
  Lease mismatch on 
 /tmp/hive-jeanlyn/hive_2015-06-15_15-20-44_734_8801220787219172413-1/-ext-1/ds=2015-06-15/type=2/part-00301.lzo
  owned by DFSClient_attempt_201506031520_0011_m_000189_0_-1513487243_53 but 
 is accessed by DFSClient_attempt_201506031520_0011_m_42_0_-1275047721_57
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3258) Python API for streaming MLlib algorithms

2015-06-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3258:
---

Assignee: (was: Apache Spark)

 Python API for streaming MLlib algorithms
 -

 Key: SPARK-3258
 URL: https://issues.apache.org/jira/browse/SPARK-3258
 Project: Spark
  Issue Type: Umbrella
  Components: MLlib, PySpark, Streaming
Reporter: Xiangrui Meng

 This is an umbrella JIRA to track Python port of streaming MLlib algorithms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3258) Python API for streaming MLlib algorithms

2015-06-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3258:
---

Assignee: Apache Spark

 Python API for streaming MLlib algorithms
 -

 Key: SPARK-3258
 URL: https://issues.apache.org/jira/browse/SPARK-3258
 Project: Spark
  Issue Type: Umbrella
  Components: MLlib, PySpark, Streaming
Reporter: Xiangrui Meng
Assignee: Apache Spark

 This is an umbrella JIRA to track Python port of streaming MLlib algorithms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3258) Python API for streaming MLlib algorithms

2015-06-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588585#comment-14588585
 ] 

Apache Spark commented on SPARK-3258:
-

User 'MechCoder' has created a pull request for this issue:
https://github.com/apache/spark/pull/6849

 Python API for streaming MLlib algorithms
 -

 Key: SPARK-3258
 URL: https://issues.apache.org/jira/browse/SPARK-3258
 Project: Spark
  Issue Type: Umbrella
  Components: MLlib, PySpark, Streaming
Reporter: Xiangrui Meng

 This is an umbrella JIRA to track Python port of streaming MLlib algorithms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7944) Spark-Shell 2.11 1.4.0-RC-03 does not add jars to class path

2015-06-16 Thread Vincent Ohprecio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588588#comment-14588588
 ] 

Vincent Ohprecio commented on SPARK-7944:
-

just compiled version 1.5.0-SNAPSHOT Using Scala version 2.10.4 from github.
~/dev/spark(master) $build/mvn -DskipTests clean package
[INFO] BUILD SUCCESS ...

~/dev/spark(master) $bin/spark-shell --jars 
/Users/antigen/Downloads/algebird-core_2.10-0.10.2.jar

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_67)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
SQL context available as sqlContext.

scala import com.twitter.algebird._
import com.twitter.algebird._

 Spark-Shell 2.11 1.4.0-RC-03 does not add jars to class path
 

 Key: SPARK-7944
 URL: https://issues.apache.org/jira/browse/SPARK-7944
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.3.1, 1.4.0
 Environment: scala 2.11
Reporter: Alexander Nakos
Priority: Critical
 Attachments: spark_shell_output.txt, spark_shell_output_2.10.txt


 When I run the spark-shell with the --jars argument and supply a path to a 
 single jar file, none of the classes in the jar are available in the REPL.
 I have encountered this same behaviour in both 1.3.1 and 1.4.0_RC-03 builds 
 for scala 2.11. I have yet to do a 1.4.0 RC-03 build for scala 2.10, but the 
 contents of the jar are available in the 1.3.1_2.10 REPL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7633) Streaming Logistic Regression- Python bindings

2015-06-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7633:
---

Assignee: (was: Apache Spark)

 Streaming Logistic Regression- Python bindings
 --

 Key: SPARK-7633
 URL: https://issues.apache.org/jira/browse/SPARK-7633
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Yanbo Liang

 Add Python API for StreamingLogisticRegressionWithSGD



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7633) Streaming Logistic Regression- Python bindings

2015-06-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7633:
---

Assignee: Apache Spark

 Streaming Logistic Regression- Python bindings
 --

 Key: SPARK-7633
 URL: https://issues.apache.org/jira/browse/SPARK-7633
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Yanbo Liang
Assignee: Apache Spark

 Add Python API for StreamingLogisticRegressionWithSGD



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7633) Streaming Logistic Regression- Python bindings

2015-06-16 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588600#comment-14588600
 ] 

Manoj Kumar commented on SPARK-7633:


I'm extremely sorry. I was halfway through when you had commented (Just tests 
were remaining)

 Streaming Logistic Regression- Python bindings
 --

 Key: SPARK-7633
 URL: https://issues.apache.org/jira/browse/SPARK-7633
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Yanbo Liang

 Add Python API for StreamingLogisticRegressionWithSGD



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7633) Streaming Logistic Regression- Python bindings

2015-06-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588595#comment-14588595
 ] 

Apache Spark commented on SPARK-7633:
-

User 'MechCoder' has created a pull request for this issue:
https://github.com/apache/spark/pull/6849

 Streaming Logistic Regression- Python bindings
 --

 Key: SPARK-7633
 URL: https://issues.apache.org/jira/browse/SPARK-7633
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Yanbo Liang

 Add Python API for StreamingLogisticRegressionWithSGD



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8387) [SPARK][Web-UI] Only show 4096 bytes content for executor log instead all

2015-06-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8387.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6834
[https://github.com/apache/spark/pull/6834]

 [SPARK][Web-UI] Only show 4096 bytes content for executor log instead all
 -

 Key: SPARK-8387
 URL: https://issues.apache.org/jira/browse/SPARK-8387
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.4.0
Reporter: SuYan
Priority: Minor
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8387) [SPARK][Web-UI] Only show 4096 bytes content for executor log instead all

2015-06-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8387:
-
Assignee: SuYan

 [SPARK][Web-UI] Only show 4096 bytes content for executor log instead all
 -

 Key: SPARK-8387
 URL: https://issues.apache.org/jira/browse/SPARK-8387
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.4.0
Reporter: SuYan
Assignee: SuYan
Priority: Minor
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3665) Java API for GraphX

2015-06-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3665:
---

Assignee: Ankur Dave  (was: Apache Spark)

 Java API for GraphX
 ---

 Key: SPARK-3665
 URL: https://issues.apache.org/jira/browse/SPARK-3665
 Project: Spark
  Issue Type: Improvement
  Components: GraphX, Java API
Affects Versions: 1.0.0
Reporter: Ankur Dave
Assignee: Ankur Dave

 The Java API will wrap the Scala API in a similar manner as JavaRDD. 
 Components will include:
 # JavaGraph
 #- removes optional param from persist, subgraph, mapReduceTriplets, 
 Graph.fromEdgeTuples, Graph.fromEdges, Graph.apply
 #- removes implicit {{=:=}} param from mapVertices, outerJoinVertices
 #- merges multiple parameters lists
 #- incorporates GraphOps
 # JavaVertexRDD
 # JavaEdgeRDD



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3665) Java API for GraphX

2015-06-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3665:
---

Assignee: Apache Spark  (was: Ankur Dave)

 Java API for GraphX
 ---

 Key: SPARK-3665
 URL: https://issues.apache.org/jira/browse/SPARK-3665
 Project: Spark
  Issue Type: Improvement
  Components: GraphX, Java API
Affects Versions: 1.0.0
Reporter: Ankur Dave
Assignee: Apache Spark

 The Java API will wrap the Scala API in a similar manner as JavaRDD. 
 Components will include:
 # JavaGraph
 #- removes optional param from persist, subgraph, mapReduceTriplets, 
 Graph.fromEdgeTuples, Graph.fromEdges, Graph.apply
 #- removes implicit {{=:=}} param from mapVertices, outerJoinVertices
 #- merges multiple parameters lists
 #- incorporates GraphOps
 # JavaVertexRDD
 # JavaEdgeRDD



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5362) Gradient and Optimizer to support generic output (instead of label) and data batches

2015-06-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5362:
---

Assignee: (was: Apache Spark)

 Gradient and Optimizer to support generic output (instead of label) and data 
 batches
 

 Key: SPARK-5362
 URL: https://issues.apache.org/jira/browse/SPARK-5362
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Alexander Ulanov
   Original Estimate: 24h
  Remaining Estimate: 24h

 Currently, Gradient and Optimizer interfaces support data in form of 
 RDD[Double, Vector] which refers to label and features. This limits its 
 application to classification problems. For example, artificial neural 
 network demands Vector as output (instead of label: Double). Moreover, 
 current interface does not support data batches. I propose to replace label: 
 Double with output: Vector. It enables passing generic output instead of 
 label and also passing data and output batches stored in corresponding 
 vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5362) Gradient and Optimizer to support generic output (instead of label) and data batches

2015-06-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5362:
---

Assignee: Apache Spark

 Gradient and Optimizer to support generic output (instead of label) and data 
 batches
 

 Key: SPARK-5362
 URL: https://issues.apache.org/jira/browse/SPARK-5362
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Alexander Ulanov
Assignee: Apache Spark
   Original Estimate: 24h
  Remaining Estimate: 24h

 Currently, Gradient and Optimizer interfaces support data in form of 
 RDD[Double, Vector] which refers to label and features. This limits its 
 application to classification problems. For example, artificial neural 
 network demands Vector as output (instead of label: Double). Moreover, 
 current interface does not support data batches. I propose to replace label: 
 Double with output: Vector. It enables passing generic output instead of 
 label and also passing data and output batches stored in corresponding 
 vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >