[jira] [Commented] (SPARK-7324) Add DataFrame.dropDuplicates
[ https://issues.apache.org/jira/browse/SPARK-7324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14525686#comment-14525686 ] Apache Spark commented on SPARK-7324: - User 'kaka1992' has created a pull request for this issue: https://github.com/apache/spark/pull/5870 Add DataFrame.dropDuplicates Key: SPARK-7324 URL: https://issues.apache.org/jira/browse/SPARK-7324 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Similar to http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html def dropDuplicates(): DataFrame def dropDuplicates(subset: Seq[String]): DataFrame We can turn this into groupBy(cols).agg(first(...)) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7324) Add DataFrame.dropDuplicates
[ https://issues.apache.org/jira/browse/SPARK-7324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7324: --- Assignee: Apache Spark Add DataFrame.dropDuplicates Key: SPARK-7324 URL: https://issues.apache.org/jira/browse/SPARK-7324 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark Similar to http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html def dropDuplicates(): DataFrame def dropDuplicates(subset: Seq[String]): DataFrame We can turn this into groupBy(cols).agg(first(...)) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7326) Performing window() on a WindowedDStream doesn't work all the time
Wesley Miao created SPARK-7326: -- Summary: Performing window() on a WindowedDStream doesn't work all the time Key: SPARK-7326 URL: https://issues.apache.org/jira/browse/SPARK-7326 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.1 Reporter: Wesley Miao Someone reported similar issues before but got no response. http://apache-spark-user-list.1001560.n3.nabble.com/Windows-of-windowed-streams-not-displaying-the-expected-results-td466.html And I met the same issue recently and it can be reproduced in 1.3.1 by the following piece of code: def main(args: Array[String]) { val batchInterval = 1234 val sparkConf = new SparkConf() .setAppName(WindowOnWindowedDStream) .setMaster(local[2]) val ssc = new StreamingContext(sparkConf, Milliseconds(batchInterval.toInt)) ssc.checkpoint(checkpoint) def createRDD(i: Int) : RDD[(String, Int)] = { val count = 1000 val rawLogs = (1 to count).map{ _ = val word = word + Random.nextInt.abs % 5 (word, 1) } ssc.sparkContext.parallelize(rawLogs) } val rddQueue = mutable.Queue[RDD[(String, Int)]]() val rawLogStream = ssc.queueStream(rddQueue) (1 to 300) foreach { i = rddQueue.enqueue(createRDD(i)) } val l1 = rawLogStream.window(Milliseconds(batchInterval.toInt) * 5, Milliseconds(batchInterval.toInt) * 5).reduceByKey(_ + _) val l2 = l1.window(Milliseconds(batchInterval.toInt) * 15, Milliseconds(batchInterval.toInt) * 15).reduceByKey(_ + _) l1.print() l2.print() ssc.start() ssc.awaitTermination() } Here we have two windowed DStream instance l1 and l2. l1 is the result DStream by performing a window() on the source DStream with both window and sliding duration 5 times the batch internal of the source stream. l2 is the result DStream by performing a window() on l1, with both window and sliding duration 3 times l1's batch interval, which is 15 times of the source stream. From the output of this simple streaming app, I can only see print data output from l1 and no data printed from l2. Diving into the source code, I found the problem may most likely reside in DStream.slice() implementation, as shown below. def slice(fromTime: Time, toTime: Time): Seq[RDD[T]] = { if (!isInitialized) { throw new SparkException(this + has not been initialized) } if (!(fromTime - zeroTime).isMultipleOf(slideDuration)) { logWarning(fromTime ( + fromTime + ) is not a multiple of slideDuration ( + slideDuration + )) } if (!(toTime - zeroTime).isMultipleOf(slideDuration)) { logWarning(toTime ( + fromTime + ) is not a multiple of slideDuration ( + slideDuration + )) } val alignedToTime = toTime.floor(slideDuration, zeroTime) val alignedFromTime = fromTime.floor(slideDuration, zeroTime) logInfo(Slicing from + fromTime + to + toTime + (aligned to + alignedFromTime + and + alignedToTime + )) alignedFromTime.to(alignedToTime, slideDuration).flatMap(time = { if (time = zeroTime) getOrCompute(time) else None }) } Here after performing floor() on both fromTime and toTime, the result (alignedFromTime - zeroTime) and (alignedToTime - zeroTime) may no longer be multiple of the slidingDuration, thus making isTimeValid check failed for all the remaining computation. The fix would be to add a new floor() function in Time.scala to respect the zeroTime while performing the floor : def floor(that: Duration, zeroTime: Time): Time = { val t = that.milliseconds new Time(((this.millis - zeroTime.milliseconds) / t) * t + zeroTime.milliseconds) } And then change the DStream.slice to call this new floor function by passing in its zeroTime. val alignedToTime = toTime.floor(slideDuration, zeroTime) val alignedFromTime = fromTime.floor(slideDuration, zeroTime) This way the alignedToTime and alignedFromTime are *really* aligned in respect to zeroTime whose value is not really a 0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7324) Add DataFrame.dropDuplicates
[ https://issues.apache.org/jira/browse/SPARK-7324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7324: --- Assignee: (was: Apache Spark) Add DataFrame.dropDuplicates Key: SPARK-7324 URL: https://issues.apache.org/jira/browse/SPARK-7324 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Similar to http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html def dropDuplicates(): DataFrame def dropDuplicates(subset: Seq[String]): DataFrame We can turn this into groupBy(cols).agg(first(...)) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6026) Eliminate the bypassMergeThreshold parameter and associated hash-ish shuffle within the Sort shuffle code
[ https://issues.apache.org/jira/browse/SPARK-6026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-6026: -- Component/s: Shuffle Eliminate the bypassMergeThreshold parameter and associated hash-ish shuffle within the Sort shuffle code - Key: SPARK-6026 URL: https://issues.apache.org/jira/browse/SPARK-6026 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 1.3.0 Reporter: Kay Ousterhout The bypassMergeThreshold parameter (and associated use of a hash-ish shuffle when the number of partitions is less than this) is basically a workaround for SparkSQL, because the fact that the sort-based shuffle stores non-serialized objects is a deal-breaker for SparkSQL, which re-uses objects. Once the sort-based shuffle is changed to store serialized objects, we should never be secretly doing hash-ish shuffle even when the user has specified to use sort-based shuffle (because of its otherwise worse performance). [~rxin][~adav], masters of shuffle, it would be helpful to get agreement from you on this proposal (and also a sanity check that I've correctly characterized the issue). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6411) PySpark DataFrames can't be created if any datetimes have timezones
[ https://issues.apache.org/jira/browse/SPARK-6411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526291#comment-14526291 ] Xiangrui Meng commented on SPARK-6411: -- [~airhorns] I'm testing Spark with Pyrolite master branch. With your patch, it is possible to create DFs with datetime that contains tzinfo. However, I found two issues: 1. A tz-unaware date (or maybe datetime) object becomes tz-aware after a round trip, which makes `test_apply_schema` in `sql/tests.py` fail. 2. The tzinfo does not remain the same after a round trip. Test code: {code} def test_datetime_with_timezone(self): SPARK-6411 try: import pytz has_pytz = True except: has_pytz = False if has_pytz: tz = pytz.timezone('America/Los_Angeles') date = datetime.datetime(2014, 7, 8, 11, 10, tzinfo=tz) first = self.sqlCtx.createDataFrame([(date,)]).first()[0] self.assertEqual(date, first) {code} Output: {code} == FAIL: test_datetime_with_timezone (__main__.SQLTests) -- Traceback (most recent call last): File /Users/meng/src/spark/python/pyspark/sql/tests.py, line 536, in test_datetime_with_timezone self.assertEqual(date, first) AssertionError: datetime.datetime(2014, 7, 8, 11, 10, tzinfo=DstTzInfo 'America/Los_Angeles' PST-1 day, 16:00:00 STD) != datetime.datetime(2014, 7, 8, 11, 10, tzinfo=DstTzInfo 'America/Los_Angeles' PDT-1 day, 17:00:00 DST) {code} I will check the conversion for date/datetime. It would be really helpful if you could provide insights. PySpark DataFrames can't be created if any datetimes have timezones --- Key: SPARK-6411 URL: https://issues.apache.org/jira/browse/SPARK-6411 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.3.0 Reporter: Harry Brundage Assignee: Xiangrui Meng I am unable to create a DataFrame with PySpark if any of the {{datetime}} objects that pass through the conversion process have a {{tzinfo}} property set. This works fine: {code} In [9]: sc.parallelize([(datetime.datetime(2014, 7, 8, 11, 10),)]).toDF().collect() Out[9]: [Row(_1=datetime.datetime(2014, 7, 8, 11, 10))] {code} as expected, the tuple's schema is inferred as having one anonymous column with a datetime field, and the datetime roundtrips through to the Java side python deserialization and then back into python land upon {{collect}}. This however: {code} In [5]: from dateutil.tz import tzutc In [10]: sc.parallelize([(datetime.datetime(2014, 7, 8, 11, 10, tzinfo=tzutc()),)]).toDF().collect() {code} explodes with {code} Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 12.0 failed 1 times, most recent failure: Lost task 0.0 in stage 12.0 (TID 12, localhost): net.razorvine.pickle.PickleException: invalid pickle data for datetime; expected 1 or 7 args, got 2 at net.razorvine.pickle.objects.DateTimeConstructor.createDateTime(DateTimeConstructor.java:69) at net.razorvine.pickle.objects.DateTimeConstructor.construct(DateTimeConstructor.java:32) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170) at net.razorvine.pickle.Unpickler.load(Unpickler.java:84) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:154) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:119) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:114) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
[jira] [Closed] (SPARK-4184) Improve Spark Streaming documentation to address commonly-asked questions
[ https://issues.apache.org/jira/browse/SPARK-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Fregly closed SPARK-4184. --- Resolution: Duplicate we'll incorporate changes in incrementally Improve Spark Streaming documentation to address commonly-asked questions -- Key: SPARK-4184 URL: https://issues.apache.org/jira/browse/SPARK-4184 Project: Spark Issue Type: Documentation Components: Streaming Reporter: Chris Fregly Labels: documentation, streaming Improve Streaming documentation including API descriptions, concurrency/thread safety, fault tolerance, replication, checkpointing, scalability, resource allocation and utilization, back pressure, and monitoring. also, add a section to the kinesis streaming guide describing how to use IAM roles with the Spark Kinesis Receiver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6654) Update Kinesis Streaming impls (both KCL-based and Direct) to use latest aws-java-sdk and kinesis-client-library
[ https://issues.apache.org/jira/browse/SPARK-6654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Fregly updated SPARK-6654: Priority: Major (was: Blocker) Target Version/s: 1.5.0 (was: 1.4.0) Update Kinesis Streaming impls (both KCL-based and Direct) to use latest aws-java-sdk and kinesis-client-library Key: SPARK-6654 URL: https://issues.apache.org/jira/browse/SPARK-6654 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.1.0 Reporter: Chris Fregly -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6907) Create an isolated classloader for the Hive Client.
[ https://issues.apache.org/jira/browse/SPARK-6907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-6907. - Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5851 [https://github.com/apache/spark/pull/5851] Create an isolated classloader for the Hive Client. --- Key: SPARK-6907 URL: https://issues.apache.org/jira/browse/SPARK-6907 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7302) SPARK building documentation still mentions building for yarn 0.23
[ https://issues.apache.org/jira/browse/SPARK-7302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7302. -- Resolution: Fixed Fix Version/s: 1.4.0 Assignee: Sean Owen Resolved by https://github.com/apache/spark/pull/5863 SPARK building documentation still mentions building for yarn 0.23 -- Key: SPARK-7302 URL: https://issues.apache.org/jira/browse/SPARK-7302 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.3.1 Reporter: Thomas Graves Assignee: Sean Owen Fix For: 1.4.0 as of SPARK-3445 we deprecated using hadoop 0.23. It looks like the building documentation still references it though. We should remove that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7326) Performing window() on a WindowedDStream doesn't work all the time
[ https://issues.apache.org/jira/browse/SPARK-7326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526062#comment-14526062 ] Wesley Miao commented on SPARK-7326: What I'd like to achieve is to do multiple-level aggregations on the source of logging stream. For example - let's say the source logging stream is at interval 1 second. The first level aggregation would be every 1 minute, which is the 60 intervals of the source stream. The second level would be every 1 hour. The third level would be 1 day. And we can do more levels if we want. What I hope is that at each level we'll do reduceByKey(_ + _) so that its aggregation can be done against its immediate previous level, instead of always aggregating against the source stream. Level 3's reduceByKey will be based on level 2's result, level 2 is based one level 1 and level 1 is based on the source stream. I would think this approach will be more efficient than always reduce over the source stream, particularly for the higher level (like daily and weekly) aggregation. Performing window() on a WindowedDStream doesn't work all the time -- Key: SPARK-7326 URL: https://issues.apache.org/jira/browse/SPARK-7326 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.1 Reporter: Wesley Miao Someone reported similar issues before but got no response. http://apache-spark-user-list.1001560.n3.nabble.com/Windows-of-windowed-streams-not-displaying-the-expected-results-td466.html And I met the same issue recently and it can be reproduced in 1.3.1 by the following piece of code: def main(args: Array[String]) { val batchInterval = 1234 val sparkConf = new SparkConf() .setAppName(WindowOnWindowedDStream) .setMaster(local[2]) val ssc = new StreamingContext(sparkConf, Milliseconds(batchInterval.toInt)) ssc.checkpoint(checkpoint) def createRDD(i: Int) : RDD[(String, Int)] = { val count = 1000 val rawLogs = (1 to count).map{ _ = val word = word + Random.nextInt.abs % 5 (word, 1) } ssc.sparkContext.parallelize(rawLogs) } val rddQueue = mutable.Queue[RDD[(String, Int)]]() val rawLogStream = ssc.queueStream(rddQueue) (1 to 300) foreach { i = rddQueue.enqueue(createRDD(i)) } val l1 = rawLogStream.window(Milliseconds(batchInterval.toInt) * 5, Milliseconds(batchInterval.toInt) * 5).reduceByKey(_ + _) val l2 = l1.window(Milliseconds(batchInterval.toInt) * 15, Milliseconds(batchInterval.toInt) * 15).reduceByKey(_ + _) l1.print() l2.print() ssc.start() ssc.awaitTermination() } Here we have two windowed DStream instance l1 and l2. l1 is the result DStream by performing a window() on the source DStream with both window and sliding duration 5 times the batch internal of the source stream. l2 is the result DStream by performing a window() on l1, with both window and sliding duration 3 times l1's batch interval, which is 15 times of the source stream. From the output of this simple streaming app, I can only see print data output from l1 and no data printed from l2. Diving into the source code, I found the problem may most likely reside in DStream.slice() implementation, as shown below. def slice(fromTime: Time, toTime: Time): Seq[RDD[T]] = { if (!isInitialized) { throw new SparkException(this + has not been initialized) } if (!(fromTime - zeroTime).isMultipleOf(slideDuration)) { logWarning(fromTime ( + fromTime + ) is not a multiple of slideDuration ( + slideDuration + )) } if (!(toTime - zeroTime).isMultipleOf(slideDuration)) { logWarning(toTime ( + fromTime + ) is not a multiple of slideDuration ( + slideDuration + )) } val alignedToTime = toTime.floor(slideDuration, zeroTime) val alignedFromTime = fromTime.floor(slideDuration, zeroTime) logInfo(Slicing from + fromTime + to + toTime + (aligned to + alignedFromTime + and + alignedToTime + )) alignedFromTime.to(alignedToTime, slideDuration).flatMap(time = { if (time = zeroTime) getOrCompute(time) else None }) } Here after performing floor() on both fromTime and toTime, the result (alignedFromTime - zeroTime) and (alignedToTime - zeroTime) may no longer be multiple of the slidingDuration, thus making isTimeValid() check failed for all the remaining computation. The fix would be to add a new floor() function in Time.scala to respect the zeroTime while performing the floor : def floor(that: Duration, zeroTime: Time): Time = { val t = that.milliseconds new Time(((this.millis - zeroTime.milliseconds) / t) * t + zeroTime.milliseconds) } And then change
[jira] [Assigned] (SPARK-6908) Refactor existing code to use the isolated client
[ https://issues.apache.org/jira/browse/SPARK-6908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6908: --- Assignee: (was: Apache Spark) Refactor existing code to use the isolated client - Key: SPARK-6908 URL: https://issues.apache.org/jira/browse/SPARK-6908 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6908) Refactor existing code to use the isolated client
[ https://issues.apache.org/jira/browse/SPARK-6908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526104#comment-14526104 ] Apache Spark commented on SPARK-6908: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/5876 Refactor existing code to use the isolated client - Key: SPARK-6908 URL: https://issues.apache.org/jira/browse/SPARK-6908 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6908) Refactor existing code to use the isolated client
[ https://issues.apache.org/jira/browse/SPARK-6908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6908: --- Assignee: Apache Spark Refactor existing code to use the isolated client - Key: SPARK-6908 URL: https://issues.apache.org/jira/browse/SPARK-6908 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7302) SPARK building documentation still mentions building for yarn 0.23
[ https://issues.apache.org/jira/browse/SPARK-7302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7302: - Priority: Minor (was: Major) SPARK building documentation still mentions building for yarn 0.23 -- Key: SPARK-7302 URL: https://issues.apache.org/jira/browse/SPARK-7302 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.3.1 Reporter: Thomas Graves Assignee: Sean Owen Priority: Minor Fix For: 1.4.0 as of SPARK-3445 we deprecated using hadoop 0.23. It looks like the building documentation still references it though. We should remove that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6873) Some Hive-Catalyst comparison tests fail due to unimportant order of some printed elements
[ https://issues.apache.org/jira/browse/SPARK-6873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526019#comment-14526019 ] Sean Owen commented on SPARK-6873: -- Sorry for the late reply, missed this. Yes, HashMap ordering did change in Java 8. Hm, so it sounds like the test can't pass on both Java 7 and 8 at the same time then, since the golden output would fail to match one or the other? or are you saying there's a way to ignore this part of the output? Otherwise we can track this for a later version like Spark 2.x where we may require Java 8. Some Hive-Catalyst comparison tests fail due to unimportant order of some printed elements -- Key: SPARK-6873 URL: https://issues.apache.org/jira/browse/SPARK-6873 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 1.3.1 Reporter: Sean Owen Assignee: Cheng Lian Priority: Minor As I mentioned, I've been seeing 4 test failures in Hive tests for a while, and actually it still affects master. I think it's a superficial problem that only turns up when running on Java 8, but still, would probably be an easy fix and good to fix. Specifically, here are four tests and the bit that fails the comparison, below. I tried to diagnose this but had trouble even finding where some of this occurs, like the list of synonyms? {code} - show_tblproperties *** FAILED *** Results do not match for show_tblproperties: ... !== HIVE - 2 row(s) == == CATALYST - 2 row(s) == !tmptruebar bar value !barbar value tmp true (HiveComparisonTest.scala:391) {code} {code} - show_create_table_serde *** FAILED *** Results do not match for show_create_table_serde: ... WITH SERDEPROPERTIES ( WITH SERDEPROPERTIES ( ! 'serialization.format'='$', 'field.delim'=',', ! 'field.delim'=',') 'serialization.format'='$') {code} {code} - udf_std *** FAILED *** Results do not match for udf_std: ... !== HIVE - 2 row(s) == == CATALYST - 2 row(s) == std(x) - Returns the standard deviation of a set of numbers std(x) - Returns the standard deviation of a set of numbers !Synonyms: stddev_pop, stddev Synonyms: stddev, stddev_pop (HiveComparisonTest.scala:391) {code} {code} - udf_stddev *** FAILED *** Results do not match for udf_stddev: ... !== HIVE - 2 row(s) ==== CATALYST - 2 row(s) == stddev(x) - Returns the standard deviation of a set of numbers stddev(x) - Returns the standard deviation of a set of numbers !Synonyms: stddev_pop, stdSynonyms: std, stddev_pop (HiveComparisonTest.scala:391) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7013) Add unit test for spark.ml StandardScaler
[ https://issues.apache.org/jira/browse/SPARK-7013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526042#comment-14526042 ] Glenn Weidner commented on SPARK-7013: -- I would like to work on this. I've started looking into the differences between spark.ml StandardScaler and mllib StandardScaler including mlib's existing test StandardScalerSuite. Add unit test for spark.ml StandardScaler - Key: SPARK-7013 URL: https://issues.apache.org/jira/browse/SPARK-7013 Project: Spark Issue Type: Test Components: ML Affects Versions: 1.3.1 Reporter: Joseph K. Bradley Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7275) Make LogicalRelation public
[ https://issues.apache.org/jira/browse/SPARK-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526098#comment-14526098 ] Glenn Weidner commented on SPARK-7275: -- I checked the history of the file org.apache.spark.sql.sources.LogicalRelation and it looks like LogicalRelation has always been declared private[sql]. Can you provide more details as to how this makes it harder to work with full logical plans from third party packages. Make LogicalRelation public --- Key: SPARK-7275 URL: https://issues.apache.org/jira/browse/SPARK-7275 Project: Spark Issue Type: Improvement Components: SQL Reporter: Santiago M. Mola Priority: Minor It seems LogicalRelation is the only part of the LogicalPlan that is not public. This makes it harder to work with full logical plans from third party packages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib
[ https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526153#comment-14526153 ] Sen Fang commented on SPARK-2336: - Hey Longbao, great to hear from you. To my best understanding, in the paper I cited above, they are distributing the input points by pushing them through the top tree (figure 4). Of course, for a precise result, this means we need to backtrack which isn't very ideal. What they propose is to use a buffer boundary like a spill tree. However unlike spill tree, here you would push the input targets to both children if it falls within the buffer zone, because the top tree was built as a metric tree (they explained the reason being a spill tree as top tree has a high memory penalty). So every input now might end up in multiple subtrees and you will need to reduceByKey at the end to keep the top K neighbors. Is your implementation available somewhere? I'm having a hard time to find time to finish my implementation this month. Would be great if eventually we can compare our implementations, validate and benchmark. Approximate k-NN Models for MLLib - Key: SPARK-2336 URL: https://issues.apache.org/jira/browse/SPARK-2336 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Brian Gawalt Priority: Minor Labels: clustering, features After tackling the general k-Nearest Neighbor model as per https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to also offer approximate k-Nearest Neighbor. A promising approach would involve building a kd-tree variant within from each partition, a la http://www.autonlab.org/autonweb/14714.html?branch=1language=2 This could offer a simple non-linear ML model that can label new data with much lower latency than the plain-vanilla kNN versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6943) Graphically show RDD's included in a stage
[ https://issues.apache.org/jira/browse/SPARK-6943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6943: - Attachment: (was: with-stack-trace.png) Graphically show RDD's included in a stage -- Key: SPARK-6943 URL: https://issues.apache.org/jira/browse/SPARK-6943 Project: Spark Issue Type: Sub-task Components: Spark Core, SQL Reporter: Patrick Wendell Assignee: Andrew Or Attachments: DAGvisualizationintheSparkWebUI.pdf, job-page.png, stage-page.png -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6943) Graphically show RDD's included in a stage
[ https://issues.apache.org/jira/browse/SPARK-6943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6943: - Attachment: (was: with-closures.png) Graphically show RDD's included in a stage -- Key: SPARK-6943 URL: https://issues.apache.org/jira/browse/SPARK-6943 Project: Spark Issue Type: Sub-task Components: Spark Core, SQL Reporter: Patrick Wendell Assignee: Andrew Or Attachments: DAGvisualizationintheSparkWebUI.pdf, with-stack-trace.png -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7330) JDBC RDD could lead to NPE when the date field is null
[ https://issues.apache.org/jira/browse/SPARK-7330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7330: --- Assignee: Apache Spark JDBC RDD could lead to NPE when the date field is null -- Key: SPARK-7330 URL: https://issues.apache.org/jira/browse/SPARK-7330 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang Assignee: Apache Spark because we call DateUtils.fromDate(rs.getDate(xx)) no matter it is null or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6943) Graphically show RDD's included in a stage
[ https://issues.apache.org/jira/browse/SPARK-6943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526181#comment-14526181 ] Andrew Or commented on SPARK-6943: -- Hi [~kayousterhout] I updated the patch to include both the job view and the stage view. Please have a look. Graphically show RDD's included in a stage -- Key: SPARK-6943 URL: https://issues.apache.org/jira/browse/SPARK-6943 Project: Spark Issue Type: Sub-task Components: Spark Core, SQL Reporter: Patrick Wendell Assignee: Andrew Or Attachments: DAGvisualizationintheSparkWebUI.pdf, job-page.png, stage-page.png -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3524) remove workaround to pickle array of float for Pyrolite
[ https://issues.apache.org/jira/browse/SPARK-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-3524: Assignee: Xiangrui Meng remove workaround to pickle array of float for Pyrolite --- Key: SPARK-3524 URL: https://issues.apache.org/jira/browse/SPARK-3524 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Davies Liu Assignee: Xiangrui Meng After Pyrolite release a new version with PR https://github.com/irmen/Pyrolite/pull/11, we should remove the workaround introduced in PR https://github.com/apache/spark/pull/2365 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3524) remove workaround to pickle array of float for Pyrolite
[ https://issues.apache.org/jira/browse/SPARK-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3524: - Target Version/s: 1.4.0 (was: 1.2.0) remove workaround to pickle array of float for Pyrolite --- Key: SPARK-3524 URL: https://issues.apache.org/jira/browse/SPARK-3524 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Davies Liu Assignee: Xiangrui Meng After Pyrolite release a new version with PR https://github.com/irmen/Pyrolite/pull/11, we should remove the workaround introduced in PR https://github.com/apache/spark/pull/2365 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7327) DataFrame show() method doesn't like empty dataframes
[ https://issues.apache.org/jira/browse/SPARK-7327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526231#comment-14526231 ] Olivier Girardot commented on SPARK-7327: - ok thx DataFrame show() method doesn't like empty dataframes - Key: SPARK-7327 URL: https://issues.apache.org/jira/browse/SPARK-7327 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Olivier Girardot Priority: Minor For an empty DataFrame (for exemple after a filter) any call to show() ends up with : {code} java.util.MissingFormatWidthException: -0s at java.util.Formatter$FormatSpecifier.checkGeneral(Formatter.java:2906) at java.util.Formatter$FormatSpecifier.init(Formatter.java:2680) at java.util.Formatter.parse(Formatter.java:2528) at java.util.Formatter.format(Formatter.java:2469) at java.util.Formatter.format(Formatter.java:2423) at java.lang.String.format(String.java:2790) at org.apache.spark.sql.DataFrame$$anonfun$showString$2$$anonfun$apply$4.apply(DataFrame.scala:200) at org.apache.spark.sql.DataFrame$$anonfun$showString$2$$anonfun$apply$4.apply(DataFrame.scala:199) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.DataFrame$$anonfun$showString$2.apply(DataFrame.scala:199) at org.apache.spark.sql.DataFrame$$anonfun$showString$2.apply(DataFrame.scala:198) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:198) at org.apache.spark.sql.DataFrame.show(DataFrame.scala:314) at org.apache.spark.sql.DataFrame.show(DataFrame.scala:320) {code} If no-one takes it by next friday, I'll fix it, the problem seems to come from the colWidths method : {code} // Compute the width of each column val colWidths = Array.fill(numCols)(0) for (row - rows) { for ((cell, i) - row.zipWithIndex) { colWidths(i) = math.max(colWidths(i), cell.length) } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7332) RpcCallContext.sender has a different name from the original sender's name
Qiping Li created SPARK-7332: Summary: RpcCallContext.sender has a different name from the original sender's name Key: SPARK-7332 URL: https://issues.apache.org/jira/browse/SPARK-7332 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.1 Reporter: Qiping Li Priority: Critical Fix For: 1.3.2, 1.4.0 In the function {{receiveAndReply}} of {{RpcEndpoint}}, we get the sender of the received message through {{context.sender}}. But this doesn't work because we don't get the right {{RpcEndpointRef}}. It's name is different from the original sender's name, so the path is different. Here is the code to test it: {code} case class Greeting(who: String) class GreetingActor(override val rpcEnv: RpcEnv) extends RpcEndpoint with Logging { override def receiveAndReply(context: RpcCallContext) : PartialFunction[Any, Unit] = { case Greeting(who) = logInfo(Hello + who) logInfo(s${context.sender.name}) } } class ToSend(override val rpcEnv: RpcEnv, greeting: RpcEndpointRef) extends RpcEndpoint with Logging { override def onStart(): Unit = { logInfo(s${self.name}) greeting.ask(Greeting(Charlie Parker)) } } object RpcEndpointNameTest { def main(args: Array[String]): Unit = { val actorSystemName = driver val conf = new SparkConf val rpcEnv = RpcEnv.create(actorSystemName, localhost, 0, conf, new SecurityManager(conf)) val greeter = rpcEnv.setupEndpoint(greeter, new GreetingActor(rpcEnv)) rpcEnv.setupEndpoint(toSend, new ToSend(rpcEnv, greeter)) } } {code} The result was: {code} toSend Hello Charlie Parker $a {code} I test the above code using akka with the following code: {code} case class Greeting(who: String) class GreetingActor extends Actor with ActorLogging { def receive = { case Greeting(who) = println(Hello + who) println(s${sender.path} ${sender.path.name}) } } class ToSend(greeting: ActorRef) extends Actor with ActorLogging { override def preStart(): Unit = { println(s${self.path} ${self.path.name}) greeting ! Greeting(Charlie Parker) } def receive = { case _ = log.info(here) } } object HelloWorld { def main(args: Array[String]): Unit = { val system = ActorSystem(MySystem) val greeter = system.actorOf(Props[GreetingActor], name = greeter) println(s${greeter.path} ${greeter.path.name}) val system2 = ActorSystem(MySystem2) system2.actorOf(Props(classOf[ToSend], greeter), name = toSend_2) } } {code} And the result was: {code} akka://MySystem/user/greeter greeter akka://MySystem2/user/toSend_2 toSend_2 Hello Charlie Parker akka://MySystem2/user/toSend_2 toSend_2 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6602) Replace direct use of Akka with Spark RPC interface
[ https://issues.apache.org/jira/browse/SPARK-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6602: --- Priority: Critical (was: Major) Replace direct use of Akka with Spark RPC interface --- Key: SPARK-6602 URL: https://issues.apache.org/jira/browse/SPARK-6602 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin Assignee: Shixiong Zhu Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5293) Enable Spark user applications to use different versions of Akka
[ https://issues.apache.org/jira/browse/SPARK-5293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5293: --- Target Version/s: 1.6.0 Enable Spark user applications to use different versions of Akka Key: SPARK-5293 URL: https://issues.apache.org/jira/browse/SPARK-5293 Project: Spark Issue Type: Umbrella Components: Spark Core Affects Versions: 1.3.0 Reporter: Reynold Xin Assignee: Shixiong Zhu A lot of Spark user applications are using (or want to use) Akka. Akka as a whole can contribute great architectural simplicity and uniformity. However, because Spark depends on Akka, it is not possible for users to rely on different versions, and we have received many requests in the past asking for help about this specific issue. For example, Spark Streaming might be used as the receiver of Akka messages - but our dependency on Akka requires the upstream Akka actors to also use the identical version of Akka. Since our usage of Akka is limited (mainly for RPC and single-threaded event loop), we can replace it with alternative RPC implementations and a common event loop in Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6602) Replace direct use of Akka with Spark RPC interface
[ https://issues.apache.org/jira/browse/SPARK-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6602: --- Target Version/s: 1.5.0 (was: 1.4.0) Replace direct use of Akka with Spark RPC interface --- Key: SPARK-6602 URL: https://issues.apache.org/jira/browse/SPARK-6602 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin Assignee: Shixiong Zhu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6944) Mechanism to associate generic operator scope with RDD's
[ https://issues.apache.org/jira/browse/SPARK-6944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6944: --- Assignee: Andrew Or (was: Apache Spark) Mechanism to associate generic operator scope with RDD's Key: SPARK-6944 URL: https://issues.apache.org/jira/browse/SPARK-6944 Project: Spark Issue Type: Sub-task Components: Spark Core, SQL Reporter: Patrick Wendell Assignee: Andrew Or -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7330) JDBC RDD could lead to NPE when the date field is null
Adrian Wang created SPARK-7330: -- Summary: JDBC RDD could lead to NPE when the date field is null Key: SPARK-7330 URL: https://issues.apache.org/jira/browse/SPARK-7330 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang because we call DateUtils.fromDate(rs.getDate(xx)) no matter it is null or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6944) Mechanism to associate generic operator scope with RDD's
[ https://issues.apache.org/jira/browse/SPARK-6944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526178#comment-14526178 ] Apache Spark commented on SPARK-6944: - User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/5729 Mechanism to associate generic operator scope with RDD's Key: SPARK-6944 URL: https://issues.apache.org/jira/browse/SPARK-6944 Project: Spark Issue Type: Sub-task Components: Spark Core, SQL Reporter: Patrick Wendell Assignee: Andrew Or -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7113) Add the direct stream related information to the streaming listener and web UI
[ https://issues.apache.org/jira/browse/SPARK-7113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7113: --- Assignee: Apache Spark Add the direct stream related information to the streaming listener and web UI -- Key: SPARK-7113 URL: https://issues.apache.org/jira/browse/SPARK-7113 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Saisai Shao Assignee: Apache Spark Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7113) Add the direct stream related information to the streaming listener and web UI
[ https://issues.apache.org/jira/browse/SPARK-7113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526202#comment-14526202 ] Apache Spark commented on SPARK-7113: - User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/5879 Add the direct stream related information to the streaming listener and web UI -- Key: SPARK-7113 URL: https://issues.apache.org/jira/browse/SPARK-7113 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Saisai Shao Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7113) Add the direct stream related information to the streaming listener and web UI
[ https://issues.apache.org/jira/browse/SPARK-7113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7113: --- Assignee: (was: Apache Spark) Add the direct stream related information to the streaming listener and web UI -- Key: SPARK-7113 URL: https://issues.apache.org/jira/browse/SPARK-7113 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Saisai Shao Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7331) Create HiveConf per application instead of per query in HiveQl.scala
[ https://issues.apache.org/jira/browse/SPARK-7331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7331: --- Assignee: Apache Spark Create HiveConf per application instead of per query in HiveQl.scala Key: SPARK-7331 URL: https://issues.apache.org/jira/browse/SPARK-7331 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0, 1.3.0 Reporter: Nitin Goyal Assignee: Apache Spark Priority: Minor A new HiveConf is created per query in getAst method in HiveQl.scala def getAst(sql: String): ASTNode = { /* * Context has to be passed in hive0.13.1. * Otherwise, there will be Null pointer exception, * when retrieving properties form HiveConf. */ val hContext = new Context(new HiveConf()) val node = ParseUtils.findRootNonNullToken((new ParseDriver).parse(sql, hContext)) hContext.clear() node } Creating hiveConf adds a minimum of 90ms delay per query. So moving its creation in Object such that it gets initialised once. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7331) Create HiveConf per application instead of per query in HiveQl.scala
[ https://issues.apache.org/jira/browse/SPARK-7331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7331: --- Assignee: (was: Apache Spark) Create HiveConf per application instead of per query in HiveQl.scala Key: SPARK-7331 URL: https://issues.apache.org/jira/browse/SPARK-7331 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0, 1.3.0 Reporter: Nitin Goyal Priority: Minor A new HiveConf is created per query in getAst method in HiveQl.scala def getAst(sql: String): ASTNode = { /* * Context has to be passed in hive0.13.1. * Otherwise, there will be Null pointer exception, * when retrieving properties form HiveConf. */ val hContext = new Context(new HiveConf()) val node = ParseUtils.findRootNonNullToken((new ParseDriver).parse(sql, hContext)) hContext.clear() node } Creating hiveConf adds a minimum of 90ms delay per query. So moving its creation in Object such that it gets initialised once. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7331) Create HiveConf per application instead of per query in HiveQl.scala
[ https://issues.apache.org/jira/browse/SPARK-7331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526223#comment-14526223 ] Apache Spark commented on SPARK-7331: - User 'nitin2goyal' has created a pull request for this issue: https://github.com/apache/spark/pull/5880 Create HiveConf per application instead of per query in HiveQl.scala Key: SPARK-7331 URL: https://issues.apache.org/jira/browse/SPARK-7331 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0, 1.3.0 Reporter: Nitin Goyal Priority: Minor A new HiveConf is created per query in getAst method in HiveQl.scala def getAst(sql: String): ASTNode = { /* * Context has to be passed in hive0.13.1. * Otherwise, there will be Null pointer exception, * when retrieving properties form HiveConf. */ val hContext = new Context(new HiveConf()) val node = ParseUtils.findRootNonNullToken((new ParseDriver).parse(sql, hContext)) hContext.clear() node } Creating hiveConf adds a minimum of 90ms delay per query. So moving its creation in Object such that it gets initialised once. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7331) Create HiveConf per application instead of per query in HiveQl.scala
[ https://issues.apache.org/jira/browse/SPARK-7331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nitin Goyal updated SPARK-7331: --- Description: A new HiveConf is created per query in getAst method in HiveQl.scala def getAst(sql: String): ASTNode = { /* * Context has to be passed in hive0.13.1. * Otherwise, there will be Null pointer exception, * when retrieving properties form HiveConf. */ val hContext = new Context(new HiveConf()) val node = ParseUtils.findRootNonNullToken((new ParseDriver).parse(sql, hContext)) hContext.clear() node } Creating hiveConf adds a minimum of 90ms delay per query. So moving its creation in Object such that it gets created once. was: A new HiveConf is created per query in getAst method in HiveQl.scala def getAst(sql: String): ASTNode = { /* * Context has to be passed in hive0.13.1. * Otherwise, there will be Null pointer exception, * when retrieving properties form HiveConf. */ val hContext = new Context(new HiveConf()) val node = ParseUtils.findRootNonNullToken((new ParseDriver).parse(sql, hContext)) hContext.clear() node } Creating hiveConf adds a minimum of 90ms delay per query. So moving its creation in Object such that it gets initialised once. Create HiveConf per application instead of per query in HiveQl.scala Key: SPARK-7331 URL: https://issues.apache.org/jira/browse/SPARK-7331 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0, 1.3.0 Reporter: Nitin Goyal Priority: Minor A new HiveConf is created per query in getAst method in HiveQl.scala def getAst(sql: String): ASTNode = { /* * Context has to be passed in hive0.13.1. * Otherwise, there will be Null pointer exception, * when retrieving properties form HiveConf. */ val hContext = new Context(new HiveConf()) val node = ParseUtils.findRootNonNullToken((new ParseDriver).parse(sql, hContext)) hContext.clear() node } Creating hiveConf adds a minimum of 90ms delay per query. So moving its creation in Object such that it gets created once. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7275) Make LogicalRelation public
[ https://issues.apache.org/jira/browse/SPARK-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7275: --- Assignee: (was: Apache Spark) Make LogicalRelation public --- Key: SPARK-7275 URL: https://issues.apache.org/jira/browse/SPARK-7275 Project: Spark Issue Type: Improvement Components: SQL Reporter: Santiago M. Mola Priority: Minor It seems LogicalRelation is the only part of the LogicalPlan that is not public. This makes it harder to work with full logical plans from third party packages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7275) Make LogicalRelation public
[ https://issues.apache.org/jira/browse/SPARK-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526227#comment-14526227 ] Apache Spark commented on SPARK-7275: - User 'gweidner' has created a pull request for this issue: https://github.com/apache/spark/pull/5881 Make LogicalRelation public --- Key: SPARK-7275 URL: https://issues.apache.org/jira/browse/SPARK-7275 Project: Spark Issue Type: Improvement Components: SQL Reporter: Santiago M. Mola Priority: Minor It seems LogicalRelation is the only part of the LogicalPlan that is not public. This makes it harder to work with full logical plans from third party packages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7275) Make LogicalRelation public
[ https://issues.apache.org/jira/browse/SPARK-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7275: --- Assignee: Apache Spark Make LogicalRelation public --- Key: SPARK-7275 URL: https://issues.apache.org/jira/browse/SPARK-7275 Project: Spark Issue Type: Improvement Components: SQL Reporter: Santiago M. Mola Assignee: Apache Spark Priority: Minor It seems LogicalRelation is the only part of the LogicalPlan that is not public. This makes it harder to work with full logical plans from third party packages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7322) Add DataFrame DSL for window function support
[ https://issues.apache.org/jira/browse/SPARK-7322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526166#comment-14526166 ] Reynold Xin commented on SPARK-7322: Yup. Add DataFrame DSL for window function support - Key: SPARK-7322 URL: https://issues.apache.org/jira/browse/SPARK-7322 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin A good reference implementation: http://www.jooq.org/doc/3.6/manual-single-page/#window-functions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7330) JDBC RDD could lead to NPE when the date field is null
[ https://issues.apache.org/jira/browse/SPARK-7330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526180#comment-14526180 ] Apache Spark commented on SPARK-7330: - User 'adrian-wang' has created a pull request for this issue: https://github.com/apache/spark/pull/5877 JDBC RDD could lead to NPE when the date field is null -- Key: SPARK-7330 URL: https://issues.apache.org/jira/browse/SPARK-7330 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang because we call DateUtils.fromDate(rs.getDate(xx)) no matter it is null or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7330) JDBC RDD could lead to NPE when the date field is null
[ https://issues.apache.org/jira/browse/SPARK-7330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7330: --- Assignee: (was: Apache Spark) JDBC RDD could lead to NPE when the date field is null -- Key: SPARK-7330 URL: https://issues.apache.org/jira/browse/SPARK-7330 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang because we call DateUtils.fromDate(rs.getDate(xx)) no matter it is null or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6944) Mechanism to associate generic operator scope with RDD's
[ https://issues.apache.org/jira/browse/SPARK-6944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526179#comment-14526179 ] Andrew Or commented on SPARK-6944: -- https://github.com/apache/spark/pull/5729 Mechanism to associate generic operator scope with RDD's Key: SPARK-6944 URL: https://issues.apache.org/jira/browse/SPARK-6944 Project: Spark Issue Type: Sub-task Components: Spark Core, SQL Reporter: Patrick Wendell Assignee: Andrew Or -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6944) Mechanism to associate generic operator scope with RDD's
[ https://issues.apache.org/jira/browse/SPARK-6944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6944: --- Assignee: Apache Spark (was: Andrew Or) Mechanism to associate generic operator scope with RDD's Key: SPARK-6944 URL: https://issues.apache.org/jira/browse/SPARK-6944 Project: Spark Issue Type: Sub-task Components: Spark Core, SQL Reporter: Patrick Wendell Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-6944) Mechanism to associate generic operator scope with RDD's
[ https://issues.apache.org/jira/browse/SPARK-6944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6944: - Comment: was deleted (was: https://github.com/apache/spark/pull/5729) Mechanism to associate generic operator scope with RDD's Key: SPARK-6944 URL: https://issues.apache.org/jira/browse/SPARK-6944 Project: Spark Issue Type: Sub-task Components: Spark Core, SQL Reporter: Patrick Wendell Assignee: Andrew Or -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7331) Create HiveConf per application instead of per query in HiveQl.scala
[ https://issues.apache.org/jira/browse/SPARK-7331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nitin Goyal updated SPARK-7331: --- Description: A new HiveConf is created per query in getAst method in HiveQl.scala def getAst(sql: String): ASTNode = { /* * Context has to be passed in hive0.13.1. * Otherwise, there will be Null pointer exception, * when retrieving properties form HiveConf. */ val hContext = new Context(new HiveConf()) val node = ParseUtils.findRootNonNullToken((new ParseDriver).parse(sql, hContext)) hContext.clear() node } Creating hiveConf adds a minimum of 90ms delay per query. So moving its creation in Object such that it gets initialised once. was: A new HiveConf is created per query in getAst method in HiveQl.scala def getAst(sql: String): ASTNode = { /* * Context has to be passed in hive0.13.1. * Otherwise, there will be Null pointer exception, * when retrieving properties form HiveConf. */ val hContext = new Context(new HiveConf()) val node = ParseUtils.findRootNonNullToken((new ParseDriver).parse(sql, hContext)) hContext.clear() node } Creating hiveConf adds a minimum of 90ms delay per query. So moving its creation in Object such that it gets initialised once. Create HiveConf per application instead of per query in HiveQl.scala Key: SPARK-7331 URL: https://issues.apache.org/jira/browse/SPARK-7331 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0, 1.3.0 Reporter: Nitin Goyal Priority: Minor A new HiveConf is created per query in getAst method in HiveQl.scala def getAst(sql: String): ASTNode = { /* * Context has to be passed in hive0.13.1. * Otherwise, there will be Null pointer exception, * when retrieving properties form HiveConf. */ val hContext = new Context(new HiveConf()) val node = ParseUtils.findRootNonNullToken((new ParseDriver).parse(sql, hContext)) hContext.clear() node } Creating hiveConf adds a minimum of 90ms delay per query. So moving its creation in Object such that it gets initialised once. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7331) Create HiveConf per application instead of per query in HiveQl.scala
Nitin Goyal created SPARK-7331: -- Summary: Create HiveConf per application instead of per query in HiveQl.scala Key: SPARK-7331 URL: https://issues.apache.org/jira/browse/SPARK-7331 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0, 1.2.0 Reporter: Nitin Goyal Priority: Minor A new HiveConf is created per query in getAst method in HiveQl.scala def getAst(sql: String): ASTNode = { /* * Context has to be passed in hive0.13.1. * Otherwise, there will be Null pointer exception, * when retrieving properties form HiveConf. */ val hContext = new Context(new HiveConf()) val node = ParseUtils.findRootNonNullToken((new ParseDriver).parse(sql, hContext)) hContext.clear() node } Creating hiveConf adds a minimum of 90ms delay per query. So moving its creation in Object such that it gets initialised once. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3524) remove workaround to pickle array of float for Pyrolite
[ https://issues.apache.org/jira/browse/SPARK-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526229#comment-14526229 ] Apache Spark commented on SPARK-3524: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/5850 remove workaround to pickle array of float for Pyrolite --- Key: SPARK-3524 URL: https://issues.apache.org/jira/browse/SPARK-3524 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Davies Liu Assignee: Xiangrui Meng After Pyrolite release a new version with PR https://github.com/irmen/Pyrolite/pull/11, we should remove the workaround introduced in PR https://github.com/apache/spark/pull/2365 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3524) remove workaround to pickle array of float for Pyrolite
[ https://issues.apache.org/jira/browse/SPARK-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-3524: --- Assignee: Apache Spark (was: Xiangrui Meng) remove workaround to pickle array of float for Pyrolite --- Key: SPARK-3524 URL: https://issues.apache.org/jira/browse/SPARK-3524 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Davies Liu Assignee: Apache Spark After Pyrolite release a new version with PR https://github.com/irmen/Pyrolite/pull/11, we should remove the workaround introduced in PR https://github.com/apache/spark/pull/2365 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3524) remove workaround to pickle array of float for Pyrolite
[ https://issues.apache.org/jira/browse/SPARK-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-3524: --- Assignee: Xiangrui Meng (was: Apache Spark) remove workaround to pickle array of float for Pyrolite --- Key: SPARK-3524 URL: https://issues.apache.org/jira/browse/SPARK-3524 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Davies Liu Assignee: Xiangrui Meng After Pyrolite release a new version with PR https://github.com/irmen/Pyrolite/pull/11, we should remove the workaround introduced in PR https://github.com/apache/spark/pull/2365 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7241) Pearson correlation for DataFrames
[ https://issues.apache.org/jira/browse/SPARK-7241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-7241. Resolution: Fixed Fix Version/s: 1.4.0 Pearson correlation for DataFrames -- Key: SPARK-7241 URL: https://issues.apache.org/jira/browse/SPARK-7241 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Xiangrui Meng Assignee: Burak Yavuz Fix For: 1.4.0 This JIRA is for computing the Pearson linear correlation for two numerical columns in a DataFrame. The method `corr` should live under `df.stat`: {code} df.stat.corr(col1, col2, method=pearson): Double {code} `method` will be used when we add other correlations. Similar to SPARK-7240, UDAF will be added later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6411) PySpark DataFrames can't be created if any datetimes have timezones
[ https://issues.apache.org/jira/browse/SPARK-6411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6411: --- Assignee: Apache Spark (was: Xiangrui Meng) PySpark DataFrames can't be created if any datetimes have timezones --- Key: SPARK-6411 URL: https://issues.apache.org/jira/browse/SPARK-6411 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.3.0 Reporter: Harry Brundage Assignee: Apache Spark I am unable to create a DataFrame with PySpark if any of the {{datetime}} objects that pass through the conversion process have a {{tzinfo}} property set. This works fine: {code} In [9]: sc.parallelize([(datetime.datetime(2014, 7, 8, 11, 10),)]).toDF().collect() Out[9]: [Row(_1=datetime.datetime(2014, 7, 8, 11, 10))] {code} as expected, the tuple's schema is inferred as having one anonymous column with a datetime field, and the datetime roundtrips through to the Java side python deserialization and then back into python land upon {{collect}}. This however: {code} In [5]: from dateutil.tz import tzutc In [10]: sc.parallelize([(datetime.datetime(2014, 7, 8, 11, 10, tzinfo=tzutc()),)]).toDF().collect() {code} explodes with {code} Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 12.0 failed 1 times, most recent failure: Lost task 0.0 in stage 12.0 (TID 12, localhost): net.razorvine.pickle.PickleException: invalid pickle data for datetime; expected 1 or 7 args, got 2 at net.razorvine.pickle.objects.DateTimeConstructor.createDateTime(DateTimeConstructor.java:69) at net.razorvine.pickle.objects.DateTimeConstructor.construct(DateTimeConstructor.java:32) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170) at net.razorvine.pickle.Unpickler.load(Unpickler.java:84) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:154) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:119) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:114) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:114) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:114) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:114) at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813) at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1520) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1520) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1211)
[jira] [Commented] (SPARK-6411) PySpark DataFrames can't be created if any datetimes have timezones
[ https://issues.apache.org/jira/browse/SPARK-6411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526244#comment-14526244 ] Apache Spark commented on SPARK-6411: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/5850 PySpark DataFrames can't be created if any datetimes have timezones --- Key: SPARK-6411 URL: https://issues.apache.org/jira/browse/SPARK-6411 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.3.0 Reporter: Harry Brundage Assignee: Xiangrui Meng I am unable to create a DataFrame with PySpark if any of the {{datetime}} objects that pass through the conversion process have a {{tzinfo}} property set. This works fine: {code} In [9]: sc.parallelize([(datetime.datetime(2014, 7, 8, 11, 10),)]).toDF().collect() Out[9]: [Row(_1=datetime.datetime(2014, 7, 8, 11, 10))] {code} as expected, the tuple's schema is inferred as having one anonymous column with a datetime field, and the datetime roundtrips through to the Java side python deserialization and then back into python land upon {{collect}}. This however: {code} In [5]: from dateutil.tz import tzutc In [10]: sc.parallelize([(datetime.datetime(2014, 7, 8, 11, 10, tzinfo=tzutc()),)]).toDF().collect() {code} explodes with {code} Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 12.0 failed 1 times, most recent failure: Lost task 0.0 in stage 12.0 (TID 12, localhost): net.razorvine.pickle.PickleException: invalid pickle data for datetime; expected 1 or 7 args, got 2 at net.razorvine.pickle.objects.DateTimeConstructor.createDateTime(DateTimeConstructor.java:69) at net.razorvine.pickle.objects.DateTimeConstructor.construct(DateTimeConstructor.java:32) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170) at net.razorvine.pickle.Unpickler.load(Unpickler.java:84) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:154) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:119) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:114) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:114) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:114) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:114) at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813) at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1520) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1520) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Driver stacktrace: at
[jira] [Assigned] (SPARK-6411) PySpark DataFrames can't be created if any datetimes have timezones
[ https://issues.apache.org/jira/browse/SPARK-6411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6411: --- Assignee: Xiangrui Meng (was: Apache Spark) PySpark DataFrames can't be created if any datetimes have timezones --- Key: SPARK-6411 URL: https://issues.apache.org/jira/browse/SPARK-6411 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.3.0 Reporter: Harry Brundage Assignee: Xiangrui Meng I am unable to create a DataFrame with PySpark if any of the {{datetime}} objects that pass through the conversion process have a {{tzinfo}} property set. This works fine: {code} In [9]: sc.parallelize([(datetime.datetime(2014, 7, 8, 11, 10),)]).toDF().collect() Out[9]: [Row(_1=datetime.datetime(2014, 7, 8, 11, 10))] {code} as expected, the tuple's schema is inferred as having one anonymous column with a datetime field, and the datetime roundtrips through to the Java side python deserialization and then back into python land upon {{collect}}. This however: {code} In [5]: from dateutil.tz import tzutc In [10]: sc.parallelize([(datetime.datetime(2014, 7, 8, 11, 10, tzinfo=tzutc()),)]).toDF().collect() {code} explodes with {code} Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 12.0 failed 1 times, most recent failure: Lost task 0.0 in stage 12.0 (TID 12, localhost): net.razorvine.pickle.PickleException: invalid pickle data for datetime; expected 1 or 7 args, got 2 at net.razorvine.pickle.objects.DateTimeConstructor.createDateTime(DateTimeConstructor.java:69) at net.razorvine.pickle.objects.DateTimeConstructor.construct(DateTimeConstructor.java:32) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170) at net.razorvine.pickle.Unpickler.load(Unpickler.java:84) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:154) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:119) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:114) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:114) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:114) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:114) at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813) at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1520) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1520) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1211)
[jira] [Assigned] (SPARK-6411) PySpark DataFrames can't be created if any datetimes have timezones
[ https://issues.apache.org/jira/browse/SPARK-6411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-6411: Assignee: Xiangrui Meng (was: Davies Liu) PySpark DataFrames can't be created if any datetimes have timezones --- Key: SPARK-6411 URL: https://issues.apache.org/jira/browse/SPARK-6411 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.3.0 Reporter: Harry Brundage Assignee: Xiangrui Meng I am unable to create a DataFrame with PySpark if any of the {{datetime}} objects that pass through the conversion process have a {{tzinfo}} property set. This works fine: {code} In [9]: sc.parallelize([(datetime.datetime(2014, 7, 8, 11, 10),)]).toDF().collect() Out[9]: [Row(_1=datetime.datetime(2014, 7, 8, 11, 10))] {code} as expected, the tuple's schema is inferred as having one anonymous column with a datetime field, and the datetime roundtrips through to the Java side python deserialization and then back into python land upon {{collect}}. This however: {code} In [5]: from dateutil.tz import tzutc In [10]: sc.parallelize([(datetime.datetime(2014, 7, 8, 11, 10, tzinfo=tzutc()),)]).toDF().collect() {code} explodes with {code} Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 12.0 failed 1 times, most recent failure: Lost task 0.0 in stage 12.0 (TID 12, localhost): net.razorvine.pickle.PickleException: invalid pickle data for datetime; expected 1 or 7 args, got 2 at net.razorvine.pickle.objects.DateTimeConstructor.createDateTime(DateTimeConstructor.java:69) at net.razorvine.pickle.objects.DateTimeConstructor.construct(DateTimeConstructor.java:32) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170) at net.razorvine.pickle.Unpickler.load(Unpickler.java:84) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:154) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:119) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:114) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:114) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:114) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:114) at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813) at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1520) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1520) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1211)
[jira] [Commented] (SPARK-6514) For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as the Kinesis stream itself
[ https://issues.apache.org/jira/browse/SPARK-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526261#comment-14526261 ] Apache Spark commented on SPARK-6514: - User 'cfregly' has created a pull request for this issue: https://github.com/apache/spark/pull/5882 For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as the Kinesis stream itself Key: SPARK-6514 URL: https://issues.apache.org/jira/browse/SPARK-6514 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.3.0 Reporter: Chris Fregly context: i started the original Kinesis impl with KCL 1.0 (not supported), then finished on KCL 1.1 (supported) without realizing that it's supported. also, we should upgrade to the latest Kinesis Client Library (KCL) which is currently v1.2 right now, i believe. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5960) Allow AWS credentials to be passed to KinesisUtils.createStream()
[ https://issues.apache.org/jira/browse/SPARK-5960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526262#comment-14526262 ] Apache Spark commented on SPARK-5960: - User 'cfregly' has created a pull request for this issue: https://github.com/apache/spark/pull/5882 Allow AWS credentials to be passed to KinesisUtils.createStream() - Key: SPARK-5960 URL: https://issues.apache.org/jira/browse/SPARK-5960 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.1.0 Reporter: Chris Fregly Assignee: Chris Fregly Priority: Blocker While IAM roles are preferable, we're seeing a lot of cases where we need to pass AWS credentials when creating the KinesisReceiver. Notes: * Make sure we don't log the credentials anywhere * Maintain compatibility with existing KinesisReceiver-based code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6514) For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as the Kinesis stream itself
[ https://issues.apache.org/jira/browse/SPARK-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6514: --- Assignee: Apache Spark For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as the Kinesis stream itself Key: SPARK-6514 URL: https://issues.apache.org/jira/browse/SPARK-6514 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.3.0 Reporter: Chris Fregly Assignee: Apache Spark context: i started the original Kinesis impl with KCL 1.0 (not supported), then finished on KCL 1.1 (supported) without realizing that it's supported. also, we should upgrade to the latest Kinesis Client Library (KCL) which is currently v1.2 right now, i believe. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6656) Allow the application name to be passed in versus pulling from SparkContext.getAppName()
[ https://issues.apache.org/jira/browse/SPARK-6656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6656: --- Assignee: Apache Spark Allow the application name to be passed in versus pulling from SparkContext.getAppName() - Key: SPARK-6656 URL: https://issues.apache.org/jira/browse/SPARK-6656 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.1.0 Reporter: Chris Fregly Assignee: Apache Spark this is useful for the scenario where Kinesis Spark Streaming is being invoked from the Spark Shell. in this case, the application name in the SparkContext is pre-set to Spark Shell. this isn't a common or recommended use case, but it's best to make this configurable outside of SparkContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6656) Allow the application name to be passed in versus pulling from SparkContext.getAppName()
[ https://issues.apache.org/jira/browse/SPARK-6656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526263#comment-14526263 ] Apache Spark commented on SPARK-6656: - User 'cfregly' has created a pull request for this issue: https://github.com/apache/spark/pull/5882 Allow the application name to be passed in versus pulling from SparkContext.getAppName() - Key: SPARK-6656 URL: https://issues.apache.org/jira/browse/SPARK-6656 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.1.0 Reporter: Chris Fregly this is useful for the scenario where Kinesis Spark Streaming is being invoked from the Spark Shell. in this case, the application name in the SparkContext is pre-set to Spark Shell. this isn't a common or recommended use case, but it's best to make this configurable outside of SparkContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6514) For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as the Kinesis stream itself
[ https://issues.apache.org/jira/browse/SPARK-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6514: --- Assignee: (was: Apache Spark) For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as the Kinesis stream itself Key: SPARK-6514 URL: https://issues.apache.org/jira/browse/SPARK-6514 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.3.0 Reporter: Chris Fregly context: i started the original Kinesis impl with KCL 1.0 (not supported), then finished on KCL 1.1 (supported) without realizing that it's supported. also, we should upgrade to the latest Kinesis Client Library (KCL) which is currently v1.2 right now, i believe. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6656) Allow the application name to be passed in versus pulling from SparkContext.getAppName()
[ https://issues.apache.org/jira/browse/SPARK-6656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6656: --- Assignee: (was: Apache Spark) Allow the application name to be passed in versus pulling from SparkContext.getAppName() - Key: SPARK-6656 URL: https://issues.apache.org/jira/browse/SPARK-6656 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.1.0 Reporter: Chris Fregly this is useful for the scenario where Kinesis Spark Streaming is being invoked from the Spark Shell. in this case, the application name in the SparkContext is pre-set to Spark Shell. this isn't a common or recommended use case, but it's best to make this configurable outside of SparkContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6028) Provide an alternative RPC implementation based on the network transport module
[ https://issues.apache.org/jira/browse/SPARK-6028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6028: --- Priority: Critical (was: Major) Target Version/s: 1.5.0 Provide an alternative RPC implementation based on the network transport module --- Key: SPARK-6028 URL: https://issues.apache.org/jira/browse/SPARK-6028 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin Priority: Critical Network transport module implements a low level RPC interface. We can build a new RPC implementation on top of that to replace Akka's. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6280) Remove Akka systemName from Spark
[ https://issues.apache.org/jira/browse/SPARK-6280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6280: --- Target Version/s: 1.5.0 (was: 1.4.0) Remove Akka systemName from Spark - Key: SPARK-6280 URL: https://issues.apache.org/jira/browse/SPARK-6280 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Shixiong Zhu `systemName` is a Akka concept. A RPC implementation does not need to support it. We can hard code the system name in Spark and hide it in the internal Akka RPC implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7329) Use itertools.product in ParamGridBuilder
[ https://issues.apache.org/jira/browse/SPARK-7329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-7329. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5873 [https://github.com/apache/spark/pull/5873] Use itertools.product in ParamGridBuilder - Key: SPARK-7329 URL: https://issues.apache.org/jira/browse/SPARK-7329 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Minor Fix For: 1.4.0 justinuang suggested the following on https://github.com/apache/spark/pull/5601: {code} [dict(zip(self._param_grid.keys(), prod)) for prod in itertools.product(*self._param_grid.values())] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6943) Graphically show RDD's included in a stage
[ https://issues.apache.org/jira/browse/SPARK-6943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6943: - Attachment: job-page.png stage-page.png Graphically show RDD's included in a stage -- Key: SPARK-6943 URL: https://issues.apache.org/jira/browse/SPARK-6943 Project: Spark Issue Type: Sub-task Components: Spark Core, SQL Reporter: Patrick Wendell Assignee: Andrew Or Attachments: DAGvisualizationintheSparkWebUI.pdf, job-page.png, stage-page.png -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7327) DataFrame show() method doesn't like empty dataframes
[ https://issues.apache.org/jira/browse/SPARK-7327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526191#comment-14526191 ] Chen Song commented on SPARK-7327: -- I am working on upgrade the show function. I'll run the empty df test and fix it. DataFrame show() method doesn't like empty dataframes - Key: SPARK-7327 URL: https://issues.apache.org/jira/browse/SPARK-7327 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Olivier Girardot Priority: Minor For an empty DataFrame (for exemple after a filter) any call to show() ends up with : {code} java.util.MissingFormatWidthException: -0s at java.util.Formatter$FormatSpecifier.checkGeneral(Formatter.java:2906) at java.util.Formatter$FormatSpecifier.init(Formatter.java:2680) at java.util.Formatter.parse(Formatter.java:2528) at java.util.Formatter.format(Formatter.java:2469) at java.util.Formatter.format(Formatter.java:2423) at java.lang.String.format(String.java:2790) at org.apache.spark.sql.DataFrame$$anonfun$showString$2$$anonfun$apply$4.apply(DataFrame.scala:200) at org.apache.spark.sql.DataFrame$$anonfun$showString$2$$anonfun$apply$4.apply(DataFrame.scala:199) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.DataFrame$$anonfun$showString$2.apply(DataFrame.scala:199) at org.apache.spark.sql.DataFrame$$anonfun$showString$2.apply(DataFrame.scala:198) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:198) at org.apache.spark.sql.DataFrame.show(DataFrame.scala:314) at org.apache.spark.sql.DataFrame.show(DataFrame.scala:320) {code} If no-one takes it by next friday, I'll fix it, the problem seems to come from the colWidths method : {code} // Compute the width of each column val colWidths = Array.fill(numCols)(0) for (row - rows) { for ((cell, i) - row.zipWithIndex) { colWidths(i) = math.max(colWidths(i), cell.length) } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7024) Improve performance of function containsStar
[ https://issues.apache.org/jira/browse/SPARK-7024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yadong Qi closed SPARK-7024. Resolution: Not A Problem Improve performance of function containsStar Key: SPARK-7024 URL: https://issues.apache.org/jira/browse/SPARK-7024 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Reporter: Yadong Qi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6873) Some Hive-Catalyst comparison tests fail due to unimportant order of some printed elements
[ https://issues.apache.org/jira/browse/SPARK-6873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526218#comment-14526218 ] Cheng Lian commented on SPARK-6873: --- The UDF synonyms lines are easy to be filtered out. Table properties lines can be fixed by sorting the results. The SerDe lines can be problematic, maybe we can ignore this case for now. If we plan to add Java 8 Jenkins builder, I can try to find some time to work on this after 1.4 release. Otherwise it's not that harmful for now. The time consuming part is regenerating all golden files since we need to change golden file generation and comparison strategies. Some Hive-Catalyst comparison tests fail due to unimportant order of some printed elements -- Key: SPARK-6873 URL: https://issues.apache.org/jira/browse/SPARK-6873 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 1.3.1 Reporter: Sean Owen Assignee: Cheng Lian Priority: Minor As I mentioned, I've been seeing 4 test failures in Hive tests for a while, and actually it still affects master. I think it's a superficial problem that only turns up when running on Java 8, but still, would probably be an easy fix and good to fix. Specifically, here are four tests and the bit that fails the comparison, below. I tried to diagnose this but had trouble even finding where some of this occurs, like the list of synonyms? {code} - show_tblproperties *** FAILED *** Results do not match for show_tblproperties: ... !== HIVE - 2 row(s) == == CATALYST - 2 row(s) == !tmptruebar bar value !barbar value tmp true (HiveComparisonTest.scala:391) {code} {code} - show_create_table_serde *** FAILED *** Results do not match for show_create_table_serde: ... WITH SERDEPROPERTIES ( WITH SERDEPROPERTIES ( ! 'serialization.format'='$', 'field.delim'=',', ! 'field.delim'=',') 'serialization.format'='$') {code} {code} - udf_std *** FAILED *** Results do not match for udf_std: ... !== HIVE - 2 row(s) == == CATALYST - 2 row(s) == std(x) - Returns the standard deviation of a set of numbers std(x) - Returns the standard deviation of a set of numbers !Synonyms: stddev_pop, stddev Synonyms: stddev, stddev_pop (HiveComparisonTest.scala:391) {code} {code} - udf_stddev *** FAILED *** Results do not match for udf_stddev: ... !== HIVE - 2 row(s) ==== CATALYST - 2 row(s) == stddev(x) - Returns the standard deviation of a set of numbers stddev(x) - Returns the standard deviation of a set of numbers !Synonyms: stddev_pop, stdSynonyms: std, stddev_pop (HiveComparisonTest.scala:391) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7326) Performing window() on a WindowedDStream doesn't work all the time
[ https://issues.apache.org/jira/browse/SPARK-7326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526278#comment-14526278 ] Sean Owen commented on SPARK-7326: -- Makes sense, just interested in whether this was simplifiable, but it is beside the point. Performing window() on a WindowedDStream doesn't work all the time -- Key: SPARK-7326 URL: https://issues.apache.org/jira/browse/SPARK-7326 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.1 Reporter: Wesley Miao Someone reported similar issues before but got no response. http://apache-spark-user-list.1001560.n3.nabble.com/Windows-of-windowed-streams-not-displaying-the-expected-results-td466.html And I met the same issue recently and it can be reproduced in 1.3.1 by the following piece of code: def main(args: Array[String]) { val batchInterval = 1234 val sparkConf = new SparkConf() .setAppName(WindowOnWindowedDStream) .setMaster(local[2]) val ssc = new StreamingContext(sparkConf, Milliseconds(batchInterval.toInt)) ssc.checkpoint(checkpoint) def createRDD(i: Int) : RDD[(String, Int)] = { val count = 1000 val rawLogs = (1 to count).map{ _ = val word = word + Random.nextInt.abs % 5 (word, 1) } ssc.sparkContext.parallelize(rawLogs) } val rddQueue = mutable.Queue[RDD[(String, Int)]]() val rawLogStream = ssc.queueStream(rddQueue) (1 to 300) foreach { i = rddQueue.enqueue(createRDD(i)) } val l1 = rawLogStream.window(Milliseconds(batchInterval.toInt) * 5, Milliseconds(batchInterval.toInt) * 5).reduceByKey(_ + _) val l2 = l1.window(Milliseconds(batchInterval.toInt) * 15, Milliseconds(batchInterval.toInt) * 15).reduceByKey(_ + _) l1.print() l2.print() ssc.start() ssc.awaitTermination() } Here we have two windowed DStream instance l1 and l2. l1 is the result DStream by performing a window() on the source DStream with both window and sliding duration 5 times the batch internal of the source stream. l2 is the result DStream by performing a window() on l1, with both window and sliding duration 3 times l1's batch interval, which is 15 times of the source stream. From the output of this simple streaming app, I can only see print data output from l1 and no data printed from l2. Diving into the source code, I found the problem may most likely reside in DStream.slice() implementation, as shown below. def slice(fromTime: Time, toTime: Time): Seq[RDD[T]] = { if (!isInitialized) { throw new SparkException(this + has not been initialized) } if (!(fromTime - zeroTime).isMultipleOf(slideDuration)) { logWarning(fromTime ( + fromTime + ) is not a multiple of slideDuration ( + slideDuration + )) } if (!(toTime - zeroTime).isMultipleOf(slideDuration)) { logWarning(toTime ( + fromTime + ) is not a multiple of slideDuration ( + slideDuration + )) } val alignedToTime = toTime.floor(slideDuration, zeroTime) val alignedFromTime = fromTime.floor(slideDuration, zeroTime) logInfo(Slicing from + fromTime + to + toTime + (aligned to + alignedFromTime + and + alignedToTime + )) alignedFromTime.to(alignedToTime, slideDuration).flatMap(time = { if (time = zeroTime) getOrCompute(time) else None }) } Here after performing floor() on both fromTime and toTime, the result (alignedFromTime - zeroTime) and (alignedToTime - zeroTime) may no longer be multiple of the slidingDuration, thus making isTimeValid() check failed for all the remaining computation. The fix would be to add a new floor() function in Time.scala to respect the zeroTime while performing the floor : def floor(that: Duration, zeroTime: Time): Time = { val t = that.milliseconds new Time(((this.millis - zeroTime.milliseconds) / t) * t + zeroTime.milliseconds) } And then change the DStream.slice to call this new floor function by passing in its zeroTime. val alignedToTime = toTime.floor(slideDuration, zeroTime) val alignedFromTime = fromTime.floor(slideDuration, zeroTime) This way the alignedToTime and alignedFromTime are *really* aligned in respect to zeroTime whose value is not really a 0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7326) Performing window() on a WindowedDStream doesn't work all the time
[ https://issues.apache.org/jira/browse/SPARK-7326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wesley Miao updated SPARK-7326: --- Description: Someone reported similar issues before but got no response. http://apache-spark-user-list.1001560.n3.nabble.com/Windows-of-windowed-streams-not-displaying-the-expected-results-td466.html And I met the same issue recently and it can be reproduced in 1.3.1 by the following piece of code: def main(args: Array[String]) { val batchInterval = 1234 val sparkConf = new SparkConf() .setAppName(WindowOnWindowedDStream) .setMaster(local[2]) val ssc = new StreamingContext(sparkConf, Milliseconds(batchInterval.toInt)) ssc.checkpoint(checkpoint) def createRDD(i: Int) : RDD[(String, Int)] = { val count = 1000 val rawLogs = (1 to count).map{ _ = val word = word + Random.nextInt.abs % 5 (word, 1) } ssc.sparkContext.parallelize(rawLogs) } val rddQueue = mutable.Queue[RDD[(String, Int)]]() val rawLogStream = ssc.queueStream(rddQueue) (1 to 300) foreach { i = rddQueue.enqueue(createRDD(i)) } val l1 = rawLogStream.window(Milliseconds(batchInterval.toInt) * 5, Milliseconds(batchInterval.toInt) * 5).reduceByKey(_ + _) val l2 = l1.window(Milliseconds(batchInterval.toInt) * 15, Milliseconds(batchInterval.toInt) * 15).reduceByKey(_ + _) l1.print() l2.print() ssc.start() ssc.awaitTermination() } Here we have two windowed DStream instance l1 and l2. l1 is the result DStream by performing a window() on the source DStream with both window and sliding duration 5 times the batch internal of the source stream. l2 is the result DStream by performing a window() on l1, with both window and sliding duration 3 times l1's batch interval, which is 15 times of the source stream. From the output of this simple streaming app, I can only see print data output from l1 and no data printed from l2. Diving into the source code, I found the problem may most likely reside in DStream.slice() implementation, as shown below. def slice(fromTime: Time, toTime: Time): Seq[RDD[T]] = { if (!isInitialized) { throw new SparkException(this + has not been initialized) } if (!(fromTime - zeroTime).isMultipleOf(slideDuration)) { logWarning(fromTime ( + fromTime + ) is not a multiple of slideDuration ( + slideDuration + )) } if (!(toTime - zeroTime).isMultipleOf(slideDuration)) { logWarning(toTime ( + fromTime + ) is not a multiple of slideDuration ( + slideDuration + )) } val alignedToTime = toTime.floor(slideDuration, zeroTime) val alignedFromTime = fromTime.floor(slideDuration, zeroTime) logInfo(Slicing from + fromTime + to + toTime + (aligned to + alignedFromTime + and + alignedToTime + )) alignedFromTime.to(alignedToTime, slideDuration).flatMap(time = { if (time = zeroTime) getOrCompute(time) else None }) } Here after performing floor() on both fromTime and toTime, the result (alignedFromTime - zeroTime) and (alignedToTime - zeroTime) may no longer be multiple of the slidingDuration, thus making isTimeValid() check failed for all the remaining computation. The fix would be to add a new floor() function in Time.scala to respect the zeroTime while performing the floor : def floor(that: Duration, zeroTime: Time): Time = { val t = that.milliseconds new Time(((this.millis - zeroTime.milliseconds) / t) * t + zeroTime.milliseconds) } And then change the DStream.slice to call this new floor function by passing in its zeroTime. val alignedToTime = toTime.floor(slideDuration, zeroTime) val alignedFromTime = fromTime.floor(slideDuration, zeroTime) This way the alignedToTime and alignedFromTime are *really* aligned in respect to zeroTime whose value is not really a 0. was: Someone reported similar issues before but got no response. http://apache-spark-user-list.1001560.n3.nabble.com/Windows-of-windowed-streams-not-displaying-the-expected-results-td466.html And I met the same issue recently and it can be reproduced in 1.3.1 by the following piece of code: def main(args: Array[String]) { val batchInterval = 1234 val sparkConf = new SparkConf() .setAppName(WindowOnWindowedDStream) .setMaster(local[2]) val ssc = new StreamingContext(sparkConf, Milliseconds(batchInterval.toInt)) ssc.checkpoint(checkpoint) def createRDD(i: Int) : RDD[(String, Int)] = { val count = 1000 val rawLogs = (1 to count).map{ _ = val word = word + Random.nextInt.abs % 5 (word, 1) } ssc.sparkContext.parallelize(rawLogs) } val rddQueue = mutable.Queue[RDD[(String, Int)]]() val rawLogStream = ssc.queueStream(rddQueue) (1 to 300) foreach { i = rddQueue.enqueue(createRDD(i)) } val l1 =
[jira] [Updated] (SPARK-7326) Performing window() on a WindowedDStream doesn't work all the time
[ https://issues.apache.org/jira/browse/SPARK-7326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7326: - Description: Someone reported similar issues before but got no response. http://apache-spark-user-list.1001560.n3.nabble.com/Windows-of-windowed-streams-not-displaying-the-expected-results-td466.html And I met the same issue recently and it can be reproduced in 1.3.1 by the following piece of code: {code} def main(args: Array[String]) { val batchInterval = 1234 val sparkConf = new SparkConf() .setAppName(WindowOnWindowedDStream) .setMaster(local[2]) val ssc = new StreamingContext(sparkConf, Milliseconds(batchInterval.toInt)) ssc.checkpoint(checkpoint) def createRDD(i: Int) : RDD[(String, Int)] = { val count = 1000 val rawLogs = (1 to count).map{ _ = val word = word + Random.nextInt.abs % 5 (word, 1) } ssc.sparkContext.parallelize(rawLogs) } val rddQueue = mutable.Queue[RDD[(String, Int)]]() val rawLogStream = ssc.queueStream(rddQueue) (1 to 300) foreach { i = rddQueue.enqueue(createRDD(i)) } val l1 = rawLogStream.window(Milliseconds(batchInterval.toInt) * 5, Milliseconds(batchInterval.toInt) * 5).reduceByKey(_ + _) val l2 = l1.window(Milliseconds(batchInterval.toInt) * 15, Milliseconds(batchInterval.toInt) * 15).reduceByKey(_ + _) l1.print() l2.print() ssc.start() ssc.awaitTermination() } {code} Here we have two windowed DStream instance l1 and l2. l1 is the result DStream by performing a window() on the source DStream with both window and sliding duration 5 times the batch internal of the source stream. l2 is the result DStream by performing a window() on l1, with both window and sliding duration 3 times l1's batch interval, which is 15 times of the source stream. From the output of this simple streaming app, I can only see print data output from l1 and no data printed from l2. Diving into the source code, I found the problem may most likely reside in DStream.slice() implementation, as shown below. {code} def slice(fromTime: Time, toTime: Time): Seq[RDD[T]] = { if (!isInitialized) { throw new SparkException(this + has not been initialized) } if (!(fromTime - zeroTime).isMultipleOf(slideDuration)) { logWarning(fromTime ( + fromTime + ) is not a multiple of slideDuration ( + slideDuration + )) } if (!(toTime - zeroTime).isMultipleOf(slideDuration)) { logWarning(toTime ( + fromTime + ) is not a multiple of slideDuration ( + slideDuration + )) } val alignedToTime = toTime.floor(slideDuration, zeroTime) val alignedFromTime = fromTime.floor(slideDuration, zeroTime) logInfo(Slicing from + fromTime + to + toTime + (aligned to + alignedFromTime + and + alignedToTime + )) alignedFromTime.to(alignedToTime, slideDuration).flatMap(time = { if (time = zeroTime) getOrCompute(time) else None }) } {code} Here after performing floor() on both fromTime and toTime, the result (alignedFromTime - zeroTime) and (alignedToTime - zeroTime) may no longer be multiple of the slidingDuration, thus making isTimeValid check failed for all the remaining computation. The fix would be to add a new floor() function in Time.scala to respect the zeroTime while performing the floor : {code} def floor(that: Duration, zeroTime: Time): Time = { val t = that.milliseconds new Time(((this.millis - zeroTime.milliseconds) / t) * t + zeroTime.milliseconds) } {code} And then change the DStream.slice to call this new floor function by passing in its zeroTime. {code} val alignedToTime = toTime.floor(slideDuration, zeroTime) val alignedFromTime = fromTime.floor(slideDuration, zeroTime) {code} This way the alignedToTime and alignedFromTime are *really* aligned in respect to zeroTime whose value is not really a 0. was: Someone reported similar issues before but got no response. http://apache-spark-user-list.1001560.n3.nabble.com/Windows-of-windowed-streams-not-displaying-the-expected-results-td466.html And I met the same issue recently and it can be reproduced in 1.3.1 by the following piece of code: def main(args: Array[String]) { val batchInterval = 1234 val sparkConf = new SparkConf() .setAppName(WindowOnWindowedDStream) .setMaster(local[2]) val ssc = new StreamingContext(sparkConf, Milliseconds(batchInterval.toInt)) ssc.checkpoint(checkpoint) def createRDD(i: Int) : RDD[(String, Int)] = { val count = 1000 val rawLogs = (1 to count).map{ _ = val word = word + Random.nextInt.abs % 5 (word, 1) } ssc.sparkContext.parallelize(rawLogs) } val rddQueue = mutable.Queue[RDD[(String, Int)]]() val rawLogStream = ssc.queueStream(rddQueue) (1 to 300) foreach { i =
[jira] [Updated] (SPARK-7326) Performing window() on a WindowedDStream doesn't work all the time
[ https://issues.apache.org/jira/browse/SPARK-7326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wesley Miao updated SPARK-7326: --- Description: Someone reported similar issues before but got no response. http://apache-spark-user-list.1001560.n3.nabble.com/Windows-of-windowed-streams-not-displaying-the-expected-results-td466.html And I met the same issue recently and it can be reproduced in 1.3.1 by the following piece of code: def main(args: Array[String]) { val batchInterval = 1234 val sparkConf = new SparkConf() .setAppName(WindowOnWindowedDStream) .setMaster(local[2]) val ssc = new StreamingContext(sparkConf, Milliseconds(batchInterval.toInt)) ssc.checkpoint(checkpoint) def createRDD(i: Int) : RDD[(String, Int)] = { val count = 1000 val rawLogs = (1 to count).map{ _ = val word = word + Random.nextInt.abs % 5 (word, 1) } ssc.sparkContext.parallelize(rawLogs) } val rddQueue = mutable.Queue[RDD[(String, Int)]]() val rawLogStream = ssc.queueStream(rddQueue) (1 to 300) foreach { i = rddQueue.enqueue(createRDD(i)) } val l1 = rawLogStream.window(Milliseconds(batchInterval.toInt) * 5, Milliseconds(batchInterval.toInt) * 5).reduceByKey(_ + _) val l2 = l1.window(Milliseconds(batchInterval.toInt) * 15, Milliseconds(batchInterval.toInt) * 15).reduceByKey(_ + _) l1.print() l2.print() ssc.start() ssc.awaitTermination() } Here we have two windowed DStream instance l1 and l2. l1 is the result DStream by performing a window() on the source DStream with both window and sliding duration 5 times the batch internal of the source stream. l2 is the result DStream by performing a window() on l1, with both window and sliding duration 3 times l1's batch interval, which is 15 times of the source stream. From the output of this simple streaming app, I can only see print data output from l1 and no data printed from l2. Diving into the source code, I found the problem may most likely reside in DStream.slice() implementation, as shown below. def slice(fromTime: Time, toTime: Time): Seq[RDD[T]] = { if (!isInitialized) { throw new SparkException(this + has not been initialized) } if (!(fromTime - zeroTime).isMultipleOf(slideDuration)) { logWarning(fromTime ( + fromTime + ) is not a multiple of slideDuration ( + slideDuration + )) } if (!(toTime - zeroTime).isMultipleOf(slideDuration)) { logWarning(toTime ( + fromTime + ) is not a multiple of slideDuration ( + slideDuration + )) } val alignedToTime = toTime.floor(slideDuration, zeroTime) val alignedFromTime = fromTime.floor(slideDuration, zeroTime) logInfo(Slicing from + fromTime + to + toTime + (aligned to + alignedFromTime + and + alignedToTime + )) alignedFromTime.to(alignedToTime, slideDuration).flatMap(time = { if (time = zeroTime) getOrCompute(time) else None }) } Here after performing floor() on both fromTime and toTime, the result (alignedFromTime - zeroTime) and (alignedToTime - zeroTime) may no longer be multiple of the slidingDuration, thus making isTimeValid() check failed for all the remaining computation. The fix would be to add a new floor() function in Time.scala to respect the zeroTime while performing the floor : def floor(that: Duration, zeroTime: Time): Time = { val t = that.milliseconds new Time(((this.millis - zeroTime.milliseconds) / t) * t + zeroTime.milliseconds) } And then change the DStream.slice to call this new floor function by passing in its zeroTime. val alignedToTime = toTime.floor(slideDuration, zeroTime) val alignedFromTime = fromTime.floor(slideDuration, zeroTime) This way the alignedToTime and alignedFromTime are *really* aligned in respect to zeroTime whose value is not really a 0. was: Someone reported similar issues before but got no response. http://apache-spark-user-list.1001560.n3.nabble.com/Windows-of-windowed-streams-not-displaying-the-expected-results-td466.html And I met the same issue recently and it can be reproduced in 1.3.1 by the following piece of code: {code} def main(args: Array[String]) { val batchInterval = 1234 val sparkConf = new SparkConf() .setAppName(WindowOnWindowedDStream) .setMaster(local[2]) val ssc = new StreamingContext(sparkConf, Milliseconds(batchInterval.toInt)) ssc.checkpoint(checkpoint) def createRDD(i: Int) : RDD[(String, Int)] = { val count = 1000 val rawLogs = (1 to count).map{ _ = val word = word + Random.nextInt.abs % 5 (word, 1) } ssc.sparkContext.parallelize(rawLogs) } val rddQueue = mutable.Queue[RDD[(String, Int)]]() val rawLogStream = ssc.queueStream(rddQueue) (1 to 300) foreach { i = rddQueue.enqueue(createRDD(i)) } val l1 =
[jira] [Updated] (SPARK-7249) Updated Hadoop dependencies due to inconsistency in the versions
[ https://issues.apache.org/jira/browse/SPARK-7249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7249: - Priority: Blocker (was: Minor) Target Version/s: 1.4.0 Affects Version/s: 1.3.1 Shepherd: (was: Antonio Parcero) Labels: (was: Build HADOOP dependencies upgrade) I'm marking this blocker simply because I think it's not only useful to get this in before the next minor release, but kind of important as the Hadoop 2.2 default we have in place today in the build is actually incomplete. I'm not against pushing this back or toning it down but want it on the radar for a look before releasing 1.4.0. Updated Hadoop dependencies due to inconsistency in the versions - Key: SPARK-7249 URL: https://issues.apache.org/jira/browse/SPARK-7249 Project: Spark Issue Type: Dependency upgrade Components: Build Affects Versions: 1.3.1 Environment: Ubuntu 14.04. Apache Mesos in cluster mode with HDFS from cloudera 2.5.0-cdh5.3.3. Reporter: Favio Vázquez Priority: Blocker Updated Hadoop dependencies due to inconsistency in the versions. Now the global properties are the ones used by the hadoop-2.2 profile, and the profile was set to empty but kept for backwards compatibility reasons. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7326) Performing window() on a WindowedDStream doesn't work all the time
[ https://issues.apache.org/jira/browse/SPARK-7326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14525808#comment-14525808 ] Apache Spark commented on SPARK-7326: - User 'wesleymiao' has created a pull request for this issue: https://github.com/apache/spark/pull/5871 Performing window() on a WindowedDStream doesn't work all the time -- Key: SPARK-7326 URL: https://issues.apache.org/jira/browse/SPARK-7326 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.1 Reporter: Wesley Miao Someone reported similar issues before but got no response. http://apache-spark-user-list.1001560.n3.nabble.com/Windows-of-windowed-streams-not-displaying-the-expected-results-td466.html And I met the same issue recently and it can be reproduced in 1.3.1 by the following piece of code: def main(args: Array[String]) { val batchInterval = 1234 val sparkConf = new SparkConf() .setAppName(WindowOnWindowedDStream) .setMaster(local[2]) val ssc = new StreamingContext(sparkConf, Milliseconds(batchInterval.toInt)) ssc.checkpoint(checkpoint) def createRDD(i: Int) : RDD[(String, Int)] = { val count = 1000 val rawLogs = (1 to count).map{ _ = val word = word + Random.nextInt.abs % 5 (word, 1) } ssc.sparkContext.parallelize(rawLogs) } val rddQueue = mutable.Queue[RDD[(String, Int)]]() val rawLogStream = ssc.queueStream(rddQueue) (1 to 300) foreach { i = rddQueue.enqueue(createRDD(i)) } val l1 = rawLogStream.window(Milliseconds(batchInterval.toInt) * 5, Milliseconds(batchInterval.toInt) * 5).reduceByKey(_ + _) val l2 = l1.window(Milliseconds(batchInterval.toInt) * 15, Milliseconds(batchInterval.toInt) * 15).reduceByKey(_ + _) l1.print() l2.print() ssc.start() ssc.awaitTermination() } Here we have two windowed DStream instance l1 and l2. l1 is the result DStream by performing a window() on the source DStream with both window and sliding duration 5 times the batch internal of the source stream. l2 is the result DStream by performing a window() on l1, with both window and sliding duration 3 times l1's batch interval, which is 15 times of the source stream. From the output of this simple streaming app, I can only see print data output from l1 and no data printed from l2. Diving into the source code, I found the problem may most likely reside in DStream.slice() implementation, as shown below. def slice(fromTime: Time, toTime: Time): Seq[RDD[T]] = { if (!isInitialized) { throw new SparkException(this + has not been initialized) } if (!(fromTime - zeroTime).isMultipleOf(slideDuration)) { logWarning(fromTime ( + fromTime + ) is not a multiple of slideDuration ( + slideDuration + )) } if (!(toTime - zeroTime).isMultipleOf(slideDuration)) { logWarning(toTime ( + fromTime + ) is not a multiple of slideDuration ( + slideDuration + )) } val alignedToTime = toTime.floor(slideDuration, zeroTime) val alignedFromTime = fromTime.floor(slideDuration, zeroTime) logInfo(Slicing from + fromTime + to + toTime + (aligned to + alignedFromTime + and + alignedToTime + )) alignedFromTime.to(alignedToTime, slideDuration).flatMap(time = { if (time = zeroTime) getOrCompute(time) else None }) } Here after performing floor() on both fromTime and toTime, the result (alignedFromTime - zeroTime) and (alignedToTime - zeroTime) may no longer be multiple of the slidingDuration, thus making isTimeValid() check failed for all the remaining computation. The fix would be to add a new floor() function in Time.scala to respect the zeroTime while performing the floor : def floor(that: Duration, zeroTime: Time): Time = { val t = that.milliseconds new Time(((this.millis - zeroTime.milliseconds) / t) * t + zeroTime.milliseconds) } And then change the DStream.slice to call this new floor function by passing in its zeroTime. val alignedToTime = toTime.floor(slideDuration, zeroTime) val alignedFromTime = fromTime.floor(slideDuration, zeroTime) This way the alignedToTime and alignedFromTime are *really* aligned in respect to zeroTime whose value is not really a 0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1437) Jenkins should build with Java 6
[ https://issues.apache.org/jira/browse/SPARK-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14525761#comment-14525761 ] Steve Loughran commented on SPARK-1437: --- ..be good for the pull request test runs to use java6 too; I've already had to purge some java7-isms that didn't get picked up Jenkins should build with Java 6 Key: SPARK-1437 URL: https://issues.apache.org/jira/browse/SPARK-1437 Project: Spark Issue Type: Bug Components: Build, Project Infra Affects Versions: 0.9.0 Reporter: Sean Owen Priority: Minor Labels: javac, jenkins Attachments: Screen Shot 2014-04-07 at 22.53.56.png Apologies if this was already on someone's to-do list, but I wanted to track this, as it bit two commits in the last few weeks. Spark is intended to work with Java 6, and so compiles with source/target 1.6. Java 7 can correctly enforce Java 6 language rules and emit Java 6 bytecode. However, unless otherwise configured with -bootclasspath, javac will use its own (Java 7) library classes. This means code that uses classes in Java 7 will be allowed to compile, but the result will fail when run on Java 6. This is why you get warnings like ... Using /usr/java/jdk1.7.0_51 as default JAVA_HOME. ... [warn] warning: [options] bootstrap class path not set in conjunction with -source 1.6 The solution is just to tell Jenkins to use Java 6. This may be stating the obvious, but it should just be a setting under Configure for SparkPullRequestBuilder. In our Jenkinses, JDK 6/7/8 are set up; if it's not an option already I'm guessing it's not too hard to get Java 6 configured on the Amplab machines. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7326) Performing window() on a WindowedDStream doesn't work all the time
[ https://issues.apache.org/jira/browse/SPARK-7326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14525815#comment-14525815 ] Sean Owen commented on SPARK-7326: -- Out of curiosity why do you have a window + slide of 5x the batch duration? that is simply the same as making an underlying stream with 5x the batch duration. Maybe this is just a toy example. I know it's not directly relevant to what you're reporting. Performing window() on a WindowedDStream doesn't work all the time -- Key: SPARK-7326 URL: https://issues.apache.org/jira/browse/SPARK-7326 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.1 Reporter: Wesley Miao Someone reported similar issues before but got no response. http://apache-spark-user-list.1001560.n3.nabble.com/Windows-of-windowed-streams-not-displaying-the-expected-results-td466.html And I met the same issue recently and it can be reproduced in 1.3.1 by the following piece of code: def main(args: Array[String]) { val batchInterval = 1234 val sparkConf = new SparkConf() .setAppName(WindowOnWindowedDStream) .setMaster(local[2]) val ssc = new StreamingContext(sparkConf, Milliseconds(batchInterval.toInt)) ssc.checkpoint(checkpoint) def createRDD(i: Int) : RDD[(String, Int)] = { val count = 1000 val rawLogs = (1 to count).map{ _ = val word = word + Random.nextInt.abs % 5 (word, 1) } ssc.sparkContext.parallelize(rawLogs) } val rddQueue = mutable.Queue[RDD[(String, Int)]]() val rawLogStream = ssc.queueStream(rddQueue) (1 to 300) foreach { i = rddQueue.enqueue(createRDD(i)) } val l1 = rawLogStream.window(Milliseconds(batchInterval.toInt) * 5, Milliseconds(batchInterval.toInt) * 5).reduceByKey(_ + _) val l2 = l1.window(Milliseconds(batchInterval.toInt) * 15, Milliseconds(batchInterval.toInt) * 15).reduceByKey(_ + _) l1.print() l2.print() ssc.start() ssc.awaitTermination() } Here we have two windowed DStream instance l1 and l2. l1 is the result DStream by performing a window() on the source DStream with both window and sliding duration 5 times the batch internal of the source stream. l2 is the result DStream by performing a window() on l1, with both window and sliding duration 3 times l1's batch interval, which is 15 times of the source stream. From the output of this simple streaming app, I can only see print data output from l1 and no data printed from l2. Diving into the source code, I found the problem may most likely reside in DStream.slice() implementation, as shown below. def slice(fromTime: Time, toTime: Time): Seq[RDD[T]] = { if (!isInitialized) { throw new SparkException(this + has not been initialized) } if (!(fromTime - zeroTime).isMultipleOf(slideDuration)) { logWarning(fromTime ( + fromTime + ) is not a multiple of slideDuration ( + slideDuration + )) } if (!(toTime - zeroTime).isMultipleOf(slideDuration)) { logWarning(toTime ( + fromTime + ) is not a multiple of slideDuration ( + slideDuration + )) } val alignedToTime = toTime.floor(slideDuration, zeroTime) val alignedFromTime = fromTime.floor(slideDuration, zeroTime) logInfo(Slicing from + fromTime + to + toTime + (aligned to + alignedFromTime + and + alignedToTime + )) alignedFromTime.to(alignedToTime, slideDuration).flatMap(time = { if (time = zeroTime) getOrCompute(time) else None }) } Here after performing floor() on both fromTime and toTime, the result (alignedFromTime - zeroTime) and (alignedToTime - zeroTime) may no longer be multiple of the slidingDuration, thus making isTimeValid() check failed for all the remaining computation. The fix would be to add a new floor() function in Time.scala to respect the zeroTime while performing the floor : def floor(that: Duration, zeroTime: Time): Time = { val t = that.milliseconds new Time(((this.millis - zeroTime.milliseconds) / t) * t + zeroTime.milliseconds) } And then change the DStream.slice to call this new floor function by passing in its zeroTime. val alignedToTime = toTime.floor(slideDuration, zeroTime) val alignedFromTime = fromTime.floor(slideDuration, zeroTime) This way the alignedToTime and alignedFromTime are *really* aligned in respect to zeroTime whose value is not really a 0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14525903#comment-14525903 ] Josh Rosen commented on SPARK-4105: --- While working on some new shuffle code, I managed to trigger a {{FAILED_TO_UNCOMPRESS(5)}} error by trying to decompress data which was not compressed or which was compressed with the wrong compression codec. This is kind of a long shot, but I wonder if there's a rarely-hit branch in the sort-shuffle write path that doesn't properly wrap an output stream for compression (or that does so with the wrong compression codec). FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle - Key: SPARK-4105 URL: https://issues.apache.org/jira/browse/SPARK-4105 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 1.2.0, 1.2.1, 1.3.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Blocker We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during shuffle read. Here's a sample stacktrace from an executor: {code} 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 33053) java.io.IOException: FAILED_TO_UNCOMPRESS(5) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391) at org.xerial.snappy.Snappy.uncompress(Snappy.java:427) at org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127) at org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) at org.xerial.snappy.SnappyInputStream.init(SnappyInputStream.java:58) at org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128) at org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090) at org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116) at org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) at
[jira] [Updated] (SPARK-4106) Shuffle write and spill to disk metrics are incorrect
[ https://issues.apache.org/jira/browse/SPARK-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4106: -- Component/s: Shuffle Shuffle write and spill to disk metrics are incorrect - Key: SPARK-4106 URL: https://issues.apache.org/jira/browse/SPARK-4106 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Aaron Davidson Priority: Critical I have an encountered a job which has some disk spilled (memory) but the disk spilled (disk) is 0, as well as the shuffle write. If I switch to hash based shuffle, where there happens to be no disk spilling, then the shuffle write is correct. I can get more info on a workload to repro this situation, but perhaps that state of events is sufficient. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7329) Use itertools.product in ParamGridBuilder
Xiangrui Meng created SPARK-7329: Summary: Use itertools.product in ParamGridBuilder Key: SPARK-7329 URL: https://issues.apache.org/jira/browse/SPARK-7329 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Minor justinuang suggested the following on https://github.com/apache/spark/pull/5601: {code} [dict(zip(self._param_grid.keys(), prod)) for prod in itertools.product(*self._param_grid.values())] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7022) PySpark is missing ParamGridBuilder
[ https://issues.apache.org/jira/browse/SPARK-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-7022. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5601 [https://github.com/apache/spark/pull/5601] PySpark is missing ParamGridBuilder --- Key: SPARK-7022 URL: https://issues.apache.org/jira/browse/SPARK-7022 Project: Spark Issue Type: New Feature Components: ML, MLlib, PySpark Affects Versions: 1.3.0 Reporter: Omede Firouz Assignee: Omede Firouz Priority: Critical Fix For: 1.4.0 PySpark is missing the entirety of ML.Tuning (see: https://issues.apache.org/jira/browse/SPARK-6940) This is a subticket specifically to track the ParamGridBuilder. The CrossValidator will be dealt with in a followup. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7329) Use itertools.product in ParamGridBuilder
[ https://issues.apache.org/jira/browse/SPARK-7329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14525962#comment-14525962 ] Apache Spark commented on SPARK-7329: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/5873 Use itertools.product in ParamGridBuilder - Key: SPARK-7329 URL: https://issues.apache.org/jira/browse/SPARK-7329 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Minor justinuang suggested the following on https://github.com/apache/spark/pull/5601: {code} [dict(zip(self._param_grid.keys(), prod)) for prod in itertools.product(*self._param_grid.values())] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7329) Use itertools.product in ParamGridBuilder
[ https://issues.apache.org/jira/browse/SPARK-7329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7329: --- Assignee: Xiangrui Meng (was: Apache Spark) Use itertools.product in ParamGridBuilder - Key: SPARK-7329 URL: https://issues.apache.org/jira/browse/SPARK-7329 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Minor justinuang suggested the following on https://github.com/apache/spark/pull/5601: {code} [dict(zip(self._param_grid.keys(), prod)) for prod in itertools.product(*self._param_grid.values())] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4112) Have a reserved copy of Sorter/SortDataFormat
[ https://issues.apache.org/jira/browse/SPARK-4112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4112: -- Component/s: Shuffle Have a reserved copy of Sorter/SortDataFormat - Key: SPARK-4112 URL: https://issues.apache.org/jira/browse/SPARK-4112 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Xiangrui Meng After SPARK-4084, developers can use Sorter with their own SortDataFormat. However, if there are multiple subclasses of SortDataFormat instantiated, JIT won't inline the methods in SortDataFormat and virtual method table lookup is slow. One solution could be making two copies of the code and reserve one for shuffle only, and expose the other to developers. Before we do that, we should compare the performance with/without JIT and check whether it is worth the extra code complexity. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5392) Shuffle spill size is shown as negative
[ https://issues.apache.org/jira/browse/SPARK-5392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5392: -- Component/s: Shuffle Shuffle spill size is shown as negative --- Key: SPARK-5392 URL: https://issues.apache.org/jira/browse/SPARK-5392 Project: Spark Issue Type: Bug Components: Shuffle, Web UI Affects Versions: 1.2.0 Reporter: Sven Krasser Priority: Minor Attachments: Screen Shot 2015-01-23 at 5.13.55 PM.png The Shuffle Spill (Memory) metric on the Stage Detail Web UI shows as negative for some executors (e.g. -2097152.0 B), see attached screenshot. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7328) Add missing items to pyspark.mllib.linalg.Vectors
[ https://issues.apache.org/jira/browse/SPARK-7328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14525921#comment-14525921 ] Apache Spark commented on SPARK-7328: - User 'MechCoder' has created a pull request for this issue: https://github.com/apache/spark/pull/5872 Add missing items to pyspark.mllib.linalg.Vectors - Key: SPARK-7328 URL: https://issues.apache.org/jira/browse/SPARK-7328 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Manoj Kumar Add 1. Class methods squared_dist, dot 2. toString 3. parse 4. norm 5. numNonzeros 6. copy -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7328) Add missing items to pyspark.mllib.linalg.Vectors
[ https://issues.apache.org/jira/browse/SPARK-7328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7328: --- Assignee: Apache Spark Add missing items to pyspark.mllib.linalg.Vectors - Key: SPARK-7328 URL: https://issues.apache.org/jira/browse/SPARK-7328 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Manoj Kumar Assignee: Apache Spark Add 1. Class methods squared_dist, dot 2. toString 3. parse 4. norm 5. numNonzeros 6. copy -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7329) Use itertools.product in ParamGridBuilder
[ https://issues.apache.org/jira/browse/SPARK-7329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7329: --- Assignee: Apache Spark (was: Xiangrui Meng) Use itertools.product in ParamGridBuilder - Key: SPARK-7329 URL: https://issues.apache.org/jira/browse/SPARK-7329 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Apache Spark Priority: Minor justinuang suggested the following on https://github.com/apache/spark/pull/5601: {code} [dict(zip(self._param_grid.keys(), prod)) for prod in itertools.product(*self._param_grid.values())] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6980) Akka timeout exceptions indicate which conf controls them
[ https://issues.apache.org/jira/browse/SPARK-6980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14525927#comment-14525927 ] Bryan Cutler commented on SPARK-6980: - I added another commit to the PR and some basic unit tests. Akka timeout exceptions indicate which conf controls them - Key: SPARK-6980 URL: https://issues.apache.org/jira/browse/SPARK-6980 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Imran Rashid Assignee: Harsh Gupta Priority: Minor Labels: starter Attachments: Spark-6980-Test.scala If you hit one of the akka timeouts, you just get an exception like {code} java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] {code} The exception doesn't indicate how to change the timeout, though there is usually (always?) a corresponding setting in {{SparkConf}} . It would be nice if the exception including the relevant setting. I think this should be pretty easy to do -- we just need to create something like a {{NamedTimeout}}. It would have its own {{await}} method, catches the akka timeout and throws its own exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5989) Model import/export for LDAModel
[ https://issues.apache.org/jira/browse/SPARK-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14525931#comment-14525931 ] Manoj Kumar commented on SPARK-5989: Oh just saw this, but there are a few other PR's of mine in line for review. Would get back after they are done. :) Model import/export for LDAModel Key: SPARK-5989 URL: https://issues.apache.org/jira/browse/SPARK-5989 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Manoj Kumar Add save/load for LDAModel and its local and distributed variants. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org