[jira] [Commented] (SPARK-9228) Combine unsafe and codegen into a single option
[ https://issues.apache.org/jira/browse/SPARK-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706315#comment-14706315 ] Yi Zhou commented on SPARK-9228: Thanks [~davies] ! > Combine unsafe and codegen into a single option > --- > > Key: SPARK-9228 > URL: https://issues.apache.org/jira/browse/SPARK-9228 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Davies Liu >Priority: Blocker > Fix For: 1.5.0 > > > Before QA, lets flip on features and consolidate unsafe and codegen. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10122) AttributeError: 'RDD' object has no attribute 'offsetRanges'
[ https://issues.apache.org/jira/browse/SPARK-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706306#comment-14706306 ] Saisai Shao commented on SPARK-10122: - Thanks :). > AttributeError: 'RDD' object has no attribute 'offsetRanges' > > > Key: SPARK-10122 > URL: https://issues.apache.org/jira/browse/SPARK-10122 > Project: Spark > Issue Type: Bug > Components: PySpark, Streaming >Reporter: Amit Ramesh > Labels: kafka > > SPARK-8389 added the offsetRanges interface to Kafka direct streams. This > however appears to break when chaining operations after a transform > operation. Following is example code that would result in an error (stack > trace below). Note that if the 'count()' operation is taken out of the > example code then this error does not occur anymore, and the Kafka data is > printed. > {code:title=kafka_test.py|collapse=true} > from pyspark import SparkContext > from pyspark.streaming import StreamingContext > from pyspark.streaming.kafka import KafkaUtils > def attach_kafka_metadata(kafka_rdd): > offset_ranges = kafka_rdd.offsetRanges() > return kafka_rdd > if __name__ == "__main__": > sc = SparkContext(appName='kafka-test') > ssc = StreamingContext(sc, 10) > kafka_stream = KafkaUtils.createDirectStream( > ssc, > [TOPIC], > kafkaParams={ > 'metadata.broker.list': BROKERS, > }, > ) > kafka_stream.transform(attach_kafka_metadata).count().pprint() > ssc.start() > ssc.awaitTermination() > {code} > {code:title=Stack trace|collapse=true} > Traceback (most recent call last): > File "/home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/util.py", > line 62, in call > r = self.func(t, *rdds) > File > "/home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py", line > 616, in > self.func = lambda t, rdd: func(t, prev_func(t, rdd)) > File > "/home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py", line > 616, in > self.func = lambda t, rdd: func(t, prev_func(t, rdd)) > File > "/home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py", line > 616, in > self.func = lambda t, rdd: func(t, prev_func(t, rdd)) > File > "/home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py", line > 616, in > self.func = lambda t, rdd: func(t, prev_func(t, rdd)) > File "/home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py", > line 332, in > func = lambda t, rdd: oldfunc(rdd) > File "/home/spark/ad_realtime/batch/kafka_test.py", line 7, in > attach_kafka_metadata > offset_ranges = kafka_rdd.offsetRanges() > AttributeError: 'RDD' object has no attribute 'offsetRanges' > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9228) Combine unsafe and codegen into a single option
[ https://issues.apache.org/jira/browse/SPARK-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706289#comment-14706289 ] Davies Liu commented on SPARK-9228: --- Right now, it's an internal configuration (could be changed or removed in next release), we keep them only for debug purpose. > Combine unsafe and codegen into a single option > --- > > Key: SPARK-9228 > URL: https://issues.apache.org/jira/browse/SPARK-9228 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Davies Liu >Priority: Blocker > Fix For: 1.5.0 > > > Before QA, lets flip on features and consolidate unsafe and codegen. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10148) Display active and inactive receiver numbers in Streaming page
[ https://issues.apache.org/jira/browse/SPARK-10148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10148: Assignee: Apache Spark > Display active and inactive receiver numbers in Streaming page > -- > > Key: SPARK-10148 > URL: https://issues.apache.org/jira/browse/SPARK-10148 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Shixiong Zhu >Assignee: Apache Spark > > Displaying active and inactive receiver numbers in Streaming page is helpful > to understand whether receivers have started or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10148) Display active and inactive receiver numbers in Streaming page
[ https://issues.apache.org/jira/browse/SPARK-10148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10148: Assignee: (was: Apache Spark) > Display active and inactive receiver numbers in Streaming page > -- > > Key: SPARK-10148 > URL: https://issues.apache.org/jira/browse/SPARK-10148 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Shixiong Zhu > > Displaying active and inactive receiver numbers in Streaming page is helpful > to understand whether receivers have started or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10148) Display active and inactive receiver numbers in Streaming page
[ https://issues.apache.org/jira/browse/SPARK-10148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706288#comment-14706288 ] Apache Spark commented on SPARK-10148: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/8351 > Display active and inactive receiver numbers in Streaming page > -- > > Key: SPARK-10148 > URL: https://issues.apache.org/jira/browse/SPARK-10148 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Shixiong Zhu > > Displaying active and inactive receiver numbers in Streaming page is helpful > to understand whether receivers have started or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10122) AttributeError: 'RDD' object has no attribute 'offsetRanges'
[ https://issues.apache.org/jira/browse/SPARK-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706286#comment-14706286 ] Amit Ramesh commented on SPARK-10122: - [~jerryshao] thanks for jumping onto this right away! I tried your patch with the example I have provided in this ticket, and also with the original, more involved, code that we first witnessed this issue in and they both seem to be working fine :). > AttributeError: 'RDD' object has no attribute 'offsetRanges' > > > Key: SPARK-10122 > URL: https://issues.apache.org/jira/browse/SPARK-10122 > Project: Spark > Issue Type: Bug > Components: PySpark, Streaming >Reporter: Amit Ramesh > Labels: kafka > > SPARK-8389 added the offsetRanges interface to Kafka direct streams. This > however appears to break when chaining operations after a transform > operation. Following is example code that would result in an error (stack > trace below). Note that if the 'count()' operation is taken out of the > example code then this error does not occur anymore, and the Kafka data is > printed. > {code:title=kafka_test.py|collapse=true} > from pyspark import SparkContext > from pyspark.streaming import StreamingContext > from pyspark.streaming.kafka import KafkaUtils > def attach_kafka_metadata(kafka_rdd): > offset_ranges = kafka_rdd.offsetRanges() > return kafka_rdd > if __name__ == "__main__": > sc = SparkContext(appName='kafka-test') > ssc = StreamingContext(sc, 10) > kafka_stream = KafkaUtils.createDirectStream( > ssc, > [TOPIC], > kafkaParams={ > 'metadata.broker.list': BROKERS, > }, > ) > kafka_stream.transform(attach_kafka_metadata).count().pprint() > ssc.start() > ssc.awaitTermination() > {code} > {code:title=Stack trace|collapse=true} > Traceback (most recent call last): > File "/home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/util.py", > line 62, in call > r = self.func(t, *rdds) > File > "/home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py", line > 616, in > self.func = lambda t, rdd: func(t, prev_func(t, rdd)) > File > "/home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py", line > 616, in > self.func = lambda t, rdd: func(t, prev_func(t, rdd)) > File > "/home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py", line > 616, in > self.func = lambda t, rdd: func(t, prev_func(t, rdd)) > File > "/home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py", line > 616, in > self.func = lambda t, rdd: func(t, prev_func(t, rdd)) > File "/home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py", > line 332, in > func = lambda t, rdd: oldfunc(rdd) > File "/home/spark/ad_realtime/batch/kafka_test.py", line 7, in > attach_kafka_metadata > offset_ranges = kafka_rdd.offsetRanges() > AttributeError: 'RDD' object has no attribute 'offsetRanges' > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10148) Display active and inactive receiver numbers in Streaming page
Shixiong Zhu created SPARK-10148: Summary: Display active and inactive receiver numbers in Streaming page Key: SPARK-10148 URL: https://issues.apache.org/jira/browse/SPARK-10148 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Shixiong Zhu Displaying active and inactive receiver numbers in Streaming page is helpful to understand whether receivers have started or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-10147) App shouldn't show in HistoryServer web when the event file has been deleted on hdfs
[ https://issues.apache.org/jira/browse/SPARK-10147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] meiyoula closed SPARK-10147. Resolution: Not A Problem > App shouldn't show in HistoryServer web when the event file has been deleted > on hdfs > > > Key: SPARK-10147 > URL: https://issues.apache.org/jira/browse/SPARK-10147 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: meiyoula > > Phenomenon:App still shows in HistoryServer web when the event file has been > deleted on hdfs. > Cause: It is because *log-replay-executor* thread and *clean log* thread both > will write value to object *application*, so it has synchronization problem -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9669) Support PySpark with Mesos Cluster mode
[ https://issues.apache.org/jira/browse/SPARK-9669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9669: --- Assignee: Apache Spark > Support PySpark with Mesos Cluster mode > --- > > Key: SPARK-9669 > URL: https://issues.apache.org/jira/browse/SPARK-9669 > Project: Spark > Issue Type: Improvement > Components: Mesos, PySpark >Reporter: Timothy Chen >Assignee: Apache Spark > > PySpark with cluster mode with Mesos is not yet supported. > We need to enable it and make sure it's able to launch Pyspark jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9669) Support PySpark with Mesos Cluster mode
[ https://issues.apache.org/jira/browse/SPARK-9669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9669: --- Assignee: (was: Apache Spark) > Support PySpark with Mesos Cluster mode > --- > > Key: SPARK-9669 > URL: https://issues.apache.org/jira/browse/SPARK-9669 > Project: Spark > Issue Type: Improvement > Components: Mesos, PySpark >Reporter: Timothy Chen > > PySpark with cluster mode with Mesos is not yet supported. > We need to enable it and make sure it's able to launch Pyspark jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9669) Support PySpark with Mesos Cluster mode
[ https://issues.apache.org/jira/browse/SPARK-9669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706262#comment-14706262 ] Apache Spark commented on SPARK-9669: - User 'tnachen' has created a pull request for this issue: https://github.com/apache/spark/pull/8349 > Support PySpark with Mesos Cluster mode > --- > > Key: SPARK-9669 > URL: https://issues.apache.org/jira/browse/SPARK-9669 > Project: Spark > Issue Type: Improvement > Components: Mesos, PySpark >Reporter: Timothy Chen > > PySpark with cluster mode with Mesos is not yet supported. > We need to enable it and make sure it's able to launch Pyspark jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8467) Add LDAModel.describeTopics() in Python
[ https://issues.apache.org/jira/browse/SPARK-8467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706257#comment-14706257 ] Hrishikesh commented on SPARK-8467: --- [~yuu.ishik...@gmail.com], are you still working on this? > Add LDAModel.describeTopics() in Python > --- > > Key: SPARK-8467 > URL: https://issues.apache.org/jira/browse/SPARK-8467 > Project: Spark > Issue Type: New Feature > Components: MLlib, PySpark >Reporter: Yu Ishikawa > > Add LDAModel. describeTopics() in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706244#comment-14706244 ] Reynold Xin commented on SPARK-: This needs to be designed first. I'm not sure if static code analysis is a great idea since they fail often. I'm open to ideas though. > RDD-like API on top of Catalyst/DataFrame > - > > Key: SPARK- > URL: https://issues.apache.org/jira/browse/SPARK- > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Reynold Xin > > The RDD API is very flexible, and as a result harder to optimize its > execution in some cases. The DataFrame API, on the other hand, is much easier > to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to > use UDFs, lack of strong types in Scala/Java). > As a Spark user, I want an API that sits somewhere in the middle of the > spectrum so I can write most of my applications with that API, and yet it can > be optimized well by Spark to achieve performance and stability. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9983) Local physical operators for query execution
[ https://issues.apache.org/jira/browse/SPARK-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9983: --- Description: In distributed query execution, there are two kinds of operators: (1) operators that exchange data between different executors or threads: examples include broadcast, shuffle. (2) operators that process data in a single thread: examples include project, filter, group by, etc. This ticket proposes clearly differentiating them and create local operators in Spark. This leads to a lot of benefits: easier to test, easier to optimize data exchange, better design (single responsibility), and potentially even having a hyper-optimized single-node version of DataFrame. was: In distributed query execution, there are two kinds of operators: (1) operators that exchange data between different executors or threads: examples include broadcast, shuffle. (2) operators that process data in a single thread: examples include project, filter, group by, etc. This ticket proposes clearly differentiating them and create local operators in Spark. This leads to a lot of benefits: easier to test, easier to optimize data exchange, and better design (single responsibility). > Local physical operators for query execution > > > Key: SPARK-9983 > URL: https://issues.apache.org/jira/browse/SPARK-9983 > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Reynold Xin >Assignee: Shixiong Zhu > > In distributed query execution, there are two kinds of operators: > (1) operators that exchange data between different executors or threads: > examples include broadcast, shuffle. > (2) operators that process data in a single thread: examples include project, > filter, group by, etc. > This ticket proposes clearly differentiating them and create local operators > in Spark. This leads to a lot of benefits: easier to test, easier to optimize > data exchange, better design (single responsibility), and potentially even > having a hyper-optimized single-node version of DataFrame. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10147) App shouldn't show in HistoryServer web when the event file has been deleted on hdfs
[ https://issues.apache.org/jira/browse/SPARK-10147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10147: Assignee: Apache Spark > App shouldn't show in HistoryServer web when the event file has been deleted > on hdfs > > > Key: SPARK-10147 > URL: https://issues.apache.org/jira/browse/SPARK-10147 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: meiyoula >Assignee: Apache Spark > > Phenomenon:App still shows in HistoryServer web when the event file has been > deleted on hdfs. > Cause: It is because *log-replay-executor* thread and *clean log* thread both > will write value to object *application*, so it has synchronization problem -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10147) App shouldn't show in HistoryServer web when the event file has been deleted on hdfs
[ https://issues.apache.org/jira/browse/SPARK-10147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10147: Assignee: (was: Apache Spark) > App shouldn't show in HistoryServer web when the event file has been deleted > on hdfs > > > Key: SPARK-10147 > URL: https://issues.apache.org/jira/browse/SPARK-10147 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: meiyoula > > Phenomenon:App still shows in HistoryServer web when the event file has been > deleted on hdfs. > Cause: It is because *log-replay-executor* thread and *clean log* thread both > will write value to object *application*, so it has synchronization problem -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10147) App shouldn't show in HistoryServer web when the event file has been deleted on hdfs
[ https://issues.apache.org/jira/browse/SPARK-10147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706217#comment-14706217 ] Apache Spark commented on SPARK-10147: -- User 'XuTingjun' has created a pull request for this issue: https://github.com/apache/spark/pull/8348 > App shouldn't show in HistoryServer web when the event file has been deleted > on hdfs > > > Key: SPARK-10147 > URL: https://issues.apache.org/jira/browse/SPARK-10147 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: meiyoula > > Phenomenon:App still shows in HistoryServer web when the event file has been > deleted on hdfs. > Cause: It is because *log-replay-executor* thread and *clean log* thread both > will write value to object *application*, so it has synchronization problem -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10147) App shouldn't show in HistoryServer web when the event file has been deleted on hdfs
[ https://issues.apache.org/jira/browse/SPARK-10147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] meiyoula updated SPARK-10147: - Description: Phenomenon:App still shows in HistoryServer web when the event file has been deleted on hdfs. Cause: It is because *log-replay-executor* thread and *clean log* thread both will write value to object *application*, so it has synchronization problem was: It is because *log-replay-executor* thread and *clean log* thread both will write value to object *application*, so it has synchronization problem > App shouldn't show in HistoryServer web when the event file has been deleted > on hdfs > > > Key: SPARK-10147 > URL: https://issues.apache.org/jira/browse/SPARK-10147 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: meiyoula > > Phenomenon:App still shows in HistoryServer web when the event file has been > deleted on hdfs. > Cause: It is because *log-replay-executor* thread and *clean log* thread both > will write value to object *application*, so it has synchronization problem -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10147) App shouldn't show in HistoryServer web when the event file has been deleted on hdfs
[ https://issues.apache.org/jira/browse/SPARK-10147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] meiyoula updated SPARK-10147: - Summary: App shouldn't show in HistoryServer web when the event file has been deleted on hdfs (was: App still shows in HistoryServer web when the event file has been deleted on hdfs) > App shouldn't show in HistoryServer web when the event file has been deleted > on hdfs > > > Key: SPARK-10147 > URL: https://issues.apache.org/jira/browse/SPARK-10147 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: meiyoula > > It is because *log-replay-executor* thread and *clean log* thread both will > write value to object *application*, so it has synchronization problem -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10147) App shouldn't show in HistoryServer web when the event file has been deleted on hdfs
[ https://issues.apache.org/jira/browse/SPARK-10147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] meiyoula updated SPARK-10147: - Description: It is because *log-replay-executor* thread and *clean log* thread both will write value to object *application*, so it has synchronization problem was:It is because *log-replay-executor* thread and *clean log* thread both will write value to object *application*, so it has synchronization problem > App shouldn't show in HistoryServer web when the event file has been deleted > on hdfs > > > Key: SPARK-10147 > URL: https://issues.apache.org/jira/browse/SPARK-10147 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: meiyoula > > It is because *log-replay-executor* thread and *clean log* thread both will > write value to object *application*, so it has synchronization problem -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10147) App still shows in HistoryServer web when the event file has been deleted on hdfs
meiyoula created SPARK-10147: Summary: App still shows in HistoryServer web when the event file has been deleted on hdfs Key: SPARK-10147 URL: https://issues.apache.org/jira/browse/SPARK-10147 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula It is because *log-replay-executor* thread and *clean log* thread both will write value to object *application*, so it has synchronization problem -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10146) Have an easy way to set data source reader/writer specific confs
[ https://issues.apache.org/jira/browse/SPARK-10146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706205#comment-14706205 ] Yin Huai edited comment on SPARK-10146 at 8/21/15 3:42 AM: --- One possible way is that every data source defines a list of confs that can be applied to its reader/writer and we let users set those confs in SQLConf or through data source options. Then, we propagate those confs to the reader/writer. was (Author: yhuai): One possible way to do it is that every data source defines a list of confs that can be applied to its reader/writer and we let users set those confs in SQLConf or through data source options. Then, we propagate those confs to the reader/writer. > Have an easy way to set data source reader/writer specific confs > > > Key: SPARK-10146 > URL: https://issues.apache.org/jira/browse/SPARK-10146 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yin Huai >Priority: Critical > > Right now, it is hard to set data source reader/writer specifics confs > correctly (e.g. parquet's row group size). Users need to set those confs in > hadoop conf before start the application or through > {{org.apache.spark.deploy.SparkHadoopUtil.get.conf}} at runtime. It will be > great if we can have an easy to set those confs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10146) Have an easy way to set data source reader/writer specific confs
[ https://issues.apache.org/jira/browse/SPARK-10146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706205#comment-14706205 ] Yin Huai commented on SPARK-10146: -- One possible way to do it is that every data source defines a list of confs that can be applied to its reader/writer and we let users set those confs in SQLConf or through data source options. Then, we propagate those confs to the reader/writer. > Have an easy way to set data source reader/writer specific confs > > > Key: SPARK-10146 > URL: https://issues.apache.org/jira/browse/SPARK-10146 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yin Huai >Priority: Critical > > Right now, it is hard to set data source reader/writer specifics confs > correctly (e.g. parquet's row group size). Users need to set those confs in > hadoop conf before start the application or through > {{org.apache.spark.deploy.SparkHadoopUtil.get.conf}} at runtime. It will be > great if we can have an easy to set those confs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10146) Have an easy way to set data source reader/writer specific confs
[ https://issues.apache.org/jira/browse/SPARK-10146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-10146: - Issue Type: Improvement (was: Bug) > Have an easy way to set data source reader/writer specific confs > > > Key: SPARK-10146 > URL: https://issues.apache.org/jira/browse/SPARK-10146 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yin Huai >Priority: Critical > > Right now, it is hard to set data source reader/writer specifics confs > correctly (e.g. parquet's row group size). Users need to set those confs in > hadoop conf before start the application or through > {{org.apache.spark.deploy.SparkHadoopUtil.get.conf}} at runtime. It will be > great if we can have an easy to set those confs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10146) Have an easy way to set data source reader/writer specific confs
Yin Huai created SPARK-10146: Summary: Have an easy way to set data source reader/writer specific confs Key: SPARK-10146 URL: https://issues.apache.org/jira/browse/SPARK-10146 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Priority: Critical Right now, it is hard to set data source reader/writer specifics confs correctly (e.g. parquet's row group size). Users need to set those confs in hadoop conf before start the application or through {{org.apache.spark.deploy.SparkHadoopUtil.get.conf}} at runtime. It will be great if we can have an easy to set those confs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10122) AttributeError: 'RDD' object has no attribute 'offsetRanges'
[ https://issues.apache.org/jira/browse/SPARK-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706202#comment-14706202 ] Apache Spark commented on SPARK-10122: -- User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/8347 > AttributeError: 'RDD' object has no attribute 'offsetRanges' > > > Key: SPARK-10122 > URL: https://issues.apache.org/jira/browse/SPARK-10122 > Project: Spark > Issue Type: Bug > Components: PySpark, Streaming >Reporter: Amit Ramesh > Labels: kafka > > SPARK-8389 added the offsetRanges interface to Kafka direct streams. This > however appears to break when chaining operations after a transform > operation. Following is example code that would result in an error (stack > trace below). Note that if the 'count()' operation is taken out of the > example code then this error does not occur anymore, and the Kafka data is > printed. > {code:title=kafka_test.py|collapse=true} > from pyspark import SparkContext > from pyspark.streaming import StreamingContext > from pyspark.streaming.kafka import KafkaUtils > def attach_kafka_metadata(kafka_rdd): > offset_ranges = kafka_rdd.offsetRanges() > return kafka_rdd > if __name__ == "__main__": > sc = SparkContext(appName='kafka-test') > ssc = StreamingContext(sc, 10) > kafka_stream = KafkaUtils.createDirectStream( > ssc, > [TOPIC], > kafkaParams={ > 'metadata.broker.list': BROKERS, > }, > ) > kafka_stream.transform(attach_kafka_metadata).count().pprint() > ssc.start() > ssc.awaitTermination() > {code} > {code:title=Stack trace|collapse=true} > Traceback (most recent call last): > File "/home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/util.py", > line 62, in call > r = self.func(t, *rdds) > File > "/home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py", line > 616, in > self.func = lambda t, rdd: func(t, prev_func(t, rdd)) > File > "/home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py", line > 616, in > self.func = lambda t, rdd: func(t, prev_func(t, rdd)) > File > "/home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py", line > 616, in > self.func = lambda t, rdd: func(t, prev_func(t, rdd)) > File > "/home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py", line > 616, in > self.func = lambda t, rdd: func(t, prev_func(t, rdd)) > File "/home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py", > line 332, in > func = lambda t, rdd: oldfunc(rdd) > File "/home/spark/ad_realtime/batch/kafka_test.py", line 7, in > attach_kafka_metadata > offset_ranges = kafka_rdd.offsetRanges() > AttributeError: 'RDD' object has no attribute 'offsetRanges' > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10143) Parquet changed the behavior of calculating splits
[ https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-10143: - Component/s: SQL > Parquet changed the behavior of calculating splits > -- > > Key: SPARK-10143 > URL: https://issues.apache.org/jira/browse/SPARK-10143 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Priority: Critical > > When Parquet's task side metadata is enabled (by default it is enabled and it > needs to be enabled to deal with tables with many files), Parquet delegates > the work of calculating initial splits to FileInputFormat (see > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311). > If filesystem's block size is smaller than the row group size and users do > not set min split size, splits in the initial split list will have lots of > dummy splits and they contribute to empty tasks (because the starting point > and ending point of a split does not cover the starting point of a row > group). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10145) Executor exit without useful messages when spark runs in spark-streaming
[ https://issues.apache.org/jira/browse/SPARK-10145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706139#comment-14706139 ] Baogang Wang edited comment on SPARK-10145 at 8/21/15 3:27 AM: --- spark.serializer org.apache.spark.serializer.KryoSerializer spark.akka.frameSize1024 spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0–2041 spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041 spark.akka.timeout 900 spark.storage.memoryFraction0.4 spark.rdd.compress true spark.shuffle.blockTransferService nio spark.yarn.executor.memoryOverhead 1024 was (Author: heayin): # Default system properties included when running spark-submit. # This is useful for setting default environmental settings. # Example: # spark.master spark://master:7077 # spark.eventLog.enabled true # spark.eventLog.dir hdfs://namenode:8021/directory spark.serializer org.apache.spark.serializer.KryoSerializer # spark.driver.memory 5g # spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three" #spark.core.connection.ack.wait.timeout 3600 #spark.core.connection.auth.wait.timeout3600 spark.akka.frameSize1024 spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0–2041 spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041 spark.akka.timeout 900 spark.storage.memoryFraction0.4 spark.rdd.compress true spark.shuffle.blockTransferService nio spark.yarn.executor.memoryOverhead 1024 > Executor exit without useful messages when spark runs in spark-streaming > > > Key: SPARK-10145 > URL: https://issues.apache.org/jira/browse/SPARK-10145 > Project: Spark > Issue Type: Bug > Components: Streaming, YARN > Environment: spark 1.3.1, hadoop 2.6.0, 6 nodes, each node has 32 > cores and 32g memory >Reporter: Baogang Wang >Priority: Critical > Original Estimate: 168h > Remaining Estimate: 168h > > Each node is allocated 30g memory by Yarn. > My application receives messages from Kafka by directstream. Each application > consists of 4 dstream window > Spark application is submitted by this command: > spark-submit --class spark_security.safe.SafeSockPuppet --driver-memory 3g > --executor-memory 3g --num-executors 3 --executor-cores 4 --name > safeSparkDealerUser --master yarn --deploy-mode cluster > spark_Security-1.0-SNAPSHOT.jar.nocalse > hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/spark_properties/safedealer.properties > After about 1 hours, some executor exits. There is no more yarn logs after > the executor exits and there is no stack when the executor exits. > When I see the yarn node manager log, it shows as follows : > 2015-08-17 17:25:41,550 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Start request for container_1439803298368_0005_01_01 by user root > 2015-08-17 17:25:41,551 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Creating a new application reference for app application_1439803298368_0005 > 2015-08-17 17:25:41,551 INFO > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root > IP=172.19.160.102 OPERATION=Start Container Request > TARGET=ContainerManageImpl RESULT=SUCCESS > APPID=application_1439803298368_0005 > CONTAINERID=container_1439803298368_0005_01_01 > 2015-08-17 17:25:41,551 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: > Application application_1439803298368_0005 transitioned from NEW to INITING > 2015-08-17 17:25:41,552 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: > Adding container_1439803298368_0005_01_01 to application > application_1439803298368_0005 > 2015-08-17 17:25:41,557 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > rollingMonitorInterval is set as -1. The log rolling mornitoring interval is > disabled. The logs will be aggregated after this application is finished. > 2015-08-17 17:25:41,663 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: > Application application_1439803298368_0005 transitioned from INITING to > RUNNING > 2015-08-17 17:25:41,664 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1439803298368_0005_01_01 transitioned from NEW to > LOCALIZING > 2015-08-17 17:25:41,664 INFO > org.apache
[jira] [Commented] (SPARK-6192) Enhance MLlib's Python API (GSoC 2015)
[ https://issues.apache.org/jira/browse/SPARK-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706197#comment-14706197 ] Xiangrui Meng commented on SPARK-6192: -- [~srblakcHwak] As I mentioned above, it would be great if you can start with some small features or helping review others' PRs. We need to know each other before we can plan a GSoC project. This is a good place to start: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > Enhance MLlib's Python API (GSoC 2015) > -- > > Key: SPARK-6192 > URL: https://issues.apache.org/jira/browse/SPARK-6192 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib, PySpark >Reporter: Xiangrui Meng >Assignee: Manoj Kumar > Labels: gsoc, gsoc2015, mentor > > This is an umbrella JIRA for [~MechCoder]'s GSoC 2015 project. The main theme > is to enhance MLlib's Python API, to make it on par with the Scala/Java API. > The main tasks are: > 1. For all models in MLlib, provide save/load method. This also > includes save/load in Scala. > 2. Python API for evaluation metrics. > 3. Python API for streaming ML algorithms. > 4. Python API for distributed linear algebra. > 5. Simplify MLLibPythonAPI using DataFrames. Currently, we use > customized serialization, making MLLibPythonAPI hard to maintain. It > would be nice to use the DataFrames for serialization. > I'll link the JIRAs for each of the tasks. > Note that this doesn't mean all these JIRAs are pre-assigned to [~MechCoder]. > The TODO list will be dynamic based on the backlog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10145) Executor exit without useful messages when spark runs in spark-streaming
[ https://issues.apache.org/jira/browse/SPARK-10145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706195#comment-14706195 ] Baogang Wang commented on SPARK-10145: -- Streaming batch is 1 second The width of Windows are 60 seconds, 180 seconds, 300 seconds and 600 seconds > Executor exit without useful messages when spark runs in spark-streaming > > > Key: SPARK-10145 > URL: https://issues.apache.org/jira/browse/SPARK-10145 > Project: Spark > Issue Type: Bug > Components: Streaming, YARN > Environment: spark 1.3.1, hadoop 2.6.0, 6 nodes, each node has 32 > cores and 32g memory >Reporter: Baogang Wang >Priority: Critical > Original Estimate: 168h > Remaining Estimate: 168h > > Each node is allocated 30g memory by Yarn. > My application receives messages from Kafka by directstream. Each application > consists of 4 dstream window > Spark application is submitted by this command: > spark-submit --class spark_security.safe.SafeSockPuppet --driver-memory 3g > --executor-memory 3g --num-executors 3 --executor-cores 4 --name > safeSparkDealerUser --master yarn --deploy-mode cluster > spark_Security-1.0-SNAPSHOT.jar.nocalse > hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/spark_properties/safedealer.properties > After about 1 hours, some executor exits. There is no more yarn logs after > the executor exits and there is no stack when the executor exits. > When I see the yarn node manager log, it shows as follows : > 2015-08-17 17:25:41,550 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Start request for container_1439803298368_0005_01_01 by user root > 2015-08-17 17:25:41,551 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Creating a new application reference for app application_1439803298368_0005 > 2015-08-17 17:25:41,551 INFO > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root > IP=172.19.160.102 OPERATION=Start Container Request > TARGET=ContainerManageImpl RESULT=SUCCESS > APPID=application_1439803298368_0005 > CONTAINERID=container_1439803298368_0005_01_01 > 2015-08-17 17:25:41,551 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: > Application application_1439803298368_0005 transitioned from NEW to INITING > 2015-08-17 17:25:41,552 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: > Adding container_1439803298368_0005_01_01 to application > application_1439803298368_0005 > 2015-08-17 17:25:41,557 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > rollingMonitorInterval is set as -1. The log rolling mornitoring interval is > disabled. The logs will be aggregated after this application is finished. > 2015-08-17 17:25:41,663 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: > Application application_1439803298368_0005 transitioned from INITING to > RUNNING > 2015-08-17 17:25:41,664 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1439803298368_0005_01_01 transitioned from NEW to > LOCALIZING > 2015-08-17 17:25:41,664 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got > event CONTAINER_INIT for appId application_1439803298368_0005 > 2015-08-17 17:25:41,664 INFO > org.apache.spark.network.yarn.YarnShuffleService: Initializing container > container_1439803298368_0005_01_01 > 2015-08-17 17:25:41,665 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource > hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/user/root/.sparkStaging/application_1439803298368_0005/spark-assembly-1.3.1-hadoop2.6.0.jar > transitioned from INIT to DOWNLOADING > 2015-08-17 17:25:41,665 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource > hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/user/root/.sparkStaging/application_1439803298368_0005/spark_Security-1.0-SNAPSHOT.jar > transitioned from INIT to DOWNLOADING > 2015-08-17 17:25:41,665 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Created localizer for container_1439803298368_0005_01_01 > 2015-08-17 17:25:41,668 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Writing credentials to the nmPrivate file > /export/servers/hadoop2.6.0/tmp/nm-local-dir/nmPrivate/container_1439803298368_0005_01_01.tokens. > Credentials list: > 2015-08-17 17:25:41,682 INFO > org.apache.hadoop.ya
[jira] [Updated] (SPARK-6192) Enhance MLlib's Python API (GSoC 2015)
[ https://issues.apache.org/jira/browse/SPARK-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6192: - Target Version/s: (was: 1.5.0) > Enhance MLlib's Python API (GSoC 2015) > -- > > Key: SPARK-6192 > URL: https://issues.apache.org/jira/browse/SPARK-6192 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib, PySpark >Reporter: Xiangrui Meng >Assignee: Manoj Kumar > Labels: gsoc, gsoc2015, mentor > > This is an umbrella JIRA for [~MechCoder]'s GSoC 2015 project. The main theme > is to enhance MLlib's Python API, to make it on par with the Scala/Java API. > The main tasks are: > 1. For all models in MLlib, provide save/load method. This also > includes save/load in Scala. > 2. Python API for evaluation metrics. > 3. Python API for streaming ML algorithms. > 4. Python API for distributed linear algebra. > 5. Simplify MLLibPythonAPI using DataFrames. Currently, we use > customized serialization, making MLLibPythonAPI hard to maintain. It > would be nice to use the DataFrames for serialization. > I'll link the JIRAs for each of the tasks. > Note that this doesn't mean all these JIRAs are pre-assigned to [~MechCoder]. > The TODO list will be dynamic based on the backlog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6192) Enhance MLlib's Python API (GSoC 2015)
[ https://issues.apache.org/jira/browse/SPARK-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6192: - Target Version/s: 1.5.0 > Enhance MLlib's Python API (GSoC 2015) > -- > > Key: SPARK-6192 > URL: https://issues.apache.org/jira/browse/SPARK-6192 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib, PySpark >Reporter: Xiangrui Meng >Assignee: Manoj Kumar > Labels: gsoc, gsoc2015, mentor > > This is an umbrella JIRA for [~MechCoder]'s GSoC 2015 project. The main theme > is to enhance MLlib's Python API, to make it on par with the Scala/Java API. > The main tasks are: > 1. For all models in MLlib, provide save/load method. This also > includes save/load in Scala. > 2. Python API for evaluation metrics. > 3. Python API for streaming ML algorithms. > 4. Python API for distributed linear algebra. > 5. Simplify MLLibPythonAPI using DataFrames. Currently, we use > customized serialization, making MLLibPythonAPI hard to maintain. It > would be nice to use the DataFrames for serialization. > I'll link the JIRAs for each of the tasks. > Note that this doesn't mean all these JIRAs are pre-assigned to [~MechCoder]. > The TODO list will be dynamic based on the backlog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6192) Enhance MLlib's Python API (GSoC 2015)
[ https://issues.apache.org/jira/browse/SPARK-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706191#comment-14706191 ] Xiangrui Meng commented on SPARK-6192: -- Not yet, officially. > Enhance MLlib's Python API (GSoC 2015) > -- > > Key: SPARK-6192 > URL: https://issues.apache.org/jira/browse/SPARK-6192 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib, PySpark >Reporter: Xiangrui Meng >Assignee: Manoj Kumar > Labels: gsoc, gsoc2015, mentor > > This is an umbrella JIRA for [~MechCoder]'s GSoC 2015 project. The main theme > is to enhance MLlib's Python API, to make it on par with the Scala/Java API. > The main tasks are: > 1. For all models in MLlib, provide save/load method. This also > includes save/load in Scala. > 2. Python API for evaluation metrics. > 3. Python API for streaming ML algorithms. > 4. Python API for distributed linear algebra. > 5. Simplify MLLibPythonAPI using DataFrames. Currently, we use > customized serialization, making MLLibPythonAPI hard to maintain. It > would be nice to use the DataFrames for serialization. > I'll link the JIRAs for each of the tasks. > Note that this doesn't mean all these JIRAs are pre-assigned to [~MechCoder]. > The TODO list will be dynamic based on the backlog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8530) Add Python API for MinMaxScaler
[ https://issues.apache.org/jira/browse/SPARK-8530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8530: - Component/s: PySpark > Add Python API for MinMaxScaler > --- > > Key: SPARK-8530 > URL: https://issues.apache.org/jira/browse/SPARK-8530 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: yuhao yang >Assignee: yuhao yang >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8050) Make Savable and Loader Java-friendly.
[ https://issues.apache.org/jira/browse/SPARK-8050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8050: - Target Version/s: 1.6.0 (was: 1.5.0) > Make Savable and Loader Java-friendly. > -- > > Key: SPARK-8050 > URL: https://issues.apache.org/jira/browse/SPARK-8050 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0, 1.4.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Minor > > Should overload save/load to accept JavaSparkContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8530) Add Python API for MinMaxScaler
[ https://issues.apache.org/jira/browse/SPARK-8530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8530: - Target Version/s: 1.6.0 (was: 1.5.0) > Add Python API for MinMaxScaler > --- > > Key: SPARK-8530 > URL: https://issues.apache.org/jira/browse/SPARK-8530 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: yuhao yang >Assignee: yuhao yang >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10122) AttributeError: 'RDD' object has no attribute 'offsetRanges'
[ https://issues.apache.org/jira/browse/SPARK-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706186#comment-14706186 ] Saisai Shao commented on SPARK-10122: - Hi [~aramesh], thanks a lot for pointing this out. This is actually a bug, sorry for not covering it in the unit test. The problem is Python will compact a series of {{TransformedDStream}} into one: {code} if (isinstance(prev, TransformedDStream) and not prev.is_cached and not prev.is_checkpointed): prev_func = prev.func self.func = lambda t, rdd: func(t, prev_func(t, rdd)) self.prev = prev.prev {code} As {{KafkaTransformedDStream}} is a subclass of {{TransformedDStream}}, so it will be compacted to replace with its parent DStream, as the code shows {{self.prev = prev.prev}}, which is a DStream, get offset ranges on DStream will throw an exception as you mentioned before. I will submit a PR to fix this, so you could try with the patch to see if it is fixed. > AttributeError: 'RDD' object has no attribute 'offsetRanges' > > > Key: SPARK-10122 > URL: https://issues.apache.org/jira/browse/SPARK-10122 > Project: Spark > Issue Type: Bug > Components: PySpark, Streaming >Reporter: Amit Ramesh > Labels: kafka > > SPARK-8389 added the offsetRanges interface to Kafka direct streams. This > however appears to break when chaining operations after a transform > operation. Following is example code that would result in an error (stack > trace below). Note that if the 'count()' operation is taken out of the > example code then this error does not occur anymore, and the Kafka data is > printed. > {code:title=kafka_test.py|collapse=true} > from pyspark import SparkContext > from pyspark.streaming import StreamingContext > from pyspark.streaming.kafka import KafkaUtils > def attach_kafka_metadata(kafka_rdd): > offset_ranges = kafka_rdd.offsetRanges() > return kafka_rdd > if __name__ == "__main__": > sc = SparkContext(appName='kafka-test') > ssc = StreamingContext(sc, 10) > kafka_stream = KafkaUtils.createDirectStream( > ssc, > [TOPIC], > kafkaParams={ > 'metadata.broker.list': BROKERS, > }, > ) > kafka_stream.transform(attach_kafka_metadata).count().pprint() > ssc.start() > ssc.awaitTermination() > {code} > {code:title=Stack trace|collapse=true} > Traceback (most recent call last): > File "/home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/util.py", > line 62, in call > r = self.func(t, *rdds) > File > "/home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py", line > 616, in > self.func = lambda t, rdd: func(t, prev_func(t, rdd)) > File > "/home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py", line > 616, in > self.func = lambda t, rdd: func(t, prev_func(t, rdd)) > File > "/home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py", line > 616, in > self.func = lambda t, rdd: func(t, prev_func(t, rdd)) > File > "/home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py", line > 616, in > self.func = lambda t, rdd: func(t, prev_func(t, rdd)) > File "/home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py", > line 332, in > func = lambda t, rdd: oldfunc(rdd) > File "/home/spark/ad_realtime/batch/kafka_test.py", line 7, in > attach_kafka_metadata > offset_ranges = kafka_rdd.offsetRanges() > AttributeError: 'RDD' object has no attribute 'offsetRanges' > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8854) Documentation for Association Rules
[ https://issues.apache.org/jira/browse/SPARK-8854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-8854. -- Resolution: Duplicate > Documentation for Association Rules > --- > > Key: SPARK-8854 > URL: https://issues.apache.org/jira/browse/SPARK-8854 > Project: Spark > Issue Type: Documentation > Components: MLlib >Reporter: Feynman Liang >Priority: Minor > > Documentation describing how to generate association rules from frequent > itemsets needs to be provided. The relevant method is > {{FPGrowthModel.generateAssociationRules}}. This will likely be added to the > existing section for frequent-itemsets using FPGrowth. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8521) Feature Transformers in 1.5
[ https://issues.apache.org/jira/browse/SPARK-8521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8521: - Fix Version/s: 1.5.0 > Feature Transformers in 1.5 > --- > > Key: SPARK-8521 > URL: https://issues.apache.org/jira/browse/SPARK-8521 > Project: Spark > Issue Type: Umbrella > Components: ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > Fix For: 1.5.0 > > > This is a list of feature transformers we plan to add in Spark 1.5. Feel free > to propose useful transformers that are not on the list. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7455) Perf test for LDA (EM/online)
[ https://issues.apache.org/jira/browse/SPARK-7455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7455: - Fix Version/s: 1.5.0 > Perf test for LDA (EM/online) > - > > Key: SPARK-7455 > URL: https://issues.apache.org/jira/browse/SPARK-7455 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng >Assignee: yuhao yang > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7536) Audit MLlib Python API for 1.4
[ https://issues.apache.org/jira/browse/SPARK-7536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7536: - Fix Version/s: 1.5.0 > Audit MLlib Python API for 1.4 > -- > > Key: SPARK-7536 > URL: https://issues.apache.org/jira/browse/SPARK-7536 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang > Fix For: 1.5.0 > > > **NOTE: This is targeted at 1.5.0 because it has so many useful links for > JIRAs targeted for 1.5.0. In the future, we should create a _new_ JIRA for > linking future items.** > For new public APIs added to MLlib, we need to check the generated HTML doc > and compare the Scala & Python versions. We need to track: > * Inconsistency: Do class/method/parameter names match? SPARK-7667 > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. [SPARK-7666], [SPARK-6173] > * API breaking changes: These should be very rare but are occasionally either > necessary (intentional) or accidental. These must be recorded and added in > the Migration Guide for this release. SPARK-7665 > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python. > ** classification > *** StreamingLogisticRegressionWithSGD SPARK-7633 > ** clustering > *** GaussianMixture SPARK-6258 > *** LDA SPARK-6259 > *** Power Iteration Clustering SPARK-5962 > *** StreamingKMeans SPARK-4118 > ** evaluation > *** MultilabelMetrics SPARK-6094 > ** feature > *** ElementwiseProduct SPARK-7605 > *** PCA SPARK-7604 > ** linalg > *** Distributed linear algebra SPARK-6100 > ** pmml.export SPARK-7638 > ** regression > *** StreamingLinearRegressionWithSGD SPARK-4127 > ** stat > *** KernelDensity SPARK-7639 > ** util > *** MLUtils SPARK-6263 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8633) List missing model methods in Python Pipeline API
[ https://issues.apache.org/jira/browse/SPARK-8633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8633: - Fix Version/s: 1.5.0 > List missing model methods in Python Pipeline API > - > > Key: SPARK-8633 > URL: https://issues.apache.org/jira/browse/SPARK-8633 > Project: Spark > Issue Type: Task > Components: ML, PySpark >Reporter: Xiangrui Meng >Assignee: Manoj Kumar > Fix For: 1.5.0 > > > Most Python models under the pipeline API are implemented as JavaModel > wrappers. However, we didn't provide methods to extract information from > model. In SPARK-7647, we added weights and intercept to linear models. This > JIRA is to list all missing model methods, create JIRAs for each, and link > them here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9712) List source compatibility issues in Scala API from scaladocs
[ https://issues.apache.org/jira/browse/SPARK-9712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-9712: - Fix Version/s: 1.5.0 > List source compatibility issues in Scala API from scaladocs > > > Key: SPARK-9712 > URL: https://issues.apache.org/jira/browse/SPARK-9712 > Project: Spark > Issue Type: Task > Components: ML, MLlib >Reporter: Feynman Liang >Assignee: Feynman Liang > Fix For: 1.5.0 > > Attachments: scaladoc-compare.diff > > > Generate raw scaladocs and use {{scala/tools/scaladoc-compare}} to show > changes to public APIs and documentations. These results are access-modifier > aware since they run on the Scala source rather than generated classfiles, > but will include documentation changes which may not affect behavior. > Results attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9706) List Binary and Source Compatibility Issues with japi-compliance checker
[ https://issues.apache.org/jira/browse/SPARK-9706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-9706: - Fix Version/s: 1.5.0 > List Binary and Source Compatibility Issues with japi-compliance checker > > > Key: SPARK-9706 > URL: https://issues.apache.org/jira/browse/SPARK-9706 > Project: Spark > Issue Type: Task > Components: ML, MLlib >Reporter: Feynman Liang >Assignee: Feynman Liang > Fix For: 1.5.0 > > Attachments: compat_reports.zip > > > To identify potential API issues, list public API changes which affect binary > and source incompatibility by using command: > {code} > japi-compliance-checker spark-mllib_2.10-1.4.2-SNAPSHOT.jar > spark-mllib_2.10-1.5.0-SNAPSHOT.jar > {code} > Report result attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8400) ml.ALS doesn't handle -1 block size
[ https://issues.apache.org/jira/browse/SPARK-8400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706182#comment-14706182 ] Xiangrui Meng commented on SPARK-8400: -- Sorry for my late reply! We check numBlocks in LocalIndexEncoder. However, I'm not sure whether this happens before any data shuffling. It might be better to check numUserBlocks and numItemBlocks directly. > ml.ALS doesn't handle -1 block size > --- > > Key: SPARK-8400 > URL: https://issues.apache.org/jira/browse/SPARK-8400 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.3.1 >Reporter: Xiangrui Meng >Assignee: Bryan Cutler > > Under spark.mllib, if number blocks is set to -1, we set the block size > automatically based on the input partition size. However, this behavior is > not preserved in the spark.ml API. If user sets -1 in Spark 1.3, it will not > work, but no error messages will show. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10137) Avoid to restart receivers if scheduleReceivers returns balanced results
[ https://issues.apache.org/jira/browse/SPARK-10137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-10137: -- Assignee: Shixiong Zhu > Avoid to restart receivers if scheduleReceivers returns balanced results > > > Key: SPARK-10137 > URL: https://issues.apache.org/jira/browse/SPARK-10137 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Critical > > In some cases, even if scheduleReceivers returns balanced results, > ReceiverTracker still may reject some receivers and force them to restart. > See my PR for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10137) Avoid to restart receivers if scheduleReceivers returns balanced results
[ https://issues.apache.org/jira/browse/SPARK-10137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-10137: -- Priority: Critical (was: Major) > Avoid to restart receivers if scheduleReceivers returns balanced results > > > Key: SPARK-10137 > URL: https://issues.apache.org/jira/browse/SPARK-10137 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Shixiong Zhu >Priority: Critical > > In some cases, even if scheduleReceivers returns balanced results, > ReceiverTracker still may reject some receivers and force them to restart. > See my PR for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9848) Add @Since annotation to new public APIs in 1.5
[ https://issues.apache.org/jira/browse/SPARK-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706178#comment-14706178 ] Xiangrui Meng edited comment on SPARK-9848 at 8/21/15 3:10 AM: --- No, that would be too much for this release. We plan to do that after 1.5. If you cannot find more new public APIs under spark.mllib, we can mark this as resolved. was (Author: mengxr): No, that would be too much for this release. We plan to do that after 1.5. > Add @Since annotation to new public APIs in 1.5 > --- > > Key: SPARK-9848 > URL: https://issues.apache.org/jira/browse/SPARK-9848 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib >Reporter: Xiangrui Meng >Assignee: Manoj Kumar >Priority: Critical > Labels: starter > > We should get a list of new APIs from SPARK-9660. cc: [~fliang] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8665) Update ALS documentation to include performance tips
[ https://issues.apache.org/jira/browse/SPARK-8665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8665: - Target Version/s: 1.6.0 (was: 1.5.0) > Update ALS documentation to include performance tips > > > Key: SPARK-8665 > URL: https://issues.apache.org/jira/browse/SPARK-8665 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > Original Estimate: 1h > Remaining Estimate: 1h > > With the new ALS implementation, users still need to deal with > computation/communication trade-offs. It would be nice to document this > clearly based on the issues on the mailing list. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9848) Add @Since annotation to new public APIs in 1.5
[ https://issues.apache.org/jira/browse/SPARK-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706178#comment-14706178 ] Xiangrui Meng commented on SPARK-9848: -- No, that would be too much for this release. We plan to do that after 1.5. > Add @Since annotation to new public APIs in 1.5 > --- > > Key: SPARK-9848 > URL: https://issues.apache.org/jira/browse/SPARK-9848 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib >Reporter: Xiangrui Meng >Assignee: Manoj Kumar >Priority: Critical > Labels: starter > > We should get a list of new APIs from SPARK-9660. cc: [~fliang] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9846) User guide for Multilayer Perceptron Classifier
[ https://issues.apache.org/jira/browse/SPARK-9846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-9846. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8262 [https://github.com/apache/spark/pull/8262] > User guide for Multilayer Perceptron Classifier > --- > > Key: SPARK-9846 > URL: https://issues.apache.org/jira/browse/SPARK-9846 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng >Assignee: Alexander Ulanov > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10140) Add target fields to @Since annotation
[ https://issues.apache.org/jira/browse/SPARK-10140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-10140. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8344 [https://github.com/apache/spark/pull/8344] > Add target fields to @Since annotation > -- > > Key: SPARK-10140 > URL: https://issues.apache.org/jira/browse/SPARK-10140 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > Fix For: 1.5.0 > > > Add target fields to @Since so constructor params and fields also get > annotated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9228) Combine unsafe and codegen into a single option
[ https://issues.apache.org/jira/browse/SPARK-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706152#comment-14706152 ] Yi Zhou commented on SPARK-9228: Hi [~davies] After introducing the 'spark.sql.tungsten.enabled' , it mean that the previous 2 settings(spark.sql.unsafe.enabled and spark.sql.codegen) will both deprecated or removed ,right ? But currently i still can show the parameters in Spark SQL CLI like below: 15/08/21 10:28:54 INFO DAGScheduler: Job 6 finished: processCmd at CliDriver.java:376, took 0.191960 s spark.sql.unsafe.enabledtrue Time taken: 0.253 seconds, Fetched 1 row(s) 15/08/21 10:34:10 INFO DAGScheduler: Job 7 finished: processCmd at CliDriver.java:376, took 0.284666 s spark.sql.codegen true Time taken: 0.336 seconds, Fetched 1 row(s) 15/08/21 10:34:10 INFO CliDriver: Time taken: 0.336 seconds, Fetched 1 row(s) > Combine unsafe and codegen into a single option > --- > > Key: SPARK-9228 > URL: https://issues.apache.org/jira/browse/SPARK-9228 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Davies Liu >Priority: Blocker > Fix For: 1.5.0 > > > Before QA, lets flip on features and consolidate unsafe and codegen. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10145) Executor exit without useful messages when spark runs in spark-streaming
[ https://issues.apache.org/jira/browse/SPARK-10145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706143#comment-14706143 ] Baogang Wang commented on SPARK-10145: -- 4 other applications run at the same time and each of them has the same count of executor. And 3g memory is allocated to each executor > Executor exit without useful messages when spark runs in spark-streaming > > > Key: SPARK-10145 > URL: https://issues.apache.org/jira/browse/SPARK-10145 > Project: Spark > Issue Type: Bug > Components: Streaming, YARN > Environment: spark 1.3.1, hadoop 2.6.0, 6 nodes, each node has 32 > cores and 32g memory >Reporter: Baogang Wang >Priority: Critical > Original Estimate: 168h > Remaining Estimate: 168h > > Each node is allocated 30g memory by Yarn. > My application receives messages from Kafka by directstream. Each application > consists of 4 dstream window > Spark application is submitted by this command: > spark-submit --class spark_security.safe.SafeSockPuppet --driver-memory 3g > --executor-memory 3g --num-executors 3 --executor-cores 4 --name > safeSparkDealerUser --master yarn --deploy-mode cluster > spark_Security-1.0-SNAPSHOT.jar.nocalse > hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/spark_properties/safedealer.properties > After about 1 hours, some executor exits. There is no more yarn logs after > the executor exits and there is no stack when the executor exits. > When I see the yarn node manager log, it shows as follows : > 2015-08-17 17:25:41,550 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Start request for container_1439803298368_0005_01_01 by user root > 2015-08-17 17:25:41,551 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Creating a new application reference for app application_1439803298368_0005 > 2015-08-17 17:25:41,551 INFO > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root > IP=172.19.160.102 OPERATION=Start Container Request > TARGET=ContainerManageImpl RESULT=SUCCESS > APPID=application_1439803298368_0005 > CONTAINERID=container_1439803298368_0005_01_01 > 2015-08-17 17:25:41,551 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: > Application application_1439803298368_0005 transitioned from NEW to INITING > 2015-08-17 17:25:41,552 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: > Adding container_1439803298368_0005_01_01 to application > application_1439803298368_0005 > 2015-08-17 17:25:41,557 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > rollingMonitorInterval is set as -1. The log rolling mornitoring interval is > disabled. The logs will be aggregated after this application is finished. > 2015-08-17 17:25:41,663 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: > Application application_1439803298368_0005 transitioned from INITING to > RUNNING > 2015-08-17 17:25:41,664 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1439803298368_0005_01_01 transitioned from NEW to > LOCALIZING > 2015-08-17 17:25:41,664 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got > event CONTAINER_INIT for appId application_1439803298368_0005 > 2015-08-17 17:25:41,664 INFO > org.apache.spark.network.yarn.YarnShuffleService: Initializing container > container_1439803298368_0005_01_01 > 2015-08-17 17:25:41,665 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource > hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/user/root/.sparkStaging/application_1439803298368_0005/spark-assembly-1.3.1-hadoop2.6.0.jar > transitioned from INIT to DOWNLOADING > 2015-08-17 17:25:41,665 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource > hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/user/root/.sparkStaging/application_1439803298368_0005/spark_Security-1.0-SNAPSHOT.jar > transitioned from INIT to DOWNLOADING > 2015-08-17 17:25:41,665 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Created localizer for container_1439803298368_0005_01_01 > 2015-08-17 17:25:41,668 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Writing credentials to the nmPrivate file > /export/servers/hadoop2.6.0/tmp/nm-local-dir/nmPrivate/container_1439803298368_0005_01_01.tokens. > Credentials list: > 2015-08-17 17:25:4
[jira] [Comment Edited] (SPARK-10145) Executor exit without useful messages when spark runs in spark-streaming
[ https://issues.apache.org/jira/browse/SPARK-10145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706139#comment-14706139 ] Baogang Wang edited comment on SPARK-10145 at 8/21/15 2:34 AM: --- # Default system properties included when running spark-submit. # This is useful for setting default environmental settings. # Example: # spark.master spark://master:7077 # spark.eventLog.enabled true # spark.eventLog.dir hdfs://namenode:8021/directory spark.serializer org.apache.spark.serializer.KryoSerializer # spark.driver.memory 5g # spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three" #spark.core.connection.ack.wait.timeout 3600 #spark.core.connection.auth.wait.timeout3600 spark.akka.frameSize1024 spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0–2041 spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041 spark.akka.timeout 900 spark.storage.memoryFraction0.4 spark.rdd.compress true spark.shuffle.blockTransferService nio spark.yarn.executor.memoryOverhead 1024 was (Author: heayin): the spark-defaults.conf is as follows: Default system properties included when running spark-submit. # This is useful for setting default environmental settings. # Example: # spark.master spark://master:7077 # spark.eventLog.enabled true # spark.eventLog.dir hdfs://namenode:8021/directory spark.serializer org.apache.spark.serializer.KryoSerializer # spark.driver.memory 5g # spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three" #spark.core.connection.ack.wait.timeout 3600 #spark.core.connection.auth.wait.timeout3600 spark.akka.frameSize1024 spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0–2041 spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041 spark.akka.timeout 900 spark.storage.memoryFraction0.4 spark.rdd.compress true spark.shuffle.blockTransferService nio spark.yarn.executor.memoryOverhead 1024 > Executor exit without useful messages when spark runs in spark-streaming > > > Key: SPARK-10145 > URL: https://issues.apache.org/jira/browse/SPARK-10145 > Project: Spark > Issue Type: Bug > Components: Streaming, YARN > Environment: spark 1.3.1, hadoop 2.6.0, 6 nodes, each node has 32 > cores and 32g memory >Reporter: Baogang Wang >Priority: Critical > Original Estimate: 168h > Remaining Estimate: 168h > > Each node is allocated 30g memory by Yarn. > My application receives messages from Kafka by directstream. Each application > consists of 4 dstream window > Spark application is submitted by this command: > spark-submit --class spark_security.safe.SafeSockPuppet --driver-memory 3g > --executor-memory 3g --num-executors 3 --executor-cores 4 --name > safeSparkDealerUser --master yarn --deploy-mode cluster > spark_Security-1.0-SNAPSHOT.jar.nocalse > hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/spark_properties/safedealer.properties > After about 1 hours, some executor exits. There is no more yarn logs after > the executor exits and there is no stack when the executor exits. > When I see the yarn node manager log, it shows as follows : > 2015-08-17 17:25:41,550 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Start request for container_1439803298368_0005_01_01 by user root > 2015-08-17 17:25:41,551 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Creating a new application reference for app application_1439803298368_0005 > 2015-08-17 17:25:41,551 INFO > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root > IP=172.19.160.102 OPERATION=Start Container Request > TARGET=ContainerManageImpl RESULT=SUCCESS > APPID=application_1439803298368_0005 > CONTAINERID=container_1439803298368_0005_01_01 > 2015-08-17 17:25:41,551 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: > Application application_1439803298368_0005 transitioned from NEW to INITING > 2015-08-17 17:25:41,552 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: > Adding container_1439803298368_0005_01_01 to application > application_1439803298368_0005 > 2015-08-17 17:25:41,557 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > rollingMonitorInterval is set as -1. The
[jira] [Commented] (SPARK-10145) Executor exit without useful messages when spark runs in spark-streaming
[ https://issues.apache.org/jira/browse/SPARK-10145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706139#comment-14706139 ] Baogang Wang commented on SPARK-10145: -- the spark-defaults.conf is as follows: Default system properties included when running spark-submit. # This is useful for setting default environmental settings. # Example: # spark.master spark://master:7077 # spark.eventLog.enabled true # spark.eventLog.dir hdfs://namenode:8021/directory spark.serializer org.apache.spark.serializer.KryoSerializer # spark.driver.memory 5g # spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three" #spark.core.connection.ack.wait.timeout 3600 #spark.core.connection.auth.wait.timeout3600 spark.akka.frameSize1024 spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0–2041 spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041 spark.akka.timeout 900 spark.storage.memoryFraction0.4 spark.rdd.compress true spark.shuffle.blockTransferService nio spark.yarn.executor.memoryOverhead 1024 > Executor exit without useful messages when spark runs in spark-streaming > > > Key: SPARK-10145 > URL: https://issues.apache.org/jira/browse/SPARK-10145 > Project: Spark > Issue Type: Bug > Components: Streaming, YARN > Environment: spark 1.3.1, hadoop 2.6.0, 6 nodes, each node has 32 > cores and 32g memory >Reporter: Baogang Wang >Priority: Critical > Original Estimate: 168h > Remaining Estimate: 168h > > Each node is allocated 30g memory by Yarn. > My application receives messages from Kafka by directstream. Each application > consists of 4 dstream window > Spark application is submitted by this command: > spark-submit --class spark_security.safe.SafeSockPuppet --driver-memory 3g > --executor-memory 3g --num-executors 3 --executor-cores 4 --name > safeSparkDealerUser --master yarn --deploy-mode cluster > spark_Security-1.0-SNAPSHOT.jar.nocalse > hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/spark_properties/safedealer.properties > After about 1 hours, some executor exits. There is no more yarn logs after > the executor exits and there is no stack when the executor exits. > When I see the yarn node manager log, it shows as follows : > 2015-08-17 17:25:41,550 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Start request for container_1439803298368_0005_01_01 by user root > 2015-08-17 17:25:41,551 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Creating a new application reference for app application_1439803298368_0005 > 2015-08-17 17:25:41,551 INFO > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root > IP=172.19.160.102 OPERATION=Start Container Request > TARGET=ContainerManageImpl RESULT=SUCCESS > APPID=application_1439803298368_0005 > CONTAINERID=container_1439803298368_0005_01_01 > 2015-08-17 17:25:41,551 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: > Application application_1439803298368_0005 transitioned from NEW to INITING > 2015-08-17 17:25:41,552 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: > Adding container_1439803298368_0005_01_01 to application > application_1439803298368_0005 > 2015-08-17 17:25:41,557 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > rollingMonitorInterval is set as -1. The log rolling mornitoring interval is > disabled. The logs will be aggregated after this application is finished. > 2015-08-17 17:25:41,663 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: > Application application_1439803298368_0005 transitioned from INITING to > RUNNING > 2015-08-17 17:25:41,664 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1439803298368_0005_01_01 transitioned from NEW to > LOCALIZING > 2015-08-17 17:25:41,664 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got > event CONTAINER_INIT for appId application_1439803298368_0005 > 2015-08-17 17:25:41,664 INFO > org.apache.spark.network.yarn.YarnShuffleService: Initializing container > container_1439803298368_0005_01_01 > 2015-08-17 17:25:41,665 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource > hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/user/root/.sparkStaging/application_1439803298368_0005/
[jira] [Created] (SPARK-10145) Executor exit without useful messages when spark runs in spark-streaming
Baogang Wang created SPARK-10145: Summary: Executor exit without useful messages when spark runs in spark-streaming Key: SPARK-10145 URL: https://issues.apache.org/jira/browse/SPARK-10145 Project: Spark Issue Type: Bug Components: Streaming, YARN Environment: spark 1.3.1, hadoop 2.6.0, 6 nodes, each node has 32 cores and 32g memory Reporter: Baogang Wang Priority: Critical Each node is allocated 30g memory by Yarn. My application receives messages from Kafka by directstream. Each application consists of 4 dstream window Spark application is submitted by this command: spark-submit --class spark_security.safe.SafeSockPuppet --driver-memory 3g --executor-memory 3g --num-executors 3 --executor-cores 4 --name safeSparkDealerUser --master yarn --deploy-mode cluster spark_Security-1.0-SNAPSHOT.jar.nocalse hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/spark_properties/safedealer.properties After about 1 hours, some executor exits. There is no more yarn logs after the executor exits and there is no stack when the executor exits. When I see the yarn node manager log, it shows as follows : 2015-08-17 17:25:41,550 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Start request for container_1439803298368_0005_01_01 by user root 2015-08-17 17:25:41,551 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Creating a new application reference for app application_1439803298368_0005 2015-08-17 17:25:41,551 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root IP=172.19.160.102 OPERATION=Start Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1439803298368_0005 CONTAINERID=container_1439803298368_0005_01_01 2015-08-17 17:25:41,551 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Application application_1439803298368_0005 transitioned from NEW to INITING 2015-08-17 17:25:41,552 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Adding container_1439803298368_0005_01_01 to application application_1439803298368_0005 2015-08-17 17:25:41,557 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this application is finished. 2015-08-17 17:25:41,663 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Application application_1439803298368_0005 transitioned from INITING to RUNNING 2015-08-17 17:25:41,664 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1439803298368_0005_01_01 transitioned from NEW to LOCALIZING 2015-08-17 17:25:41,664 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_INIT for appId application_1439803298368_0005 2015-08-17 17:25:41,664 INFO org.apache.spark.network.yarn.YarnShuffleService: Initializing container container_1439803298368_0005_01_01 2015-08-17 17:25:41,665 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/user/root/.sparkStaging/application_1439803298368_0005/spark-assembly-1.3.1-hadoop2.6.0.jar transitioned from INIT to DOWNLOADING 2015-08-17 17:25:41,665 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/user/root/.sparkStaging/application_1439803298368_0005/spark_Security-1.0-SNAPSHOT.jar transitioned from INIT to DOWNLOADING 2015-08-17 17:25:41,665 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1439803298368_0005_01_01 2015-08-17 17:25:41,668 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Writing credentials to the nmPrivate file /export/servers/hadoop2.6.0/tmp/nm-local-dir/nmPrivate/container_1439803298368_0005_01_01.tokens. Credentials list: 2015-08-17 17:25:41,682 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Initializing user root 2015-08-17 17:25:41,686 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Copying from /export/servers/hadoop2.6.0/tmp/nm-local-dir/nmPrivate/container_1439803298368_0005_01_01.tokens to /export/servers/hadoop2.6.0/tmp/nm-local-dir/usercache/root/appcache/application_1439803298368_0005/container_1439803298368_0005_01_01.tokens 2015-08-17 17:25:41,686 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Localizer CWD set t
[jira] [Assigned] (SPARK-9853) Optimize shuffle fetch of contiguous partition IDs
[ https://issues.apache.org/jira/browse/SPARK-9853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia reassigned SPARK-9853: Assignee: Matei Zaharia > Optimize shuffle fetch of contiguous partition IDs > -- > > Key: SPARK-9853 > URL: https://issues.apache.org/jira/browse/SPARK-9853 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Reporter: Matei Zaharia >Assignee: Matei Zaharia >Priority: Minor > > On the map side, we should be able to serve a block representing multiple > partition IDs in one block manager request -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10143) Parquet changed the behavior of calculating splits
[ https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10143: Assignee: (was: Apache Spark) > Parquet changed the behavior of calculating splits > -- > > Key: SPARK-10143 > URL: https://issues.apache.org/jira/browse/SPARK-10143 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.0 >Reporter: Yin Huai >Priority: Critical > > When Parquet's task side metadata is enabled (by default it is enabled and it > needs to be enabled to deal with tables with many files), Parquet delegates > the work of calculating initial splits to FileInputFormat (see > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311). > If filesystem's block size is smaller than the row group size and users do > not set min split size, splits in the initial split list will have lots of > dummy splits and they contribute to empty tasks (because the starting point > and ending point of a split does not cover the starting point of a row > group). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10143) Parquet changed the behavior of calculating splits
[ https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706045#comment-14706045 ] Apache Spark commented on SPARK-10143: -- User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/8346 > Parquet changed the behavior of calculating splits > -- > > Key: SPARK-10143 > URL: https://issues.apache.org/jira/browse/SPARK-10143 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.0 >Reporter: Yin Huai >Priority: Critical > > When Parquet's task side metadata is enabled (by default it is enabled and it > needs to be enabled to deal with tables with many files), Parquet delegates > the work of calculating initial splits to FileInputFormat (see > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311). > If filesystem's block size is smaller than the row group size and users do > not set min split size, splits in the initial split list will have lots of > dummy splits and they contribute to empty tasks (because the starting point > and ending point of a split does not cover the starting point of a row > group). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10143) Parquet changed the behavior of calculating splits
[ https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10143: Assignee: Apache Spark > Parquet changed the behavior of calculating splits > -- > > Key: SPARK-10143 > URL: https://issues.apache.org/jira/browse/SPARK-10143 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Apache Spark >Priority: Critical > > When Parquet's task side metadata is enabled (by default it is enabled and it > needs to be enabled to deal with tables with many files), Parquet delegates > the work of calculating initial splits to FileInputFormat (see > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311). > If filesystem's block size is smaller than the row group size and users do > not set min split size, splits in the initial split list will have lots of > dummy splits and they contribute to empty tasks (because the starting point > and ending point of a split does not cover the starting point of a row > group). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8580) Add Parquet files generated by different systems to test interoperability and compatibility
[ https://issues.apache.org/jira/browse/SPARK-8580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706041#comment-14706041 ] Apache Spark commented on SPARK-8580: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/8341 > Add Parquet files generated by different systems to test interoperability and > compatibility > --- > > Key: SPARK-8580 > URL: https://issues.apache.org/jira/browse/SPARK-8580 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.5.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > As we are implementing Parquet backwards-compatibility rules for Spark 1.5.0 > to improve interoperability with other systems (reading non-standard Parquet > files they generate, and generating standard Parquet files), it would be good > to have a set of standard test Parquet files generated by various > systems/tools (parquet-thrift, parquet-avro, parquet-hive, Impala, and old > versions of Spark SQL) to ensure compatibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8580) Add Parquet files generated by different systems to test interoperability and compatibility
[ https://issues.apache.org/jira/browse/SPARK-8580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8580: --- Assignee: Cheng Lian (was: Apache Spark) > Add Parquet files generated by different systems to test interoperability and > compatibility > --- > > Key: SPARK-8580 > URL: https://issues.apache.org/jira/browse/SPARK-8580 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.5.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > As we are implementing Parquet backwards-compatibility rules for Spark 1.5.0 > to improve interoperability with other systems (reading non-standard Parquet > files they generate, and generating standard Parquet files), it would be good > to have a set of standard test Parquet files generated by various > systems/tools (parquet-thrift, parquet-avro, parquet-hive, Impala, and old > versions of Spark SQL) to ensure compatibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8580) Add Parquet files generated by different systems to test interoperability and compatibility
[ https://issues.apache.org/jira/browse/SPARK-8580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8580: --- Assignee: Apache Spark (was: Cheng Lian) > Add Parquet files generated by different systems to test interoperability and > compatibility > --- > > Key: SPARK-8580 > URL: https://issues.apache.org/jira/browse/SPARK-8580 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.5.0 >Reporter: Cheng Lian >Assignee: Apache Spark > > As we are implementing Parquet backwards-compatibility rules for Spark 1.5.0 > to improve interoperability with other systems (reading non-standard Parquet > files they generate, and generating standard Parquet files), it would be good > to have a set of standard test Parquet files generated by various > systems/tools (parquet-thrift, parquet-avro, parquet-hive, Impala, and old > versions of Spark SQL) to ensure compatibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10144) Actually show peak execution memory on UI by default
[ https://issues.apache.org/jira/browse/SPARK-10144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10144: Assignee: Apache Spark (was: Andrew Or) > Actually show peak execution memory on UI by default > > > Key: SPARK-10144 > URL: https://issues.apache.org/jira/browse/SPARK-10144 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.5.0 >Reporter: Andrew Or >Assignee: Apache Spark > > The peak execution memory metric was introduced in SPARK-8735. That was > before Tungsten was enabled by default, so it assumed that > `spark.sql.unsafe.enabled` must be explicitly set to true. This is no longer > the case... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10144) Actually show peak execution memory on UI by default
[ https://issues.apache.org/jira/browse/SPARK-10144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706033#comment-14706033 ] Apache Spark commented on SPARK-10144: -- User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/8345 > Actually show peak execution memory on UI by default > > > Key: SPARK-10144 > URL: https://issues.apache.org/jira/browse/SPARK-10144 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.5.0 >Reporter: Andrew Or >Assignee: Andrew Or > > The peak execution memory metric was introduced in SPARK-8735. That was > before Tungsten was enabled by default, so it assumed that > `spark.sql.unsafe.enabled` must be explicitly set to true. This is no longer > the case... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10144) Actually show peak execution memory on UI by default
[ https://issues.apache.org/jira/browse/SPARK-10144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10144: Assignee: Andrew Or (was: Apache Spark) > Actually show peak execution memory on UI by default > > > Key: SPARK-10144 > URL: https://issues.apache.org/jira/browse/SPARK-10144 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.5.0 >Reporter: Andrew Or >Assignee: Andrew Or > > The peak execution memory metric was introduced in SPARK-8735. That was > before Tungsten was enabled by default, so it assumed that > `spark.sql.unsafe.enabled` must be explicitly set to true. This is no longer > the case... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10144) Actually show peak execution memory on UI by default
[ https://issues.apache.org/jira/browse/SPARK-10144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-10144: -- Summary: Actually show peak execution memory on UI by default (was: Actually show peak execution memory by default) > Actually show peak execution memory on UI by default > > > Key: SPARK-10144 > URL: https://issues.apache.org/jira/browse/SPARK-10144 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.5.0 >Reporter: Andrew Or >Assignee: Andrew Or > > The peak execution memory metric was introduced in SPARK-8735. That was > before Tungsten was enabled by default, so it assumed that > `spark.sql.unsafe.enabled` must be explicitly set to true. This is no longer > the case... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10144) Actually show peak execution memory by default
Andrew Or created SPARK-10144: - Summary: Actually show peak execution memory by default Key: SPARK-10144 URL: https://issues.apache.org/jira/browse/SPARK-10144 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.5.0 Reporter: Andrew Or Assignee: Andrew Or The peak execution memory metric was introduced in SPARK-8735. That was before Tungsten was enabled by default, so it assumed that `spark.sql.unsafe.enabled` must be explicitly set to true. This is no longer the case... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8805) Spark shell not working
[ https://issues.apache.org/jira/browse/SPARK-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706025#comment-14706025 ] Amir Gur commented on SPARK-8805: - You are right thanks for the pointer. That was the current latest from the git I was using, it comes with bash 3.x - https://github.com/msysgit/msysgit/releases/tag/Git-1.9.5-preview20150319. Newer one is at https://git-for-windows.github.io/, and https://github.com/git-for-windows/git/releases/tag/v2.5.0.windows.1 got: {quote} $ bash --version GNU bash, version 4.3.39(3)-release (x86_64-pc-msys) Copyright (C) 2013 Free Software Foundation, Inc. {quote} And it does not have that issue, which looks good. But this is still not ok to depend on the bash 4.x without proper message to bash 3.x users that bails out right away and prints an appropriate message. Then now after taking latest On branch branch-1.4 + a successful build/mvn clean compile on JDK 1.7 getting the following: {quote} $ bin/spark-shell ls: cannot access /c/dev/github/apache/spark/assembly/target/scala-2.10: No such file or directory Failed to find Spark assembly in /c/dev/github/apache/spark/assembly/target/scala-2.10. You need to build Spark before running this program. {quote} Looks like something didn't get built and build is still marked as passed, not sure why. Will further check how to solve it. > Spark shell not working > --- > > Key: SPARK-8805 > URL: https://issues.apache.org/jira/browse/SPARK-8805 > Project: Spark > Issue Type: Brainstorming > Components: Spark Core, Windows >Reporter: Perinkulam I Ganesh > > I am using Git Bash on windows. Installed Open jdk1.8.0_45 and spark 1.4.0 > I am able to build spark and install it. But when ever I execute spark shell > it gives me the following error: > $ spark-shell > /c/.../spark/bin/spark-class: line 76: conditional binary operator expected -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4134) Dynamic allocation: tone down scary executor lost messages when killing on purpose
[ https://issues.apache.org/jira/browse/SPARK-4134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4134: - Target Version/s: 1.6.0 (was: 1.5.0) > Dynamic allocation: tone down scary executor lost messages when killing on > purpose > -- > > Key: SPARK-4134 > URL: https://issues.apache.org/jira/browse/SPARK-4134 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Andrew Or >Assignee: Andrew Or > > After SPARK-3822 goes in, we are now able to dynamically kill executors after > an application has started. However, when we do that we get a ton of scary > error messages telling us that we've done wrong somehow. It would be good to > detect when this is the case and prevent these messages from surfacing. > This maybe difficult, however, because the connection manager tends to be > quite verbose in unconditionally logging disconnection messages. This is a > very nice-to-have for 1.2 but certainly not a blocker. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-10143) Parquet changed the behavior of calculating splits
[ https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-10143: - Comment: was deleted (was: For something quick, we can use the row group size set in hadoop conf to set the min split size.) > Parquet changed the behavior of calculating splits > -- > > Key: SPARK-10143 > URL: https://issues.apache.org/jira/browse/SPARK-10143 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.0 >Reporter: Yin Huai >Priority: Critical > > When Parquet's task side metadata is enabled (by default it is enabled and it > needs to be enabled to deal with tables with many files), Parquet delegates > the work of calculating initial splits to FileInputFormat (see > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311). > If filesystem's block size is smaller than the row group size and users do > not set min split size, splits in the initial split list will have lots of > dummy splits and they contribute to empty tasks (because the starting point > and ending point of a split does not cover the starting point of a row > group). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10143) Parquet changed the behavior of calculating splits
[ https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14705988#comment-14705988 ] Yin Huai commented on SPARK-10143: -- [~rdblue] Can you confirm the behavior change of Parquet? Looks like we are just asking FileInputFormat to give us the initial splits. I am thinking to use the current setting of parquet row group size as the fs min split size for the job. What do you think? Thanks :) > Parquet changed the behavior of calculating splits > -- > > Key: SPARK-10143 > URL: https://issues.apache.org/jira/browse/SPARK-10143 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.0 >Reporter: Yin Huai >Priority: Critical > > When Parquet's task side metadata is enabled (by default it is enabled and it > needs to be enabled to deal with tables with many files), Parquet delegates > the work of calculating initial splits to FileInputFormat (see > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311). > If filesystem's block size is smaller than the row group size and users do > not set min split size, splits in the initial split list will have lots of > dummy splits and they contribute to empty tasks (because the starting point > and ending point of a split does not cover the starting point of a row > group). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10143) Parquet changed the behavior of calculating splits
[ https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14705985#comment-14705985 ] Yin Huai commented on SPARK-10143: -- For something quick, we can use the row group size set in hadoop conf to set the min split size. > Parquet changed the behavior of calculating splits > -- > > Key: SPARK-10143 > URL: https://issues.apache.org/jira/browse/SPARK-10143 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.0 >Reporter: Yin Huai >Priority: Critical > > When Parquet's task side metadata is enabled (by default it is enabled and it > needs to be enabled to deal with tables with many files), Parquet delegates > the work of calculating initial splits to FileInputFormat (see > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311). > If filesystem's block size is smaller than the row group size and users do > not set min split size, splits in the initial split list will have lots of > dummy splits and they contribute to empty tasks (because the starting point > and ending point of a split does not cover the starting point of a row > group). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10143) Parquet changed the behavior of calculating splits
Yin Huai created SPARK-10143: Summary: Parquet changed the behavior of calculating splits Key: SPARK-10143 URL: https://issues.apache.org/jira/browse/SPARK-10143 Project: Spark Issue Type: Bug Affects Versions: 1.5.0 Reporter: Yin Huai Priority: Critical When Parquet's task side metadata is enabled (by default it is enabled and it needs to be enabled to deal with tables with many files), Parquet delegates the work of calculating initial splits to FileInputFormat (see https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311). If filesystem's block size is smaller than the row group size and users do not set min split size, splits in the initial split list will have lots of dummy splits and they contribute to empty tasks (because the starting point and ending point of a split does not cover the starting point of a row group). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10142) Python checkpoint recovery does not work with non-local file path
Tathagata Das created SPARK-10142: - Summary: Python checkpoint recovery does not work with non-local file path Key: SPARK-10142 URL: https://issues.apache.org/jira/browse/SPARK-10142 Project: Spark Issue Type: Bug Components: PySpark, Streaming Affects Versions: 1.4.1, 1.3.1 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9811) Test updated Kinesis Receiver
[ https://issues.apache.org/jira/browse/SPARK-9811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-9811. -- Resolution: Fixed > Test updated Kinesis Receiver > - > > Key: SPARK-9811 > URL: https://issues.apache.org/jira/browse/SPARK-9811 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-9811) Test updated Kinesis Receiver
[ https://issues.apache.org/jira/browse/SPARK-9811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das reopened SPARK-9811: -- > Test updated Kinesis Receiver > - > > Key: SPARK-9811 > URL: https://issues.apache.org/jira/browse/SPARK-9811 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9811) Test updated Kinesis Receiver
[ https://issues.apache.org/jira/browse/SPARK-9811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-9811. -- Resolution: Done > Test updated Kinesis Receiver > - > > Key: SPARK-9811 > URL: https://issues.apache.org/jira/browse/SPARK-9811 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10141) Number of tasks on executors still become negative after failures
[ https://issues.apache.org/jira/browse/SPARK-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-10141: -- Description: I hit this failure when running LDA on EC2 (after I made the model size really big). I was using the LDAExample.scala code on an EC2 cluster with 16 workers (r3.2xlarge), on a Wikipedia dataset: {code} Training set size (documents) 4534059 Vocabulary size (terms) 1 Training set size (tokens) 895575317 EM optimizer 1K topics {code} Failure message: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 55 in stage 22.0 failed 4 times, most recent failure: Lost task 55.3 in stage 22.0 (TID 2881, 10.0.202.128): java.io.IOException: Failed to connect to /10.0.202.128:54740 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156) at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.ConnectException: Connection refused: /10.0.202.128:54740 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) ... 1 more Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1267) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1255) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1254) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1254) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:684) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:684) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:684) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1480) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1442) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1431) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:554) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1805) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1925) at org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1059) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) at org.apache.spark.rdd.RDD.fold(RDD.scala:1053) at org.apache.spark.mllib.clustering.EMLDAOptimizer.computeGlobalTopicTotals(LDAOptimizer.scala:205) at org.apache.spark.mllib.clustering.EMLDAOptimizer.next(LDAOptimizer.sc
[jira] [Commented] (SPARK-10141) Number of tasks on executors still become negative after failures
[ https://issues.apache.org/jira/browse/SPARK-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14705881#comment-14705881 ] Joseph K. Bradley commented on SPARK-10141: --- Note this bug was found with a 1.5 version which includes the fix from [SPARK-8560] > Number of tasks on executors still become negative after failures > - > > Key: SPARK-10141 > URL: https://issues.apache.org/jira/browse/SPARK-10141 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.5.0 >Reporter: Joseph K. Bradley >Priority: Minor > Attachments: Screen Shot 2015-08-20 at 3.14.49 PM.png > > > I hit this failure when running LDA on EC2 (after I made the model size > really big). > Failure message: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 55 in > stage 22.0 failed 4 times, most recent failure: Lost task 55.3 in stage 22.0 > (TID 2881, 10.0.202.128): java.io.IOException: Failed to connect to > /10.0.202.128:54740 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156) > at > org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.net.ConnectException: Connection refused: /10.0.202.128:54740 > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) > at > io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224) > at > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > ... 1 more > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1267) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1255) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1254) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1254) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:684) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:684) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:684) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1480) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1442) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1431) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:554) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1805) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1925) > at org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1059) > at > org.apache.spark.rdd.RDDOperationScop
[jira] [Updated] (SPARK-10141) Number of tasks on executors still become negative after failures
[ https://issues.apache.org/jira/browse/SPARK-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-10141: -- Attachment: Screen Shot 2015-08-20 at 3.14.49 PM.png > Number of tasks on executors still become negative after failures > - > > Key: SPARK-10141 > URL: https://issues.apache.org/jira/browse/SPARK-10141 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.5.0 >Reporter: Joseph K. Bradley >Priority: Minor > Attachments: Screen Shot 2015-08-20 at 3.14.49 PM.png > > > I hit this failure when running LDA on EC2 (after I made the model size > really big). > Failure message: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 55 in > stage 22.0 failed 4 times, most recent failure: Lost task 55.3 in stage 22.0 > (TID 2881, 10.0.202.128): java.io.IOException: Failed to connect to > /10.0.202.128:54740 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156) > at > org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.net.ConnectException: Connection refused: /10.0.202.128:54740 > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) > at > io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224) > at > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > ... 1 more > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1267) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1255) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1254) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1254) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:684) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:684) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:684) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1480) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1442) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1431) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:554) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1805) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1925) > at org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1059) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.rdd.RDDOperat
[jira] [Updated] (SPARK-10141) Number of tasks on executors still become negative after failures
[ https://issues.apache.org/jira/browse/SPARK-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-10141: -- Affects Version/s: 1.5.0 > Number of tasks on executors still become negative after failures > - > > Key: SPARK-10141 > URL: https://issues.apache.org/jira/browse/SPARK-10141 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.5.0 >Reporter: Joseph K. Bradley >Priority: Minor > Attachments: Screen Shot 2015-08-20 at 3.14.49 PM.png > > > I hit this failure when running LDA on EC2 (after I made the model size > really big). > Failure message: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 55 in > stage 22.0 failed 4 times, most recent failure: Lost task 55.3 in stage 22.0 > (TID 2881, 10.0.202.128): java.io.IOException: Failed to connect to > /10.0.202.128:54740 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156) > at > org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.net.ConnectException: Connection refused: /10.0.202.128:54740 > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) > at > io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224) > at > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > ... 1 more > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1267) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1255) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1254) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1254) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:684) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:684) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:684) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1480) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1442) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1431) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:554) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1805) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1925) > at org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1059) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOpera
[jira] [Created] (SPARK-10141) Number of tasks on executors still become negative after failures
Joseph K. Bradley created SPARK-10141: - Summary: Number of tasks on executors still become negative after failures Key: SPARK-10141 URL: https://issues.apache.org/jira/browse/SPARK-10141 Project: Spark Issue Type: Bug Components: Web UI Reporter: Joseph K. Bradley Priority: Minor I hit this failure when running LDA on EC2 (after I made the model size really big). Failure message: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 55 in stage 22.0 failed 4 times, most recent failure: Lost task 55.3 in stage 22.0 (TID 2881, 10.0.202.128): java.io.IOException: Failed to connect to /10.0.202.128:54740 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156) at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.ConnectException: Connection refused: /10.0.202.128:54740 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) ... 1 more Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1267) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1255) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1254) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1254) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:684) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:684) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:684) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1480) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1442) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1431) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:554) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1805) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1925) at org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1059) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) at org.apache.spark.rdd.RDD.fold(RDD.scala:1053) at org.apache.spark.mllib.clustering.EMLDAOptimizer.computeGlobalTopicTotals(LDAOptimizer.scala:205) at org.apache.spark.mllib.clustering.EMLDAOptimizer.next(LDAOptimizer.scala:192) at org.apache.spark.mllib.clustering.E
[jira] [Resolved] (SPARK-9400) Implement code generation for StringLocate
[ https://issues.apache.org/jira/browse/SPARK-9400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-9400. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8330 [https://github.com/apache/spark/pull/8330] > Implement code generation for StringLocate > -- > > Key: SPARK-9400 > URL: https://issues.apache.org/jira/browse/SPARK-9400 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9245) DistributedLDAModel predict top topic per doc-term instance
[ https://issues.apache.org/jira/browse/SPARK-9245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9245: - Target Version/s: (was: 1.6.0) > DistributedLDAModel predict top topic per doc-term instance > --- > > Key: SPARK-9245 > URL: https://issues.apache.org/jira/browse/SPARK-9245 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > Fix For: 1.5.0 > > Original Estimate: 48h > Remaining Estimate: 48h > > For each (document, term) pair, return top topic. Note that instances of > (doc, term) pairs within a document (a.k.a. "tokens") are exchangeable, so we > should provide an estimate per document-term, rather than per token. > Synopsis for DistributedLDAModel: > {code} > /** @return RDD of (doc ID, vector of top topic index for each term) */ > def topTopicAssignments: RDD[(Long, Vector)] > {code} > Note that using Vector will let us have a sparse encoding which is > Java-friendly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4223) Support * (meaning all users) as part of the acls
[ https://issues.apache.org/jira/browse/SPARK-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14705851#comment-14705851 ] Zhuo Liu commented on SPARK-4223: - Hi, everyone, I am working on this and will submit a pull request for this soon. > Support * (meaning all users) as part of the acls > - > > Key: SPARK-4223 > URL: https://issues.apache.org/jira/browse/SPARK-4223 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Thomas Graves > > Currently we support setting view and modify acls but you have to specify a > list of users. It would be nice to support * meaning all users have access. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9245) DistributedLDAModel predict top topic per doc-term instance
[ https://issues.apache.org/jira/browse/SPARK-9245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-9245. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8329 [https://github.com/apache/spark/pull/8329] > DistributedLDAModel predict top topic per doc-term instance > --- > > Key: SPARK-9245 > URL: https://issues.apache.org/jira/browse/SPARK-9245 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > Fix For: 1.5.0 > > Original Estimate: 48h > Remaining Estimate: 48h > > For each (document, term) pair, return top topic. Note that instances of > (doc, term) pairs within a document (a.k.a. "tokens") are exchangeable, so we > should provide an estimate per document-term, rather than per token. > Synopsis for DistributedLDAModel: > {code} > /** @return RDD of (doc ID, vector of top topic index for each term) */ > def topTopicAssignments: RDD[(Long, Vector)] > {code} > Note that using Vector will let us have a sparse encoding which is > Java-friendly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10108) Add @Since annotation to mllib.feature
[ https://issues.apache.org/jira/browse/SPARK-10108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-10108. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8309 [https://github.com/apache/spark/pull/8309] > Add @Since annotation to mllib.feature > -- > > Key: SPARK-10108 > URL: https://issues.apache.org/jira/browse/SPARK-10108 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib >Reporter: Manoj Kumar >Assignee: Manoj Kumar >Priority: Minor > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10138) Setters do not return self type in Java MultilayerPerceptronClassifier
[ https://issues.apache.org/jira/browse/SPARK-10138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-10138. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8342 [https://github.com/apache/spark/pull/8342] > Setters do not return self type in Java MultilayerPerceptronClassifier > -- > > Key: SPARK-10138 > URL: https://issues.apache.org/jira/browse/SPARK-10138 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Blocker > Fix For: 1.5.0 > > > We need to move setters to the final class instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10140) Add target fields to @Since annotation
[ https://issues.apache.org/jira/browse/SPARK-10140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10140: Assignee: Xiangrui Meng (was: Apache Spark) > Add target fields to @Since annotation > -- > > Key: SPARK-10140 > URL: https://issues.apache.org/jira/browse/SPARK-10140 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > Add target fields to @Since so constructor params and fields also get > annotated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10140) Add target fields to @Since annotation
[ https://issues.apache.org/jira/browse/SPARK-10140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14705819#comment-14705819 ] Apache Spark commented on SPARK-10140: -- User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/8344 > Add target fields to @Since annotation > -- > > Key: SPARK-10140 > URL: https://issues.apache.org/jira/browse/SPARK-10140 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > Add target fields to @Since so constructor params and fields also get > annotated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10140) Add target fields to @Since annotation
[ https://issues.apache.org/jira/browse/SPARK-10140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10140: Assignee: Apache Spark (was: Xiangrui Meng) > Add target fields to @Since annotation > -- > > Key: SPARK-10140 > URL: https://issues.apache.org/jira/browse/SPARK-10140 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng >Assignee: Apache Spark > > Add target fields to @Since so constructor params and fields also get > annotated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10140) Add target fields to @Since annotation
Xiangrui Meng created SPARK-10140: - Summary: Add target fields to @Since annotation Key: SPARK-10140 URL: https://issues.apache.org/jira/browse/SPARK-10140 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.5.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Add target fields to @Since so constructor params and fields also get annotated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14705810#comment-14705810 ] Jason Dai commented on SPARK-5556: -- [~pedrorodriguez] We'll try to make a spark package based on our repo; please help take a look at the code and provide your feedback. Please let us know if there are anything we may collaborate for LDA/topic modeling on Spark. > Latent Dirichlet Allocation (LDA) using Gibbs sampler > -- > > Key: SPARK-5556 > URL: https://issues.apache.org/jira/browse/SPARK-5556 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Pedro Rodriguez > Attachments: LDA_test.xlsx, spark-summit.pptx > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10139) java.util.NoSuchElementException
[ https://issues.apache.org/jira/browse/SPARK-10139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dan young updated SPARK-10139: -- Description: cloned Spark repo and am running the spark-ec2/spark_ec2.py on branch-1.4. Startup a Spark 1.4.1 cluster, and then try to run/connect to the ThriftServer2. When I try to connect via Beeline, or via JDBC thru the ThriftServer2, I'm getting an java.util.NoSuchElementException with any command. Example: !sql show tables Also, If I just connect via SparkSQL, i.e ./spark/bin/spark-sql I'm able to run show tables, queries,etc Here is the ThriftServer Log: https://www.refheap.com/c043705ce7978c16c9cda4e12 I run the same code/examples with Spark 1.3.1 and I'm able to connect via BeeLine and/or JDBC thru the ThriftServer2, and run queries, etc Someone else seems to have the same issue: http://stackoverflow.com/questions/31984057/unable-to-see-hive-tables-from-beeline-in-spark-version-1-4-0 was: cloned Spark repo and am running the spark-ec2/spark_ec2.py on branch-1.4. Startup a Spark 1.4.1 cluster, and then try to run/connect to the ThriftServer2. When I try to connect via Beeline, or via JDBC thru the ThriftServer2, I'm getting an java.util.NoSuchElementException with any command. Example: !sql show tables Here is the ThriftServer Log: https://www.refheap.com/c043705ce7978c16c9cda4e12 I run the same code/examples with Spark 1.3.1 and I'm able to connect via BeeLine and/or JDBC thru the ThriftServer2. Also, If I just connect via SparkSQL, i.e ./spark/bin/spark-sql I'm able to run show tables, queries,etc Someone else seems to have the same issue: http://stackoverflow.com/questions/31984057/unable-to-see-hive-tables-from-beeline-in-spark-version-1-4-0 > java.util.NoSuchElementException > > > Key: SPARK-10139 > URL: https://issues.apache.org/jira/browse/SPARK-10139 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1 > Environment: using the spark-ec2/spark_ec2.py scripts to launch spark > cluster in AWS. >Reporter: dan young > > cloned Spark repo and am running the spark-ec2/spark_ec2.py on branch-1.4. > Startup a Spark 1.4.1 cluster, and then try to run/connect to the > ThriftServer2. > When I try to connect via Beeline, or via JDBC thru the ThriftServer2, I'm > getting an java.util.NoSuchElementException with any command. Example: !sql > show tables > Also, If I just connect via SparkSQL, i.e ./spark/bin/spark-sql I'm > able to run show tables, queries,etc > Here is the ThriftServer Log: > https://www.refheap.com/c043705ce7978c16c9cda4e12 > I run the same code/examples with Spark 1.3.1 and I'm able to connect via > BeeLine and/or JDBC thru the ThriftServer2, and run queries, etc > Someone else seems to have the same issue: > http://stackoverflow.com/questions/31984057/unable-to-see-hive-tables-from-beeline-in-spark-version-1-4-0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10139) java.util.NoSuchElementException
dan young created SPARK-10139: - Summary: java.util.NoSuchElementException Key: SPARK-10139 URL: https://issues.apache.org/jira/browse/SPARK-10139 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.1 Environment: using the spark-ec2/spark_ec2.py scripts to launch spark cluster in AWS. Reporter: dan young cloned Spark repo and am running the spark-ec2/spark_ec2.py on branch-1.4. Startup a Spark 1.4.1 cluster, and then try to run/connect to the ThriftServer2. When I try to connect via Beeline, or via JDBC thru the ThriftServer2, I'm getting an java.util.NoSuchElementException with any command. Example: !sql show tables Here is the ThriftServer Log: https://www.refheap.com/c043705ce7978c16c9cda4e12 I run the same code/examples with Spark 1.3.1 and I'm able to connect via BeeLine and/or JDBC thru the ThriftServer2. Also, If I just connect via SparkSQL, i.e ./spark/bin/spark-sql I'm able to run show tables, queries,etc Someone else seems to have the same issue: http://stackoverflow.com/questions/31984057/unable-to-see-hive-tables-from-beeline-in-spark-version-1-4-0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8952) JsonFile() of SQLContext display improper warning message for a S3 path
[ https://issues.apache.org/jira/browse/SPARK-8952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8952: --- Assignee: (was: Apache Spark) > JsonFile() of SQLContext display improper warning message for a S3 path > --- > > Key: SPARK-8952 > URL: https://issues.apache.org/jira/browse/SPARK-8952 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.4.0 >Reporter: Sun Rui > > This is an issue reported by Ben Spark . > {quote} > Spark 1.4 deployed on AWS EMR > "jsonFile" is working though with some warning message > Warning message: > In normalizePath(path) : > > path[1]="s3://rea-consumer-data-dev/cbr/profiler/output/20150618/part-0": > No such file or directory > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8952) JsonFile() of SQLContext display improper warning message for a S3 path
[ https://issues.apache.org/jira/browse/SPARK-8952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14705616#comment-14705616 ] Apache Spark commented on SPARK-8952: - User 'lresende' has created a pull request for this issue: https://github.com/apache/spark/pull/8343 > JsonFile() of SQLContext display improper warning message for a S3 path > --- > > Key: SPARK-8952 > URL: https://issues.apache.org/jira/browse/SPARK-8952 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.4.0 >Reporter: Sun Rui > > This is an issue reported by Ben Spark . > {quote} > Spark 1.4 deployed on AWS EMR > "jsonFile" is working though with some warning message > Warning message: > In normalizePath(path) : > > path[1]="s3://rea-consumer-data-dev/cbr/profiler/output/20150618/part-0": > No such file or directory > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org