[jira] [Commented] (SPARK-3995) [PYSPARK] PySpark's sample methods do not work with NumPy 1.9
[ https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225886#comment-14225886 ] Xiangrui Meng commented on SPARK-3995: -- I think we can backport SPARK-4477, which removes numpy from sampling. [PYSPARK] PySpark's sample methods do not work with NumPy 1.9 - Key: SPARK-3995 URL: https://issues.apache.org/jira/browse/SPARK-3995 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 1.1.0 Reporter: Jeremy Freeman Assignee: Jeremy Freeman Priority: Critical Fix For: 1.2.0 There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. This is the version of NumPy included with the current Anaconda distribution (v2.1); this is a popular distribution, and is likely to affect many users. Steps to reproduce are: {code:python} foo = sc.parallelize(range(1000),5) foo.takeSample(False, 10) {code} Returns: {code} PythonException: Traceback (most recent call last): File /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py, line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, line 127, in dump_stream for obj in iterator: File /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, line 185, in _batched for item in iterator: File /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, line 116, in func if self.getUniformSample(split) = self._fraction: File /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, line 58, in getUniformSample self.initRandomGenerator(split) File /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, line 44, in initRandomGenerator self._random = numpy.random.RandomState(self._seed) File mtrand.pyx, line 610, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:7397) File mtrand.pyx, line 646, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:7697) ValueError: Seed must be between 0 and 4294967295 {code} In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use: {code:python} self._seed = seed if seed is not None else random.randint(0, sys.maxint) {code} In previous versions of NumPy a random seed larger than 2 ** 32 would silently get truncated to 2 ** 32. This was fixed in a recent patch (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). But sampling {{(0, sys.maxint)}} often yields ints larger than 2 ** 32, which effectively breaks sampling operations in PySpark (unless the seed is set manually). I am putting a PR together now (the fix is very simple!). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4619) Double ms in ShuffleBlockFetcherIterator log
maji2014 created SPARK-4619: --- Summary: Double ms in ShuffleBlockFetcherIterator log Key: SPARK-4619 URL: https://issues.apache.org/jira/browse/SPARK-4619 Project: Spark Issue Type: Bug Affects Versions: 1.1.2 Reporter: maji2014 Priority: Minor log as followings: ShuffleBlockFetcherIterator: Got local blocks in 8 ms ms reason: logInfo(Got local blocks in + Utils.getUsedTimeMs(startTime) + ms) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4584) 2x Performance regression for Spark-on-YARN
[ https://issues.apache.org/jira/browse/SPARK-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225899#comment-14225899 ] Sean Owen commented on SPARK-4584: -- {{SecurityManager}} is something that loads of the JVM code consults, if it exists: {code} SecurityManager sm = SecurityManager.getSystemSecurityManager(); if (sm != null) { ... } {code} Setting any {{SecurityManager}} is like turning on a whole lot of not-cheap permission checks throughout the JDK. I think setting one is pretty undesirable from a performance perspective. It also precludes the possibility of enabling a real SecurityManager for contexts that want to although I find it unlikely that would ever really work with Spark. How about just documenting and telling users don't System.exit in your code, which is widely accepted as a no-no in Java/Scala anyway? 2x Performance regression for Spark-on-YARN --- Key: SPARK-4584 URL: https://issues.apache.org/jira/browse/SPARK-4584 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Nishkam Ravi Assignee: Marcelo Vanzin Priority: Blocker Significant performance regression observed for Spark-on-YARN (upto 2x) after 1.2 rebase. The offending commit is: 70e824f750aa8ed446eec104ba158b0503ba58a9 from Oct 7th. Problem can be reproduced with JavaWordCount against a large enough input dataset in YARN cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4619) Double ms in ShuffleBlockFetcherIterator log
[ https://issues.apache.org/jira/browse/SPARK-4619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225910#comment-14225910 ] Apache Spark commented on SPARK-4619: - User 'maji2014' has created a pull request for this issue: https://github.com/apache/spark/pull/3475 Double ms in ShuffleBlockFetcherIterator log -- Key: SPARK-4619 URL: https://issues.apache.org/jira/browse/SPARK-4619 Project: Spark Issue Type: Bug Affects Versions: 1.1.2 Reporter: maji2014 Priority: Minor log as followings: ShuffleBlockFetcherIterator: Got local blocks in 8 ms ms reason: logInfo(Got local blocks in + Utils.getUsedTimeMs(startTime) + ms) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4620) Add unpersist in Graph/GraphImpl
Takeshi Yamamuro created SPARK-4620: --- Summary: Add unpersist in Graph/GraphImpl Key: SPARK-4620 URL: https://issues.apache.org/jira/browse/SPARK-4620 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Takeshi Yamamuro Priority: Trivial Add an IF to uncache both vertices and edges of this graph. This IF is useful when iterative graph operations build a new graph in each iteration, and the vertices and edges of previous iterations are no longer needed for following iterations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4620) Add unpersist in Graph/GraphImpl
[ https://issues.apache.org/jira/browse/SPARK-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225924#comment-14225924 ] Apache Spark commented on SPARK-4620: - User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/3476 Add unpersist in Graph/GraphImpl Key: SPARK-4620 URL: https://issues.apache.org/jira/browse/SPARK-4620 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Takeshi Yamamuro Priority: Trivial Add an IF to uncache both vertices and edges of this graph. This IF is useful when iterative graph operations build a new graph in each iteration, and the vertices and edges of previous iterations are no longer needed for following iterations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4620) Add unpersist in Graph/GraphImpl
[ https://issues.apache.org/jira/browse/SPARK-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-4620: Description: Add an IF to uncache both vertices and edges of Graph/GraphImpl. This IF is useful when iterative graph operations build a new graph in each iteration, and the vertices and edges of previous iterations are no longer needed for following iterations. was: Add an IF to uncache both vertices and edges of this graph. This IF is useful when iterative graph operations build a new graph in each iteration, and the vertices and edges of previous iterations are no longer needed for following iterations. Add unpersist in Graph/GraphImpl Key: SPARK-4620 URL: https://issues.apache.org/jira/browse/SPARK-4620 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Takeshi Yamamuro Priority: Trivial Add an IF to uncache both vertices and edges of Graph/GraphImpl. This IF is useful when iterative graph operations build a new graph in each iteration, and the vertices and edges of previous iterations are no longer needed for following iterations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4537) Add 'processing delay' and 'totalDelay' to the metrics reported by the Spark Streaming subsystem
[ https://issues.apache.org/jira/browse/SPARK-4537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225945#comment-14225945 ] Gerard Maas commented on SPARK-4537: I was waiting for internal clearance to put a PR out, but this is probably much faster. Thanks. Add 'processing delay' and 'totalDelay' to the metrics reported by the Spark Streaming subsystem Key: SPARK-4537 URL: https://issues.apache.org/jira/browse/SPARK-4537 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.1.0 Reporter: Gerard Maas Labels: metrics As the Spark Streaming tuning guide indicates, the key indicators of a healthy streaming job are: - Processing Time - Total Delay The Spark UI page for the Streaming job [1] shows these two indicators but the metrics source for Spark Streaming (StreamingSource.scala) [2] does not. Adding these metrics will allow external monitoring systems that consume the Spark metrics interface to track these two critical pieces of information on a streaming job performance. [1] https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingPage.scala#L127 [2] https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/StreamingSource.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4002) KafkaStreamSuite Kafka input stream case fails on OSX
[ https://issues.apache.org/jira/browse/SPARK-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4002. -- Resolution: Cannot Reproduce Yes I haven't seen this in a while so let's mark it CannotReproduce KafkaStreamSuite Kafka input stream case fails on OSX --- Key: SPARK-4002 URL: https://issues.apache.org/jira/browse/SPARK-4002 Project: Spark Issue Type: Bug Components: Streaming Environment: Mac OSX 10.9.5. Reporter: Ryan Williams Attachments: unit-tests.log [~sowen] mentioned this on spark-dev [here|http://mail-archives.apache.org/mod_mbox/spark-dev/201409.mbox/%3ccamassdjs0fmsdc-k-4orgbhbfz2vvrmm0hfyifeeal-spft...@mail.gmail.com%3E] and I just reproduced it on {{master}} ([7e63bb4|https://github.com/apache/spark/commit/7e63bb49c526c3f872619ae14e4b5273f4c535e9]). The relevant output I get when running {{./dev/run-tests}} is: {code} [info] KafkaStreamSuite: [info] - Kafka input stream *** FAILED *** [info] 3 did not equal 0 (KafkaStreamSuite.scala:135) [info] Test run started [info] Test org.apache.spark.streaming.kafka.JavaKafkaStreamSuite.testKafkaStream started [error] Test org.apache.spark.streaming.kafka.JavaKafkaStreamSuite.testKafkaStream failed: junit.framework.AssertionFailedError: expected:3 but was:0 [error] at junit.framework.Assert.fail(Assert.java:50) [error] at junit.framework.Assert.failNotEquals(Assert.java:287) [error] at junit.framework.Assert.assertEquals(Assert.java:67) [error] at junit.framework.Assert.assertEquals(Assert.java:199) [error] at junit.framework.Assert.assertEquals(Assert.java:205) [error] at org.apache.spark.streaming.kafka.JavaKafkaStreamSuite.testKafkaStream(JavaKafkaStreamSuite.java:129) [error] ... [info] Test run finished: 1 failed, 0 ignored, 1 total, 14.451s Java HotSpot(TM) 64-Bit Server VM warning: ignoring option PermSize=128M; support was removed in 8.0 Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=1g; support was removed in 8.0 [info] ScalaTest [info] Run completed in 11 minutes, 39 seconds. [info] Total number of tests run: 1 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0 [info] *** 1 TEST FAILED *** [error] Failed: Total 2, Failed 2, Errors 0, Passed 0 [error] Failed tests: [error] org.apache.spark.streaming.kafka.JavaKafkaStreamSuite [error] org.apache.spark.streaming.kafka.KafkaStreamSuite {code} This simplest command I know that reproduces this test failure is: {code} mvn test -Dsuites='*KafkaStreamSuite' {code} Often I have to {{mvn clean}} before or as part of running that command, otherwise I get other spurious compile errors or crashes, but that is another story. Seems like this test should be {{@Ignore}}'d, or some note about this made on the {{README.md}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4621) when sort- based shuffle, Cache recently used shuffle index can reduce indexFile's io
Lianhui Wang created SPARK-4621: --- Summary: when sort- based shuffle, Cache recently used shuffle index can reduce indexFile's io Key: SPARK-4621 URL: https://issues.apache.org/jira/browse/SPARK-4621 Project: Spark Issue Type: Improvement Components: Shuffle Reporter: Lianhui Wang Priority: Critical in IndexShuffleBlockManager, we can use LRUCache to store recently finished shuffle index and that can reduce indexFile's io. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4621) when sort- based shuffle, Cache recently finished shuffle index can reduce indexFile's io
[ https://issues.apache.org/jira/browse/SPARK-4621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-4621: Summary: when sort- based shuffle, Cache recently finished shuffle index can reduce indexFile's io (was: when sort- based shuffle, Cache recently used shuffle index can reduce indexFile's io) when sort- based shuffle, Cache recently finished shuffle index can reduce indexFile's io - Key: SPARK-4621 URL: https://issues.apache.org/jira/browse/SPARK-4621 Project: Spark Issue Type: Improvement Components: Shuffle Reporter: Lianhui Wang Priority: Critical in IndexShuffleBlockManager, we can use LRUCache to store recently finished shuffle index and that can reduce indexFile's io. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2468) Netty-based block server / client module
[ https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225960#comment-14225960 ] zzc commented on SPARK-2468: AH, Aaron Davidson, With patch #3465, I can run successful previously failed application and my configuration is the same as before. It's great. Netty-based block server / client module Key: SPARK-2468 URL: https://issues.apache.org/jira/browse/SPARK-2468 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Fix For: 1.2.0 Right now shuffle send goes through the block manager. This is inefficient because it requires loading a block from disk into a kernel buffer, then into a user space buffer, and then back to a kernel send buffer before it reaches the NIC. It does multiple copies of the data and context switching between kernel/user. It also creates unnecessary buffer in the JVM that increases GC Instead, we should use FileChannel.transferTo, which handles this in the kernel space with zero-copy. See http://www.ibm.com/developerworks/library/j-zerocopy/ One potential solution is to use Netty. Spark already has a Netty based network module implemented (org.apache.spark.network.netty). However, it lacks some functionality and is turned off by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4528) [SQL] add comment support for Spark SQL CLI
[ https://issues.apache.org/jira/browse/SPARK-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225989#comment-14225989 ] Apache Spark commented on SPARK-4528: - User 'tsingfu' has created a pull request for this issue: https://github.com/apache/spark/pull/3477 [SQL] add comment support for Spark SQL CLI Key: SPARK-4528 URL: https://issues.apache.org/jira/browse/SPARK-4528 Project: Spark Issue Type: Improvement Components: SQL Reporter: Fuqing Yang Priority: Minor Labels: features while using spark-sql, found it does not support comment while write sqls. It returns an error if we add some comments for examples: spark-sql show tables; --list tables in current db; 14/11/21 11:30:37 INFO parse.ParseDriver: Parsing command: show tables 14/11/21 11:30:37 INFO parse.ParseDriver: Parse Completed 14/11/21 11:30:37 INFO analysis.Analyzer: Max iterations (2) reached for batch MultiInstanceRelations 14/11/21 11:30:37 INFO analysis.Analyzer: Max iterations (2) reached for batch CaseInsensitiveAttributeReferences .. 14/11/21 11:30:38 INFO HiveMetaStore.audit: ugi=hadoop ip=unknown-ip-addr cmd=get_tables: db=default pat=.* 14/11/21 11:30:38 INFO ql.Driver: /PERFLOG method=task.DDL.Stage-0 start=1416540638191 end=1416540638202 duration=11 14/11/21 11:30:38 INFO ql.Driver: /PERFLOG method=runTasks start=1416540638191 end=1416540638202 duration=11 14/11/21 11:30:38 INFO ql.Driver: /PERFLOG method=Driver.execute start=1416540638190 end=1416540638203 duration=13 OK 14/11/21 11:30:38 INFO ql.Driver: OK 14/11/21 11:30:38 INFO ql.Driver: PERFLOG method=releaseLocks 14/11/21 11:30:38 INFO ql.Driver: /PERFLOG method=releaseLocks start=1416540638203 end=1416540638203 duration=0 14/11/21 11:30:38 INFO ql.Driver: /PERFLOG method=Driver.run start=1416540637998 end=1416540638203 duration=205 14/11/21 11:30:38 INFO mapred.FileInputFormat: Total input paths to process : 1 14/11/21 11:30:38 INFO ql.Driver: PERFLOG method=releaseLocks 14/11/21 11:30:38 INFO ql.Driver: /PERFLOG method=releaseLocks start=1416540638207 end=1416540638208 duration=1 14/11/21 11:30:38 INFO analysis.Analyzer: Max iterations (2) reached for batch MultiInstanceRelations 14/11/21 11:30:38 INFO analysis.Analyzer: Max iterations (2) reached for batch CaseInsensitiveAttributeReferences 14/11/21 11:30:38 INFO analysis.Analyzer: Max iterations (2) reached for batch Check Analysis 14/11/21 11:30:38 INFO sql.SQLContext$$anon$1: Max iterations (2) reached for batch Add exchange 14/11/21 11:30:38 INFO sql.SQLContext$$anon$1: Max iterations (2) reached for batch Prepare Expressions dummy records tab_sogou10 Time taken: 0.412 seconds 14/11/21 11:30:38 INFO CliDriver: Time taken: 0.412 seconds 14/11/21 11:30:38 INFO parse.ParseDriver: Parsing command: --list tables in current db NoViableAltException(-1@[]) at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:902) Comment support is widely using in projects, it is a necessary to add this feature. this implitation can be archived in the source org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.scala. We may need support three comment stypes: From a ‘#’ character to the end of the line. From a ‘-- ’ sequence to the end of the line From a /* sequence to the following */ sequence, as in the C programming language. This syntax allows a comment to extend over multiple lines because the beginning and closing sequences need not be on the same line. what do you think about? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4613) Make JdbcRDD easier to use from Java
[ https://issues.apache.org/jira/browse/SPARK-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226007#comment-14226007 ] Apache Spark commented on SPARK-4613: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/3478 Make JdbcRDD easier to use from Java Key: SPARK-4613 URL: https://issues.apache.org/jira/browse/SPARK-4613 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Matei Zaharia Assignee: Cheng Lian We might eventually deprecate it, but for now it would be nice to expose a Java wrapper that allows users to create this using the java function interface. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3546) InputStream of ManagedBuffer is not closed and causes running out of file descriptor
[ https://issues.apache.org/jira/browse/SPARK-3546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-3546: -- Target Version/s: 1.2.0 (was: 1.1.1, 1.2.0) InputStream of ManagedBuffer is not closed and causes running out of file descriptor Key: SPARK-3546 URL: https://issues.apache.org/jira/browse/SPARK-3546 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta Priority: Critical Fix For: 1.2.0 If application has lots of shuffle blocks, resource leak (running out of file descriptor) is occurred. Following text is file descriptors of an Executor which has lots of blocks {code} ・・・ lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9980 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/13/shuffle_0_340_0.data lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9981 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/37/shuffle_0_355_0.data lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9982 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/10/shuffle_0_370_0.data lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9983 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/0c/shuffle_0_385_0.data lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9984 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/0e/shuffle_0_390_0.data lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9985 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/1d/shuffle_0_405_0.data lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9986 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/36/shuffle_0_420_0.data lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9987 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/1b/shuffle_0_425_0.data lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9988 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/0b/shuffle_0_430_0.data lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9989 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/0d/shuffle_0_450_0.data lr-x-- 1 yarn yarn 64 9月 16 20:11 2014 999 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/28/shuffle_1_630_0.data lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9990 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/29/shuffle_0_465_0.data lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9991 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/14/shuffle_0_495_0.data lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9992 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/3c/shuffle_0_525_0.data lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9993 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/2a/shuffle_0_530_0.data lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9994 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/05/shuffle_0_535_0.data lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9995 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/15/shuffle_0_540_0.data lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9996 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/2c/shuffle_0_550_0.data lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9997 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/13/shuffle_0_560_0.data lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9998 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/12/shuffle_0_570_0.data lr-x-- 1 yarn yarn 64 9月 16 20:27 2014 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/2f/shuffle_0_580_0.data {code} This is caused by not closing InputStream generated by ManagedBuffer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (SPARK-3546) InputStream of ManagedBuffer is not closed and causes running out of file descriptor
[ https://issues.apache.org/jira/browse/SPARK-3546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226051#comment-14226051 ] Kousuke Saruta commented on SPARK-3546: --- [~andrewor14] I don't think we need this for branch-1.1 because ManagedBuffer is not implemented in the branch. I'll remove branch-1.1 from the target version. InputStream of ManagedBuffer is not closed and causes running out of file descriptor Key: SPARK-3546 URL: https://issues.apache.org/jira/browse/SPARK-3546 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta Priority: Critical Fix For: 1.2.0 If application has lots of shuffle blocks, resource leak (running out of file descriptor) is occurred. Following text is file descriptors of an Executor which has lots of blocks {code} ・・・ lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9980 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/13/shuffle_0_340_0.data lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9981 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/37/shuffle_0_355_0.data lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9982 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/10/shuffle_0_370_0.data lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9983 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/0c/shuffle_0_385_0.data lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9984 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/0e/shuffle_0_390_0.data lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9985 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/1d/shuffle_0_405_0.data lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9986 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/36/shuffle_0_420_0.data lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9987 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/1b/shuffle_0_425_0.data lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9988 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/0b/shuffle_0_430_0.data lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9989 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/0d/shuffle_0_450_0.data lr-x-- 1 yarn yarn 64 9月 16 20:11 2014 999 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/28/shuffle_1_630_0.data lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9990 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/29/shuffle_0_465_0.data lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9991 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/14/shuffle_0_495_0.data lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9992 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/3c/shuffle_0_525_0.data lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9993 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/2a/shuffle_0_530_0.data lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9994 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/05/shuffle_0_535_0.data lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9995 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/15/shuffle_0_540_0.data lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9996 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/2c/shuffle_0_550_0.data lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9997 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/13/shuffle_0_560_0.data lr-x-- 1 yarn yarn 64 9月 16 20:28 2014 9998 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/12/shuffle_0_570_0.data lr-x-- 1 yarn yarn 64 9月 16 20:27 2014 - /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/2f/shuffle_0_580_0.data {code} This
[jira] [Commented] (SPARK-4615) Cannot disconnect from spark-shell
[ https://issues.apache.org/jira/browse/SPARK-4615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226097#comment-14226097 ] Sean Owen commented on SPARK-4615: -- I can't reproduce this, although I'm using Java 7, Ubuntu, and HEAD. Same on OS X, Java 8. Maybe you can kill -QUIT the Java process to see what threads are waiting on what? Is any of your code starting a thread or adding a shutdown hook? Cannot disconnect from spark-shell -- Key: SPARK-4615 URL: https://issues.apache.org/jira/browse/SPARK-4615 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.2.0 Environment: Ubuntu 12/14 Reporter: Thomas Omans Priority: Minor When running spark-shell using the `v1.2.0-snapshot1` tag, using the instructions at: http://spark.apache.org/docs/latest/building-with-maven.html When attemping to disconnect from a spark shell (C-c) the terminal locks and does not interrupt the process. The spark-shell was built with: ./make-distribution.sh --tgz -Pyarn -Phive -Phadoop-2.3 -Dhadoop.version=2.3.0-cdh5.1.0 -DskipTests Using oracle jdk6 maven 3.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4622) Add the some error infomation if using spark-sql in yarn-cluster mode
hzw created SPARK-4622: -- Summary: Add the some error infomation if using spark-sql in yarn-cluster mode Key: SPARK-4622 URL: https://issues.apache.org/jira/browse/SPARK-4622 Project: Spark Issue Type: Bug Components: Deploy Reporter: hzw If using spark-sql in yarn-cluster mode, print an error infomation just as the spark shell in yarn-cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4623) Add the some error infomation if using spark-sql in yarn-cluster mode
hzw created SPARK-4623: -- Summary: Add the some error infomation if using spark-sql in yarn-cluster mode Key: SPARK-4623 URL: https://issues.apache.org/jira/browse/SPARK-4623 Project: Spark Issue Type: Bug Components: Deploy Reporter: hzw If using spark-sql in yarn-cluster mode, print an error infomation just as the spark shell in yarn-cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4624) Errors when reading/writtign to S3 large object files
Kriton Tsintaris created SPARK-4624: --- Summary: Errors when reading/writtign to S3 large object files Key: SPARK-4624 URL: https://issues.apache.org/jira/browse/SPARK-4624 Project: Spark Issue Type: Bug Components: EC2, Input/Output, Mesos Affects Versions: 1.1.0 Environment: manually setup Mesos cluster in EC2 made of 30 c3.4xLArge Nodes Reporter: Kriton Tsintaris Priority: Critical My cluster is not configured to use hdfs. Instead the local disk of each node is used. I've got a number of huge RDD object files (each made of ~600 part files each of ~60 GB). They are updated extremely rarely. An example of the model of the data stored in these RDDs is the following: (Long, Array[Long]). When I load them to my cluster, using val page_users = sc.objectFile[(Long,Array[Long])](s3n://mybucket/path/myrdd.obj.rdd) or equivelant, sometimes data is missing (as if 1 or 2 of the part files was not sucesfuly loaded). What is more frustrating is that I get no errors that this has happened! Sometimes reading s3 timeouts or gets some errors but eventually auto-retries do succeed. Furthermore If I attempt to write an RDD back into S3, using myrdd.saveAsObjectFile(s3n://...), the operation will again terminate before it was completed without any warning or indication of error. More specifically what will happen is that the object files parts will be left under a _temporary folder and only a few of them will have been moved in the correct path in s3. This only happens when I am writing huge object files. If my object file is just a few GB everything will be fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2192) Examples Data Not in Binary Distribution
[ https://issues.apache.org/jira/browse/SPARK-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226151#comment-14226151 ] Sean Owen commented on SPARK-2192: -- Oops, on further inspection I see that the file is not Movielens data, but merely in the same format. The comments do say this in MovieLensALS.scala. I'll cook up a PR to add the example data to the distro. Examples Data Not in Binary Distribution Key: SPARK-2192 URL: https://issues.apache.org/jira/browse/SPARK-2192 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.0 Reporter: Pat McDonough The data used by examples is not packaged up with the binary distribution. The data subdirectory of spark should make it's way in to the distribution somewhere so the examples can use it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2192) Examples Data Not in Binary Distribution
[ https://issues.apache.org/jira/browse/SPARK-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226175#comment-14226175 ] Apache Spark commented on SPARK-2192: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/3480 Examples Data Not in Binary Distribution Key: SPARK-2192 URL: https://issues.apache.org/jira/browse/SPARK-2192 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.0 Reporter: Pat McDonough The data used by examples is not packaged up with the binary distribution. The data subdirectory of spark should make it's way in to the distribution somewhere so the examples can use it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4621) when sort- based shuffle, Cache recently finished shuffle index can reduce indexFile's io
[ https://issues.apache.org/jira/browse/SPARK-4621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-4621: Priority: Minor (was: Critical) when sort- based shuffle, Cache recently finished shuffle index can reduce indexFile's io - Key: SPARK-4621 URL: https://issues.apache.org/jira/browse/SPARK-4621 Project: Spark Issue Type: Improvement Components: Shuffle Reporter: Lianhui Wang Priority: Minor in IndexShuffleBlockManager, we can use LRUCache to store recently finished shuffle index and that can reduce indexFile's io. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4196) Streaming + checkpointing + saveAsNewAPIHadoopFiles = NotSerializableException for Hadoop Configuration
[ https://issues.apache.org/jira/browse/SPARK-4196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226192#comment-14226192 ] Sean Owen commented on SPARK-4196: -- Looks good. At least, this commit got me past this particular issue. Thanks! Streaming + checkpointing + saveAsNewAPIHadoopFiles = NotSerializableException for Hadoop Configuration --- Key: SPARK-4196 URL: https://issues.apache.org/jira/browse/SPARK-4196 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.1.0 Reporter: Sean Owen Assignee: Tathagata Das Fix For: 1.2.0, 1.1.2 I am reasonably sure there is some issue here in Streaming and that I'm not missing something basic, but not 100%. I went ahead and posted it as a JIRA to track, since it's come up a few times before without resolution, and right now I can't get checkpointing to work at all. When Spark Streaming checkpointing is enabled, I see a NotSerializableException thrown for a Hadoop Configuration object, and it seems like it is not one from my user code. Before I post my particular instance see http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3c1408135046777-12202.p...@n3.nabble.com%3E for another occurrence. I was also on customer site last week debugging an identical issue with checkpointing in a Scala-based program and they also could not enable checkpointing without hitting exactly this error. The essence of my code is: {code} final JavaSparkContext sparkContext = new JavaSparkContext(sparkConf); JavaStreamingContextFactory streamingContextFactory = new JavaStreamingContextFactory() { @Override public JavaStreamingContext create() { return new JavaStreamingContext(sparkContext, new Duration(batchDurationMS)); } }; streamingContext = JavaStreamingContext.getOrCreate( checkpointDirString, sparkContext.hadoopConfiguration(), streamingContextFactory, false); streamingContext.checkpoint(checkpointDirString); {code} It yields: {code} 2014-10-31 14:29:00,211 ERROR OneForOneStrategy:66 org.apache.hadoop.conf.Configuration - field (class org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$9, name: conf$2, type: class org.apache.hadoop.conf.Configuration) - object (class org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$9, function2) - field (class org.apache.spark.streaming.dstream.ForEachDStream, name: org$apache$spark$streaming$dstream$ForEachDStream$$foreachFunc, type: interface scala.Function2) - object (class org.apache.spark.streaming.dstream.ForEachDStream, org.apache.spark.streaming.dstream.ForEachDStream@cb8016a) ... {code} This looks like it's due to PairRDDFunctions, as this saveFunc seems to be org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$9 : {code} def saveAsNewAPIHadoopFiles( prefix: String, suffix: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[_ : NewOutputFormat[_, _]], conf: Configuration = new Configuration ) { val saveFunc = (rdd: RDD[(K, V)], time: Time) = { val file = rddToFileName(prefix, suffix, time) rdd.saveAsNewAPIHadoopFile(file, keyClass, valueClass, outputFormatClass, conf) } self.foreachRDD(saveFunc) } {code} Is that not a problem? but then I don't know how it would ever work in Spark. But then again I don't see why this is an issue and only when checkpointing is enabled. Long-shot, but I wonder if it is related to closure issues like https://issues.apache.org/jira/browse/SPARK-1866 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4625) Support Sort By in both DSL SimpleSQLParser
Cheng Hao created SPARK-4625: Summary: Support Sort By in both DSL SimpleSQLParser Key: SPARK-4625 URL: https://issues.apache.org/jira/browse/SPARK-4625 Project: Spark Issue Type: New Feature Components: SQL Reporter: Cheng Hao -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4625) Support Sort By in both DSL SimpleSQLParser
[ https://issues.apache.org/jira/browse/SPARK-4625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226280#comment-14226280 ] Apache Spark commented on SPARK-4625: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/3481 Support Sort By in both DSL SimpleSQLParser --- Key: SPARK-4625 URL: https://issues.apache.org/jira/browse/SPARK-4625 Project: Spark Issue Type: New Feature Components: SQL Reporter: Cheng Hao -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4612) Configuration object gets created for every task even if not new file/jar is added
[ https://issues.apache.org/jira/browse/SPARK-4612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-4612. -- Resolution: Fixed Fix Version/s: 1.2.0 Target Version/s: 1.2.0 (was: 1.3.0) Configuration object gets created for every task even if not new file/jar is added -- Key: SPARK-4612 URL: https://issues.apache.org/jira/browse/SPARK-4612 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical Fix For: 1.2.0 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L337 creates a configuration object for every task that is launched, even if there is no new dependent file/JAR to update. This is a heavy-weight creation that should be avoided if there is not new file/JAR to update. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4046) Incorrect Java example on site
[ https://issues.apache.org/jira/browse/SPARK-4046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brennon York updated SPARK-4046: Attachment: SPARK-4046.diff Pulled down the site at the [Apache repo|https://svn.apache.org/repos/asf/spark/site], made the change, and attached the .diff file to resolve the issue. Incorrect Java example on site -- Key: SPARK-4046 URL: https://issues.apache.org/jira/browse/SPARK-4046 Project: Spark Issue Type: Bug Components: Documentation, Java API Affects Versions: 1.1.0 Environment: Web Reporter: Ian Babrou Priority: Minor Attachments: SPARK-4046.diff https://spark.apache.org/examples.html Here word count example for java is incorrect. It should be mapToPair instead of map. Correct example is here: https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaWordCount.java -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4604) Make MatrixFactorizationModel constructor public
[ https://issues.apache.org/jira/browse/SPARK-4604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-4604. -- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3473 [https://github.com/apache/spark/pull/3473] Make MatrixFactorizationModel constructor public Key: SPARK-4604 URL: https://issues.apache.org/jira/browse/SPARK-4604 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.2.0 Make MatrixFactorizationModel public and add note about partitioning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4614) Slight API changes in Matrix and Matrices
[ https://issues.apache.org/jira/browse/SPARK-4614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226396#comment-14226396 ] Apache Spark commented on SPARK-4614: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/3482 Slight API changes in Matrix and Matrices - Key: SPARK-4614 URL: https://issues.apache.org/jira/browse/SPARK-4614 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Before we have a full picture of the operators we want to add, it might be safer to hide `Matrix.transposeMultiply` in 1.2.0. Another update we want to change is `Matrix.randn` and `Matrix.rand`, both of which should take a Random implementation. Otherwise, it is very likely to produce inconsistent RDDs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4626) NoSuchElementException in CoarseGrainedSchedulerBackend
Victor Tso created SPARK-4626: - Summary: NoSuchElementException in CoarseGrainedSchedulerBackend Key: SPARK-4626 URL: https://issues.apache.org/jira/browse/SPARK-4626 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Victor Tso Priority: Blocker 26 Nov 2014 06:38:21,330 ERROR [spark-akka.actor.default-dispatcher-22] OneForOneStrategy - key not found: 0 java.util.NoSuchElementException: key not found: 0 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:106) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) This came on the heels of a lot of lost executors with error messages like: 26 Nov 2014 06:38:20,330 ERROR [spark-akka.actor.default-dispatcher-15] TaskSchedulerImpl - Lost executor 31 on xxx: remote Akka client disassociated -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4626) NoSuchElementException in CoarseGrainedSchedulerBackend
[ https://issues.apache.org/jira/browse/SPARK-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226466#comment-14226466 ] Victor Tso commented on SPARK-4626: --- Apologies if Blocker is the wrong level. We're seeing this in production so it has us jittery. NoSuchElementException in CoarseGrainedSchedulerBackend --- Key: SPARK-4626 URL: https://issues.apache.org/jira/browse/SPARK-4626 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Victor Tso Priority: Blocker 26 Nov 2014 06:38:21,330 ERROR [spark-akka.actor.default-dispatcher-22] OneForOneStrategy - key not found: 0 java.util.NoSuchElementException: key not found: 0 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:106) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) This came on the heels of a lot of lost executors with error messages like: 26 Nov 2014 06:38:20,330 ERROR [spark-akka.actor.default-dispatcher-15] TaskSchedulerImpl - Lost executor 31 on xxx: remote Akka client disassociated -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4626) NoSuchElementException in CoarseGrainedSchedulerBackend
[ https://issues.apache.org/jira/browse/SPARK-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Victor Tso updated SPARK-4626: -- Affects Version/s: (was: 1.1.0) 1.0.2 NoSuchElementException in CoarseGrainedSchedulerBackend --- Key: SPARK-4626 URL: https://issues.apache.org/jira/browse/SPARK-4626 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Reporter: Victor Tso Priority: Blocker 26 Nov 2014 06:38:21,330 ERROR [spark-akka.actor.default-dispatcher-22] OneForOneStrategy - key not found: 0 java.util.NoSuchElementException: key not found: 0 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:106) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) This came on the heels of a lot of lost executors with error messages like: 26 Nov 2014 06:38:20,330 ERROR [spark-akka.actor.default-dispatcher-15] TaskSchedulerImpl - Lost executor 31 on xxx: remote Akka client disassociated -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4626) NoSuchElementException in CoarseGrainedSchedulerBackend
[ https://issues.apache.org/jira/browse/SPARK-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Victor Tso updated SPARK-4626: -- Priority: Major (was: Blocker) NoSuchElementException in CoarseGrainedSchedulerBackend --- Key: SPARK-4626 URL: https://issues.apache.org/jira/browse/SPARK-4626 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Reporter: Victor Tso 26 Nov 2014 06:38:21,330 ERROR [spark-akka.actor.default-dispatcher-22] OneForOneStrategy - key not found: 0 java.util.NoSuchElementException: key not found: 0 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:106) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) This came on the heels of a lot of lost executors with error messages like: 26 Nov 2014 06:38:20,330 ERROR [spark-akka.actor.default-dispatcher-15] TaskSchedulerImpl - Lost executor 31 on xxx: remote Akka client disassociated -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-4626) NoSuchElementException in CoarseGrainedSchedulerBackend
[ https://issues.apache.org/jira/browse/SPARK-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Victor Tso updated SPARK-4626: -- Comment: was deleted (was: Apologies if Blocker is the wrong level. We're seeing this in production so it has us jittery.) NoSuchElementException in CoarseGrainedSchedulerBackend --- Key: SPARK-4626 URL: https://issues.apache.org/jira/browse/SPARK-4626 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Reporter: Victor Tso 26 Nov 2014 06:38:21,330 ERROR [spark-akka.actor.default-dispatcher-22] OneForOneStrategy - key not found: 0 java.util.NoSuchElementException: key not found: 0 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:106) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) This came on the heels of a lot of lost executors with error messages like: 26 Nov 2014 06:38:20,330 ERROR [spark-akka.actor.default-dispatcher-15] TaskSchedulerImpl - Lost executor 31 on xxx: remote Akka client disassociated -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4626) NoSuchElementException in CoarseGrainedSchedulerBackend
[ https://issues.apache.org/jira/browse/SPARK-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226480#comment-14226480 ] Victor Tso commented on SPARK-4626: --- It looks like a race where the KillTask message comes in shortly after a RemoveExecutor message. If the executor has already been removed KillTask should no-op: executorActor.get(executorId).foreach(_ ! KillTask(taskId, executorId, interruptThread)) NoSuchElementException in CoarseGrainedSchedulerBackend --- Key: SPARK-4626 URL: https://issues.apache.org/jira/browse/SPARK-4626 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Reporter: Victor Tso 26 Nov 2014 06:38:21,330 ERROR [spark-akka.actor.default-dispatcher-22] OneForOneStrategy - key not found: 0 java.util.NoSuchElementException: key not found: 0 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:106) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) This came on the heels of a lot of lost executors with error messages like: 26 Nov 2014 06:38:20,330 ERROR [spark-akka.actor.default-dispatcher-15] TaskSchedulerImpl - Lost executor 31 on xxx: remote Akka client disassociated -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4584) 2x Performance regression for Spark-on-YARN
[ https://issues.apache.org/jira/browse/SPARK-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226506#comment-14226506 ] Marcelo Vanzin commented on SPARK-4584: --- [~sowen] that was going to be my suggestion. For 1.2, I'll just remote the security manager and declare use System.exit() at your own risk - the behavior should then be the same as 1.1. Post 1.2, we could add some new exception (e.g. {{SparkAppException}}) that users can throw if they want the runtime to exit with a specific error code. But even that I don't think is strictly necessary - just a nice to have. BTW, we've tried to extend the security manager so that all operations except for {{checkExit}} are no-ops, but even that doesn't help. 2x Performance regression for Spark-on-YARN --- Key: SPARK-4584 URL: https://issues.apache.org/jira/browse/SPARK-4584 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Nishkam Ravi Assignee: Marcelo Vanzin Priority: Blocker Significant performance regression observed for Spark-on-YARN (upto 2x) after 1.2 rebase. The offending commit is: 70e824f750aa8ed446eec104ba158b0503ba58a9 from Oct 7th. Problem can be reproduced with JavaWordCount against a large enough input dataset in YARN cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4626) NoSuchElementException in CoarseGrainedSchedulerBackend
[ https://issues.apache.org/jira/browse/SPARK-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226513#comment-14226513 ] Victor Tso commented on SPARK-4626: --- https://github.com/apache/spark/pull/3483 NoSuchElementException in CoarseGrainedSchedulerBackend --- Key: SPARK-4626 URL: https://issues.apache.org/jira/browse/SPARK-4626 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Reporter: Victor Tso 26 Nov 2014 06:38:21,330 ERROR [spark-akka.actor.default-dispatcher-22] OneForOneStrategy - key not found: 0 java.util.NoSuchElementException: key not found: 0 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:106) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) This came on the heels of a lot of lost executors with error messages like: 26 Nov 2014 06:38:20,330 ERROR [spark-akka.actor.default-dispatcher-15] TaskSchedulerImpl - Lost executor 31 on xxx: remote Akka client disassociated -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4626) NoSuchElementException in CoarseGrainedSchedulerBackend
[ https://issues.apache.org/jira/browse/SPARK-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226514#comment-14226514 ] Apache Spark commented on SPARK-4626: - User 'roxchkplusony' has created a pull request for this issue: https://github.com/apache/spark/pull/3483 NoSuchElementException in CoarseGrainedSchedulerBackend --- Key: SPARK-4626 URL: https://issues.apache.org/jira/browse/SPARK-4626 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Reporter: Victor Tso 26 Nov 2014 06:38:21,330 ERROR [spark-akka.actor.default-dispatcher-22] OneForOneStrategy - key not found: 0 java.util.NoSuchElementException: key not found: 0 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:106) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) This came on the heels of a lot of lost executors with error messages like: 26 Nov 2014 06:38:20,330 ERROR [spark-akka.actor.default-dispatcher-15] TaskSchedulerImpl - Lost executor 31 on xxx: remote Akka client disassociated -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4001) Add Apriori algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226562#comment-14226562 ] Jacky Li commented on SPARK-4001: - Thanks for your suggestion, Daniel. Here is the current status. 1. Currently I have implemented apriori and fp-growth by referring to YAFIM (http://pasa-bigdata.nju.edu.cn/people/ronggu/pub/YAFIM_ParLearning.pdf) and PFP (http://dl.acm.org/citation.cfm?id=1454027) For apriori, currently there are two versions implemented, one using broadcast variable and another one using cartisian join of two RDD, I am testing them using mushroom and webdoc open dataset (http://fimi.ua.ac.be/data/) to check the performance of them before deciding which one to contribute to MLlib. I have updated the code in the PR (https://github.com/apache/spark/pull/2847), you are welcome to check it and try in your use case. 2. For the input part, currently the apriori algo is taking {{RDD\[Array\[String\]\]}} as the input dataset, but not containing basket_id or user_id. I am not sure whether it can easily fit into your use case. Can you give more detail of how you want to use it in collaborative filtering contexts? Add Apriori algorithm to Spark MLlib Key: SPARK-4001 URL: https://issues.apache.org/jira/browse/SPARK-4001 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Jacky Li Assignee: Jacky Li Apriori is the classic algorithm for frequent item set mining in a transactional data set. It will be useful if Apriori algorithm is added to MLLib in Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4584) 2x Performance regression for Spark-on-YARN
[ https://issues.apache.org/jira/browse/SPARK-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226637#comment-14226637 ] Apache Spark commented on SPARK-4584: - User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/3484 2x Performance regression for Spark-on-YARN --- Key: SPARK-4584 URL: https://issues.apache.org/jira/browse/SPARK-4584 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Nishkam Ravi Assignee: Marcelo Vanzin Priority: Blocker Significant performance regression observed for Spark-on-YARN (upto 2x) after 1.2 rebase. The offending commit is: 70e824f750aa8ed446eec104ba158b0503ba58a9 from Oct 7th. Problem can be reproduced with JavaWordCount against a large enough input dataset in YARN cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4626) NoSuchElementException in CoarseGrainedSchedulerBackend
[ https://issues.apache.org/jira/browse/SPARK-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4626: - Assignee: Victor Tso NoSuchElementException in CoarseGrainedSchedulerBackend --- Key: SPARK-4626 URL: https://issues.apache.org/jira/browse/SPARK-4626 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Reporter: Victor Tso Assignee: Victor Tso 26 Nov 2014 06:38:21,330 ERROR [spark-akka.actor.default-dispatcher-22] OneForOneStrategy - key not found: 0 java.util.NoSuchElementException: key not found: 0 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:106) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) This came on the heels of a lot of lost executors with error messages like: 26 Nov 2014 06:38:20,330 ERROR [spark-akka.actor.default-dispatcher-15] TaskSchedulerImpl - Lost executor 31 on xxx: remote Akka client disassociated -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4627) Too many TripletFields
Joseph E. Gonzalez created SPARK-4627: - Summary: Too many TripletFields Key: SPARK-4627 URL: https://issues.apache.org/jira/browse/SPARK-4627 Project: Spark Issue Type: Bug Components: GraphX Reporter: Joseph E. Gonzalez Priority: Trivial The `TripletFields` class defines a set of constants for all possible configurations of the triplet fields. However, many are not useful and as result the API is slightly confusing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4627) Too many TripletFields
[ https://issues.apache.org/jira/browse/SPARK-4627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph E. Gonzalez closed SPARK-4627. - Resolution: Fixed Fix Version/s: 1.2.0 Fixed in PR #3472. Too many TripletFields -- Key: SPARK-4627 URL: https://issues.apache.org/jira/browse/SPARK-4627 Project: Spark Issue Type: Bug Components: GraphX Reporter: Joseph E. Gonzalez Priority: Trivial Fix For: 1.2.0 The `TripletFields` class defines a set of constants for all possible configurations of the triplet fields. However, many are not useful and as result the API is slightly confusing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4627) Too many TripletFields
[ https://issues.apache.org/jira/browse/SPARK-4627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226652#comment-14226652 ] Apache Spark commented on SPARK-4627: - User 'jegonzal' has created a pull request for this issue: https://github.com/apache/spark/pull/3472 Too many TripletFields -- Key: SPARK-4627 URL: https://issues.apache.org/jira/browse/SPARK-4627 Project: Spark Issue Type: Bug Components: GraphX Reporter: Joseph E. Gonzalez Priority: Trivial Fix For: 1.2.0 The `TripletFields` class defines a set of constants for all possible configurations of the triplet fields. However, many are not useful and as result the API is slightly confusing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4628) Put all external projects behind a build flag
Patrick Wendell created SPARK-4628: -- Summary: Put all external projects behind a build flag Key: SPARK-4628 URL: https://issues.apache.org/jira/browse/SPARK-4628 Project: Spark Issue Type: Improvement Reporter: Patrick Wendell Priority: Blocker This is something we talked about doing for convenience, but I'm escalating this based on realizing today that some of our external projects depend on code that is not in maven central. I.e. if one of these dependencies is taken down (as happened recently with mqtt), all Spark builds will fail. The proposal here is simple, have a profile -Pexternal-projects that enables these. This can follow the exact pattern of -Pkinesis-asl which was disabled by default due to a license issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4628) Put all external projects behind a build flag
[ https://issues.apache.org/jira/browse/SPARK-4628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226691#comment-14226691 ] Apache Spark commented on SPARK-4628: - User 'pwendell' has created a pull request for this issue: https://github.com/apache/spark/pull/3485 Put all external projects behind a build flag - Key: SPARK-4628 URL: https://issues.apache.org/jira/browse/SPARK-4628 Project: Spark Issue Type: Improvement Reporter: Patrick Wendell Priority: Blocker This is something we talked about doing for convenience, but I'm escalating this based on realizing today that some of our external projects depend on code that is not in maven central. I.e. if one of these dependencies is taken down (as happened recently with mqtt), all Spark builds will fail. The proposal here is simple, have a profile -Pexternal-projects that enables these. This can follow the exact pattern of -Pkinesis-asl which was disabled by default due to a license issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4614) Slight API changes in Matrix and Matrices
[ https://issues.apache.org/jira/browse/SPARK-4614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-4614. -- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3482 [https://github.com/apache/spark/pull/3482] Slight API changes in Matrix and Matrices - Key: SPARK-4614 URL: https://issues.apache.org/jira/browse/SPARK-4614 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.2.0 Before we have a full picture of the operators we want to add, it might be safer to hide `Matrix.transposeMultiply` in 1.2.0. Another update we want to change is `Matrix.randn` and `Matrix.rand`, both of which should take a Random implementation. Otherwise, it is very likely to produce inconsistent RDDs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2450) Provide link to YARN executor logs on UI
[ https://issues.apache.org/jira/browse/SPARK-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226706#comment-14226706 ] Apache Spark commented on SPARK-2450: - User 'ksakellis' has created a pull request for this issue: https://github.com/apache/spark/pull/3486 Provide link to YARN executor logs on UI Key: SPARK-2450 URL: https://issues.apache.org/jira/browse/SPARK-2450 Project: Spark Issue Type: Improvement Components: Web UI, YARN Affects Versions: 1.0.0 Reporter: Bill Havanki Assignee: Kostas Sakellis Priority: Minor When running under YARN, provide links to executor logs from the web UI to avoid the need to drill down through the YARN UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4376) Put external modules behind build profiles
[ https://issues.apache.org/jira/browse/SPARK-4376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza resolved SPARK-4376. --- Resolution: Duplicate Put external modules behind build profiles -- Key: SPARK-4376 URL: https://issues.apache.org/jira/browse/SPARK-4376 Project: Spark Issue Type: Improvement Components: Build Reporter: Patrick Wendell Assignee: Sandy Ryza Priority: Blocker Several people have asked me whether, to speed up the build, we can put the external projects behind build flags similar to the kinesis-asl module. Since these aren't in the assembly there isn't a great reason to build them by default. We can just modify our release script to build them and when we run tests. This doesn't technically block Spark 1.2 but it is going to be looped into a separate fix that does block Spark 1.2 so I'm upgrading it to blocker. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4628) Put all external projects behind a build flag
[ https://issues.apache.org/jira/browse/SPARK-4628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226722#comment-14226722 ] Sandy Ryza commented on SPARK-4628: --- This looks like a duplicate of SPARK-4376. Resolving that one as duplicate because it has no PR. Put all external projects behind a build flag - Key: SPARK-4628 URL: https://issues.apache.org/jira/browse/SPARK-4628 Project: Spark Issue Type: Improvement Reporter: Patrick Wendell Priority: Blocker This is something we talked about doing for convenience, but I'm escalating this based on realizing today that some of our external projects depend on code that is not in maven central. I.e. if one of these dependencies is taken down (as happened recently with mqtt), all Spark builds will fail. The proposal here is simple, have a profile -Pexternal-projects that enables these. This can follow the exact pattern of -Pkinesis-asl which was disabled by default due to a license issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4001) Add Apriori algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226793#comment-14226793 ] Daniel Erenrich commented on SPARK-4001: First a minor point: I'd suggest making it take an RDD of arrays of ints? Holding tons of strings around seems wasteful. The user can maintain a map from strings to ints. Or else can we just make the array contain comparables? So the main issue is whether we are trying to do frequent itemset mining or association rule construction. I argue the latter is more common an operation and while the second requires the first there's no great extra cost to doing both. I'm actually unfamiliar with what can be done with just the former and not the latter. If you equate baskets with users the connection between association rules and collaborative filtering becomes very clear. I want to feed in someone's, say, movie viewing history and get reccomendations of the form you watched X and Y so you'll really like Z (where X,Y,Z is a frequent itemset). The API could be made to match. Give me all the things this user bought in the format I described above and the prediction mode is here are all of things this person bought please apply as many rules as you can. If a user does care about the frequent item sets that would be additionally stored inside the model. The alternative here is to make a set of frequent itemset miners and then make an association rule learner that takes their output. The only downside is that that suffers some perf loss (requiring an additional pass). I'll gladly write this version if we decide that's the way we should go. Add Apriori algorithm to Spark MLlib Key: SPARK-4001 URL: https://issues.apache.org/jira/browse/SPARK-4001 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Jacky Li Assignee: Jacky Li Apriori is the classic algorithm for frequent item set mining in a transactional data set. It will be useful if Apriori algorithm is added to MLLib in Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4629) Spark SQL uses Hadoop Configuration in a thread-unsafe manner when writing Parquet files
Michael Allman created SPARK-4629: - Summary: Spark SQL uses Hadoop Configuration in a thread-unsafe manner when writing Parquet files Key: SPARK-4629 URL: https://issues.apache.org/jira/browse/SPARK-4629 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Michael Allman The method {{ParquetRelation.createEmpty}} mutates its given Hadoop {{Configuration}} instance to set the Parquet writer compression level (cf. https://github.com/apache/spark/blob/v1.1.0/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala#L149). This can lead to a {{ConcurrentModificationException}} when running concurrent jobs sharing a single {{SparkContext}} which involve saving Parquet files. Our fix was to simply remove the line in question and set the compression level in the hadoop configuration before starting our jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4583) GradientBoostedTrees error logging should use loss being minimized
[ https://issues.apache.org/jira/browse/SPARK-4583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-4583. -- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3474 [https://github.com/apache/spark/pull/3474] GradientBoostedTrees error logging should use loss being minimized -- Key: SPARK-4583 URL: https://issues.apache.org/jira/browse/SPARK-4583 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Minor Fix For: 1.2.0 Currently, the LogLoss used by GradientBoostedTrees has 2 issues: * the gradient (and therefore loss) does not match that used by Friedman (1999) * the error computation uses 0/1 accuracy, not log loss -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2192) Examples Data Not in Binary Distribution
[ https://issues.apache.org/jira/browse/SPARK-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226868#comment-14226868 ] Pat McDonough commented on SPARK-2192: -- Thanks [~srowen]. And yes, Xiangrui confirmed he just generated the data. Examples Data Not in Binary Distribution Key: SPARK-2192 URL: https://issues.apache.org/jira/browse/SPARK-2192 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.0 Reporter: Pat McDonough The data used by examples is not packaged up with the binary distribution. The data subdirectory of spark should make it's way in to the distribution somewhere so the examples can use it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4630) Dynamically determine optimal number of partitions
Kostas Sakellis created SPARK-4630: -- Summary: Dynamically determine optimal number of partitions Key: SPARK-4630 URL: https://issues.apache.org/jira/browse/SPARK-4630 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Kostas Sakellis Partition sizes play a big part in how fast stages execute during a Spark job. There is a direct relationship between the size of partitions to the number of tasks - larger partitions, fewer tasks. For better performance, Spark has a sweet spot for how large partitions should be that get executed by a task. If partitions are too small, then the user pays a disproportionate cost in scheduling overhead. If the partitions are too large, then task execution slows down due to gc pressure and spilling to disk. To increase performance of jobs, users often hand optimize the number(size) of partitions that the next stage gets. Factors that come into play are: Incoming partition sizes from previous stage number of available executors available memory per executor (taking into account spark.shuffle.memoryFraction) Spark has access to this data and so should be able to automatically do the partition sizing for the user. This feature can be turned off/on with a configuration option. To make this happen, we propose modifying the DAGScheduler to take into account partition sizes upon stage completion. Before scheduling the next stage, the scheduler can examine the sizes of the partitions and determine the appropriate number tasks to create. Since this change requires non-trivial modifications to the DAGScheduler, a detailed design doc will be attached before proceeding with the work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4630) Dynamically determine optimal number of partitions
[ https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-4630: -- Assignee: Kostas Sakellis Dynamically determine optimal number of partitions -- Key: SPARK-4630 URL: https://issues.apache.org/jira/browse/SPARK-4630 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Kostas Sakellis Assignee: Kostas Sakellis Partition sizes play a big part in how fast stages execute during a Spark job. There is a direct relationship between the size of partitions to the number of tasks - larger partitions, fewer tasks. For better performance, Spark has a sweet spot for how large partitions should be that get executed by a task. If partitions are too small, then the user pays a disproportionate cost in scheduling overhead. If the partitions are too large, then task execution slows down due to gc pressure and spilling to disk. To increase performance of jobs, users often hand optimize the number(size) of partitions that the next stage gets. Factors that come into play are: Incoming partition sizes from previous stage number of available executors available memory per executor (taking into account spark.shuffle.memoryFraction) Spark has access to this data and so should be able to automatically do the partition sizing for the user. This feature can be turned off/on with a configuration option. To make this happen, we propose modifying the DAGScheduler to take into account partition sizes upon stage completion. Before scheduling the next stage, the scheduler can examine the sizes of the partitions and determine the appropriate number tasks to create. Since this change requires non-trivial modifications to the DAGScheduler, a detailed design doc will be attached before proceeding with the work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method
[ https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226958#comment-14226958 ] Aaron Staple commented on SPARK-1503: - [~rezazadeh] Thanks for your feedback. Your point about the communication cost of backtracking is well taken. Just to explain where I was coming from with the design proposal: As I was looking to come up to speed on accelerated gradient descent, I came across some scattered comments online suggesting that acceleration was difficult to implement well, was finicky, etc - especially when compared with standard gradient descent for machine learning. So, wary of these sorts of comments, I wrote up the proposal with the intention of duplicating TFOCS as closely as possible to start with, with the possibility of making changes from there based on performance. In addition, I’d interpreted an earlier comment in this ticket as suggesting that line search be implemented in the same manner as in TFOCS. I am happy to implement a constant step size first, though. It may also be informative to run some performance tests in spark both with and without backtracking. One (basic, not conclusive) data point I have now is that, if I run TFOCS’ test_LASSO example it triggers 97 iterations of the outer AT loop and 106 iterations of the inner backtracking loop. For this one example, the backtracking iteration overhead is only about 10%. Though keep in mind that in spark if we removed backtracking entirely it would mean only one distributed aggregation per iteration rather than two - so a huge improvement in communication cost assuming there is still a good convergence rate. Incidentally, are there any specific learning benchmarks for spark that you would recommend? I’ll do a bit of research to identify the best ways to manage the lipschitz estimate / step size in the absence of backtracking. I’ve also noticed some references online to distributed implementations of accelerated methods. It may be informative to learn more about them - if you’ve heard of any particularly good distributed optimizers using acceleration, please let me know. Thanks, Aaron PS Yes, I’ll make sure to follow the lbfgs example so the accelerated implementation can be easily substituted. Implement Nesterov's accelerated first-order method --- Key: SPARK-1503 URL: https://issues.apache.org/jira/browse/SPARK-1503 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Aaron Staple Nesterov's accelerated first-order method is a drop-in replacement for steepest descent but it converges much faster. We should implement this method and compare its performance with existing algorithms, including SGD and L-BFGS. TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's method and its variants on composite objectives. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory
[ https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226969#comment-14226969 ] Sandy Ryza commented on SPARK-4452: --- Thinking about the current change a little more, an issue is that it will spill all the in-memory data to disk in situations where this is probably overkill. E.g. consider the typical situation of shuffle data slightly exceeding memory. We end up spilling the entire data structure if a downstream data structure needs even a small amount of memory. I think that your proposed change 2 is probably worthwhile. Shuffle data structures can starve others on the same thread for memory Key: SPARK-4452 URL: https://issues.apache.org/jira/browse/SPARK-4452 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Tianshuo Deng Assignee: Tianshuo Deng Priority: Critical When an Aggregator is used with ExternalSorter in a task, spark will create many small files and could cause too many files open error during merging. Currently, ShuffleMemoryManager does not work well when there are 2 spillable objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used by Aggregator) in this case. Here is an example: Due to the usage of mapside aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may ask as much memory as it can, which is totalMem/numberOfThreads. Then later on when ExternalSorter is created in the same thread, the ShuffleMemoryManager could refuse to allocate more memory to it, since the memory is already given to the previous requested object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling small files(due to the lack of memory) I'm currently working on a PR to address these two issues. It will include following changes: 1. The ShuffleMemoryManager should not only track the memory usage for each thread, but also the object who holds the memory 2. The ShuffleMemoryManager should be able to trigger the spilling of a spillable object. In this way, if a new object in a thread is requesting memory, the old occupant could be evicted/spilled. Previously the spillable objects trigger spilling by themselves. So one may not trigger spilling even if another object in the same thread needs more memory. After this change The ShuffleMemoryManager could trigger the spilling of an object if it needs to. 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously ExternalAppendOnlyMap returns an destructive iterator and can not be spilled after the iterator is returned. This should be changed so that even after the iterator is returned, the ShuffleMemoryManager can still spill it. Currently, I have a working branch in progress: https://github.com/tsdeng/spark/tree/enhance_memory_manager. Already made change 3 and have a prototype of change 1 and 2 to evict spillable from memory manager, still in progress. I will send a PR when it's done. Any feedback or thoughts on this change is highly appreciated ! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1503) Implement Nesterov's accelerated first-order method
[ https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226958#comment-14226958 ] Aaron Staple edited comment on SPARK-1503 at 11/26/14 11:09 PM: [~rezazadeh] Thanks for your feedback. Your point about the communication cost of backtracking is well taken. Just to explain where I was coming from with the design proposal: As I was looking to come up to speed on accelerated gradient descent, I came across some scattered comments online suggesting that acceleration was difficult to implement well, was finicky, etc - especially when compared with standard gradient descent for machine learning. So, wary of these sorts of comments, I wrote up the proposal with the intention of duplicating TFOCS as closely as possible to start with, with the possibility of making changes from there based on performance. In addition, I’d interpreted an earlier comment in this ticket as suggesting that line search be implemented in the same manner as in TFOCS. I am happy to implement a constant step size first, though. It may also be informative to run some performance tests in spark both with and without backtracking. One (basic, not conclusive) data point I have now is that, if I run TFOCS’ test_LASSO example it triggers 97 iterations of the outer AT loop and 106 iterations of the inner backtracking loop. For this one example, the backtracking iteration overhead is only about 10%. Though keep in mind that in spark if we removed backtracking entirely it would mean only one distributed aggregation per iteration rather than two - so a huge improvement in communication cost assuming there is still a good convergence rate. Incidentally, are there any specific learning benchmarks for spark that you would recommend? I’ll do a bit of research to identify the best ways to manage the lipschitz estimate / step size in the absence of backtracking (for our objective functions in particular). I’ve also noticed some references online to distributed implementations of accelerated methods. It may be informative to learn more about them - if you’ve heard of any particularly good distributed optimizers using acceleration, please let me know. Thanks, Aaron PS Yes, I’ll make sure to follow the lbfgs example so the accelerated implementation can be easily substituted. was (Author: staple): [~rezazadeh] Thanks for your feedback. Your point about the communication cost of backtracking is well taken. Just to explain where I was coming from with the design proposal: As I was looking to come up to speed on accelerated gradient descent, I came across some scattered comments online suggesting that acceleration was difficult to implement well, was finicky, etc - especially when compared with standard gradient descent for machine learning. So, wary of these sorts of comments, I wrote up the proposal with the intention of duplicating TFOCS as closely as possible to start with, with the possibility of making changes from there based on performance. In addition, I’d interpreted an earlier comment in this ticket as suggesting that line search be implemented in the same manner as in TFOCS. I am happy to implement a constant step size first, though. It may also be informative to run some performance tests in spark both with and without backtracking. One (basic, not conclusive) data point I have now is that, if I run TFOCS’ test_LASSO example it triggers 97 iterations of the outer AT loop and 106 iterations of the inner backtracking loop. For this one example, the backtracking iteration overhead is only about 10%. Though keep in mind that in spark if we removed backtracking entirely it would mean only one distributed aggregation per iteration rather than two - so a huge improvement in communication cost assuming there is still a good convergence rate. Incidentally, are there any specific learning benchmarks for spark that you would recommend? I’ll do a bit of research to identify the best ways to manage the lipschitz estimate / step size in the absence of backtracking. I’ve also noticed some references online to distributed implementations of accelerated methods. It may be informative to learn more about them - if you’ve heard of any particularly good distributed optimizers using acceleration, please let me know. Thanks, Aaron PS Yes, I’ll make sure to follow the lbfgs example so the accelerated implementation can be easily substituted. Implement Nesterov's accelerated first-order method --- Key: SPARK-1503 URL: https://issues.apache.org/jira/browse/SPARK-1503 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Aaron Staple Nesterov's accelerated first-order method is a
[jira] [Comment Edited] (SPARK-1503) Implement Nesterov's accelerated first-order method
[ https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226958#comment-14226958 ] Aaron Staple edited comment on SPARK-1503 at 11/26/14 11:10 PM: [~rezazadeh] Thanks for your feedback. Your point about the communication cost of backtracking is well taken. Just to explain where I was coming from with the design proposal: As I was looking to come up to speed on accelerated gradient descent, I came across some scattered comments online suggesting that acceleration was difficult to implement well, was finicky, etc - especially when compared with standard gradient descent for machine learning applications. So, wary of these sorts of comments, I wrote up the proposal with the intention of duplicating TFOCS as closely as possible to start with, with the possibility of making changes from there based on performance. In addition, I’d interpreted an earlier comment in this ticket as suggesting that line search be implemented in the same manner as in TFOCS. I am happy to implement a constant step size first, though. It may also be informative to run some performance tests in spark both with and without backtracking. One (basic, not conclusive) data point I have now is that, if I run TFOCS’ test_LASSO example it triggers 97 iterations of the outer AT loop and 106 iterations of the inner backtracking loop. For this one example, the backtracking iteration overhead is only about 10%. Though keep in mind that in spark if we removed backtracking entirely it would mean only one distributed aggregation per iteration rather than two - so a huge improvement in communication cost assuming there is still a good convergence rate. Incidentally, are there any specific learning benchmarks for spark that you would recommend? I’ll do a bit of research to identify the best ways to manage the lipschitz estimate / step size in the absence of backtracking (for our objective functions in particular). I’ve also noticed some references online to distributed implementations of accelerated methods. It may be informative to learn more about them - if you’ve heard of any particularly good distributed optimizers using acceleration, please let me know. Thanks, Aaron PS Yes, I’ll make sure to follow the lbfgs example so the accelerated implementation can be easily substituted. was (Author: staple): [~rezazadeh] Thanks for your feedback. Your point about the communication cost of backtracking is well taken. Just to explain where I was coming from with the design proposal: As I was looking to come up to speed on accelerated gradient descent, I came across some scattered comments online suggesting that acceleration was difficult to implement well, was finicky, etc - especially when compared with standard gradient descent for machine learning. So, wary of these sorts of comments, I wrote up the proposal with the intention of duplicating TFOCS as closely as possible to start with, with the possibility of making changes from there based on performance. In addition, I’d interpreted an earlier comment in this ticket as suggesting that line search be implemented in the same manner as in TFOCS. I am happy to implement a constant step size first, though. It may also be informative to run some performance tests in spark both with and without backtracking. One (basic, not conclusive) data point I have now is that, if I run TFOCS’ test_LASSO example it triggers 97 iterations of the outer AT loop and 106 iterations of the inner backtracking loop. For this one example, the backtracking iteration overhead is only about 10%. Though keep in mind that in spark if we removed backtracking entirely it would mean only one distributed aggregation per iteration rather than two - so a huge improvement in communication cost assuming there is still a good convergence rate. Incidentally, are there any specific learning benchmarks for spark that you would recommend? I’ll do a bit of research to identify the best ways to manage the lipschitz estimate / step size in the absence of backtracking (for our objective functions in particular). I’ve also noticed some references online to distributed implementations of accelerated methods. It may be informative to learn more about them - if you’ve heard of any particularly good distributed optimizers using acceleration, please let me know. Thanks, Aaron PS Yes, I’ll make sure to follow the lbfgs example so the accelerated implementation can be easily substituted. Implement Nesterov's accelerated first-order method --- Key: SPARK-1503 URL: https://issues.apache.org/jira/browse/SPARK-1503 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Aaron
[jira] [Comment Edited] (SPARK-1503) Implement Nesterov's accelerated first-order method
[ https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226958#comment-14226958 ] Aaron Staple edited comment on SPARK-1503 at 11/26/14 11:14 PM: [~rezazadeh] Thanks for your feedback. Your point about the communication cost of backtracking is well taken. Just to explain where I was coming from with the design proposal: As I was looking to come up to speed on accelerated gradient descent, I came across some scattered comments online suggesting that acceleration was difficult to implement well, was finicky, etc - especially when compared with standard gradient descent for machine learning applications. So, wary of these sorts of comments, I wrote up the proposal with the intention of duplicating TFOCS as closely as possible to start with, with the possibility of making changes from there based on performance. In addition, I’d interpreted an earlier comment in this ticket as suggesting that line search be implemented in the same manner as in TFOCS. I am happy to implement a constant step size first, though. It may also be informative to run some performance tests in spark both with and without backtracking. One (basic, not conclusive) data point I have now is that, if I run TFOCS’ test_LASSO example it triggers 97 iterations of the outer AT loop and 106 iterations of the inner backtracking loop. For this one example, the backtracking iteration overhead is only about 10%. Though keep in mind that in spark if we removed backtracking entirely it would mean only one distributed aggregation per iteration rather than two - so a huge improvement in communication cost assuming there is still a good convergence rate. Incidentally, are there any specific learning benchmarks for spark that you would recommend? I’ll do a bit of research to identify the best ways to manage the lipschitz estimate / step size in the absence of backtracking (for our objective functions in particular). I’ve also noticed some references online to distributed implementations of accelerated methods. It may be informative to learn more about them - if you happen to have heard of any particularly good distributed optimizers using acceleration, please let me know. Thanks, Aaron PS Yes, I’ll make sure to follow the lbfgs example so the accelerated implementation can be easily substituted. was (Author: staple): [~rezazadeh] Thanks for your feedback. Your point about the communication cost of backtracking is well taken. Just to explain where I was coming from with the design proposal: As I was looking to come up to speed on accelerated gradient descent, I came across some scattered comments online suggesting that acceleration was difficult to implement well, was finicky, etc - especially when compared with standard gradient descent for machine learning applications. So, wary of these sorts of comments, I wrote up the proposal with the intention of duplicating TFOCS as closely as possible to start with, with the possibility of making changes from there based on performance. In addition, I’d interpreted an earlier comment in this ticket as suggesting that line search be implemented in the same manner as in TFOCS. I am happy to implement a constant step size first, though. It may also be informative to run some performance tests in spark both with and without backtracking. One (basic, not conclusive) data point I have now is that, if I run TFOCS’ test_LASSO example it triggers 97 iterations of the outer AT loop and 106 iterations of the inner backtracking loop. For this one example, the backtracking iteration overhead is only about 10%. Though keep in mind that in spark if we removed backtracking entirely it would mean only one distributed aggregation per iteration rather than two - so a huge improvement in communication cost assuming there is still a good convergence rate. Incidentally, are there any specific learning benchmarks for spark that you would recommend? I’ll do a bit of research to identify the best ways to manage the lipschitz estimate / step size in the absence of backtracking (for our objective functions in particular). I’ve also noticed some references online to distributed implementations of accelerated methods. It may be informative to learn more about them - if you’ve heard of any particularly good distributed optimizers using acceleration, please let me know. Thanks, Aaron PS Yes, I’ll make sure to follow the lbfgs example so the accelerated implementation can be easily substituted. Implement Nesterov's accelerated first-order method --- Key: SPARK-1503 URL: https://issues.apache.org/jira/browse/SPARK-1503 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng
[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method
[ https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227023#comment-14227023 ] Reza Zadeh commented on SPARK-1503: --- Thanks Aaron. From an implementation perspective, it's probably easier to implement a constant step size first. From there you can see if there is any finicky behavior and compare to the unaccelerated proximal gradient already in Spark. If it works well enough, we should commit the first PR without backtracking, and then experiment with backtracking, otherwise if you see strange behavior then you can decide if backtracking would solve it. Implement Nesterov's accelerated first-order method --- Key: SPARK-1503 URL: https://issues.apache.org/jira/browse/SPARK-1503 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Aaron Staple Nesterov's accelerated first-order method is a drop-in replacement for steepest descent but it converges much faster. We should implement this method and compare its performance with existing algorithms, including SGD and L-BFGS. TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's method and its variants on composite objectives. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4046) Incorrect Java example on site
[ https://issues.apache.org/jira/browse/SPARK-4046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-4046. Resolution: Fixed Assignee: Reynold Xin Incorrect Java example on site -- Key: SPARK-4046 URL: https://issues.apache.org/jira/browse/SPARK-4046 Project: Spark Issue Type: Bug Components: Documentation, Java API Affects Versions: 1.1.0 Environment: Web Reporter: Ian Babrou Assignee: Reynold Xin Priority: Minor Attachments: SPARK-4046.diff https://spark.apache.org/examples.html Here word count example for java is incorrect. It should be mapToPair instead of map. Correct example is here: https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaWordCount.java -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4046) Incorrect Java example on site
[ https://issues.apache.org/jira/browse/SPARK-4046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227053#comment-14227053 ] Reynold Xin commented on SPARK-4046: Thanks - I pushed the change and updated the website. Incorrect Java example on site -- Key: SPARK-4046 URL: https://issues.apache.org/jira/browse/SPARK-4046 Project: Spark Issue Type: Bug Components: Documentation, Java API Affects Versions: 1.1.0 Environment: Web Reporter: Ian Babrou Assignee: Reynold Xin Priority: Minor Attachments: SPARK-4046.diff https://spark.apache.org/examples.html Here word count example for java is incorrect. It should be mapToPair instead of map. Correct example is here: https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaWordCount.java -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4631) Add real unit test for MQTT
Tathagata Das created SPARK-4631: Summary: Add real unit test for MQTT Key: SPARK-4631 URL: https://issues.apache.org/jira/browse/SPARK-4631 Project: Spark Issue Type: Test Components: Streaming Reporter: Tathagata Das Priority: Critical A real unit test that actually transfers data to ensure that the MQTTUtil is functional -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages
[ https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-3628. -- Resolution: Fixed Fix Version/s: 1.2.0 Target Version/s: 1.1.2 (was: 0.9.3, 1.0.3, 1.1.2, 1.2.1) Don't apply accumulator updates multiple times for tasks in result stages - Key: SPARK-3628 URL: https://issues.apache.org/jira/browse/SPARK-3628 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Matei Zaharia Assignee: Nan Zhu Priority: Blocker Fix For: 1.2.0 In previous versions of Spark, accumulator updates only got applied once for accumulators that are only used in actions (i.e. result stages), letting you use them to deterministically compute a result. Unfortunately, this got broken in some recent refactorings. This is related to https://issues.apache.org/jira/browse/SPARK-732, but that issue is about applying the same semantics to intermediate stages too, which is more work and may not be what we want for debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages
[ https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227077#comment-14227077 ] Matei Zaharia commented on SPARK-3628: -- FYI I merged this into 1.2.0, since the patch is now quite a bit smaller. We should decide whether we want to back port it to branch-1.1, so I'll leave it open for that reason. I don't think there's much point backporting it further because the issue is somewhat rare, but we can do it if people ask for it. Don't apply accumulator updates multiple times for tasks in result stages - Key: SPARK-3628 URL: https://issues.apache.org/jira/browse/SPARK-3628 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Matei Zaharia Assignee: Nan Zhu Priority: Blocker Fix For: 1.2.0 In previous versions of Spark, accumulator updates only got applied once for accumulators that are only used in actions (i.e. result stages), letting you use them to deterministically compute a result. Unfortunately, this got broken in some recent refactorings. This is related to https://issues.apache.org/jira/browse/SPARK-732, but that issue is about applying the same semantics to intermediate stages too, which is more work and may not be what we want for debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4631) Add real unit test for MQTT
[ https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227076#comment-14227076 ] Tathagata Das commented on SPARK-4631: -- [~prabeeshk] Could you please take a crack at this. This is important because other it is being impossible for us to maintain the MQTT functionalities while MQTT guys are pulling down older version 0.4.0 (that we currently depend on). We cannot upgrade to 1.0+ until we have a unit test that actually tests whether the stream works. Add real unit test for MQTT Key: SPARK-4631 URL: https://issues.apache.org/jira/browse/SPARK-4631 Project: Spark Issue Type: Test Components: Streaming Reporter: Tathagata Das Priority: Critical A real unit test that actually transfers data to ensure that the MQTTUtil is functional -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4632) Upgrade MQTT dependency to use latest mqtt-client
[ https://issues.apache.org/jira/browse/SPARK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227091#comment-14227091 ] Tathagata Das commented on SPARK-4632: -- This is a less scary alternative to the JIRA SPARK-SPARK-4628 Upgrade MQTT dependency to use latest mqtt-client - Key: SPARK-4632 URL: https://issues.apache.org/jira/browse/SPARK-4632 Project: Spark Issue Type: Test Components: Streaming Affects Versions: 1.0.2, 1.1.1 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker mqtt client 0.4.0 was removed from the Eclipse Paho repository, and hence is breaking Spark build. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-732) Recomputation of RDDs may result in duplicated accumulator updates
[ https://issues.apache.org/jira/browse/SPARK-732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227108#comment-14227108 ] Matei Zaharia commented on SPARK-732: - As discussed on https://github.com/apache/spark/pull/2524 this is pretty hard to provide good semantics for in the general case (accumulator updates inside non-result stages), for the following reasons: - An RDD may be computed as part of multiple stages. For example, if you update an accumulator inside a MappedRDD and then shuffle it, that might be one stage. But if you then call map() again on the MappedRDD, and shuffle the result of that, you get a second stage where that map is pipeline. Do you want to count this accumulator update twice or not? - Entire stages may be resubmitted if shuffle files are deleted by the periodic cleaner or are lost due to a node failure, so anything that tracks RDDs would need to do so for long periods of time (as long as the RDD is referenceable in the user program), which would be pretty complicated to implement. So I'm going to mark this as won't fix for now, except for the part for result stages done in SPARK-3628. Recomputation of RDDs may result in duplicated accumulator updates -- Key: SPARK-732 URL: https://issues.apache.org/jira/browse/SPARK-732 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.7.0, 0.6.2, 0.7.1, 0.8.0, 0.7.2, 0.7.3, 0.8.1, 0.8.2, 0.9.0, 1.0.1, 1.1.0 Reporter: Josh Rosen Assignee: Nan Zhu Priority: Blocker Currently, Spark doesn't guard against duplicated updates to the same accumulator due to recomputations of an RDD. For example: {code} val acc = sc.accumulator(0) data.map(x = acc += 1; f(x)) data.count() // acc should equal data.count() here data.foreach{...} // Now, acc = 2 * data.count() because the map() was recomputed. {code} I think that this behavior is incorrect, especially because this behavior allows the additon or removal of a cache() call to affect the outcome of a computation. There's an old TODO to fix this duplicate update issue in the [DAGScheduler code|https://github.com/mesos/spark/blob/ec5e553b418be43aa3f0ccc24e0d5ca9d63504b2/core/src/main/scala/spark/scheduler/DAGScheduler.scala#L494]. I haven't tested whether recomputation due to blocks being dropped from the cache can trigger duplicate accumulator updates. Hypothetically someone could be relying on the current behavior to implement performance counters that track the actual number of computations performed (including recomputations). To be safe, we could add an explicit warning in the release notes that documents the change in behavior when we fix this. Ignoring duplicate updates shouldn't be too hard, but there are a few subtleties. Currently, we allow accumulators to be used in multiple transformations, so we'd need to detect duplicate updates at the per-transformation level. I haven't dug too deeply into the scheduler internals, but we might also run into problems where pipelining causes what is logically one set of accumulator updates to show up in two different tasks (e.g. rdd.map(accum += x; ...) and rdd.map(accum += x; ...).count() may cause what's logically the same accumulator update to be applied from two different contexts, complicating the detection of duplicate updates). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages
[ https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia reopened SPARK-3628: -- Don't apply accumulator updates multiple times for tasks in result stages - Key: SPARK-3628 URL: https://issues.apache.org/jira/browse/SPARK-3628 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Matei Zaharia Assignee: Nan Zhu Priority: Blocker Fix For: 1.2.0 In previous versions of Spark, accumulator updates only got applied once for accumulators that are only used in actions (i.e. result stages), letting you use them to deterministically compute a result. Unfortunately, this got broken in some recent refactorings. This is related to https://issues.apache.org/jira/browse/SPARK-732, but that issue is about applying the same semantics to intermediate stages too, which is more work and may not be what we want for debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4631) Add real unit test for MQTT
[ https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227121#comment-14227121 ] Tathagata Das commented on SPARK-4631: -- I am optimistically going ahead and upgrading MQTT client version in SPARK-4632. If things break, we will fix it in a patch release. Add real unit test for MQTT Key: SPARK-4631 URL: https://issues.apache.org/jira/browse/SPARK-4631 Project: Spark Issue Type: Test Components: Streaming Reporter: Tathagata Das Priority: Critical A real unit test that actually transfers data to ensure that the MQTTUtil is functional -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4632) Upgrade MQTT dependency to use latest mqtt-client
[ https://issues.apache.org/jira/browse/SPARK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-4632: - Issue Type: Improvement (was: Test) Upgrade MQTT dependency to use latest mqtt-client - Key: SPARK-4632 URL: https://issues.apache.org/jira/browse/SPARK-4632 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2, 1.1.1 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker mqtt client 0.4.0 was removed from the Eclipse Paho repository, and hence is breaking Spark build. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4633) Support gzip in spark.compression.io.codec
Takeshi Yamamuro created SPARK-4633: --- Summary: Support gzip in spark.compression.io.codec Key: SPARK-4633 URL: https://issues.apache.org/jira/browse/SPARK-4633 Project: Spark Issue Type: Improvement Components: Input/Output Reporter: Takeshi Yamamuro Priority: Trivial gzip is widely used in other frameowrks such as hadoop mapreduce and tez, and also I think that gizip is more stable than other codecs in terms of both performance and space overheads. I have one open question; current spark configuratios have a block size option for each codec (spark.io.compression.[gzip|lz4|snappy].block.size). As # of codecs increases, the configurations have more options and I think that it is sort of complicated for non-expert users. To mitigate it, my thought follows; the three configurations are replaced with a single option for block size (spark.io.compression.block.size). Then, 'Meaning' in configurations will describe This option makes an effect on gzip, lz4, and snappy. Block size (in bytes) used in compression, in the case when these compression codecs are used. Lowering -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4633) Support gzip in spark.compression.io.codec
[ https://issues.apache.org/jira/browse/SPARK-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227133#comment-14227133 ] Apache Spark commented on SPARK-4633: - User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/3488 Support gzip in spark.compression.io.codec -- Key: SPARK-4633 URL: https://issues.apache.org/jira/browse/SPARK-4633 Project: Spark Issue Type: Improvement Components: Input/Output Reporter: Takeshi Yamamuro Priority: Trivial gzip is widely used in other frameowrks such as hadoop mapreduce and tez, and also I think that gizip is more stable than other codecs in terms of both performance and space overheads. I have one open question; current spark configuratios have a block size option for each codec (spark.io.compression.[gzip|lz4|snappy].block.size). As # of codecs increases, the configurations have more options and I think that it is sort of complicated for non-expert users. To mitigate it, my thought follows; the three configurations are replaced with a single option for block size (spark.io.compression.block.size). Then, 'Meaning' in configurations will describe This option makes an effect on gzip, lz4, and snappy. Block size (in bytes) used in compression, in the case when these compression codecs are used. Lowering -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4634) Enable metrics for each application to be gathered in one node
Masayoshi TSUZUKI created SPARK-4634: Summary: Enable metrics for each application to be gathered in one node Key: SPARK-4634 URL: https://issues.apache.org/jira/browse/SPARK-4634 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Masayoshi TSUZUKI Metrics output is now like this: {noformat} - app_1.driver.jvm.somevalue - app_1.driver.jvm.somevalue - ... - app_2.driver.jvm.somevalue - app_2.driver.jvm.somevalue - ... {noformat} In current spark, application names come to top level, but we should be able to gather the application names under some top level node. For example, think of using graphite. When we use graphite, the application names are listed as top level node. Graphite can also collect OS metrics, and OS metrics are able to be put in some one node. But the current Spark metrics are not. So, with the current Spark, the tree structure of metrics shown in graphite web UI is like this. {noformat} - os - os.node1.somevalue - os.node2.somevalue - ... - app_1 - app_1.driver.jvm.somevalue - app_1.driver.jvm.somevalue - ... - app_2 - ... - app_3 - ... {noformat} We should be able to add some top level name before the application name (top level name may be cluster name for instance). If we make the name configurable by *.conf, it might be also convenience in case that 2 different spark clusters sink metrics to the same graphite server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4634) Enable metrics for each application to be gathered in one node
[ https://issues.apache.org/jira/browse/SPARK-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227141#comment-14227141 ] Apache Spark commented on SPARK-4634: - User 'tsudukim' has created a pull request for this issue: https://github.com/apache/spark/pull/3489 Enable metrics for each application to be gathered in one node -- Key: SPARK-4634 URL: https://issues.apache.org/jira/browse/SPARK-4634 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Masayoshi TSUZUKI Metrics output is now like this: {noformat} - app_1.driver.jvm.somevalue - app_1.driver.jvm.somevalue - ... - app_2.driver.jvm.somevalue - app_2.driver.jvm.somevalue - ... {noformat} In current spark, application names come to top level, but we should be able to gather the application names under some top level node. For example, think of using graphite. When we use graphite, the application names are listed as top level node. Graphite can also collect OS metrics, and OS metrics are able to be put in some one node. But the current Spark metrics are not. So, with the current Spark, the tree structure of metrics shown in graphite web UI is like this. {noformat} - os - os.node1.somevalue - os.node2.somevalue - ... - app_1 - app_1.driver.jvm.somevalue - app_1.driver.jvm.somevalue - ... - app_2 - ... - app_3 - ... {noformat} We should be able to add some top level name before the application name (top level name may be cluster name for instance). If we make the name configurable by *.conf, it might be also convenience in case that 2 different spark clusters sink metrics to the same graphite server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3704) Short (TINYINT) incorrectly handled in thrift JDBC/ODBC server
[ https://issues.apache.org/jira/browse/SPARK-3704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3704: Summary: Short (TINYINT) incorrectly handled in thrift JDBC/ODBC server (was: the types not match adding value form spark row to hive row in SparkSQLOperationManager) Short (TINYINT) incorrectly handled in thrift JDBC/ODBC server -- Key: SPARK-3704 URL: https://issues.apache.org/jira/browse/SPARK-3704 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: wangfei Fix For: 1.1.1, 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3995) [PYSPARK] PySpark's sample methods do not work with NumPy 1.9
[ https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227266#comment-14227266 ] Jeremy Freeman commented on SPARK-3995: --- Agree with [~mengxr], that's a better strategy. [PYSPARK] PySpark's sample methods do not work with NumPy 1.9 - Key: SPARK-3995 URL: https://issues.apache.org/jira/browse/SPARK-3995 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 1.1.0 Reporter: Jeremy Freeman Assignee: Jeremy Freeman Priority: Critical Fix For: 1.2.0 There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. This is the version of NumPy included with the current Anaconda distribution (v2.1); this is a popular distribution, and is likely to affect many users. Steps to reproduce are: {code:python} foo = sc.parallelize(range(1000),5) foo.takeSample(False, 10) {code} Returns: {code} PythonException: Traceback (most recent call last): File /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py, line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, line 127, in dump_stream for obj in iterator: File /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, line 185, in _batched for item in iterator: File /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, line 116, in func if self.getUniformSample(split) = self._fraction: File /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, line 58, in getUniformSample self.initRandomGenerator(split) File /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, line 44, in initRandomGenerator self._random = numpy.random.RandomState(self._seed) File mtrand.pyx, line 610, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:7397) File mtrand.pyx, line 646, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:7697) ValueError: Seed must be between 0 and 4294967295 {code} In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use: {code:python} self._seed = seed if seed is not None else random.randint(0, sys.maxint) {code} In previous versions of NumPy a random seed larger than 2 ** 32 would silently get truncated to 2 ** 32. This was fixed in a recent patch (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). But sampling {{(0, sys.maxint)}} often yields ints larger than 2 ** 32, which effectively breaks sampling operations in PySpark (unless the seed is set manually). I am putting a PR together now (the fix is very simple!). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4630) Dynamically determine optimal number of partitions
[ https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227280#comment-14227280 ] Lianhui Wang commented on SPARK-4630: - i am interesting in this. and we can learn some methon from http://hortonworks.com/blog/apache-tez-dynamic-graph-reconfiguration/. and there is a difficult situation.example: s0 and s1 's parent is s2. s2 and s3 's parent is s4. because s3's mapoutput is dependent on s4's partition, when s3 will run s4's partition cannot be examined because s2 is not finished. [~kostas] can you share some ideas about this situation? thanks. Dynamically determine optimal number of partitions -- Key: SPARK-4630 URL: https://issues.apache.org/jira/browse/SPARK-4630 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Kostas Sakellis Assignee: Kostas Sakellis Partition sizes play a big part in how fast stages execute during a Spark job. There is a direct relationship between the size of partitions to the number of tasks - larger partitions, fewer tasks. For better performance, Spark has a sweet spot for how large partitions should be that get executed by a task. If partitions are too small, then the user pays a disproportionate cost in scheduling overhead. If the partitions are too large, then task execution slows down due to gc pressure and spilling to disk. To increase performance of jobs, users often hand optimize the number(size) of partitions that the next stage gets. Factors that come into play are: Incoming partition sizes from previous stage number of available executors available memory per executor (taking into account spark.shuffle.memoryFraction) Spark has access to this data and so should be able to automatically do the partition sizing for the user. This feature can be turned off/on with a configuration option. To make this happen, we propose modifying the DAGScheduler to take into account partition sizes upon stage completion. Before scheduling the next stage, the scheduler can examine the sizes of the partitions and determine the appropriate number tasks to create. Since this change requires non-trivial modifications to the DAGScheduler, a detailed design doc will be attached before proceeding with the work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4630) Dynamically determine optimal number of partitions
[ https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227280#comment-14227280 ] Lianhui Wang edited comment on SPARK-4630 at 11/27/14 5:34 AM: --- i am interesting in this. and we can learn some method from http://hortonworks.com/blog/apache-tez-dynamic-graph-reconfiguration/. and there is a difficult situation.example: s0 and s1 's parent is s2. s2 and s3 's parent is s4. because s3's mapoutput is dependent on s4's partition, when s3 will run s4's partition cannot be examined because s2 is not finished. [~kostas] can you share some ideas about this situation? thanks. was (Author: lianhuiwang): i am interesting in this. and we can learn some methon from http://hortonworks.com/blog/apache-tez-dynamic-graph-reconfiguration/. and there is a difficult situation.example: s0 and s1 's parent is s2. s2 and s3 's parent is s4. because s3's mapoutput is dependent on s4's partition, when s3 will run s4's partition cannot be examined because s2 is not finished. [~kostas] can you share some ideas about this situation? thanks. Dynamically determine optimal number of partitions -- Key: SPARK-4630 URL: https://issues.apache.org/jira/browse/SPARK-4630 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Kostas Sakellis Assignee: Kostas Sakellis Partition sizes play a big part in how fast stages execute during a Spark job. There is a direct relationship between the size of partitions to the number of tasks - larger partitions, fewer tasks. For better performance, Spark has a sweet spot for how large partitions should be that get executed by a task. If partitions are too small, then the user pays a disproportionate cost in scheduling overhead. If the partitions are too large, then task execution slows down due to gc pressure and spilling to disk. To increase performance of jobs, users often hand optimize the number(size) of partitions that the next stage gets. Factors that come into play are: Incoming partition sizes from previous stage number of available executors available memory per executor (taking into account spark.shuffle.memoryFraction) Spark has access to this data and so should be able to automatically do the partition sizing for the user. This feature can be turned off/on with a configuration option. To make this happen, we propose modifying the DAGScheduler to take into account partition sizes upon stage completion. Before scheduling the next stage, the scheduler can examine the sizes of the partitions and determine the appropriate number tasks to create. Since this change requires non-trivial modifications to the DAGScheduler, a detailed design doc will be attached before proceeding with the work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4635) Delete the val that never used in HashOuterJoin.
DoingDone9 created SPARK-4635: - Summary: Delete the val that never used in HashOuterJoin. Key: SPARK-4635 URL: https://issues.apache.org/jira/browse/SPARK-4635 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: DoingDone9 Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4635) Delete the val that never used in HashOuterJoin.
[ https://issues.apache.org/jira/browse/SPARK-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DoingDone9 updated SPARK-4635: -- Description: The val boundCondition is created in execute(),but it never be used in execute(); Delete the val that never used in HashOuterJoin. Key: SPARK-4635 URL: https://issues.apache.org/jira/browse/SPARK-4635 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: DoingDone9 Priority: Minor The val boundCondition is created in execute(),but it never be used in execute(); -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4631) Add real unit test for MQTT
[ https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227309#comment-14227309 ] Prabeesh K commented on SPARK-4631: --- [~tdas] Please update to new [resolver|https://repo.eclipse.org/content/repositories/paho/] from the [older resolver | https://repo.eclipse.org/content/repositories/paho-releases/] Also update the external/mqtt/pom.xml with following groupIdorg.eclipse.paho/groupId artifactIdmqtt-client/artifactId version0.4.1-SNAPSHOT/version Add real unit test for MQTT Key: SPARK-4631 URL: https://issues.apache.org/jira/browse/SPARK-4631 Project: Spark Issue Type: Test Components: Streaming Reporter: Tathagata Das Priority: Critical A real unit test that actually transfers data to ensure that the MQTTUtil is functional -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4635) Delete the val that never used in execute() of HashOuterJoin.
[ https://issues.apache.org/jira/browse/SPARK-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DoingDone9 updated SPARK-4635: -- Summary: Delete the val that never used in execute() of HashOuterJoin. (was: Delete the val that never used in HashOuterJoin.) Delete the val that never used in execute() of HashOuterJoin. -- Key: SPARK-4635 URL: https://issues.apache.org/jira/browse/SPARK-4635 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: DoingDone9 Priority: Minor The val boundCondition is created in execute(),but it never be used in execute(); -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4635) Delete the val that never used in execute() of HashOuterJoin.
[ https://issues.apache.org/jira/browse/SPARK-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DoingDone9 updated SPARK-4635: -- Description: The val boundCondition is created in execute(),but it never be used in execute(). (was: The val boundCondition is created in execute(),but it never be used in execute();) Delete the val that never used in execute() of HashOuterJoin. -- Key: SPARK-4635 URL: https://issues.apache.org/jira/browse/SPARK-4635 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: DoingDone9 Priority: Minor The val boundCondition is created in execute(),but it never be used in execute(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4170) Closure problems when running Scala app that extends App
[ https://issues.apache.org/jira/browse/SPARK-4170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227319#comment-14227319 ] Brennon York commented on SPARK-4170: - [~srowen] this bug seems to be an issue with the way Scala has been defined and the differences between what happens at runtime versus compile time with respect to the way {code}App{code} leverages the [{code}delayedInit{code} function|http://www.scala-lang.org/api/2.11.1/index.html#scala.App]. I tried to replicate the issue on my local machine under both compile time and runtime with only the latter producing the issue (as expected through the Scala documentation). The former was tested by creating a simple application, compiled with sbt, and executed while the latter was setup within the {code}spark-shell{code} REPL. I'm wondering if we can't close this issue and just provide a bit of documentation somewhere to reference that, when building even simple Spark apps, extending the {code}App{code} interface will result in delayed initialization and, likely, set null values within that closure. Thoughts? Closure problems when running Scala app that extends App -- Key: SPARK-4170 URL: https://issues.apache.org/jira/browse/SPARK-4170 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Sean Owen Priority: Minor Michael Albert noted this problem on the mailing list (http://apache-spark-user-list.1001560.n3.nabble.com/BUG-when-running-as-quot-extends-App-quot-closures-don-t-capture-variables-td17675.html): {code} object DemoBug extends App { val conf = new SparkConf() val sc = new SparkContext(conf) val rdd = sc.parallelize(List(A,B,C,D)) val str1 = A val rslt1 = rdd.filter(x = { x != A }).count val rslt2 = rdd.filter(x = { str1 != null x != A }).count println(DemoBug: rslt1 = + rslt1 + rslt2 = + rslt2) } {code} This produces the output: {code} DemoBug: rslt1 = 3 rslt2 = 0 {code} If instead there is a proper main(), it works as expected. I also this week noticed that in a program which extends App, some values were inexplicably null in a closure. When changing to use main(), it was fine. I assume there is a problem with variables not being added to the closure when main() doesn't appear in the standard way. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4170) Closure problems when running Scala app that extends App
[ https://issues.apache.org/jira/browse/SPARK-4170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227319#comment-14227319 ] Brennon York edited comment on SPARK-4170 at 11/27/14 6:48 AM: --- [~srowen] this bug seems to be an issue with the way Scala has been defined and the differences between what happens at runtime versus compile time with respect to the way App leverages the [delayedInit function|http://www.scala-lang.org/api/2.11.1/index.html#scala.App]. I tried to replicate the issue on my local machine under both compile time and runtime with only the latter producing the issue (as expected through the Scala documentation). The former was tested by creating a simple application, compiled with sbt, and executed while the latter was setup within the spark-shell REPL. I'm wondering if we can't close this issue and just provide a bit of documentation somewhere to reference that, when building even simple Spark apps, extending the App interface will result in delayed initialization and, likely, set null values within that closure. Thoughts? was (Author: boyork): [~srowen] this bug seems to be an issue with the way Scala has been defined and the differences between what happens at runtime versus compile time with respect to the way {code}App{code} leverages the [{code}delayedInit{code} function|http://www.scala-lang.org/api/2.11.1/index.html#scala.App]. I tried to replicate the issue on my local machine under both compile time and runtime with only the latter producing the issue (as expected through the Scala documentation). The former was tested by creating a simple application, compiled with sbt, and executed while the latter was setup within the {code}spark-shell{code} REPL. I'm wondering if we can't close this issue and just provide a bit of documentation somewhere to reference that, when building even simple Spark apps, extending the {code}App{code} interface will result in delayed initialization and, likely, set null values within that closure. Thoughts? Closure problems when running Scala app that extends App -- Key: SPARK-4170 URL: https://issues.apache.org/jira/browse/SPARK-4170 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Sean Owen Priority: Minor Michael Albert noted this problem on the mailing list (http://apache-spark-user-list.1001560.n3.nabble.com/BUG-when-running-as-quot-extends-App-quot-closures-don-t-capture-variables-td17675.html): {code} object DemoBug extends App { val conf = new SparkConf() val sc = new SparkContext(conf) val rdd = sc.parallelize(List(A,B,C,D)) val str1 = A val rslt1 = rdd.filter(x = { x != A }).count val rslt2 = rdd.filter(x = { str1 != null x != A }).count println(DemoBug: rslt1 = + rslt1 + rslt2 = + rslt2) } {code} This produces the output: {code} DemoBug: rslt1 = 3 rslt2 = 0 {code} If instead there is a proper main(), it works as expected. I also this week noticed that in a program which extends App, some values were inexplicably null in a closure. When changing to use main(), it was fine. I assume there is a problem with variables not being added to the closure when main() doesn't appear in the standard way. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4635) Delete the val that never used in execute() of HashOuterJoin.
[ https://issues.apache.org/jira/browse/SPARK-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227321#comment-14227321 ] Apache Spark commented on SPARK-4635: - User 'DoingDone9' has created a pull request for this issue: https://github.com/apache/spark/pull/3491 Delete the val that never used in execute() of HashOuterJoin. -- Key: SPARK-4635 URL: https://issues.apache.org/jira/browse/SPARK-4635 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: DoingDone9 Priority: Minor The val boundCondition is created in execute(),but it never be used in execute(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4632) Upgrade MQTT dependency to use latest mqtt-client
[ https://issues.apache.org/jira/browse/SPARK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227326#comment-14227326 ] Apache Spark commented on SPARK-4632: - User 'prabeesh' has created a pull request for this issue: https://github.com/apache/spark/pull/3494 Upgrade MQTT dependency to use latest mqtt-client - Key: SPARK-4632 URL: https://issues.apache.org/jira/browse/SPARK-4632 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2, 1.1.1 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker mqtt client 0.4.0 was removed from the Eclipse Paho repository, and hence is breaking Spark build. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4632) Upgrade MQTT dependency to use latest mqtt-client
[ https://issues.apache.org/jira/browse/SPARK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227325#comment-14227325 ] Apache Spark commented on SPARK-4632: - User 'prabeesh' has created a pull request for this issue: https://github.com/apache/spark/pull/3492 Upgrade MQTT dependency to use latest mqtt-client - Key: SPARK-4632 URL: https://issues.apache.org/jira/browse/SPARK-4632 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2, 1.1.1 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker mqtt client 0.4.0 was removed from the Eclipse Paho repository, and hence is breaking Spark build. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4632) Upgrade MQTT dependency to use latest mqtt-client
[ https://issues.apache.org/jira/browse/SPARK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227324#comment-14227324 ] Apache Spark commented on SPARK-4632: - User 'prabeesh' has created a pull request for this issue: https://github.com/apache/spark/pull/3493 Upgrade MQTT dependency to use latest mqtt-client - Key: SPARK-4632 URL: https://issues.apache.org/jira/browse/SPARK-4632 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2, 1.1.1 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker mqtt client 0.4.0 was removed from the Eclipse Paho repository, and hence is breaking Spark build. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4631) Add real unit test for MQTT
[ https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227309#comment-14227309 ] Prabeesh K edited comment on SPARK-4631 at 11/27/14 6:54 AM: - [~tdas] Please update to new [resolver|https://repo.eclipse.org/content/repositories/paho/] from the [older resolver | https://repo.eclipse.org/content/repositories/paho-releases/] Also update the external/mqtt/pom.xml with following groupIdorg.eclipse.paho/groupId artifactIdmqtt-client/artifactId version0.4.1-SNAPSHOT/version Refre SPARK-4632 was (Author: prabeeshk): [~tdas] Please update to new [resolver|https://repo.eclipse.org/content/repositories/paho/] from the [older resolver | https://repo.eclipse.org/content/repositories/paho-releases/] Also update the external/mqtt/pom.xml with following groupIdorg.eclipse.paho/groupId artifactIdmqtt-client/artifactId version0.4.1-SNAPSHOT/version Add real unit test for MQTT Key: SPARK-4631 URL: https://issues.apache.org/jira/browse/SPARK-4631 Project: Spark Issue Type: Test Components: Streaming Reporter: Tathagata Das Priority: Critical A real unit test that actually transfers data to ensure that the MQTTUtil is functional -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4631) Add real unit test for MQTT
[ https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227337#comment-14227337 ] Reynold Xin commented on SPARK-4631: How popular is MQTT? Do we really need this support in Spark natively? Add real unit test for MQTT Key: SPARK-4631 URL: https://issues.apache.org/jira/browse/SPARK-4631 Project: Spark Issue Type: Test Components: Streaming Reporter: Tathagata Das Priority: Critical A real unit test that actually transfers data to ensure that the MQTTUtil is functional -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4631) Add real unit test for MQTT
[ https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227337#comment-14227337 ] Reynold Xin edited comment on SPARK-4631 at 11/27/14 7:11 AM: -- How popular is MQTT? Do we really need this support in Spark natively? I think we should reconsider the dependency on it if it is not widely popular and it is causing us trouble supporting it and causing our users trouble due to build issues. was (Author: rxin): How popular is MQTT? Do we really need this support in Spark natively? Add real unit test for MQTT Key: SPARK-4631 URL: https://issues.apache.org/jira/browse/SPARK-4631 Project: Spark Issue Type: Test Components: Streaming Reporter: Tathagata Das Priority: Critical A real unit test that actually transfers data to ensure that the MQTTUtil is functional -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4599) add hive profile to root pom
[ https://issues.apache.org/jira/browse/SPARK-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Wang closed SPARK-4599. -- Resolution: Won't Fix add hive profile to root pom Key: SPARK-4599 URL: https://issues.apache.org/jira/browse/SPARK-4599 Project: Spark Issue Type: Improvement Components: Build Reporter: Adrian Wang Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4628) Put all external projects behind a build flag
[ https://issues.apache.org/jira/browse/SPARK-4628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227339#comment-14227339 ] Reynold Xin commented on SPARK-4628: Shall we also not depend on artifacts that are not in maven central? It should be a high bar for Spark to depend on some library, and if they can't make it to maven central, it doesn't inspire a lot of confidence. Of course, this can only be a general rule and there can also be exceptions. Put all external projects behind a build flag - Key: SPARK-4628 URL: https://issues.apache.org/jira/browse/SPARK-4628 Project: Spark Issue Type: Improvement Reporter: Patrick Wendell Priority: Blocker This is something we talked about doing for convenience, but I'm escalating this based on realizing today that some of our external projects depend on code that is not in maven central. I.e. if one of these dependencies is taken down (as happened recently with mqtt), all Spark builds will fail. The proposal here is simple, have a profile -Pexternal-projects that enables these. This can follow the exact pattern of -Pkinesis-asl which was disabled by default due to a license issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227352#comment-14227352 ] Yu Ishikawa commented on SPARK-2429: Hi [~yangjunpro], {quote} 1. In your implementation, in each divisive steps, there will be a copy operations to distribution the data nodes in the parent cluster tree to the split children cluster trees, when the document size is large, I think this copy cost is non-neglectable, right? {quote} Exactly. A cached memory is twice larger than the original data. For example, if data size is 10 GB, a spark cluster always has 20 GB cached RDD through the algorithm. The reason why I cache the data nodes at each time dividing is for elapsed time. That is, this algorithm is very slow without caching the data nodes. {quote} 2. In your test code, the cluster size is not quite large( only about 100 ), have you ever tested it with big cluster size and big document corpus? e.g., 1 clusters with 200 documents. What is the performance behavior facing this kind of use case? {quote} The test code deals with small data as you said. I think data size in unit tests should not be large in order to reduce the test time. Of course, I am willing to implement this algorithm to fit large input data and large clusters we want. Although I have never check the performance of this implementation with large clusters, such as 1, elapsed time can be long. I will check the performance under the condition. Or if possible, could you check the performance? thanks Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark-result.2014-10-29.html, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4631) Add real unit test for MQTT
[ https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227356#comment-14227356 ] Sean Owen commented on SPARK-4631: -- (FWIW I had never heard of MQTT before Spark. This does feel like something below the bar for inclusion in Spark and could be a separate project. Consider deprecating it?) Add real unit test for MQTT Key: SPARK-4631 URL: https://issues.apache.org/jira/browse/SPARK-4631 Project: Spark Issue Type: Test Components: Streaming Reporter: Tathagata Das Priority: Critical A real unit test that actually transfers data to ensure that the MQTTUtil is functional -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org