[jira] [Resolved] (SPARK-8793) error/warning with pyspark WholeTextFiles.first
[ https://issues.apache.org/jira/browse/SPARK-8793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Diana Carroll resolved SPARK-8793. -- Resolution: Not A Problem this is no longer occurring. > error/warning with pyspark WholeTextFiles.first > --- > > Key: SPARK-8793 > URL: https://issues.apache.org/jira/browse/SPARK-8793 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.3.0 >Reporter: Diana Carroll >Priority: Minor > Attachments: wholefilesbug.txt > > > In Spark 1.3.0 python, calling first() on sc.wholeTextFiles is not working > correctly in pyspark. It works fine in Scala. > I created a directory with two tiny, simple text files. > this works: > {code}sc.wholeTextFiles("testdata").collect(){code} > this doesn't: > {code}sc.wholeTextFiles("testdata").first(){code} > The main error message is: > {code}15/07/02 08:01:38 ERROR executor.Executor: Exception in task 0.0 in > stage 12.0 (TID 12) > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "/usr/lib/spark/python/pyspark/worker.py", line 101, in main > process() > File "/usr/lib/spark/python/pyspark/worker.py", line 96, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/usr/lib/spark/python/pyspark/serializers.py", line 236, in > dump_stream > vs = list(itertools.islice(iterator, batch)) > File "/usr/lib/spark/python/pyspark/rdd.py", line 1220, in takeUpToNumLeft > while taken < left: > ImportError: No module named iter > {code} > I will attach the full stack trace to the JIRA. > I'm using CentOS 6.6 with CDH 5.4.3 (Spark 1.3.0). Tested in both Python 2.6 > and 2.7, same results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9821) pyspark reduceByKey should allow a custom partitioner
Diana Carroll created SPARK-9821: Summary: pyspark reduceByKey should allow a custom partitioner Key: SPARK-9821 URL: https://issues.apache.org/jira/browse/SPARK-9821 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.0 Reporter: Diana Carroll In Scala, I can supply a custom partitioner to reduceByKey (and other aggregation/repartitioning methods like aggregateByKey and combinedByKey), but as far as I can tell from the Pyspark API, there's no way to do the same in Python. Here's an example of my code in Scala: {code}weblogs.map(s = (getFileType(s), 1)).reduceByKey(new FileTypePartitioner(),_+_){code} But I can't figure out how to do the same in Python. The closest I can get is to call repartition before reduceByKey like so: {code}weblogs.map(lambda s: (getFileType(s), 1)).partitionBy(3,hash_filetype).reduceByKey(lambda v1,v2: v1+v2).collect(){code} But that defeats the purpose, because I'm shuffling twice instead of once, so my performance is worse instead of better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8793) error/warning with pyspark WholeTextFiles.first
Diana Carroll created SPARK-8793: Summary: error/warning with pyspark WholeTextFiles.first Key: SPARK-8793 URL: https://issues.apache.org/jira/browse/SPARK-8793 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.0 Reporter: Diana Carroll Priority: Minor Attachments: wholefilesbug.txt In Spark 1.3.0 python, calling first() on sc.wholeTextFiles is not working correctly in pyspark. It works fine in Scala. I created a directory with two tiny, simple text files. this works: {code}sc.wholeTextFiles(testdata).collect(){code} this doesn't: {code}sc.wholeTextFiles(testdata).first(){code} The main error message is: {code}15/07/02 08:01:38 ERROR executor.Executor: Exception in task 0.0 in stage 12.0 (TID 12) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /usr/lib/spark/python/pyspark/worker.py, line 101, in main process() File /usr/lib/spark/python/pyspark/worker.py, line 96, in process serializer.dump_stream(func(split_index, iterator), outfile) File /usr/lib/spark/python/pyspark/serializers.py, line 236, in dump_stream vs = list(itertools.islice(iterator, batch)) File /usr/lib/spark/python/pyspark/rdd.py, line 1220, in takeUpToNumLeft while taken left: ImportError: No module named iter {code} I will attach the full stack trace to the JIRA. I'm using CentOS 6.6 with CDH 5.4.3 (Spark 1.3.0). Tested in both Python 2.6 and 2.7, same results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8793) error/warning with pyspark WholeTextFiles.first
[ https://issues.apache.org/jira/browse/SPARK-8793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Diana Carroll updated SPARK-8793: - Attachment: wholefilesbug.txt error/warning with pyspark WholeTextFiles.first --- Key: SPARK-8793 URL: https://issues.apache.org/jira/browse/SPARK-8793 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.0 Reporter: Diana Carroll Priority: Minor Attachments: wholefilesbug.txt In Spark 1.3.0 python, calling first() on sc.wholeTextFiles is not working correctly in pyspark. It works fine in Scala. I created a directory with two tiny, simple text files. this works: {code}sc.wholeTextFiles(testdata).collect(){code} this doesn't: {code}sc.wholeTextFiles(testdata).first(){code} The main error message is: {code}15/07/02 08:01:38 ERROR executor.Executor: Exception in task 0.0 in stage 12.0 (TID 12) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /usr/lib/spark/python/pyspark/worker.py, line 101, in main process() File /usr/lib/spark/python/pyspark/worker.py, line 96, in process serializer.dump_stream(func(split_index, iterator), outfile) File /usr/lib/spark/python/pyspark/serializers.py, line 236, in dump_stream vs = list(itertools.islice(iterator, batch)) File /usr/lib/spark/python/pyspark/rdd.py, line 1220, in takeUpToNumLeft while taken left: ImportError: No module named iter {code} I will attach the full stack trace to the JIRA. I'm using CentOS 6.6 with CDH 5.4.3 (Spark 1.3.0). Tested in both Python 2.6 and 2.7, same results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8795) pySpark wholeTextFiles error when mapping string
[ https://issues.apache.org/jira/browse/SPARK-8795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Diana Carroll closed SPARK-8795. Resolution: Not A Problem User configuration problem: error resulted from mismatched versions of Python on driver and executor. pySpark wholeTextFiles error when mapping string Key: SPARK-8795 URL: https://issues.apache.org/jira/browse/SPARK-8795 Project: Spark Issue Type: Bug Components: PySpark Environment: CentOS 6.6, Python 2.7, CDH 5.4.1 Reporter: Diana Carroll Attachments: wholefilesbug.txt I created a test directory with two tiny text files. This call works: {code}sc.wholeTextFiles(testdata).map(lambda (fname,x): len(x)).collect(){code} This call does not: {code}sc.wholeTextFiles(testdata).map(lambda (fname,x): x.islower()).collect(){code} In fact, any attempt to call any string methods on X, or pass X to any function requiring a string, fail the same way. The main error is {code} File /usr/lib/spark/python/pyspark/worker.py, line 101, in main process() File /usr/lib/spark/python/pyspark/worker.py, line 96, in process serializer.dump_stream(func(split_index, iterator), outfile) File /usr/lib/spark/python/pyspark/serializers.py, line 236, in dump_stream vs = list(itertools.islice(iterator, batch)) File ipython-input-107-5192d18d0e4c, line 1, in lambda TypeError: 'bool' object is not callable {code} Will attach full log. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8795) pySpark wholeTextFiles error when mapping string
[ https://issues.apache.org/jira/browse/SPARK-8795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Diana Carroll updated SPARK-8795: - Attachment: wholefilesbug.txt pySpark wholeTextFiles error when mapping string Key: SPARK-8795 URL: https://issues.apache.org/jira/browse/SPARK-8795 Project: Spark Issue Type: Bug Components: PySpark Environment: CentOS 6.6, Python 2.7, CDH 5.4.1 Reporter: Diana Carroll Attachments: wholefilesbug.txt I created a test directory with two tiny text files. This call works: {code}sc.wholeTextFiles(testdata).map(lambda (fname,x): len(x)).collect(){code} This call does not: {code}sc.wholeTextFiles(testdata).map(lambda (fname,x): x.islower()).collect(){code} In fact, any attempt to call any string methods on X, or pass X to any function requiring a string, fail the same way. The main error is {code} File /usr/lib/spark/python/pyspark/worker.py, line 101, in main process() File /usr/lib/spark/python/pyspark/worker.py, line 96, in process serializer.dump_stream(func(split_index, iterator), outfile) File /usr/lib/spark/python/pyspark/serializers.py, line 236, in dump_stream vs = list(itertools.islice(iterator, batch)) File ipython-input-107-5192d18d0e4c, line 1, in lambda TypeError: 'bool' object is not callable {code} Will attach full log. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8795) pySpark wholeTextFiles error when mapping string
Diana Carroll created SPARK-8795: Summary: pySpark wholeTextFiles error when mapping string Key: SPARK-8795 URL: https://issues.apache.org/jira/browse/SPARK-8795 Project: Spark Issue Type: Bug Components: PySpark Environment: CentOS 6.6, Python 2.7, CDH 5.4.1 Reporter: Diana Carroll I created a test directory with two tiny text files. This call works: {code}sc.wholeTextFiles(testdata).map(lambda (fname,x): len(x)).collect(){code} This call does not: {code}sc.wholeTextFiles(testdata).map(lambda (fname,x): x.islower()).collect(){code} In fact, any attempt to call any string methods on X, or pass X to any function requiring a string, fail the same way. The main error is {code} File /usr/lib/spark/python/pyspark/worker.py, line 101, in main process() File /usr/lib/spark/python/pyspark/worker.py, line 96, in process serializer.dump_stream(func(split_index, iterator), outfile) File /usr/lib/spark/python/pyspark/serializers.py, line 236, in dump_stream vs = list(itertools.islice(iterator, batch)) File ipython-input-107-5192d18d0e4c, line 1, in lambda TypeError: 'bool' object is not callable {code} Will attach full log. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7607) Spark SQL prog guide code error
Diana Carroll created SPARK-7607: Summary: Spark SQL prog guide code error Key: SPARK-7607 URL: https://issues.apache.org/jira/browse/SPARK-7607 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.3.1 Reporter: Diana Carroll Priority: Minor In the DataFrame Operations section, there's a bug in the filter example for all three languages. The comments say we want to filter for people over age 21, but the code is {code} df.filter(df(name) 21).show() {code} It's filtering for people whose NAME is over 21. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7607) Spark SQL prog guide code error
[ https://issues.apache.org/jira/browse/SPARK-7607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Diana Carroll closed SPARK-7607. Resolution: Duplicate Woops, duplicate of SPARK-6383 (already fixed) Spark SQL prog guide code error --- Key: SPARK-7607 URL: https://issues.apache.org/jira/browse/SPARK-7607 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.3.1 Reporter: Diana Carroll Priority: Minor In the DataFrame Operations section, there's a bug in the filter example for all three languages. The comments say we want to filter for people over age 21, but the code is {code} df.filter(df(name) 21).show() {code} It's filtering for people whose NAME is over 21. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2527) incorrect persistence level shown in Spark UI after repersisting
Diana Carroll created SPARK-2527: Summary: incorrect persistence level shown in Spark UI after repersisting Key: SPARK-2527 URL: https://issues.apache.org/jira/browse/SPARK-2527 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.0.0 Reporter: Diana Carroll If I persist an RDD at one level, unpersist it, then repersist it at another level, the UI will continue to show the RDD at the first level...but correctly show individual partitions at the second level. {code} import org.apache.spark.api.java.StorageLevels._ val test1 = sc.parallelize(Array(1,2,3))test1.persist(StorageLevels.DISK_ONLY) test1.count() test1.unpersist() test1.persist(StorageLevels.MEMORY_ONLY) test1.count() {code} after the first call to persist and count, the Spark App web UI shows: RDD Storage Info for 14 Storage Level: Disk Serialized 1x Replicated rdd_14_0Disk Serialized 1x Replicated After the second call, it shows: RDD Storage Info for 14 Storage Level: Disk Serialized 1x Replicated rdd_14_0Memory Deserialized 1x Replicated -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2527) incorrect persistence level shown in Spark UI after repersisting
[ https://issues.apache.org/jira/browse/SPARK-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Diana Carroll updated SPARK-2527: - Description: If I persist an RDD at one level, unpersist it, then repersist it at another level, the UI will continue to show the RDD at the first level...but correctly show individual partitions at the second level. {code} import org.apache.spark.api.java.StorageLevels import org.apache.spark.api.java.StorageLevels._ val test1 = sc.parallelize(Array(1,2,3))test1.persist(StorageLevels.DISK_ONLY) test1.count() test1.unpersist() test1.persist(StorageLevels.MEMORY_ONLY) test1.count() {code} after the first call to persist and count, the Spark App web UI shows: RDD Storage Info for 14 Storage Level: Disk Serialized 1x Replicated rdd_14_0Disk Serialized 1x Replicated After the second call, it shows: RDD Storage Info for 14 Storage Level: Disk Serialized 1x Replicated rdd_14_0Memory Deserialized 1x Replicated was: If I persist an RDD at one level, unpersist it, then repersist it at another level, the UI will continue to show the RDD at the first level...but correctly show individual partitions at the second level. {code} import org.apache.spark.api.java.StorageLevels._ val test1 = sc.parallelize(Array(1,2,3))test1.persist(StorageLevels.DISK_ONLY) test1.count() test1.unpersist() test1.persist(StorageLevels.MEMORY_ONLY) test1.count() {code} after the first call to persist and count, the Spark App web UI shows: RDD Storage Info for 14 Storage Level: Disk Serialized 1x Replicated rdd_14_0Disk Serialized 1x Replicated After the second call, it shows: RDD Storage Info for 14 Storage Level: Disk Serialized 1x Replicated rdd_14_0Memory Deserialized 1x Replicated incorrect persistence level shown in Spark UI after repersisting Key: SPARK-2527 URL: https://issues.apache.org/jira/browse/SPARK-2527 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.0.0 Reporter: Diana Carroll Attachments: persistbug1.png, persistbug2.png If I persist an RDD at one level, unpersist it, then repersist it at another level, the UI will continue to show the RDD at the first level...but correctly show individual partitions at the second level. {code} import org.apache.spark.api.java.StorageLevels import org.apache.spark.api.java.StorageLevels._ val test1 = sc.parallelize(Array(1,2,3))test1.persist(StorageLevels.DISK_ONLY) test1.count() test1.unpersist() test1.persist(StorageLevels.MEMORY_ONLY) test1.count() {code} after the first call to persist and count, the Spark App web UI shows: RDD Storage Info for 14 Storage Level: Disk Serialized 1x Replicated rdd_14_0 Disk Serialized 1x Replicated After the second call, it shows: RDD Storage Info for 14 Storage Level: Disk Serialized 1x Replicated rdd_14_0 Memory Deserialized 1x Replicated -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2334) Attribute Error calling PipelinedRDD.id() in pyspark
Diana Carroll created SPARK-2334: Summary: Attribute Error calling PipelinedRDD.id() in pyspark Key: SPARK-2334 URL: https://issues.apache.org/jira/browse/SPARK-2334 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Reporter: Diana Carroll calling the id() function of a PipelinedRDD causes an error in PySpark. (Works fine in Scala.) The second id() call here fails, the first works: {code} r1 = sc.parallelize([1,2,3]) r1.id() r2=r1.map(lambda i: i+1) r2.id() {code} Error: {code} --- AttributeErrorTraceback (most recent call last) ipython-input-31-a0cf66fcf645 in module() 1 r2.id() /usr/lib/spark/python/pyspark/rdd.py in id(self) 180 A unique ID for this RDD (within its SparkContext). 181 -- 182 return self._id 183 184 def __repr__(self): AttributeError: 'PipelinedRDD' object has no attribute '_id' {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Reopened] (SPARK-2003) SparkContext(SparkConf) doesn't work in pyspark
[ https://issues.apache.org/jira/browse/SPARK-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Diana Carroll reopened SPARK-2003: -- SparkContext(SparkConf) doesn't work in pyspark --- Key: SPARK-2003 URL: https://issues.apache.org/jira/browse/SPARK-2003 Project: Spark Issue Type: Bug Components: Documentation, PySpark Affects Versions: 1.0.0 Reporter: Diana Carroll Fix For: 1.0.1, 1.1.0 Using SparkConf with SparkContext as described in the Programming Guide does NOT work in Python: conf = SparkConf.setAppName(blah) sc = SparkContext(conf) When I tried I got AttributeError: 'SparkConf' object has no attribute '_get_object_id' [This equivalent code in Scala works fine: val conf = new SparkConf().setAppName(blah) val sc = new SparkContext(conf)] I think this is because there's no equivalent for the Scala constructor SparkContext(SparkConf). Workaround: If I explicitly set the conf parameter in the python call, it does work: sconf = SparkConf.setAppName(blah) sc = SparkContext(conf=sconf) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2003) SparkContext(SparkConf) doesn't work in pyspark
[ https://issues.apache.org/jira/browse/SPARK-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049084#comment-14049084 ] Diana Carroll commented on SPARK-2003: -- Actually you can't create a SparkContext in the repl...it's already created for you. Therefore this bug is moot in the repl, and only applies to a standalone pyspark program. as far as I know, in Scala the officially recommended way to programmatically configure a SparkContext is new SparkContext(sparkConf). Therefore the same should be true for PySpark. On Tue, Jul 1, 2014 at 12:40 PM, Matthew Farrellee (JIRA) j...@apache.org SparkContext(SparkConf) doesn't work in pyspark --- Key: SPARK-2003 URL: https://issues.apache.org/jira/browse/SPARK-2003 Project: Spark Issue Type: Bug Components: Documentation, PySpark Affects Versions: 1.0.0 Reporter: Diana Carroll Fix For: 1.0.1, 1.1.0 Using SparkConf with SparkContext as described in the Programming Guide does NOT work in Python: conf = SparkConf.setAppName(blah) sc = SparkContext(conf) When I tried I got AttributeError: 'SparkConf' object has no attribute '_get_object_id' [This equivalent code in Scala works fine: val conf = new SparkConf().setAppName(blah) val sc = new SparkContext(conf)] I think this is because there's no equivalent for the Scala constructor SparkContext(SparkConf). Workaround: If I explicitly set the conf parameter in the python call, it does work: sconf = SparkConf.setAppName(blah) sc = SparkContext(conf=sconf) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2003) SparkContext(SparkConf) doesn't work in pyspark
[ https://issues.apache.org/jira/browse/SPARK-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047710#comment-14047710 ] Diana Carroll commented on SPARK-2003: -- I do not agree this is a doc bug. This is legit code that should work. The python implementation of the SparkContext constructor should accept a SparkConf object just like the Scala implementation does. {code} from pyspark import SparkContext from pyspark import SparkConf if __name__ == __main__: sconf = SparkConf().setAppName(My Spark App) sc = SparkContext(sconf) print Hello world {code} (setting master is not required in Spark 1.0+, as it can be set using spark-submit) SparkContext(SparkConf) doesn't work in pyspark --- Key: SPARK-2003 URL: https://issues.apache.org/jira/browse/SPARK-2003 Project: Spark Issue Type: Bug Components: Documentation, PySpark Affects Versions: 1.0.0 Reporter: Diana Carroll Fix For: 1.0.1, 1.1.0 Using SparkConf with SparkContext as described in the Programming Guide does NOT work in Python: conf = SparkConf.setAppName(blah) sc = SparkContext(conf) When I tried I got AttributeError: 'SparkConf' object has no attribute '_get_object_id' [This equivalent code in Scala works fine: val conf = new SparkConf().setAppName(blah) val sc = new SparkContext(conf)] I think this is because there's no equivalent for the Scala constructor SparkContext(SparkConf). Workaround: If I explicitly set the conf parameter in the python call, it does work: sconf = SparkConf.setAppName(blah) sc = SparkContext(conf=sconf) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-823) spark.default.parallelism's default is inconsistent across scheduler backends
[ https://issues.apache.org/jira/browse/SPARK-823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985880#comment-13985880 ] Diana Carroll commented on SPARK-823: - Yes, please clarify the documentation, I just ran into this. the Configuration guide (http://spark.apache.org/docs/latest/configuration.html) says the default is 8. In testing this on Standalone Spark, there actually is no default value for the variable: sc.getConf.contains(spark.default.parallelism) res1: Boolean = false It looks like if the variable is not set, then the default behavior is decided in code, e.g. Partitioner.scala: {code} if (rdd.context.conf.contains(spark.default.parallelism)) { new HashPartitioner(rdd.context.defaultParallelism) } else { new HashPartitioner(bySize.head.partitions.size) } {code} spark.default.parallelism's default is inconsistent across scheduler backends - Key: SPARK-823 URL: https://issues.apache.org/jira/browse/SPARK-823 Project: Spark Issue Type: Bug Components: Documentation, Spark Core Affects Versions: 0.8.0, 0.7.3 Reporter: Josh Rosen Priority: Minor The [0.7.3 configuration guide|http://spark-project.org/docs/latest/configuration.html] says that {{spark.default.parallelism}}'s default is 8, but the default is actually max(totalCoreCount, 2) for the standalone scheduler backend, 8 for the Mesos scheduler, and {{threads}} for the local scheduler: https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/cluster/StandaloneSchedulerBackend.scala#L157 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/mesos/MesosSchedulerBackend.scala#L317 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/local/LocalScheduler.scala#L150 Should this be clarified in the documentation? Should the Mesos scheduler backend's default be revised? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-823) spark.default.parallelism's default is inconsistent across scheduler backends
[ https://issues.apache.org/jira/browse/SPARK-823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985934#comment-13985934 ] Diana Carroll commented on SPARK-823: - Okay, this is definitely more than a documentation bug, because PySpark and Scala work differently if spark.default.parallelism isn't set by the user. I'm testing using wordcount. Pyspark: reduceByKey will use the value of sc.defaultParallelism. That value is set to the number of threads when running locally. On my Spark Standalone cluster which has a single node with a single core, the value is 2. If I set spark.default.parallelism, it will set sc.defaultParallelism to that value and use that. Scala: reduceByKey will use the number of partitions in my file/map stage and ignore the value of sc.defaultParallelism. sc.defaultParallism is set by the same logic as pyspark (number of threads for local, 2 for my microcluster), it is just ignored. I'm not sure which approach is correct. Scala works as described here: http://spark.apache.org/docs/latest/tuning.html {quote} Spark automatically sets the number of “map” tasks to run on each file according to its size (though you can control it through optional parameters to SparkContext.textFile, etc), and for distributed “reduce” operations, such as groupByKey and reduceByKey, it uses the largest parent RDD’s number of partitions. You can pass the level of parallelism as a second argument (see the spark.PairRDDFunctions documentation), or set the config property spark.default.parallelism to change the default. In general, we recommend 2-3 tasks per CPU core in your cluster. {quote} spark.default.parallelism's default is inconsistent across scheduler backends - Key: SPARK-823 URL: https://issues.apache.org/jira/browse/SPARK-823 Project: Spark Issue Type: Bug Components: Documentation, Spark Core Affects Versions: 0.8.0, 0.7.3 Reporter: Josh Rosen Priority: Minor The [0.7.3 configuration guide|http://spark-project.org/docs/latest/configuration.html] says that {{spark.default.parallelism}}'s default is 8, but the default is actually max(totalCoreCount, 2) for the standalone scheduler backend, 8 for the Mesos scheduler, and {{threads}} for the local scheduler: https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/cluster/StandaloneSchedulerBackend.scala#L157 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/mesos/MesosSchedulerBackend.scala#L317 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/local/LocalScheduler.scala#L150 Should this be clarified in the documentation? Should the Mesos scheduler backend's default be revised? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-823) spark.default.parallelism's default is inconsistent across scheduler backends
[ https://issues.apache.org/jira/browse/SPARK-823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Diana Carroll updated SPARK-823: Component/s: PySpark spark.default.parallelism's default is inconsistent across scheduler backends - Key: SPARK-823 URL: https://issues.apache.org/jira/browse/SPARK-823 Project: Spark Issue Type: Bug Components: Documentation, PySpark, Spark Core Affects Versions: 0.8.0, 0.7.3, 0.9.1 Reporter: Josh Rosen Priority: Minor The [0.7.3 configuration guide|http://spark-project.org/docs/latest/configuration.html] says that {{spark.default.parallelism}}'s default is 8, but the default is actually max(totalCoreCount, 2) for the standalone scheduler backend, 8 for the Mesos scheduler, and {{threads}} for the local scheduler: https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/cluster/StandaloneSchedulerBackend.scala#L157 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/mesos/MesosSchedulerBackend.scala#L317 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/local/LocalScheduler.scala#L150 Should this be clarified in the documentation? Should the Mesos scheduler backend's default be revised? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-823) spark.default.parallelism's default is inconsistent across scheduler backends
[ https://issues.apache.org/jira/browse/SPARK-823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Diana Carroll updated SPARK-823: Affects Version/s: 0.9.1 spark.default.parallelism's default is inconsistent across scheduler backends - Key: SPARK-823 URL: https://issues.apache.org/jira/browse/SPARK-823 Project: Spark Issue Type: Bug Components: Documentation, PySpark, Spark Core Affects Versions: 0.8.0, 0.7.3, 0.9.1 Reporter: Josh Rosen Priority: Minor The [0.7.3 configuration guide|http://spark-project.org/docs/latest/configuration.html] says that {{spark.default.parallelism}}'s default is 8, but the default is actually max(totalCoreCount, 2) for the standalone scheduler backend, 8 for the Mesos scheduler, and {{threads}} for the local scheduler: https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/cluster/StandaloneSchedulerBackend.scala#L157 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/mesos/MesosSchedulerBackend.scala#L317 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/local/LocalScheduler.scala#L150 Should this be clarified in the documentation? Should the Mesos scheduler backend's default be revised? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1666) document examples
Diana Carroll created SPARK-1666: Summary: document examples Key: SPARK-1666 URL: https://issues.apache.org/jira/browse/SPARK-1666 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 0.9.1 Reporter: Diana Carroll It would be great if there were some guidance about what the example code shipped with Spark (under $SPARKHOME/examples and $SPARKHOME/python/examples) does and how to run it. Perhaps a comment block at the beginning explaining what the code accomplishes and what parameters it takes. Also, if there are sample datasets on which the example is designed to run, please point to those. (As an example, look at kmeans.py, which takes a file argument, but has no hint about what sort of data is in the file or what format the data should be in. -- This message was sent by Atlassian JIRA (v6.2#6252)