[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218610#comment-15218610 ] Juliet Hougland commented on SPARK-13587: - Being able to ship around pex files like we do .py and .egg files sounds very reasonable from a delineation of responsibilities perspective. I like the idea and would support a change like that. A question/edge case worth working out is how pex files relate to compiled c libs that python libs may need to link to. I don't know much about pex, but initial assessment is that it shouldn't be a huge problem. I like this solution. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212191#comment-15212191 ] Juliet Hougland commented on SPARK-13587: - I really do think spark and pyspark needs to stay out of the business for installing anything for people. A generic executable is relatively neutral as to what exactly that executable does, which is good. Spark's scope should be computation/execution, not environment setup and teardown. Have you considered using NFS or Amazon EFS to allow users to create and manage their own envs and then mounting those on worker/executor nodes? This is an elegant solution we (many experienced people at Cloudera like Guru M and Tristan Z recommended this) have seen deployed successfully. I believe given the description of your problem it should suit your needs. [~vanzin] as suggested "one alternative to shared mounts is to store the thing in HDFS and use something like --files / --archives in Spark. The distribution to new containers is handled by YARN, and Spark just would need some adjustments to find the right executable inside those archives." > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179253#comment-15179253 ] Juliet Hougland commented on SPARK-13587: - That is wonderful. Let me know if you'd like me to help work on it. It has been dangling on my mental todo list for a while. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179250#comment-15179250 ] Juliet Hougland commented on SPARK-13587: - I made a comment related to this below. TLDR I think the suggested --py-env option could be encompassed an already needed --pyspark_python sort of option. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179247#comment-15179247 ] Juliet Hougland commented on SPARK-13587: - Currently the way users specify the workers' python interpreter is through the PYSPARK_PYTHON env variable. It would be beneficial to users to allow that path to be specified by a cli flag. That is a current rough edge of using already installed envs on a cluster. If this was added as a cli flag, I could see valid options being 'pyspark/python/path', 'venv' (temp virtualenv), and 'conda' (temp conda env) and requiring a second flag to specify the requirements file. I think it helps prevent an explosion of flag for spark submit while helping handle a very important and often changed parameter for a job. What do you think of this? > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178536#comment-15178536 ] Juliet Hougland edited comment on SPARK-13587 at 3/3/16 9:21 PM: - If pyspark allows users to create virtual environments, users will also want and need other features of python environment management on a cluster. I think this change would broaden the scope of PySpark to include python package management on a cluster. I do not think that spark should be in the business of creating python environments. I think the support load in terms of feature requests, mailing list traffic, etc would be very large. This feature would begin to solve a problem, but would also put us on the hook for many more. I agree with the general intention of this JIRA -- make it easier to manage and interact with complex python environments on a cluster. Perhaps there are other ways to accomplish this without broadening scope and functionality as much. For example, checking a requirements file against an environment before execution. Edit: I see now that you are proposing a short lived virtualenv. My objections about the broadening of scope still apply. I generally do not agree with suggestions that tightly tie us (and users) to a specific method of pyenv management. The loose coupling of python envs one a cluster to pyspark (via a path to an interpreter) is a positive feature. I would much rather add --pyspark_python to the cli tool (and deprecate the env var) than add a ton of logic to create environments for users. was (Author: juliet): If pyspark allows users to create virtual environments, users will also want and need other features of python environment management on a cluster. I think this change would broaden the scope of PySpark to include python package management on a cluster. I do not think that spark should be in the business of creating python environments. I think the support load in terms of feature requests, mailing list traffic, etc would be very large. This feature would begin to solve a problem, but would also put us on the hook for many more. I agree with the general intention of this JIRA -- make it easier to manage and interact with complex python environments on a cluster. Perhaps there are other ways to accomplish this without broadening scope and functionality as much. For example, checking a requirements file against an environment before execution. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178536#comment-15178536 ] Juliet Hougland commented on SPARK-13587: - If pyspark allows users to create virtual environments, users will also want and need other features of python environment management on a cluster. I think this change would broaden the scope of PySpark to include python package management on a cluster. I do not think that spark should be in the business of creating python environments. I think the support load in terms of feature requests, mailing list traffic, etc would be very large. This feature would begin to solve a problem, but would also put us on the hook for many more. I agree with the general intention of this JIRA -- make it easier to manage and interact with complex python environments on a cluster. Perhaps there are other ways to accomplish this without broadening scope and functionality as much. For example, checking a requirements file against an environment before execution. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13303) Spark fails with pandas import error when pandas is not explicitly imported by user
Juliet Hougland created SPARK-13303: --- Summary: Spark fails with pandas import error when pandas is not explicitly imported by user Key: SPARK-13303 URL: https://issues.apache.org/jira/browse/SPARK-13303 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.6.0 Environment: The python installation used by the driver (edge node) has pandas installed on it, while on the data nodes pandas do not have pandas installed in the python runtimes used. Pandas is never explicitly imported by pi.py. Reporter: Juliet Hougland Running `spark-submit pi.py` results in: File "/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 98, in main command = pickleSer._read_with_length(infile) File "/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length return self.loads(obj) File "/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 422, in loads return pickle.loads(obj) ImportError: No module named pandas.algos at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:138) at org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:179) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:97) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) This is unexpected and hard for users to unravel why they may see this error, as they themselves have not explicitly done anything with pandas. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4073) Parquet+Snappy can cause significant off-heap memory usage
[ https://issues.apache.org/jira/browse/SPARK-4073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15111607#comment-15111607 ] Juliet Hougland commented on SPARK-4073: For those playing along at home-- the solution for me was to set spark.executor.extraJavaOptions=-XX:MaxDirectMemorySize=4G so that worker JVMs had a much, much higher off heap memory limit. > Parquet+Snappy can cause significant off-heap memory usage > -- > > Key: SPARK-4073 > URL: https://issues.apache.org/jira/browse/SPARK-4073 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Patrick Wendell >Priority: Critical > > The parquet snappy codec allocates off-heap buffers for decompression[1]. In > one cases the observed size of these buffers was high enough to add several > GB of data to the overall virtual memory usage of the Spark executor process. > I don't understand enough about our use of Snappy to fully grok how much data > we would _expect_ to be present in these buffers at any given time, but I can > say a few things. > 1. The dataset had individual rows that were fairly large, e.g. megabytes. > 2. Direct buffers are not cleaned up until GC events, and overall there was > not much heap contention. So maybe they just weren't being cleaned. > I opened PARQUET-118 to see if they can provide an option to use on-heap > buffers for decompression. In the mean time, we could consider changing the > default back to gzip, or we could do nothing (not sure how many other users > will hit this). > [1] > https://github.com/apache/incubator-parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/codec/SnappyDecompressor.java#L28 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4073) Parquet+Snappy can cause significant off-heap memory usage
[ https://issues.apache.org/jira/browse/SPARK-4073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1548#comment-1548 ] Juliet Hougland commented on SPARK-4073: I have run in to a related problem. I am reading a snappy compressed parquet file via pyspark and get the follow OOM error: 16/01/21 11:43:10 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[stdout writer for /opt/anaconda2/bin/python,5,main] java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:658) at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) at parquet.hadoop.codec.SnappyDecompressor.setInput(SnappyDecompressor.java:102) at parquet.hadoop.codec.NonBlockedDecompressorStream.read(NonBlockedDecompressorStream.java:46) at java.io.DataInputStream.readFully(DataInputStream.java:195) at java.io.DataInputStream.readFully(DataInputStream.java:169) at parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:204) at parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.(PlainValuesDictionary.java:89) at parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.(PlainValuesDictionary.java:72) at parquet.column.Encoding$1.initDictionary(Encoding.java:89) at parquet.column.Encoding$4.initDictionary(Encoding.java:148) at parquet.column.impl.ColumnReaderImpl.(ColumnReaderImpl.java:337) at parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:66) at parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:61) at parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:270) at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:134) at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:99) at parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154) at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:99) at parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:135) at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:205) at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:119) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:114) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:405) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:243) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1617) at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:205) > Parquet+Snappy can cause significant off-heap memory usage > -- > > Key: SPARK-4073 > URL: https://issues.apache.org/jira/browse/SPARK-4073 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Patrick Wendell >Priority: Critical > > The parquet snappy codec allocates off-heap buffers for decompression[1]. In > one cases the observed size of these buffers was high enough to add several > GB of data to the overall virtual memory usage of the Spark executor process. > I don't understand enough about our use of Snappy to fully grok how much data > we would _expect_ to be present in these buffers at any given time, but I can > say a few things. > 1. The dataset had individual rows that were fairly large, e.g. megabytes. > 2. Direct buffers are not cleaned up until GC events, and overall there was > not much heap contention. So maybe they just weren't being cleaned. > I opened PARQUET-118 to see if they can provide an option to use on-heap > buffers for decompression. In the mean time, we could consider changing the > default back to gzip, or we could do nothing (not sure how many other users > will hit this). > [1] >
[jira] [Commented] (SPARK-8646) PySpark does not run on YARN
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14628947#comment-14628947 ] Juliet Hougland commented on SPARK-8646: The failure happens at the point that I need to write out a file on the cluster and pyspark facilities need to be available to executors, not just the driver program. I can parse args and start a spark context fine, it fails at the point that I call sc.saveAsTextFile. Relevant lines: {panel} def analyze(data_io): sc = data_io.sc() sc.addPyFile(file:/home/juliet/src/out-of-stock/outofstock/GeometricModel.py) keyed_ts_rdd = to_keyed_ts(sc.textFile(data_io.input_path)).cache() # Compute days between sales on a store-item basis keyed_days_btwn_sales = keyed_ts_rdd.mapValues(days_between_sales).cache() # Identify days with an sales numbers that are outliers, using tukey's criterion keyed_outliers = keyed_ts_rdd.mapValues(flag_outliers) to_csv_lines(keyed_outliers).saveAsTextFile(data_io.sales_outliers_path) # Point of failure Other Stuff if __name__ == __main__: parser = argparse.ArgumentParser(description='Analyze store-item sales history for anomolies.') parser.add_argument('input_path') parser.add_argument('output_dir') parser.add_argument('mode') args = parser.parse_args() dataIO = DataIO(args.input_path, args.output_dir, mode=args.mode) analyze(dataIO) {panel} This runs fine on Spark 1.3, and produces reasonable results that get written to files in hdfs. I'm pretty confident that my use of argparse and other logic in my code work fine. PySpark does not run on YARN Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Attachments: executor.log, pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8646) PySpark does not run on YARN
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14628971#comment-14628971 ] Juliet Hougland commented on SPARK-8646: Yea, it works fine if I add that arg. There are two reasons I think this should be fixed in Spark, despite there being a work around. First, I think API compatibility should include scripts that PySpark does not run on YARN Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Attachments: executor.log, pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8646) PySpark does not run on YARN
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14628947#comment-14628947 ] Juliet Hougland edited comment on SPARK-8646 at 7/15/15 11:46 PM: -- The failure happens at the point that I need to write out a file on the cluster and pyspark facilities need to be available to executors, not just the driver program. I can parse args and start a spark context fine, it fails at the point that I call sc.saveAsTextFile. Relevant lines: {panel} def analyze(data_io): sc = data_io.sc() sc.addPyFile(file:/home/juliet/src/out-of-stock/outofstock/GeometricModel.py) keyed_ts_rdd = to_keyed_ts(sc.textFile(data_io.input_path)).cache() keyed_days_btwn_sales = keyed_ts_rdd.mapValues(days_between_sales).cache() keyed_outliers = keyed_ts_rdd.mapValues(flag_outliers) to_csv_lines(keyed_outliers).saveAsTextFile(data_io.sales_outliers_path) # Point of failure Other Stuff if __name__ == __main__: parser = argparse.ArgumentParser(description='Analyze store-item sales history for anomolies.') parser.add_argument('input_path') parser.add_argument('output_dir') parser.add_argument('mode') args = parser.parse_args() dataIO = DataIO(args.input_path, args.output_dir, mode=args.mode) analyze(dataIO) {panel} This runs fine on Spark 1.3, and produces reasonable results that get written to files in hdfs. I'm pretty confident that my use of argparse and other logic in my code work fine. (Note eddited because of strange jira formatting) was (Author: juliet): The failure happens at the point that I need to write out a file on the cluster and pyspark facilities need to be available to executors, not just the driver program. I can parse args and start a spark context fine, it fails at the point that I call sc.saveAsTextFile. Relevant lines: {panel} def analyze(data_io): sc = data_io.sc() sc.addPyFile(file:/home/juliet/src/out-of-stock/outofstock/GeometricModel.py) keyed_ts_rdd = to_keyed_ts(sc.textFile(data_io.input_path)).cache() # Compute days between sales on a store-item basis keyed_days_btwn_sales = keyed_ts_rdd.mapValues(days_between_sales).cache() # Identify days with an sales numbers that are outliers, using tukey's criterion keyed_outliers = keyed_ts_rdd.mapValues(flag_outliers) to_csv_lines(keyed_outliers).saveAsTextFile(data_io.sales_outliers_path) # Point of failure Other Stuff if __name__ == __main__: parser = argparse.ArgumentParser(description='Analyze store-item sales history for anomolies.') parser.add_argument('input_path') parser.add_argument('output_dir') parser.add_argument('mode') args = parser.parse_args() dataIO = DataIO(args.input_path, args.output_dir, mode=args.mode) analyze(dataIO) {panel} This runs fine on Spark 1.3, and produces reasonable results that get written to files in hdfs. I'm pretty confident that my use of argparse and other logic in my code work fine. PySpark does not run on YARN Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Attachments: executor.log, pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8646) PySpark does not run on YARN
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14628971#comment-14628971 ] Juliet Hougland edited comment on SPARK-8646 at 7/16/15 12:03 AM: -- Yea, it works fine if I add that arg. There are two reasons I think this should be fixed in Spark, despite there being a work around. First, I think API compatibility should/does include scripts. Second, if Spark provides the ability to set the master via code, it should be respected and actually work. Otherwise, the option that doesn't work (setting master via code) should not be available at all. was (Author: juliet): Yea, it works fine if I add that arg. There are two reasons I think this should be fixed in Spark, despite there being a work around. First, I think API compatibility should include scripts that PySpark does not run on YARN Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Attachments: executor.log, pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8646) PySpark does not run on YARN
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625245#comment-14625245 ] Juliet Hougland commented on SPARK-8646: [~lianhuiwang] in $SPARK_HOME/conf I only have the spark-defaults.conf.template file, not a non-template version. I also do not set the spark master to local programmatically. [~vanzin] The command logged to stderr is: Spark Command: /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/bin/java -cp /home/juliet/bin/spark-1.4.0-bin-hadoop2.6/conf/:/home/juliet/bin/spark-1.4.0-bin-hadoop2.6/lib/spark-assembly-1.4.0-hadoop2.6.0.jar:/home/juliet/bin/spark-1.4.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/home/juliet/bin/spark-1.4.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/home/juliet/bin/spark-1.4.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/etc/hadoop/conf/ -Xms512m -Xmx512m -XX: MaxPermSize=128m org.apache.spark.deploy.SparkSubmit --verbose outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex7/ yarn-client (sorry for the way the classpath gets chopped up between lines.) yarn-client is getting passed as a argument to my code, but because I am not specifying the master via the cli --master flag or via spark-defaults.conf it does not affect how the job initially starts up. PySpark does not run on YARN Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Attachments: executor.log, pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8646) PySpark does not run on YARN
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Juliet Hougland updated SPARK-8646: --- Attachment: executor.log PySpark does not run on YARN Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Attachments: executor.log, pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, spark1.4-SPARK_HOME-set.log Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8646) PySpark does not run on YARN
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Juliet Hougland updated SPARK-8646: --- Attachment: spark1.4-verbose.log verbose-executor.log PySpark does not run on YARN Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Attachments: executor.log, pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8646) PySpark does not run on YARN
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623008#comment-14623008 ] Juliet Hougland commented on SPARK-8646: [~lianhuiwang] I just uploaded the log files from using --verbose. I think I may have important clues as to where the problem lies. Instead of using '--master yarn-client' as part of my spark-submit command, I parse my own cli arg in my main class to get the spark master and initialize a configuration with it. If I add --master yarn-client in addition to my normal master specification, the job runs fine. The following command works in Spark 1.3 but not in 1.4: $SPARK_HOME/bin/spark-submit --verbose outofstock/data_transform.py \ hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex4/ yarn-client If I add the --master yarn-client parameter to the command it works. Specifically: $SPARK_HOME/bin/spark-submit --verbose --master yarn-client outofstock/data_transform.py \ hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex4/ yarn-client PySpark does not run on YARN Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Attachments: executor.log, pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8646) PySpark does not run on YARN
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623008#comment-14623008 ] Juliet Hougland edited comment on SPARK-8646 at 7/10/15 10:40 PM: -- [~lianhuiwang] I just uploaded the log files from using the verbose flag. I think I may have important clues as to where the problem lies. Instead of using '--master yarn-client' as part of my spark-submit command, I parse my own cli arg in my main class to get the spark master and initialize a configuration with it. If I add --master yarn-client in addition to my normal master specification, the job runs fine. The following command works in Spark 1.3 but not in 1.4: $SPARK_HOME/bin/spark-submit --verbose outofstock/data_transform.py \ hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex4/ yarn-client If I add the --master yarn-client parameter to the command it works. Specifically: $SPARK_HOME/bin/spark-submit --verbose --master yarn-client outofstock/data_transform.py \ hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex4/ yarn-client was (Author: juliet): [~lianhuiwang] I just uploaded the log files from using --verbose. I think I may have important clues as to where the problem lies. Instead of using '--master yarn-client' as part of my spark-submit command, I parse my own cli arg in my main class to get the spark master and initialize a configuration with it. If I add --master yarn-client in addition to my normal master specification, the job runs fine. The following command works in Spark 1.3 but not in 1.4: $SPARK_HOME/bin/spark-submit --verbose outofstock/data_transform.py \ hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex4/ yarn-client If I add the --master yarn-client parameter to the command it works. Specifically: $SPARK_HOME/bin/spark-submit --verbose --master yarn-client outofstock/data_transform.py \ hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex4/ yarn-client PySpark does not run on YARN Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Attachments: executor.log, pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8646) PySpark does not run on YARN
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14614856#comment-14614856 ] Juliet Hougland commented on SPARK-8646: [~sowen] The pandas error came when I tried to run the pi job-- which doesn't import pandas at all. The only imports in $SPARK_1.4_HOME/examples/src/main/python/pi.py are as follows: import sys from random import random from operator import add from pyspark import SparkContext PySpark itself doesn't require pandas (if it does, that should be documented) so having the pi job (doesn't require pandas) fail with a pandas not found error is wrong, because at no point should the pi job or pyspark itself require pandas. The pandas error is very, very weird but not obviously directly related to this ticket. The problem I reported here has to do with pyspark itself not being shipped or perhaps available to the worker nodes when I run a pyspark app from spark 1.4 using YARN. PySpark does not run on YARN Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Attachments: pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, spark1.4-SPARK_HOME-set.log Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8646) PySpark does not run on YARN
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615809#comment-14615809 ] Juliet Hougland commented on SPARK-8646: [~davies] Please look at the logs I have attached. The pandas.algo import error only appears in the pi-test.log file. I ran pi-test as a method to help debug this problem at the request of [~vanzin]. If you look at three other log files (with env diferences in the file names) those are from running my out-of-stock job. That job does have quite a few dependencies but I make sure those are available to the driver and workers. The real (first) issue that this ticket is related to is that pyspark isn't available on worker nodes. The same command I can use to run my job on spark 1.3 does not work with spark 1.4. PySpark does not run on YARN Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Attachments: pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, spark1.4-SPARK_HOME-set.log Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8646) PySpark does not run on YARN
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604014#comment-14604014 ] Juliet Hougland commented on SPARK-8646: When I configure spark to use my virtualenv that is on every node of the cluster and includes pandas, the pi job works fine. This makes sense to me because in the job that I have that fails, a spark context can be created without a module import error. the part that doesn't make sense to me is why pandas.algo would be needed at all. Looking at the code for the pi job, it is not part an import that is declared in the file. This is orthogonal to the point of this ticket, but is very very strange to me. The module import error that is the core of this JIRA occurs when I need to write out results of a computation (ie calling sc.writeTextFile) which require the pyspark module to be available on the worker nodes. PySpark does not run on YARN Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Attachments: pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, spark1.4-SPARK_HOME-set.log Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8646) PySpark does not run on YARN
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603263#comment-14603263 ] Juliet Hougland edited comment on SPARK-8646 at 6/26/15 5:35 PM: - Results from pi-test are uploaded in the attachment pi-test.log. Still a module missing error, this time it is pandas.algo. was (Author: juliet): Results from pu-test.log PySpark does not run on YARN Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Attachments: pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, spark1.4-SPARK_HOME-set.log Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8646) PySpark does not run on YARN
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Juliet Hougland updated SPARK-8646: --- Attachment: pi-test.log Results from pu-test.log PySpark does not run on YARN Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Attachments: pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, spark1.4-SPARK_HOME-set.log Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8646) PySpark does not run on YARN
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Juliet Hougland updated SPARK-8646: --- Attachment: spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log I ran the same line that gave me errors last time with HADOOP_CONF_DIR=/etc/hadoop/conf prepended. The command I used was: HADOOP_CONF_DIR=/etc/hadoop/conf $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex1/ yarn-client 2 spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log I've attached the output from that, it appears to be the same to me. PySpark does not run on YARN Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Attachments: pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, spark1.4-SPARK_HOME-set.log Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8642) Ungraceful failure when yarn client is not configured.
[ https://issues.apache.org/jira/browse/SPARK-8642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Juliet Hougland updated SPARK-8642: --- Attachment: yarnretries.log Log file from failed bc of misconfiguration spakr job. counting lines with 9 retires in it gives: cat yarnretries.log | grep 'Already tried 9 time(s);' | wc -l 31 Ungraceful failure when yarn client is not configured. -- Key: SPARK-8642 URL: https://issues.apache.org/jira/browse/SPARK-8642 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.3.0, 1.3.1 Reporter: Juliet Hougland Priority: Minor Attachments: yarnretries.log When HADOOP_CONF_DIR is not configured (ie yarn-site.xml is not available) the yarn client will try to submit an application. No connection to the resource manager will be able to be established. The client will try to connect 10 times (with a max retry of ten), and then do that 30 more time. This takes about 5 minutes before an Error is recorded for spark context initialization, which is caused by a connect exception. I would expect that after the first 1- tries fail, the initialization of the spark context should fail too. At least that is what I would think given the logs. An earlier failure would be ideal/preferred. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8642) Ungraceful failure when yarn client is not configured.
Juliet Hougland created SPARK-8642: -- Summary: Ungraceful failure when yarn client is not configured. Key: SPARK-8642 URL: https://issues.apache.org/jira/browse/SPARK-8642 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.3.1, 1.3.0 Reporter: Juliet Hougland Priority: Minor When HADOOP_CONF_DIR is not configured (ie yarn-site.xml is not available) the yarn client will try to submit an application. No connection to the resource manager will be able to be established. The client will try to connect 10 times (with a max retry of ten), and then do that 30 more time. This takes about 5 minutes before an Error is recorded for spark context initialization, which is caused by a connect exception. I would expect that after the first 1- tries fail, the initialization of the spark context should fail too. At least that is what I would think given the logs. An earlier failure would be ideal/preferred. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8646) PySpark does not run on YARN
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Juliet Hougland updated SPARK-8646: --- Attachment: spark1.4-SPARK_HOME-set-PYTHONPATH-set.log spark1.4-SPARK_HOME-set.log The logs from failures when only SPARK_HOME is set and when PYTHONPATH is also set. PySpark does not run on YARN Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Attachments: spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, spark1.4-SPARK_HOME-set.log Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8646) PySpark does not run on YARN
Juliet Hougland created SPARK-8646: -- Summary: PySpark does not run on YARN Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8612) Yarn application status is misreported for failed PySpark apps.
Juliet Hougland created SPARK-8612: -- Summary: Yarn application status is misreported for failed PySpark apps. Key: SPARK-8612 URL: https://issues.apache.org/jira/browse/SPARK-8612 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.4.0, 1.3.1, 1.3.0 Environment: PySpark job run in yarn-client mode on CDH 5.4.2 Reporter: Juliet Hougland Priority: Minor When a PySpark job fails, the YARN records and reports its status as successful. Hari Shreedharan pointed out to me that [the ApplicationMaster records app success when system.exit is called. | https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L124] PySpark always [exits by calling os._exit. | https://github.com/apache/spark/blob/master/python/pyspark/daemon.py#L169] Because of this, every PySpark application run on yarn is marked as completed successfully. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7194) Vectors factors method for sparse vectors should accept the output of zipWithIndex
Juliet Hougland created SPARK-7194: -- Summary: Vectors factors method for sparse vectors should accept the output of zipWithIndex Key: SPARK-7194 URL: https://issues.apache.org/jira/browse/SPARK-7194 Project: Spark Issue Type: Improvement Reporter: Juliet Hougland Let's say we have an RDD of Array[Double] where zero values are explictly recorded. Ie (0.0, 0.0, 3.2, 0.0...) If we want to transform this into an RDD of sparse vectors, we currently have to: arr_doubles.map{ array = val indexElem: Seq[(Int, Double)] = array.zipWithIndex.filter(tuple = tuple._1 != 0.0).map(tuple = (tuple._2, tuple._1)) Vectors.sparse(arrray.length, indexElem) } Notice that there is a map step at the end to switch the order of the index and the element value after .zipWithIndex. There should be a factory method on the Vectors class that allows you to avoid this flipping of tuple elements when using zipWithIndex. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7194) Vectors factors method for sparse vectors should accept the output of zipWithIndex
[ https://issues.apache.org/jira/browse/SPARK-7194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Juliet Hougland updated SPARK-7194: --- Description: Let's say we have an RDD of Array[Double] where zero values are explictly recorded. Ie (0.0, 0.0, 3.2, 0.0...) If we want to transform this into an RDD of sparse vectors, we currently have to: arr_doubles.map{ array = val indexElem: Seq[(Int, Double)] = array.zipWithIndex.filter(tuple = tuple._1 != 0.0).map(tuple = (tuple._2, tuple._1)) Vectors.sparse(arrray.length, indexElem) } Notice that there is a map step at the end to switch the order of the index and the element value after .zipWithIndex. There should be a factory method on the Vectors class that allows you to avoid this flipping of tuple elements when using zipWithIndex. was: Let's say we have an RDD of Array[Double] where zero values are explictly recorded. Ie (0.0, 0.0, 3.2, 0.0...) If we want to transform this into an RDD of sparse vectors, we currently have to: arr_doubles.map{ array = val indexElem: Seq[(Int, Double)] = array.zipWithIndex.filter(tuple = tuple._1 != 0.0).map(tuple = (tuple._2, tuple._1)) Vectors.sparse(arrray.length, indexElem) } Notice that there is a map step at the end to switch the order of the index and the element value after .zipWithIndex. There should be a factory method on the Vectors class that allows you to avoid this flipping of tuple elements when using zipWithIndex. Vectors factors method for sparse vectors should accept the output of zipWithIndex -- Key: SPARK-7194 URL: https://issues.apache.org/jira/browse/SPARK-7194 Project: Spark Issue Type: Improvement Reporter: Juliet Hougland Let's say we have an RDD of Array[Double] where zero values are explictly recorded. Ie (0.0, 0.0, 3.2, 0.0...) If we want to transform this into an RDD of sparse vectors, we currently have to: arr_doubles.map{ array = val indexElem: Seq[(Int, Double)] = array.zipWithIndex.filter(tuple = tuple._1 != 0.0).map(tuple = (tuple._2, tuple._1)) Vectors.sparse(arrray.length, indexElem) } Notice that there is a map step at the end to switch the order of the index and the element value after .zipWithIndex. There should be a factory method on the Vectors class that allows you to avoid this flipping of tuple elements when using zipWithIndex. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6938) Add informative error messages to require statements.
Juliet Hougland created SPARK-6938: -- Summary: Add informative error messages to require statements. Key: SPARK-6938 URL: https://issues.apache.org/jira/browse/SPARK-6938 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Juliet Hougland Priority: Trivial In the Vectors class there are multiple require statements that do not return any message if the requirement fails. These should instead provide an informative error message. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5459) The reference of combineByKey in the programming guide should be replaced by aggregateByKey
[ https://issues.apache.org/jira/browse/SPARK-5459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Juliet Hougland closed SPARK-5459. -- Resolution: Duplicate The reference of combineByKey in the programming guide should be replaced by aggregateByKey --- Key: SPARK-5459 URL: https://issues.apache.org/jira/browse/SPARK-5459 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.2.0 Reporter: Juliet Hougland Priority: Minor The spark programming guide references combineByKey in a note about groupByKey in the transformation section. This should be replaced with a reference to aggregateByKey. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5459) The reference of combineByKey in the programming guide should be replaced by aggregateByKey
Juliet Hougland created SPARK-5459: -- Summary: The reference of combineByKey in the programming guide should be replaced by aggregateByKey Key: SPARK-5459 URL: https://issues.apache.org/jira/browse/SPARK-5459 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.2.0 Reporter: Juliet Hougland Priority: Minor The spark programming guide references combineByKey in a note about groupByKey in the transformation section. This should be replaced with a reference to aggregateByKey. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5442) Docs claim users must explicitly depend on a hadoop client, but it is not actually required.
Juliet Hougland created SPARK-5442: -- Summary: Docs claim users must explicitly depend on a hadoop client, but it is not actually required. Key: SPARK-5442 URL: https://issues.apache.org/jira/browse/SPARK-5442 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.2.0 Reporter: Juliet Hougland Priority: Trivial In the linking with spark section, the docs claim that users need to explicitly depend on a hadoop client in order to interact with hdfs. This is not true and should be removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3369) Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator
[ https://issues.apache.org/jira/browse/SPARK-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14183301#comment-14183301 ] Juliet Hougland commented on SPARK-3369: The guaruntee of semantic versioning is that all major versions will be binary compatible. If we were to change the method to return a different type, we could no longer run programs written against previous versions on the current version, which would require us calling the version this change appears in Spark 2.0. The general rule is that you can add to an API and continue to compatible, but you can not remove. I agree that expanding that API with methods that accept a bunch of FlatMapFunction2s would be ugly. I think the up side is that it is incredibly transparent to end users. I like that is allows an explicit deprecation and suggests and immediate alternative. Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator - Key: SPARK-3369 URL: https://issues.apache.org/jira/browse/SPARK-3369 Project: Spark Issue Type: Improvement Components: Java API Affects Versions: 1.0.2 Reporter: Sean Owen Priority: Critical Labels: breaking_change Attachments: FlatMapIterator.patch {{mapPartitions}} in the Scala RDD API takes a function that transforms an {{Iterator}} to an {{Iterator}}: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD In the Java RDD API, the equivalent is a FlatMapFunction, which operates on an {{Iterator}} but is requires to return an {{Iterable}}, which is a stronger condition and appears inconsistent. It's a problematic inconsistent though because this seems to require copying all of the input into memory in order to create an object that can be iterated many times, since the input does not afford this itself. Similarity for other {{mapPartitions*}} methods and other {{*FlatMapFunctions}}s in Java. (Is there a reason for this difference that I'm overlooking?) If I'm right that this was inadvertent inconsistency, then the big issue here is that of course this is part of a public API. Workarounds I can think of: Promise that Spark will only call {{iterator()}} once, so implementors can use a hacky {{IteratorIterable}} that returns the same {{Iterator}}. Or, make a series of methods accepting a {{FlatMapFunction2}}, etc. with the desired signature, and deprecate existing ones. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org