[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14628947#comment-14628947
 ] 

Juliet Hougland commented on SPARK-8646:
----------------------------------------

The failure happens at the point that I need to write out a file on the cluster 
and pyspark facilities need to be available to executors, not just the driver 
program. I can parse args and start a spark context fine, it fails at the point 
that I call sc.saveAsTextFile. Relevant lines:

{panel}
def analyze(data_io):
    sc = data_io.sc()
    
sc.addPyFile("file:/home/juliet/src/out-of-stock/outofstock/GeometricModel.py")
    keyed_ts_rdd = to_keyed_ts(sc.textFile(data_io.input_path)).cache()

     # Compute days between sales on a store-item basis
     keyed_days_btwn_sales = keyed_ts_rdd.mapValues(days_between_sales).cache()

    # Identify days with an sales numbers that are outliers, using tukey's 
criterion
    keyed_outliers = keyed_ts_rdd.mapValues(flag_outliers)
    to_csv_lines(keyed_outliers).saveAsTextFile(data_io.sales_outliers_path) # 
Point of failure
    <Other Stuff>

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Analyze store-item sales 
history for anomolies.')
    parser.add_argument('input_path')
    parser.add_argument('output_dir')
    parser.add_argument('mode')
    args = parser.parse_args()

    dataIO = DataIO(args.input_path, args.output_dir, mode=args.mode)
    analyze(dataIO)
{panel}

This runs fine on Spark 1.3, and produces reasonable results that get written 
to files in hdfs. I'm pretty confident that my use of argparse and other logic 
in my code work fine. 

> PySpark does not run on YARN
> ----------------------------
>
>                 Key: SPARK-8646
>                 URL: https://issues.apache.org/jira/browse/SPARK-8646
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, YARN
>    Affects Versions: 1.4.0
>         Environment: SPARK_HOME=local/path/to/spark1.4install/dir
> also with
> SPARK_HOME=local/path/to/spark1.4install/dir
> PYTHONPATH=$SPARK_HOME/python/lib
> Spark apps are submitted with the command:
> $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
> hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
> data_transform contains a main method, and the rest of the args are parsed in 
> my own code.
>            Reporter: Juliet Hougland
>         Attachments: executor.log, pi-test.log, 
> spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
> spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
> spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log
>
>
> Running pyspark jobs result in a "no module named pyspark" when run in 
> yarn-client mode in spark 1.4.
> [I believe this JIRA represents the change that introduced this error.| 
> https://issues.apache.org/jira/browse/SPARK-6869 ]
> This does not represent a binary compatible change to spark. Scripts that 
> worked on previous spark versions (ie comands the use spark-submit) should 
> continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to