[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-30 Thread Juliet Hougland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218610#comment-15218610
 ] 

Juliet Hougland commented on SPARK-13587:
-

Being able to ship around pex files like we do .py and .egg files sounds very 
reasonable from a delineation of responsibilities perspective.

I like the idea and would support a change like that. A question/edge case 
worth working out is how pex files relate to compiled c libs that python libs 
may need to link to. I don't know much about pex, but initial assessment is 
that it shouldn't be a huge problem. I like this solution.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-25 Thread Juliet Hougland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212191#comment-15212191
 ] 

Juliet Hougland commented on SPARK-13587:
-

I really do think spark and pyspark needs to stay out of the business for 
installing anything for people. A generic executable is relatively neutral as 
to what exactly that executable does, which is good. Spark's scope should be 
computation/execution, not environment setup and teardown. 

Have you considered using NFS or Amazon EFS to allow users to create and manage 
their own envs and then mounting those on worker/executor nodes? This is an 
elegant solution we (many experienced people at Cloudera like Guru M and 
Tristan Z recommended this) have seen deployed successfully. I believe given 
the description of your problem it should suit your needs.

[~vanzin] as suggested "one alternative to shared mounts is to store the thing 
in HDFS and use something like --files / --archives in Spark. The distribution 
to new containers is handled by YARN, and Spark just would need some 
adjustments to find the right executable inside those archives."

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-03 Thread Juliet Hougland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179253#comment-15179253
 ] 

Juliet Hougland commented on SPARK-13587:
-

That is wonderful. Let me know if you'd like me to help work on it. It has been 
dangling on my mental todo list for a while.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-03 Thread Juliet Hougland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179250#comment-15179250
 ] 

Juliet Hougland commented on SPARK-13587:
-

I made a comment related to this below. TLDR I think the suggested --py-env 
option could be encompassed an already needed --pyspark_python sort of option.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-03 Thread Juliet Hougland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179247#comment-15179247
 ] 

Juliet Hougland commented on SPARK-13587:
-

Currently the way users specify the workers' python interpreter is through the 
PYSPARK_PYTHON env variable. It would be beneficial to users to allow that path 
to be specified by a cli flag. That is a current rough edge of using already 
installed envs on a cluster.

If this was added as a cli flag, I could see valid options being 
'pyspark/python/path', 'venv' (temp virtualenv), and 'conda' (temp conda env) 
and requiring a second flag to specify the requirements file. I think it helps 
prevent an explosion of flag for spark submit while helping handle a very 
important and often changed parameter for a job. What do you think of this? 

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13587) Support virtualenv in PySpark

2016-03-03 Thread Juliet Hougland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178536#comment-15178536
 ] 

Juliet Hougland edited comment on SPARK-13587 at 3/3/16 9:21 PM:
-

If pyspark allows users to create virtual environments, users will also want 
and need other features of python environment management on a cluster. I think 
this change would broaden the scope of PySpark to include python package 
management on a cluster. I do not think that spark should be in the business of 
creating python environments. I think the support load in terms of feature 
requests, mailing list traffic, etc would be very large. This feature would 
begin to solve a problem, but would also put us on the hook for many more. 

I agree with the general intention of this JIRA -- make it easier to manage and 
interact with complex python environments on a cluster. Perhaps there are other 
ways to accomplish this without broadening scope and functionality as much. For 
example, checking a requirements file against an environment before execution.

Edit: I see now that you are proposing a short lived virtualenv. My objections 
about the broadening of scope still apply. I generally do not agree with 
suggestions that tightly tie us (and users) to a specific method of pyenv 
management. The loose coupling of python envs one a cluster to pyspark (via a 
path to an interpreter) is a positive feature. I would much rather add 
--pyspark_python to the cli tool (and deprecate the env var) than add a ton of 
logic to create environments for users. 


was (Author: juliet):
If pyspark allows users to create virtual environments, users will also want 
and need other features of python environment management on a cluster. I think 
this change would broaden the scope of PySpark to include python package 
management on a cluster. I do not think that spark should be in the business of 
creating python environments. I think the support load in terms of feature 
requests, mailing list traffic, etc would be very large. This feature would 
begin to solve a problem, but would also put us on the hook for many more. 

I agree with the general intention of this JIRA -- make it easier to manage and 
interact with complex python environments on a cluster. Perhaps there are other 
ways to accomplish this without broadening scope and functionality as much. For 
example, checking a requirements file against an environment before execution.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-03 Thread Juliet Hougland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178536#comment-15178536
 ] 

Juliet Hougland commented on SPARK-13587:
-

If pyspark allows users to create virtual environments, users will also want 
and need other features of python environment management on a cluster. I think 
this change would broaden the scope of PySpark to include python package 
management on a cluster. I do not think that spark should be in the business of 
creating python environments. I think the support load in terms of feature 
requests, mailing list traffic, etc would be very large. This feature would 
begin to solve a problem, but would also put us on the hook for many more. 

I agree with the general intention of this JIRA -- make it easier to manage and 
interact with complex python environments on a cluster. Perhaps there are other 
ways to accomplish this without broadening scope and functionality as much. For 
example, checking a requirements file against an environment before execution.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13303) Spark fails with pandas import error when pandas is not explicitly imported by user

2016-02-12 Thread Juliet Hougland (JIRA)
Juliet Hougland created SPARK-13303:
---

 Summary: Spark fails with pandas import error when pandas is not 
explicitly imported by user
 Key: SPARK-13303
 URL: https://issues.apache.org/jira/browse/SPARK-13303
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.6.0
 Environment: The python installation used by the driver (edge node) 
has pandas installed on it, while on the data nodes pandas do not have pandas 
installed in the python runtimes used. Pandas is never explicitly imported by 
pi.py.
Reporter: Juliet Hougland


Running `spark-submit pi.py` results in:

  File 
"/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/python/lib/pyspark.zip/pyspark/worker.py",
 line 98, in main
command = pickleSer._read_with_length(infile)
  File 
"/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py",
 line 164, in _read_with_length
return self.loads(obj)
  File 
"/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py",
 line 422, in loads
return pickle.loads(obj)
ImportError: No module named pandas.algos

at 
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:138)
at 
org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:179)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:97)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


This is unexpected and hard for users to unravel why they may see this error, 
as they themselves have not explicitly done anything with pandas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4073) Parquet+Snappy can cause significant off-heap memory usage

2016-01-21 Thread Juliet Hougland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15111607#comment-15111607
 ] 

Juliet Hougland commented on SPARK-4073:


For those playing along at home-- the solution for me was to set 
spark.executor.extraJavaOptions=-XX:MaxDirectMemorySize=4G so that worker JVMs 
had a much, much higher off heap memory limit.

> Parquet+Snappy can cause significant off-heap memory usage
> --
>
> Key: SPARK-4073
> URL: https://issues.apache.org/jira/browse/SPARK-4073
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Patrick Wendell
>Priority: Critical
>
> The parquet snappy codec allocates off-heap buffers for decompression[1]. In 
> one cases the observed size of these buffers was high enough to add several 
> GB of data to the overall virtual memory usage of the Spark executor process. 
> I don't understand enough about our use of Snappy to fully grok how much data 
> we would _expect_ to be present in these buffers at any given time, but I can 
> say a few things.
> 1. The dataset had individual rows that were fairly large, e.g. megabytes.
> 2. Direct buffers are not cleaned up until GC events, and overall there was 
> not much heap contention. So maybe they just weren't being cleaned.
> I opened PARQUET-118 to see if they can provide an option to use on-heap 
> buffers for decompression. In the mean time, we could consider changing the 
> default back to gzip, or we could do nothing (not sure how many other users 
> will hit this).
> [1] 
> https://github.com/apache/incubator-parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/codec/SnappyDecompressor.java#L28



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4073) Parquet+Snappy can cause significant off-heap memory usage

2016-01-21 Thread Juliet Hougland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1548#comment-1548
 ] 

Juliet Hougland commented on SPARK-4073:


I have run in to a related problem. I am reading a snappy compressed parquet 
file via pyspark and get the follow OOM error:

16/01/21 11:43:10 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception 
in thread Thread[stdout writer for /opt/anaconda2/bin/python,5,main]
java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:658)
at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
at 
parquet.hadoop.codec.SnappyDecompressor.setInput(SnappyDecompressor.java:102)
at 
parquet.hadoop.codec.NonBlockedDecompressorStream.read(NonBlockedDecompressorStream.java:46)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at 
parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:204)
at 
parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.(PlainValuesDictionary.java:89)
at 
parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.(PlainValuesDictionary.java:72)
at parquet.column.Encoding$1.initDictionary(Encoding.java:89)
at parquet.column.Encoding$4.initDictionary(Encoding.java:148)
at 
parquet.column.impl.ColumnReaderImpl.(ColumnReaderImpl.java:337)
at 
parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:66)
at 
parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:61)
at 
parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:270)
at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:134)
at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:99)
at 
parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:99)
at 
parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:135)
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:205)
at 
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at 
org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:119)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at 
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:114)
at 
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:405)
at 
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:243)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1617)
at 
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:205)



> Parquet+Snappy can cause significant off-heap memory usage
> --
>
> Key: SPARK-4073
> URL: https://issues.apache.org/jira/browse/SPARK-4073
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Patrick Wendell
>Priority: Critical
>
> The parquet snappy codec allocates off-heap buffers for decompression[1]. In 
> one cases the observed size of these buffers was high enough to add several 
> GB of data to the overall virtual memory usage of the Spark executor process. 
> I don't understand enough about our use of Snappy to fully grok how much data 
> we would _expect_ to be present in these buffers at any given time, but I can 
> say a few things.
> 1. The dataset had individual rows that were fairly large, e.g. megabytes.
> 2. Direct buffers are not cleaned up until GC events, and overall there was 
> not much heap contention. So maybe they just weren't being cleaned.
> I opened PARQUET-118 to see if they can provide an option to use on-heap 
> buffers for decompression. In the mean time, we could consider changing the 
> default back to gzip, or we could do nothing (not sure how many other users 
> will hit this).
> [1] 
> 

[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-07-15 Thread Juliet Hougland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14628947#comment-14628947
 ] 

Juliet Hougland commented on SPARK-8646:


The failure happens at the point that I need to write out a file on the cluster 
and pyspark facilities need to be available to executors, not just the driver 
program. I can parse args and start a spark context fine, it fails at the point 
that I call sc.saveAsTextFile. Relevant lines:

{panel}
def analyze(data_io):
sc = data_io.sc()

sc.addPyFile(file:/home/juliet/src/out-of-stock/outofstock/GeometricModel.py)
keyed_ts_rdd = to_keyed_ts(sc.textFile(data_io.input_path)).cache()

 # Compute days between sales on a store-item basis
 keyed_days_btwn_sales = keyed_ts_rdd.mapValues(days_between_sales).cache()

# Identify days with an sales numbers that are outliers, using tukey's 
criterion
keyed_outliers = keyed_ts_rdd.mapValues(flag_outliers)
to_csv_lines(keyed_outliers).saveAsTextFile(data_io.sales_outliers_path) # 
Point of failure
Other Stuff

if __name__ == __main__:
parser = argparse.ArgumentParser(description='Analyze store-item sales 
history for anomolies.')
parser.add_argument('input_path')
parser.add_argument('output_dir')
parser.add_argument('mode')
args = parser.parse_args()

dataIO = DataIO(args.input_path, args.output_dir, mode=args.mode)
analyze(dataIO)
{panel}

This runs fine on Spark 1.3, and produces reasonable results that get written 
to files in hdfs. I'm pretty confident that my use of argparse and other logic 
in my code work fine. 

 PySpark does not run on YARN
 

 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir
 also with
 SPARK_HOME=local/path/to/spark1.4install/dir
 PYTHONPATH=$SPARK_HOME/python/lib
 Spark apps are submitted with the command:
 $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
 hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
 data_transform contains a main method, and the rest of the args are parsed in 
 my own code.
Reporter: Juliet Hougland
 Attachments: executor.log, pi-test.log, 
 spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
 spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
 spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log


 Running pyspark jobs result in a no module named pyspark when run in 
 yarn-client mode in spark 1.4.
 [I believe this JIRA represents the change that introduced this error.| 
 https://issues.apache.org/jira/browse/SPARK-6869 ]
 This does not represent a binary compatible change to spark. Scripts that 
 worked on previous spark versions (ie comands the use spark-submit) should 
 continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-07-15 Thread Juliet Hougland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14628971#comment-14628971
 ] 

Juliet Hougland commented on SPARK-8646:


Yea, it works fine if I add that arg. There are two reasons I think this should 
be fixed in Spark, despite there being a work around. First, I think API 
compatibility should include scripts that 

 PySpark does not run on YARN
 

 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir
 also with
 SPARK_HOME=local/path/to/spark1.4install/dir
 PYTHONPATH=$SPARK_HOME/python/lib
 Spark apps are submitted with the command:
 $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
 hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
 data_transform contains a main method, and the rest of the args are parsed in 
 my own code.
Reporter: Juliet Hougland
 Attachments: executor.log, pi-test.log, 
 spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
 spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
 spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log


 Running pyspark jobs result in a no module named pyspark when run in 
 yarn-client mode in spark 1.4.
 [I believe this JIRA represents the change that introduced this error.| 
 https://issues.apache.org/jira/browse/SPARK-6869 ]
 This does not represent a binary compatible change to spark. Scripts that 
 worked on previous spark versions (ie comands the use spark-submit) should 
 continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8646) PySpark does not run on YARN

2015-07-15 Thread Juliet Hougland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14628947#comment-14628947
 ] 

Juliet Hougland edited comment on SPARK-8646 at 7/15/15 11:46 PM:
--

The failure happens at the point that I need to write out a file on the cluster 
and pyspark facilities need to be available to executors, not just the driver 
program. I can parse args and start a spark context fine, it fails at the point 
that I call sc.saveAsTextFile. Relevant lines:

{panel}
def analyze(data_io):
sc = data_io.sc()

sc.addPyFile(file:/home/juliet/src/out-of-stock/outofstock/GeometricModel.py)
keyed_ts_rdd = to_keyed_ts(sc.textFile(data_io.input_path)).cache()
keyed_days_btwn_sales = keyed_ts_rdd.mapValues(days_between_sales).cache()
keyed_outliers = keyed_ts_rdd.mapValues(flag_outliers)
to_csv_lines(keyed_outliers).saveAsTextFile(data_io.sales_outliers_path) # 
Point of failure
Other Stuff

if __name__ == __main__:
parser = argparse.ArgumentParser(description='Analyze store-item sales 
history for anomolies.')
parser.add_argument('input_path')
parser.add_argument('output_dir')
parser.add_argument('mode')
args = parser.parse_args()

dataIO = DataIO(args.input_path, args.output_dir, mode=args.mode)
analyze(dataIO)
{panel}

This runs fine on Spark 1.3, and produces reasonable results that get written 
to files in hdfs. I'm pretty confident that my use of argparse and other logic 
in my code work fine. (Note eddited because of strange jira formatting)


was (Author: juliet):
The failure happens at the point that I need to write out a file on the cluster 
and pyspark facilities need to be available to executors, not just the driver 
program. I can parse args and start a spark context fine, it fails at the point 
that I call sc.saveAsTextFile. Relevant lines:

{panel}
def analyze(data_io):
sc = data_io.sc()

sc.addPyFile(file:/home/juliet/src/out-of-stock/outofstock/GeometricModel.py)
keyed_ts_rdd = to_keyed_ts(sc.textFile(data_io.input_path)).cache()

 # Compute days between sales on a store-item basis
 keyed_days_btwn_sales = keyed_ts_rdd.mapValues(days_between_sales).cache()

# Identify days with an sales numbers that are outliers, using tukey's 
criterion
keyed_outliers = keyed_ts_rdd.mapValues(flag_outliers)
to_csv_lines(keyed_outliers).saveAsTextFile(data_io.sales_outliers_path) # 
Point of failure
Other Stuff

if __name__ == __main__:
parser = argparse.ArgumentParser(description='Analyze store-item sales 
history for anomolies.')
parser.add_argument('input_path')
parser.add_argument('output_dir')
parser.add_argument('mode')
args = parser.parse_args()

dataIO = DataIO(args.input_path, args.output_dir, mode=args.mode)
analyze(dataIO)
{panel}

This runs fine on Spark 1.3, and produces reasonable results that get written 
to files in hdfs. I'm pretty confident that my use of argparse and other logic 
in my code work fine. 

 PySpark does not run on YARN
 

 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir
 also with
 SPARK_HOME=local/path/to/spark1.4install/dir
 PYTHONPATH=$SPARK_HOME/python/lib
 Spark apps are submitted with the command:
 $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
 hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
 data_transform contains a main method, and the rest of the args are parsed in 
 my own code.
Reporter: Juliet Hougland
 Attachments: executor.log, pi-test.log, 
 spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
 spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
 spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log


 Running pyspark jobs result in a no module named pyspark when run in 
 yarn-client mode in spark 1.4.
 [I believe this JIRA represents the change that introduced this error.| 
 https://issues.apache.org/jira/browse/SPARK-6869 ]
 This does not represent a binary compatible change to spark. Scripts that 
 worked on previous spark versions (ie comands the use spark-submit) should 
 continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8646) PySpark does not run on YARN

2015-07-15 Thread Juliet Hougland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14628971#comment-14628971
 ] 

Juliet Hougland edited comment on SPARK-8646 at 7/16/15 12:03 AM:
--

Yea, it works fine if I add that arg. There are two reasons I think this should 
be fixed in Spark, despite there being a work around. First, I think API 
compatibility should/does include scripts. Second, if Spark provides the 
ability to set the master via code, it should be respected and actually work. 
Otherwise, the option that doesn't work (setting master via code) should not be 
available at all.


was (Author: juliet):
Yea, it works fine if I add that arg. There are two reasons I think this should 
be fixed in Spark, despite there being a work around. First, I think API 
compatibility should include scripts that 

 PySpark does not run on YARN
 

 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir
 also with
 SPARK_HOME=local/path/to/spark1.4install/dir
 PYTHONPATH=$SPARK_HOME/python/lib
 Spark apps are submitted with the command:
 $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
 hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
 data_transform contains a main method, and the rest of the args are parsed in 
 my own code.
Reporter: Juliet Hougland
 Attachments: executor.log, pi-test.log, 
 spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
 spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
 spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log


 Running pyspark jobs result in a no module named pyspark when run in 
 yarn-client mode in spark 1.4.
 [I believe this JIRA represents the change that introduced this error.| 
 https://issues.apache.org/jira/browse/SPARK-6869 ]
 This does not represent a binary compatible change to spark. Scripts that 
 worked on previous spark versions (ie comands the use spark-submit) should 
 continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-07-13 Thread Juliet Hougland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625245#comment-14625245
 ] 

Juliet Hougland commented on SPARK-8646:


[~lianhuiwang] in $SPARK_HOME/conf I only have the spark-defaults.conf.template 
file, not a non-template version. I also do not set the spark master to local 
programmatically.

[~vanzin] The command logged to stderr is:

Spark Command: /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/bin/java -cp 
/home/juliet/bin/spark-1.4.0-bin-hadoop2.6/conf/:/home/juliet/bin/spark-1.4.0-bin-hadoop2.6/lib/spark-assembly-1.4.0-hadoop2.6.0.jar:/home/juliet/bin/spark-1.4.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/home/juliet/bin/spark-1.4.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/home/juliet/bin/spark-1.4.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/etc/hadoop/conf/
 -Xms512m -Xmx512m -XX:
MaxPermSize=128m org.apache.spark.deploy.SparkSubmit --verbose 
outofstock/data_transform.py
hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex7/ yarn-client

(sorry for the way the classpath gets chopped up between lines.) yarn-client is 
getting passed as a argument to my code, but because I am not specifying the 
master via the cli --master flag or via spark-defaults.conf it does not affect 
how the job initially starts up.

 PySpark does not run on YARN
 

 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir
 also with
 SPARK_HOME=local/path/to/spark1.4install/dir
 PYTHONPATH=$SPARK_HOME/python/lib
 Spark apps are submitted with the command:
 $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
 hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
 data_transform contains a main method, and the rest of the args are parsed in 
 my own code.
Reporter: Juliet Hougland
 Attachments: executor.log, pi-test.log, 
 spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
 spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
 spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log


 Running pyspark jobs result in a no module named pyspark when run in 
 yarn-client mode in spark 1.4.
 [I believe this JIRA represents the change that introduced this error.| 
 https://issues.apache.org/jira/browse/SPARK-6869 ]
 This does not represent a binary compatible change to spark. Scripts that 
 worked on previous spark versions (ie comands the use spark-submit) should 
 continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8646) PySpark does not run on YARN

2015-07-10 Thread Juliet Hougland (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juliet Hougland updated SPARK-8646:
---
Attachment: executor.log

 PySpark does not run on YARN
 

 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir
 also with
 SPARK_HOME=local/path/to/spark1.4install/dir
 PYTHONPATH=$SPARK_HOME/python/lib
 Spark apps are submitted with the command:
 $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
 hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
 data_transform contains a main method, and the rest of the args are parsed in 
 my own code.
Reporter: Juliet Hougland
 Attachments: executor.log, pi-test.log, 
 spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
 spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
 spark1.4-SPARK_HOME-set.log


 Running pyspark jobs result in a no module named pyspark when run in 
 yarn-client mode in spark 1.4.
 [I believe this JIRA represents the change that introduced this error.| 
 https://issues.apache.org/jira/browse/SPARK-6869 ]
 This does not represent a binary compatible change to spark. Scripts that 
 worked on previous spark versions (ie comands the use spark-submit) should 
 continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8646) PySpark does not run on YARN

2015-07-10 Thread Juliet Hougland (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juliet Hougland updated SPARK-8646:
---
Attachment: spark1.4-verbose.log
verbose-executor.log

 PySpark does not run on YARN
 

 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir
 also with
 SPARK_HOME=local/path/to/spark1.4install/dir
 PYTHONPATH=$SPARK_HOME/python/lib
 Spark apps are submitted with the command:
 $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
 hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
 data_transform contains a main method, and the rest of the args are parsed in 
 my own code.
Reporter: Juliet Hougland
 Attachments: executor.log, pi-test.log, 
 spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
 spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
 spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log


 Running pyspark jobs result in a no module named pyspark when run in 
 yarn-client mode in spark 1.4.
 [I believe this JIRA represents the change that introduced this error.| 
 https://issues.apache.org/jira/browse/SPARK-6869 ]
 This does not represent a binary compatible change to spark. Scripts that 
 worked on previous spark versions (ie comands the use spark-submit) should 
 continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-07-10 Thread Juliet Hougland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623008#comment-14623008
 ] 

Juliet Hougland commented on SPARK-8646:


[~lianhuiwang] I just uploaded the log files from using --verbose. I think I 
may have important clues as to where the problem lies. Instead of using 
'--master yarn-client' as part of my spark-submit command, I parse my own cli 
arg in my main class to get the spark master and initialize a configuration 
with it. If I add --master yarn-client in addition to my normal master 
specification, the job runs fine.

The following command works in Spark 1.3 but not in 1.4:
$SPARK_HOME/bin/spark-submit --verbose outofstock/data_transform.py \
hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex4/ yarn-client

If I add the --master yarn-client parameter to the command it works. 
Specifically:
$SPARK_HOME/bin/spark-submit --verbose --master yarn-client 
outofstock/data_transform.py \
hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex4/ yarn-client

 PySpark does not run on YARN
 

 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir
 also with
 SPARK_HOME=local/path/to/spark1.4install/dir
 PYTHONPATH=$SPARK_HOME/python/lib
 Spark apps are submitted with the command:
 $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
 hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
 data_transform contains a main method, and the rest of the args are parsed in 
 my own code.
Reporter: Juliet Hougland
 Attachments: executor.log, pi-test.log, 
 spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
 spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
 spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log


 Running pyspark jobs result in a no module named pyspark when run in 
 yarn-client mode in spark 1.4.
 [I believe this JIRA represents the change that introduced this error.| 
 https://issues.apache.org/jira/browse/SPARK-6869 ]
 This does not represent a binary compatible change to spark. Scripts that 
 worked on previous spark versions (ie comands the use spark-submit) should 
 continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8646) PySpark does not run on YARN

2015-07-10 Thread Juliet Hougland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623008#comment-14623008
 ] 

Juliet Hougland edited comment on SPARK-8646 at 7/10/15 10:40 PM:
--

[~lianhuiwang] I just uploaded the log files from using the verbose flag. I 
think I may have important clues as to where the problem lies. Instead of using 
'--master yarn-client' as part of my spark-submit command, I parse my own cli 
arg in my main class to get the spark master and initialize a configuration 
with it. If I add --master yarn-client in addition to my normal master 
specification, the job runs fine.

The following command works in Spark 1.3 but not in 1.4:
$SPARK_HOME/bin/spark-submit --verbose outofstock/data_transform.py \
hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex4/ yarn-client

If I add the --master yarn-client parameter to the command it works. 
Specifically:
$SPARK_HOME/bin/spark-submit --verbose --master yarn-client 
outofstock/data_transform.py \
hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex4/ yarn-client


was (Author: juliet):
[~lianhuiwang] I just uploaded the log files from using --verbose. I think I 
may have important clues as to where the problem lies. Instead of using 
'--master yarn-client' as part of my spark-submit command, I parse my own cli 
arg in my main class to get the spark master and initialize a configuration 
with it. If I add --master yarn-client in addition to my normal master 
specification, the job runs fine.

The following command works in Spark 1.3 but not in 1.4:
$SPARK_HOME/bin/spark-submit --verbose outofstock/data_transform.py \
hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex4/ yarn-client

If I add the --master yarn-client parameter to the command it works. 
Specifically:
$SPARK_HOME/bin/spark-submit --verbose --master yarn-client 
outofstock/data_transform.py \
hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex4/ yarn-client

 PySpark does not run on YARN
 

 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir
 also with
 SPARK_HOME=local/path/to/spark1.4install/dir
 PYTHONPATH=$SPARK_HOME/python/lib
 Spark apps are submitted with the command:
 $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
 hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
 data_transform contains a main method, and the rest of the args are parsed in 
 my own code.
Reporter: Juliet Hougland
 Attachments: executor.log, pi-test.log, 
 spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
 spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
 spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log


 Running pyspark jobs result in a no module named pyspark when run in 
 yarn-client mode in spark 1.4.
 [I believe this JIRA represents the change that introduced this error.| 
 https://issues.apache.org/jira/browse/SPARK-6869 ]
 This does not represent a binary compatible change to spark. Scripts that 
 worked on previous spark versions (ie comands the use spark-submit) should 
 continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-07-06 Thread Juliet Hougland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14614856#comment-14614856
 ] 

Juliet Hougland commented on SPARK-8646:


[~sowen] The pandas error came when I tried to run the pi job-- which doesn't 
import pandas at all. The only imports in 
$SPARK_1.4_HOME/examples/src/main/python/pi.py are as follows:

import sys
from random import random
from operator import add
from pyspark import SparkContext


 PySpark itself doesn't require pandas (if it does, that should be documented) 
so having the pi job (doesn't require pandas) fail with a pandas not found 
error is wrong, because at no point should the pi job or pyspark itself require 
pandas. The pandas error is very, very weird but not obviously directly related 
to this ticket. The problem I reported here has to do with pyspark itself not 
being shipped or perhaps available to the worker nodes when I run a pyspark app 
from spark 1.4 using YARN.

 PySpark does not run on YARN
 

 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir
 also with
 SPARK_HOME=local/path/to/spark1.4install/dir
 PYTHONPATH=$SPARK_HOME/python/lib
 Spark apps are submitted with the command:
 $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
 hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
 data_transform contains a main method, and the rest of the args are parsed in 
 my own code.
Reporter: Juliet Hougland
 Attachments: pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
 spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
 spark1.4-SPARK_HOME-set.log


 Running pyspark jobs result in a no module named pyspark when run in 
 yarn-client mode in spark 1.4.
 [I believe this JIRA represents the change that introduced this error.| 
 https://issues.apache.org/jira/browse/SPARK-6869 ]
 This does not represent a binary compatible change to spark. Scripts that 
 worked on previous spark versions (ie comands the use spark-submit) should 
 continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-07-06 Thread Juliet Hougland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615809#comment-14615809
 ] 

Juliet Hougland commented on SPARK-8646:


[~davies] Please look at the logs I have attached. The pandas.algo import error 
only appears in the pi-test.log file. I ran pi-test as a method to help debug 
this problem at the request of [~vanzin]. If you look at three other log files 
(with env diferences in the file names) those are from running my out-of-stock 
job. That job does have quite a few dependencies but I make sure those are 
available to the driver and workers. 

The real (first) issue that this ticket is related to is that pyspark isn't 
available on worker nodes. The same command I can use to run my job on spark 
1.3 does not work with spark 1.4.

 PySpark does not run on YARN
 

 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir
 also with
 SPARK_HOME=local/path/to/spark1.4install/dir
 PYTHONPATH=$SPARK_HOME/python/lib
 Spark apps are submitted with the command:
 $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
 hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
 data_transform contains a main method, and the rest of the args are parsed in 
 my own code.
Reporter: Juliet Hougland
 Attachments: pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
 spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
 spark1.4-SPARK_HOME-set.log


 Running pyspark jobs result in a no module named pyspark when run in 
 yarn-client mode in spark 1.4.
 [I believe this JIRA represents the change that introduced this error.| 
 https://issues.apache.org/jira/browse/SPARK-6869 ]
 This does not represent a binary compatible change to spark. Scripts that 
 worked on previous spark versions (ie comands the use spark-submit) should 
 continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-06-27 Thread Juliet Hougland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604014#comment-14604014
 ] 

Juliet Hougland commented on SPARK-8646:


When I configure spark to use my virtualenv that is on every node of the 
cluster and includes pandas, the pi job works fine. This makes sense to me 
because in the job that I have that fails, a spark context can be created 
without a module import error. the part that doesn't make sense to me is why 
pandas.algo would be needed at all. Looking at the code for the pi job, it is 
not part an import that is declared in the file. This is orthogonal to the 
point of this ticket, but is very very strange to me.

The module import error that is the core of this JIRA occurs when I need to 
write out results of a computation (ie calling sc.writeTextFile) which require 
the pyspark module to be available on the worker nodes.

 PySpark does not run on YARN
 

 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir
 also with
 SPARK_HOME=local/path/to/spark1.4install/dir
 PYTHONPATH=$SPARK_HOME/python/lib
 Spark apps are submitted with the command:
 $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
 hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
 data_transform contains a main method, and the rest of the args are parsed in 
 my own code.
Reporter: Juliet Hougland
 Attachments: pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
 spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
 spark1.4-SPARK_HOME-set.log


 Running pyspark jobs result in a no module named pyspark when run in 
 yarn-client mode in spark 1.4.
 [I believe this JIRA represents the change that introduced this error.| 
 https://issues.apache.org/jira/browse/SPARK-6869 ]
 This does not represent a binary compatible change to spark. Scripts that 
 worked on previous spark versions (ie comands the use spark-submit) should 
 continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8646) PySpark does not run on YARN

2015-06-26 Thread Juliet Hougland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603263#comment-14603263
 ] 

Juliet Hougland edited comment on SPARK-8646 at 6/26/15 5:35 PM:
-

Results from pi-test are uploaded in the attachment pi-test.log. Still a module 
missing error, this time it is pandas.algo.


was (Author: juliet):
Results from pu-test.log

 PySpark does not run on YARN
 

 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir
 also with
 SPARK_HOME=local/path/to/spark1.4install/dir
 PYTHONPATH=$SPARK_HOME/python/lib
 Spark apps are submitted with the command:
 $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
 hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
 data_transform contains a main method, and the rest of the args are parsed in 
 my own code.
Reporter: Juliet Hougland
 Attachments: pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
 spark1.4-SPARK_HOME-set.log


 Running pyspark jobs result in a no module named pyspark when run in 
 yarn-client mode in spark 1.4.
 [I believe this JIRA represents the change that introduced this error.| 
 https://issues.apache.org/jira/browse/SPARK-6869 ]
 This does not represent a binary compatible change to spark. Scripts that 
 worked on previous spark versions (ie comands the use spark-submit) should 
 continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8646) PySpark does not run on YARN

2015-06-26 Thread Juliet Hougland (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juliet Hougland updated SPARK-8646:
---
Attachment: pi-test.log

Results from pu-test.log

 PySpark does not run on YARN
 

 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir
 also with
 SPARK_HOME=local/path/to/spark1.4install/dir
 PYTHONPATH=$SPARK_HOME/python/lib
 Spark apps are submitted with the command:
 $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
 hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
 data_transform contains a main method, and the rest of the args are parsed in 
 my own code.
Reporter: Juliet Hougland
 Attachments: pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
 spark1.4-SPARK_HOME-set.log


 Running pyspark jobs result in a no module named pyspark when run in 
 yarn-client mode in spark 1.4.
 [I believe this JIRA represents the change that introduced this error.| 
 https://issues.apache.org/jira/browse/SPARK-6869 ]
 This does not represent a binary compatible change to spark. Scripts that 
 worked on previous spark versions (ie comands the use spark-submit) should 
 continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8646) PySpark does not run on YARN

2015-06-26 Thread Juliet Hougland (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juliet Hougland updated SPARK-8646:
---
Attachment: spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log

I ran the same line that gave me errors last time with 
HADOOP_CONF_DIR=/etc/hadoop/conf prepended. The command I used was:

HADOOP_CONF_DIR=/etc/hadoop/conf $SPARK_HOME/bin/spark-submit 
outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS 
hdfs:/user/juliet/ex1/ yarn-client 2 
spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log

I've attached the output from that, it appears to be the same to me.

 PySpark does not run on YARN
 

 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir
 also with
 SPARK_HOME=local/path/to/spark1.4install/dir
 PYTHONPATH=$SPARK_HOME/python/lib
 Spark apps are submitted with the command:
 $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
 hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
 data_transform contains a main method, and the rest of the args are parsed in 
 my own code.
Reporter: Juliet Hougland
 Attachments: pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
 spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
 spark1.4-SPARK_HOME-set.log


 Running pyspark jobs result in a no module named pyspark when run in 
 yarn-client mode in spark 1.4.
 [I believe this JIRA represents the change that introduced this error.| 
 https://issues.apache.org/jira/browse/SPARK-6869 ]
 This does not represent a binary compatible change to spark. Scripts that 
 worked on previous spark versions (ie comands the use spark-submit) should 
 continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8642) Ungraceful failure when yarn client is not configured.

2015-06-25 Thread Juliet Hougland (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juliet Hougland updated SPARK-8642:
---
Attachment: yarnretries.log

Log file from failed bc of misconfiguration spakr job. 
counting lines with 9 retires in it gives:
cat yarnretries.log | grep 'Already tried 9 time(s);' | wc -l
31


 Ungraceful failure when yarn client is not configured.
 --

 Key: SPARK-8642
 URL: https://issues.apache.org/jira/browse/SPARK-8642
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.3.0, 1.3.1
Reporter: Juliet Hougland
Priority: Minor
 Attachments: yarnretries.log


 When HADOOP_CONF_DIR is not configured (ie yarn-site.xml is not available) 
 the yarn client will try to submit an application. No connection to the 
 resource manager will be able to be established. The client will try to 
 connect 10 times (with a max retry of ten), and then do that 30 more time. 
 This takes about 5 minutes before an Error is recorded for spark context 
 initialization, which is caused by a connect exception. I would expect that 
 after the first 1- tries fail, the initialization of the spark context should 
 fail too. At least that is what I would think given the logs. An earlier 
 failure would be ideal/preferred.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8642) Ungraceful failure when yarn client is not configured.

2015-06-25 Thread Juliet Hougland (JIRA)
Juliet Hougland created SPARK-8642:
--

 Summary: Ungraceful failure when yarn client is not configured.
 Key: SPARK-8642
 URL: https://issues.apache.org/jira/browse/SPARK-8642
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.3.1, 1.3.0
Reporter: Juliet Hougland
Priority: Minor


When HADOOP_CONF_DIR is not configured (ie yarn-site.xml is not available) the 
yarn client will try to submit an application. No connection to the resource 
manager will be able to be established. The client will try to connect 10 times 
(with a max retry of ten), and then do that 30 more time. This takes about 5 
minutes before an Error is recorded for spark context initialization, which is 
caused by a connect exception. I would expect that after the first 1- tries 
fail, the initialization of the spark context should fail too. At least that is 
what I would think given the logs. An earlier failure would be ideal/preferred.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8646) PySpark does not run on YARN

2015-06-25 Thread Juliet Hougland (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juliet Hougland updated SPARK-8646:
---
Attachment: spark1.4-SPARK_HOME-set-PYTHONPATH-set.log
spark1.4-SPARK_HOME-set.log

The logs from failures when only SPARK_HOME is set and when PYTHONPATH is also 
set.

 PySpark does not run on YARN
 

 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir
 also with
 SPARK_HOME=local/path/to/spark1.4install/dir
 PYTHONPATH=$SPARK_HOME/python/lib
 Spark apps are submitted with the command:
 $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
 hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
 data_transform contains a main method, and the rest of the args are parsed in 
 my own code.
Reporter: Juliet Hougland
 Attachments: spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
 spark1.4-SPARK_HOME-set.log


 Running pyspark jobs result in a no module named pyspark when run in 
 yarn-client mode in spark 1.4.
 [I believe this JIRA represents the change that introduced this error.| 
 https://issues.apache.org/jira/browse/SPARK-6869 ]
 This does not represent a binary compatible change to spark. Scripts that 
 worked on previous spark versions (ie comands the use spark-submit) should 
 continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8646) PySpark does not run on YARN

2015-06-25 Thread Juliet Hougland (JIRA)
Juliet Hougland created SPARK-8646:
--

 Summary: PySpark does not run on YARN
 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir

also with
SPARK_HOME=local/path/to/spark1.4install/dir
PYTHONPATH=$SPARK_HOME/python/lib

Spark apps are submitted with the command:
$SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client

data_transform contains a main method, and the rest of the args are parsed in 
my own code.


Reporter: Juliet Hougland


Running pyspark jobs result in a no module named pyspark when run in 
yarn-client mode in spark 1.4.

[I believe this JIRA represents the change that introduced this error.| 
https://issues.apache.org/jira/browse/SPARK-6869 ]

This does not represent a binary compatible change to spark. Scripts that 
worked on previous spark versions (ie comands the use spark-submit) should 
continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8612) Yarn application status is misreported for failed PySpark apps.

2015-06-24 Thread Juliet Hougland (JIRA)
Juliet Hougland created SPARK-8612:
--

 Summary: Yarn application status is misreported for failed PySpark 
apps.
 Key: SPARK-8612
 URL: https://issues.apache.org/jira/browse/SPARK-8612
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.4.0, 1.3.1, 1.3.0
 Environment: PySpark job run in yarn-client mode on CDH 5.4.2
Reporter: Juliet Hougland
Priority: Minor


When a PySpark job fails, the YARN records and reports its status as 
successful. Hari Shreedharan pointed out to me that [the ApplicationMaster 
records app success when system.exit is called. | 
https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L124]
 PySpark always [exits by calling os._exit. | 
https://github.com/apache/spark/blob/master/python/pyspark/daemon.py#L169] 
Because of this, every PySpark application run on yarn is marked as completed 
successfully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7194) Vectors factors method for sparse vectors should accept the output of zipWithIndex

2015-04-28 Thread Juliet Hougland (JIRA)
Juliet Hougland created SPARK-7194:
--

 Summary: Vectors factors method for sparse vectors should accept 
the output of zipWithIndex
 Key: SPARK-7194
 URL: https://issues.apache.org/jira/browse/SPARK-7194
 Project: Spark
  Issue Type: Improvement
Reporter: Juliet Hougland


Let's say we have an RDD of Array[Double] where zero values are explictly 
recorded. Ie (0.0, 0.0, 3.2, 0.0...) If we want to transform this into an RDD 
of sparse vectors, we currently have to:

arr_doubles.map{ array =
   val indexElem: Seq[(Int, Double)] = array.zipWithIndex.filter(tuple =  
tuple._1 != 0.0).map(tuple = (tuple._2, tuple._1))
Vectors.sparse(arrray.length, indexElem)
}

Notice that there is a map step at the end to switch the order of the index and 
the element value after .zipWithIndex. There should be a factory method on the 
Vectors class that allows you to avoid this flipping of tuple elements when 
using zipWithIndex.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7194) Vectors factors method for sparse vectors should accept the output of zipWithIndex

2015-04-28 Thread Juliet Hougland (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juliet Hougland updated SPARK-7194:
---
Description: 
Let's say we have an RDD of Array[Double] where zero values are explictly 
recorded. Ie (0.0, 0.0, 3.2, 0.0...) If we want to transform this into an RDD 
of sparse vectors, we currently have to:

arr_doubles.map{ array =
   val indexElem: Seq[(Int, Double)] = array.zipWithIndex.filter(tuple =  
tuple._1 != 0.0).map(tuple = (tuple._2, tuple._1))

Vectors.sparse(arrray.length, indexElem)
}

Notice that there is a map step at the end to switch the order of the index and 
the element value after .zipWithIndex. There should be a factory method on the 
Vectors class that allows you to avoid this flipping of tuple elements when 
using zipWithIndex.

  was:
Let's say we have an RDD of Array[Double] where zero values are explictly 
recorded. Ie (0.0, 0.0, 3.2, 0.0...) If we want to transform this into an RDD 
of sparse vectors, we currently have to:

arr_doubles.map{ array =
   val indexElem: Seq[(Int, Double)] = array.zipWithIndex.filter(tuple =  
tuple._1 != 0.0).map(tuple = (tuple._2, tuple._1))
Vectors.sparse(arrray.length, indexElem)
}

Notice that there is a map step at the end to switch the order of the index and 
the element value after .zipWithIndex. There should be a factory method on the 
Vectors class that allows you to avoid this flipping of tuple elements when 
using zipWithIndex.


 Vectors factors method for sparse vectors should accept the output of 
 zipWithIndex
 --

 Key: SPARK-7194
 URL: https://issues.apache.org/jira/browse/SPARK-7194
 Project: Spark
  Issue Type: Improvement
Reporter: Juliet Hougland

 Let's say we have an RDD of Array[Double] where zero values are explictly 
 recorded. Ie (0.0, 0.0, 3.2, 0.0...) If we want to transform this into an RDD 
 of sparse vectors, we currently have to:
 arr_doubles.map{ array =
val indexElem: Seq[(Int, Double)] = array.zipWithIndex.filter(tuple =  
 tuple._1 != 0.0).map(tuple = (tuple._2, tuple._1))
 Vectors.sparse(arrray.length, indexElem)
 }
 Notice that there is a map step at the end to switch the order of the index 
 and the element value after .zipWithIndex. There should be a factory method 
 on the Vectors class that allows you to avoid this flipping of tuple elements 
 when using zipWithIndex.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6938) Add informative error messages to require statements.

2015-04-15 Thread Juliet Hougland (JIRA)
Juliet Hougland created SPARK-6938:
--

 Summary: Add informative error messages to require statements.
 Key: SPARK-6938
 URL: https://issues.apache.org/jira/browse/SPARK-6938
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Juliet Hougland
Priority: Trivial


In the Vectors class there are multiple require statements that do not return 
any message if the requirement fails. These should instead provide an 
informative error message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5459) The reference of combineByKey in the programming guide should be replaced by aggregateByKey

2015-01-28 Thread Juliet Hougland (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juliet Hougland closed SPARK-5459.
--
Resolution: Duplicate

 The reference of combineByKey in the programming guide should be replaced 
 by aggregateByKey
 ---

 Key: SPARK-5459
 URL: https://issues.apache.org/jira/browse/SPARK-5459
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.2.0
Reporter: Juliet Hougland
Priority: Minor

 The spark programming guide references combineByKey in a note about 
 groupByKey in the transformation section. This should be replaced with a 
 reference to aggregateByKey.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5459) The reference of combineByKey in the programming guide should be replaced by aggregateByKey

2015-01-28 Thread Juliet Hougland (JIRA)
Juliet Hougland created SPARK-5459:
--

 Summary: The reference of combineByKey in the programming guide 
should be replaced by aggregateByKey
 Key: SPARK-5459
 URL: https://issues.apache.org/jira/browse/SPARK-5459
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.2.0
Reporter: Juliet Hougland
Priority: Minor


The spark programming guide references combineByKey in a note about groupByKey 
in the transformation section. This should be replaced with a reference to 
aggregateByKey.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5442) Docs claim users must explicitly depend on a hadoop client, but it is not actually required.

2015-01-27 Thread Juliet Hougland (JIRA)
Juliet Hougland created SPARK-5442:
--

 Summary: Docs claim users must explicitly depend on a hadoop 
client, but it is not actually required.
 Key: SPARK-5442
 URL: https://issues.apache.org/jira/browse/SPARK-5442
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.2.0
Reporter: Juliet Hougland
Priority: Trivial


In the linking with spark section, the docs claim that users need to 
explicitly depend on a hadoop client in order to interact with hdfs. This is 
not true and should be removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3369) Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator

2014-10-24 Thread Juliet Hougland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14183301#comment-14183301
 ] 

Juliet Hougland commented on SPARK-3369:


The guaruntee of semantic versioning is that all major versions will be binary 
compatible. If we were to change the method to return a different type, we 
could no longer run programs written against previous versions on the current 
version, which would require us calling the version this change appears in 
Spark 2.0. The general rule is that you can add to an API and continue to 
compatible, but you can not remove.

I agree that expanding that API with methods that accept a bunch of 
FlatMapFunction2s would be ugly. I think the up side is that it is incredibly 
transparent to end users. I like that is allows an explicit deprecation and 
suggests and immediate alternative.

 Java mapPartitions Iterator-Iterable is inconsistent with Scala's 
 Iterator-Iterator
 -

 Key: SPARK-3369
 URL: https://issues.apache.org/jira/browse/SPARK-3369
 Project: Spark
  Issue Type: Improvement
  Components: Java API
Affects Versions: 1.0.2
Reporter: Sean Owen
Priority: Critical
  Labels: breaking_change
 Attachments: FlatMapIterator.patch


 {{mapPartitions}} in the Scala RDD API takes a function that transforms an 
 {{Iterator}} to an {{Iterator}}: 
 http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
 In the Java RDD API, the equivalent is a FlatMapFunction, which operates on 
 an {{Iterator}} but is requires to return an {{Iterable}}, which is a 
 stronger condition and appears inconsistent. It's a problematic inconsistent 
 though because this seems to require copying all of the input into memory in 
 order to create an object that can be iterated many times, since the input 
 does not afford this itself.
 Similarity for other {{mapPartitions*}} methods and other 
 {{*FlatMapFunctions}}s in Java.
 (Is there a reason for this difference that I'm overlooking?)
 If I'm right that this was inadvertent inconsistency, then the big issue here 
 is that of course this is part of a public API. Workarounds I can think of:
 Promise that Spark will only call {{iterator()}} once, so implementors can 
 use a hacky {{IteratorIterable}} that returns the same {{Iterator}}.
 Or, make a series of methods accepting a {{FlatMapFunction2}}, etc. with the 
 desired signature, and deprecate existing ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org