from:"Nicholas Chammas \(JIRA\)"

[jira] [Commented] (SPARK-18084) write.partitionBy() does not recognize nested columns that select() can access

2016-10-25 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15606484#comment-15606484
 ] 

Nicholas Chammas commented on SPARK-18084:
--

cc [~marmbrus] - Dunno if this is actually bug or just an unsupported or 
inappropriate use case.

> write.partitionBy() does not recognize nested columns that select() can access
> --
>
> Key: SPARK-18084
> URL: https://issues.apache.org/jira/browse/SPARK-18084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Here's a simple repro in the PySpark shell:
> {code}
> from pyspark.sql import Row
> rdd = spark.sparkContext.parallelize([Row(a=Row(b=5))])
> df = spark.createDataFrame(rdd)
> df.printSchema()
> df.select('a.b').show()  # works
> df.write.partitionBy('a.b').text('/tmp/test')  # doesn't work
> {code}
> Here's what I see when I run this:
> {code}
> >>> from pyspark.sql import Row
> >>> rdd = spark.sparkContext.parallelize([Row(a=Row(b=5))])
> >>> df = spark.createDataFrame(rdd)
> >>> df.printSchema()
> root
>  |-- a: struct (nullable = true)
>  ||-- b: long (nullable = true)
> >>> df.show()
> +---+
> |  a|
> +---+
> |[5]|
> +---+
> >>> df.select('a.b').show()
> +---+
> |  b|
> +---+
> |  5|
> +---+
> >>> df.write.partitionBy('a.b').text('/tmp/test')
> Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py",
>  line 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o233.text.
> : org.apache.spark.sql.AnalysisException: Partition column a.b not found in 
> schema 
> StructType(StructField(a,StructType(StructField(b,LongType,true)),true));
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$10.apply(PartitioningUtils.scala:368)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$10.apply(PartitioningUtils.scala:368)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:366)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$.partitionColumnsSchema(PartitioningUtils.scala:366)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$.validatePartitionColumn(PartitioningUtils.scala:349)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:458)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
>   at org.apache.spark.sql.DataFrameWriter.text(DataFrameWriter.scala:534)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:214)
>   at java.lang.Thread.run(Thread.java:745)
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
>

[jira] [Updated] (SPARK-18084) write.partitionBy() does not recognize nested columns that select() can access

2016-10-24 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-18084:
-
Issue Type: Bug  (was: Improvement)

> write.partitionBy() does not recognize nested columns that select() can access
> --
>
> Key: SPARK-18084
> URL: https://issues.apache.org/jira/browse/SPARK-18084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Here's a simple repro in the PySpark shell:
> {code}
> from pyspark.sql import Row
> rdd = spark.sparkContext.parallelize([Row(a=Row(b=5))])
> df = spark.createDataFrame(rdd)
> df.printSchema()
> df.select('a.b').show()  # works
> df.write.partitionBy('a.b').text('/tmp/test')  # doesn't work
> {code}
> Here's what I see when I run this:
> {code}
> >>> from pyspark.sql import Row
> >>> rdd = spark.sparkContext.parallelize([Row(a=Row(b=5))])
> >>> df = spark.createDataFrame(rdd)
> >>> df.printSchema()
> root
>  |-- a: struct (nullable = true)
>  ||-- b: long (nullable = true)
> >>> df.show()
> +---+
> |  a|
> +---+
> |[5]|
> +---+
> >>> df.select('a.b').show()
> +---+
> |  b|
> +---+
> |  5|
> +---+
> >>> df.write.partitionBy('a.b').text('/tmp/test')
> Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py",
>  line 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o233.text.
> : org.apache.spark.sql.AnalysisException: Partition column a.b not found in 
> schema 
> StructType(StructField(a,StructType(StructField(b,LongType,true)),true));
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$10.apply(PartitioningUtils.scala:368)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$10.apply(PartitioningUtils.scala:368)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:366)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$.partitionColumnsSchema(PartitioningUtils.scala:366)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$.validatePartitionColumn(PartitioningUtils.scala:349)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:458)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
>   at org.apache.spark.sql.DataFrameWriter.text(DataFrameWriter.scala:534)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:214)
>   at java.lang.Thread.run(Thread.java:745)
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/readwriter.py",
>  line 656, in text
> self._jwrite.text(path)
>   File 
>

[jira] [Created] (SPARK-18084) write.partitionBy() does not recognize nested columns that select() can access

2016-10-24 Thread Nicholas Chammas (JIRA)

Nicholas Chammas created SPARK-18084:


 Summary: write.partitionBy() does not recognize nested columns 
that select() can access
 Key: SPARK-18084
 URL: https://issues.apache.org/jira/browse/SPARK-18084
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.1, 2.0.0
Reporter: Nicholas Chammas
Priority: Minor


Here's a simple repro in the PySpark shell:

{code}
from pyspark.sql import Row
rdd = spark.sparkContext.parallelize([Row(a=Row(b=5))])
df = spark.createDataFrame(rdd)
df.printSchema()
df.select('a.b').show()  # works
df.write.partitionBy('a.b').text('/tmp/test')  # doesn't work
{code}

Here's what I see when I run this:

{code}
>>> from pyspark.sql import Row
>>> rdd = spark.sparkContext.parallelize([Row(a=Row(b=5))])
>>> df = spark.createDataFrame(rdd)
>>> df.printSchema()
root
 |-- a: struct (nullable = true)
 ||-- b: long (nullable = true)

>>> df.show()
+---+
|  a|
+---+
|[5]|
+---+

>>> df.select('a.b').show()
+---+
|  b|
+---+
|  5|
+---+

>>> df.write.partitionBy('a.b').text('/tmp/test')
Traceback (most recent call last):
  File 
"/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py", 
line 63, in deco
return f(*a, **kw)
  File 
"/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py",
 line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o233.text.
: org.apache.spark.sql.AnalysisException: Partition column a.b not found in 
schema StructType(StructField(a,StructType(StructField(b,LongType,true)),true));
at 
org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$10.apply(PartitioningUtils.scala:368)
at 
org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$10.apply(PartitioningUtils.scala:368)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:367)
at 
org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:366)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.execution.datasources.PartitioningUtils$.partitionColumnsSchema(PartitioningUtils.scala:366)
at 
org.apache.spark.sql.execution.datasources.PartitioningUtils$.validatePartitionColumn(PartitioningUtils.scala:349)
at 
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:458)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
at org.apache.spark.sql.DataFrameWriter.text(DataFrameWriter.scala:534)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "", line 1, in 
  File 
"/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/readwriter.py",
 line 656, in text
self._jwrite.text(path)
  File 
"/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py",
 line 1133, in __call__
  File 
"/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py", 
line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'Partition column a.b not

[jira] [Commented] (SPARK-12757) Use reference counting to prevent blocks from being evicted during reads

2016-10-24 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15603211#comment-15603211
 ] 

Nicholas Chammas commented on SPARK-12757:
--

Just to link back, [~josephkb] is reporting that [this GraphFrames 
issue|https://github.com/graphframes/graphframes/issues/116] may be related to 
the work done here.

> Use reference counting to prevent blocks from being evicted during reads
> 
>
> Key: SPARK-12757
> URL: https://issues.apache.org/jira/browse/SPARK-12757
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> As a pre-requisite to off-heap caching of blocks, we need a mechanism to 
> prevent pages / blocks from being evicted while they are being read. With 
> on-heap objects, evicting a block while it is being read merely leads to 
> memory-accounting problems (because we assume that an evicted block is a 
> candidate for garbage-collection, which will not be true during a read), but 
> with off-heap memory this will lead to either data corruption or segmentation 
> faults.
> To address this, we should add a reference-counting mechanism to track which 
> blocks/pages are being read in order to prevent them from being evicted 
> prematurely. I propose to do this in two phases: first, add a safe, 
> conservative approach in which all BlockManager.get*() calls implicitly 
> increment the reference count of blocks and where tasks' references are 
> automatically freed upon task completion. This will be correct but may have 
> adverse performance impacts because it will prevent legitimate block 
> evictions. In phase two, we should incrementally add release() calls in order 
> to fix the eviction of unreferenced blocks. The latter change may need to 
> touch many different components, which is why I propose to do it separately 
> in order to make the changes easier to reason about and review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-17976) Global options to spark-submit should not be position-sensitive

2016-10-17 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas closed SPARK-17976.

Resolution: Not A Problem

Ah, makes perfect sense. Would have realized that myself if I had held off on 
reporting this for just a day or so. Apologies.

> Global options to spark-submit should not be position-sensitive
> ---
>
> Key: SPARK-17976
> URL: https://issues.apache.org/jira/browse/SPARK-17976
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Nicholas Chammas
>Priority: Minor
>
> It is maddening that this does what you expect:
> {code}
> spark-submit --packages graphframes:graphframes:0.2.0-spark2.0-s_2.11 \
> file.py 
> {code}
> whereas this doesn't because {{--packages}} is totally ignored:
> {code}
> spark-submit file.py \
> --packages graphframes:graphframes:0.2.0-spark2.0-s_2.11
> {code}
> Ideally, global options should be valid no matter where they are specified.
> If that's too much work, then I think at the very least {{spark-submit}} 
> should display a warning that some input is being ignored. (Ideally, it 
> should error out, but that's probably not possible for 
> backwards-compatibility reasons at this point.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17976) Global options to spark-submit should not be position-sensitive

2016-10-17 Thread Nicholas Chammas (JIRA)

Nicholas Chammas created SPARK-17976:


 Summary: Global options to spark-submit should not be 
position-sensitive
 Key: SPARK-17976
 URL: https://issues.apache.org/jira/browse/SPARK-17976
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 2.0.1, 2.0.0
Reporter: Nicholas Chammas
Priority: Minor


It is maddening that this does what you expect:

{code}
spark-submit --packages graphframes:graphframes:0.2.0-spark2.0-s_2.11 \
file.py 
{code}

whereas this doesn't because {{--packages}} is totally ignored:

{code}
spark-submit file.py \
--packages graphframes:graphframes:0.2.0-spark2.0-s_2.11
{code}

Ideally, global options should be valid no matter where they are specified.

If that's too much work, then I think at the very least {{spark-submit}} should 
display a warning that some input is being ignored. (Ideally, it should error 
out, but that's probably not possible for backwards-compatibility reasons at 
this point.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14742) Redirect spark-ec2 doc to new location

2016-08-31 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15452150#comment-15452150
 ] 

Nicholas Chammas edited comment on SPARK-14742 at 8/31/16 12:50 PM:


Sounds good to me. It's working now.


was (Author: nchammas):
Sounds good to me.

> Redirect spark-ec2 doc to new location
> --
>
> Key: SPARK-14742
> URL: https://issues.apache.org/jira/browse/SPARK-14742
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, EC2
>Reporter: Nicholas Chammas
>Assignee: Sean Owen
>Priority: Trivial
> Fix For: 2.0.0
>
>
> See: https://github.com/amplab/spark-ec2/pull/24#issuecomment-212033453
> We need to redirect this page
> http://spark.apache.org/docs/latest/ec2-scripts.html
> to this page
> https://github.com/amplab/spark-ec2#readme



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14742) Redirect spark-ec2 doc to new location

2016-08-31 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15452150#comment-15452150
 ] 

Nicholas Chammas commented on SPARK-14742:
--

Sounds good to me.

> Redirect spark-ec2 doc to new location
> --
>
> Key: SPARK-14742
> URL: https://issues.apache.org/jira/browse/SPARK-14742
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, EC2
>Reporter: Nicholas Chammas
>Assignee: Sean Owen
>Priority: Trivial
> Fix For: 2.0.0
>
>
> See: https://github.com/amplab/spark-ec2/pull/24#issuecomment-212033453
> We need to redirect this page
> http://spark.apache.org/docs/latest/ec2-scripts.html
> to this page
> https://github.com/amplab/spark-ec2#readme



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14742) Redirect spark-ec2 doc to new location

2016-08-30 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15450626#comment-15450626
 ] 

Nicholas Chammas commented on SPARK-14742:
--

{quote}
Otherwise the only way to get to this link is if you have it bookmarked.
{quote}

Or any page that has linked to it in the past. All those links are now broken. 
That's my main concern.

> Redirect spark-ec2 doc to new location
> --
>
> Key: SPARK-14742
> URL: https://issues.apache.org/jira/browse/SPARK-14742
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, EC2
>Reporter: Nicholas Chammas
>Assignee: Sean Owen
>Priority: Trivial
> Fix For: 2.0.0
>
>
> See: https://github.com/amplab/spark-ec2/pull/24#issuecomment-212033453
> We need to redirect this page
> http://spark.apache.org/docs/latest/ec2-scripts.html
> to this page
> https://github.com/amplab/spark-ec2#readme



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14742) Redirect spark-ec2 doc to new location

2016-08-30 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15450602#comment-15450602
 ] 

Nicholas Chammas commented on SPARK-14742:
--

http://spark.apache.org/docs/latest/ec2-scripts.html

I am seeing that this URL is not redirecting to the new spark-ec2 location on 
GitHub. [~srowen] - Can you fix that? 

I can see that we have some kind of redirect setup, but I guess it's not 
working.

https://github.com/apache/spark/blob/master/docs/ec2-scripts.md

> Redirect spark-ec2 doc to new location
> --
>
> Key: SPARK-14742
> URL: https://issues.apache.org/jira/browse/SPARK-14742
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, EC2
>Reporter: Nicholas Chammas
>Assignee: Sean Owen
>Priority: Trivial
> Fix For: 2.0.0
>
>
> See: https://github.com/amplab/spark-ec2/pull/24#issuecomment-212033453
> We need to redirect this page
> http://spark.apache.org/docs/latest/ec2-scripts.html
> to this page
> https://github.com/amplab/spark-ec2#readme



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17220) Upgrade Py4J to 0.10.3

2016-08-26 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-17220:
-
Component/s: PySpark

> Upgrade Py4J to 0.10.3
> --
>
> Key: SPARK-17220
> URL: https://issues.apache.org/jira/browse/SPARK-17220
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Weiqing Yang
>Priority: Minor
>
> Py4J 0.10.3 has landed. It includes some important bug fixes. For example:
> Both sides: fixed memory leak issue with ClientServer and potential deadlock 
> issue by creating a memory leak test suite. (Py4J 0.10.2)
> Both sides: added more memory leak tests and fixed a potential memory leak 
> related to listeners. (Py4J 0.10.3)
> So it's time to upgrade py4j from 0.10.1 to 0.10.3. The changelog is 
> available at https://www.py4j.org/changelog.html 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14241) Output of monotonically_increasing_id lacks stable relation with rows of DataFrame

2016-08-25 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437616#comment-15437616
 ] 

Nicholas Chammas commented on SPARK-14241:
--

[~marmbrus] - Would it be tough to make this function deterministic, or somehow 
"stable"? The linked Stack Overflow question shows some pretty surprising 
behavior from an end-user perspective.

If this would be tough to change, what are some alternatives you would 
recommend?

Do you think, for example, it would be possible to make a window function that 
_is_ deterministic and does effectively the same thing? Maybe something like 
{{row_number()}}, except the {{WindowSpec}} would not need to specify any 
partitioning or ordering. (Required ordering would be the main downside of 
using {{row_number()}} instead of {{monotonically_increasing_id()}}.)

> Output of monotonically_increasing_id lacks stable relation with rows of 
> DataFrame
> --
>
> Key: SPARK-14241
> URL: https://issues.apache.org/jira/browse/SPARK-14241
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Paul Shearer
>
> If you use monotonically_increasing_id() to append a column of IDs to a 
> DataFrame, the IDs do not have a stable, deterministic relationship to the 
> rows they are appended to. A given ID value can land on different rows 
> depending on what happens in the task graph:
> http://stackoverflow.com/questions/35705038/how-do-i-add-an-persistent-column-of-row-ids-to-spark-dataframe/35706321#35706321
> From a user perspective this behavior is very unexpected, and many things one 
> would normally like to do with an ID column are in fact only possible under 
> very narrow circumstances. The function should either be made deterministic, 
> or there should be a prominent warning note in the API docs regarding its 
> behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2016-08-19 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428988#comment-15428988
 ] 

Nicholas Chammas commented on SPARK-17025:
--

{quote}
We'd need to figure out a good design for this, especially since it will 
require a Python process to be started up for what might otherwise be pure 
Scala applications.
{quote}

Ah, I guess since a Transformer using some Python code may be persisted and 
then loaded into a Scala application, right? Sounds hairy. 

Anyway, thanks for chiming in Joseph. I'll watch the linked issue.

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2016-08-11 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417788#comment-15417788
 ] 

Nicholas Chammas edited comment on SPARK-17025 at 8/11/16 7:33 PM:
---

cc [~josephkb], [~mengxr]

I guess a first step would be to add a {{_to_java}} method to the base 
{{Transformer}} class that simply raises {{NotImplementedError}}.

Ultimately though, is there a way to have the base class handle this work 
automatically, or do custom transformers need to each implement their own 
{{_to_java}} method?


was (Author: nchammas):
cc [~josephkb], [~mengxr]

I guess a first step be to add a {{_to_java}} method to the base Transformer 
class that simply raises {{NotImplementedError}}.

Is there a way to have the base class handle this work automatically, or do 
custom transformers need to each implement their own {{_to_java}} method?

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2016-08-11 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417788#comment-15417788
 ] 

Nicholas Chammas edited comment on SPARK-17025 at 8/11/16 7:27 PM:
---

cc [~josephkb], [~mengxr]

I guess a first step be to add a {{_to_java}} method to the base Transformer 
class that simply raises {{NotImplementedError}}.

Is there a way to have the base class handle this work automatically, or do 
custom transformers need to each implement their own {{_to_java}} method?


was (Author: nchammas):
cc [~josephkb] [~mengxr]

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2016-08-11 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417788#comment-15417788
 ] 

Nicholas Chammas commented on SPARK-17025:
--

cc [~josephkb] [~mengxr]

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2016-08-11 Thread Nicholas Chammas (JIRA)

Nicholas Chammas created SPARK-17025:


 Summary: Cannot persist PySpark ML Pipeline model that includes 
custom Transformer
 Key: SPARK-17025
 URL: https://issues.apache.org/jira/browse/SPARK-17025
 Project: Spark
  Issue Type: Bug
  Components: ML, PySpark
Affects Versions: 2.0.0
Reporter: Nicholas Chammas
Priority: Minor


Following the example in [this Databricks blog 
post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
 under "Python tuning", I'm trying to save an ML Pipeline model.

This pipeline, however, includes a custom transformer. When I try to save the 
model, the operation fails because the custom transformer doesn't have a 
{{_to_java}} attribute.

{code}
Traceback (most recent call last):
  File ".../file.py", line 56, in 
model.bestModel.save('model')
  File 
"/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
 line 222, in save
  File 
"/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
 line 217, in write
  File 
"/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
 line 93, in __init__
  File 
"/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
 line 254, in _to_java
AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
{code}

Looking at the source code for 
[ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
 I see that not even the base Transformer class has such an attribute.

I'm assuming this is missing functionality that is intended to be patched up 
(i.e. [like 
this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).

I'm not sure if there is an existing JIRA for this (my searches didn't turn up 
clear results).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16921) RDD/DataFrame persist() and cache() should return Python context managers

2016-08-09 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15414067#comment-15414067
 ] 

Nicholas Chammas commented on SPARK-16921:
--

[~holdenk] - Probably won't be able to do it myself for some time, so feel free 
to take that crack at it and just ping me as a reviewer. 

> RDD/DataFrame persist() and cache() should return Python context managers
> -
>
> Key: SPARK-16921
> URL: https://issues.apache.org/jira/browse/SPARK-16921
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, Spark Core, SQL
>Reporter: Nicholas Chammas
>Priority: Minor
>
> [Context 
> managers|https://docs.python.org/3/reference/datamodel.html#context-managers] 
> are a natural way to capture closely related setup and teardown code in 
> Python.
> For example, they are commonly used when doing file I/O:
> {code}
> with open('/path/to/file') as f:
> contents = f.read()
> ...
> {code}
> Once the program exits the with block, {{f}} is automatically closed.
> I think it makes sense to apply this pattern to persisting and unpersisting 
> DataFrames and RDDs. There are many cases when you want to persist a 
> DataFrame for a specific set of operations and then unpersist it immediately 
> afterwards.
> For example, take model training. Today, you might do something like this:
> {code}
> labeled_data.persist()
> model = pipeline.fit(labeled_data)
> labeled_data.unpersist()
> {code}
> If {{persist()}} returned a context manager, you could rewrite this as 
> follows:
> {code}
> with labeled_data.persist():
> model = pipeline.fit(labeled_data)
> {code}
> Upon exiting the {{with}} block, {{labeled_data}} would automatically be 
> unpersisted.
> This can be done in a backwards-compatible way since {{persist()}} would 
> still return the parent DataFrame or RDD as it does today, but add two 
> methods to the object: {{\_\_enter\_\_()}} and {{\_\_exit\_\_()}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16921) RDD/DataFrame persist() and cache() should return Python context managers

2016-08-05 Thread Nicholas Chammas (JIRA)

Nicholas Chammas created SPARK-16921:


 Summary: RDD/DataFrame persist() and cache() should return Python 
context managers
 Key: SPARK-16921
 URL: https://issues.apache.org/jira/browse/SPARK-16921
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Spark Core, SQL
Reporter: Nicholas Chammas
Priority: Minor


[Context 
managers|https://docs.python.org/3/reference/datamodel.html#context-managers] 
are a natural way to capture closely related setup and teardown code in Python.

For example, they are commonly used when doing file I/O:

{code}
with open('/path/to/file') as f:
contents = f.read()
...
{code}

Once the program exits the with block, {{f}} is automatically closed.

I think it makes sense to apply this pattern to persisting and unpersisting 
DataFrames and RDDs. There are many cases when you want to persist a DataFrame 
for a specific set of operations and then unpersist it immediately afterwards.

For example, take model training. Today, you might do something like this:

{code}
labeled_data.persist()
model = pipeline.fit(labeled_data)
labeled_data.unpersist()
{code}

If {{persist()}} returned a context manager, you could rewrite this as follows:

{code}
with labeled_data.persist():
model = pipeline.fit(labeled_data)
{code}

Upon exiting the {{with}} block, {{labeled_data}} would automatically be 
unpersisted.

This can be done in a backwards-compatible way since {{persist()}} would still 
return the parent DataFrame or RDD as it does today, but add two methods to the 
object: {{\_\_enter\_\_()}} and {{\_\_exit\_\_()}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-7505) Update PySpark DataFrame docs: encourage getitem, mark as experimental, etc.

2016-08-05 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas closed SPARK-7505.
---
Resolution: Invalid

Closing this as invalid as I believe these issues are no longer important.

> Update PySpark DataFrame docs: encourage __getitem__, mark as experimental, 
> etc.
> 
>
> Key: SPARK-7505
> URL: https://issues.apache.org/jira/browse/SPARK-7505
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark, SQL
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> The PySpark docs for DataFrame need the following fixes and improvements:
> # Per [SPARK-7035], we should encourage the use of {{\_\_getitem\_\_}} over 
> {{\_\_getattr\_\_}} and change all our examples accordingly.
> # *We should say clearly that the API is experimental.* (That is currently 
> not the case for the PySpark docs.)
> # We should provide an example of how to join and select from 2 DataFrames 
> that have identically named columns, because it is not obvious:
>   {code}
> >>> df1 = sqlContext.jsonRDD(sc.parallelize(['{"a": 4, "other": "I know"}']))
> >>> df2 = sqlContext.jsonRDD(sc.parallelize(['{"a": 4, "other": "I dunno"}']))
> >>> df12 = df1.join(df2, df1['a'] == df2['a'])
> >>> df12.select(df1['a'], df2['other']).show()
> a other   
> 
> 4 I dunno  {code}
> # 
> [{{DF.orderBy}}|https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.orderBy]
>  and 
> [{{DF.sort}}|https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sort]
>  should be marked as aliases if that's what they are.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5312) Use sbt to detect new or changed public classes in PRs

2016-08-05 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15409767#comment-15409767
 ] 

Nicholas Chammas commented on SPARK-5312:
-

[~boyork] - Shall we close this? It doesn't look like it has any momentum.

> Use sbt to detect new or changed public classes in PRs
> --
>
> Key: SPARK-5312
> URL: https://issues.apache.org/jira/browse/SPARK-5312
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Nicholas Chammas
>Priority: Minor
>
> We currently use an [unwieldy grep/sed 
> contraption|https://github.com/apache/spark/blob/19556454881e05a6c2470d406d50f004b88088a2/dev/run-tests-jenkins#L152-L174]
>  to detect new public classes in PRs.
> -Apparently, sbt lets you get a list of public classes [much more 
> directly|http://www.scala-sbt.org/0.13/docs/Howto-Inspect-the-Build.html] via 
> {{show compile:discoveredMainClasses}}. We should use that instead.-
> There is a tool called [ClassUtil|http://software.clapper.org/classutil/] 
> that seems to help give this kind of information much more directly. We 
> should look into using that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7146) Should ML sharedParams be a public API?

2016-08-02 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405300#comment-15405300
 ] 

Nicholas Chammas edited comment on SPARK-7146 at 8/3/16 4:45 AM:
-

A quick update from a PySpark user: I am using HasInputCol, HasInputCols, 
HasLabelCol, and HasOutputCol to create custom transformers, and I find them 
very handy.

I know Python does not have a notion of "private" classes, but knowing these 
are part of the public API would be good.

In summary: The updated proposal looks good to me, with the caveat that I only 
just started learning the new ML Pipeline API.


was (Author: nchammas):
A quick update from a PySpark user: I am using HasInputCol, HasInputCols, 
HasLabelCol, and HasOutputCol to create custom transformers, and I find them 
very handy.

I know Python does not have a notion of "private" classes, but knowing these 
are part of the public API would be good.

I summary: The updated proposal looks good to me, with the caveat that I only 
just started learning the new ML Pipeline API.

> Should ML sharedParams be a public API?
> ---
>
> Key: SPARK-7146
> URL: https://issues.apache.org/jira/browse/SPARK-7146
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Proposal: Make most of the Param traits in sharedParams.scala public.  Mark 
> them as DeveloperApi.
> Pros:
> * Sharing the Param traits helps to encourage standardized Param names and 
> documentation.
> Cons:
> * Users have to be careful since parameters can have different meanings for 
> different algorithms.
> * If the shared Params are public, then implementations could test for the 
> traits.  It is unclear if we want users to rely on these traits, which are 
> somewhat experimental.
> Currently, the shared params are private.
> h3. UPDATED proposal
> * Some Params are clearly safe to make public.  We will do so.
> * Some Params could be made public but may require caveats in the trait doc.
> * Some Params have turned out not to be shared in practice.  We can move 
> those Params to the classes which use them.
> *Public shared params*:
> * I/O column params
> ** HasFeaturesCol
> ** HasInputCol
> ** HasInputCols
> ** HasLabelCol
> ** HasOutputCol
> ** HasPredictionCol
> ** HasProbabilityCol
> ** HasRawPredictionCol
> ** HasVarianceCol
> ** HasWeightCol
> * Algorithm settings
> ** HasCheckpointInterval
> ** HasElasticNetParam
> ** HasFitIntercept
> ** HasMaxIter
> ** HasRegParam
> ** HasSeed
> ** HasStandardization (less common)
> ** HasStepSize
> ** HasTol
> *Questionable params*:
> * HasHandleInvalid (only used in StringIndexer, but might be more widely used 
> later on)
> * HasSolver (used in LinearRegression and GeneralizedLinearRegression, but 
> same meaning as Optimizer in LDA)
> *Params to be removed from sharedParams*:
> * HasThreshold (only used in LogisticRegression)
> * HasThresholds (only used in ProbabilisticClassifier)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7146) Should ML sharedParams be a public API?

2016-08-02 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405300#comment-15405300
 ] 

Nicholas Chammas commented on SPARK-7146:
-

A quick update from a PySpark user: I am using HasInputCol, HasInputCols, 
HasLabelCol, and HasOutputCol to create custom transformers, and I find them 
very handy.

I know Python does not have a notion of "private" classes, but knowing these 
are part of the public API would be good.

I summary: The updated proposal looks good to me, with the caveat that I only 
just started learning the new ML Pipeline API.

> Should ML sharedParams be a public API?
> ---
>
> Key: SPARK-7146
> URL: https://issues.apache.org/jira/browse/SPARK-7146
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Proposal: Make most of the Param traits in sharedParams.scala public.  Mark 
> them as DeveloperApi.
> Pros:
> * Sharing the Param traits helps to encourage standardized Param names and 
> documentation.
> Cons:
> * Users have to be careful since parameters can have different meanings for 
> different algorithms.
> * If the shared Params are public, then implementations could test for the 
> traits.  It is unclear if we want users to rely on these traits, which are 
> somewhat experimental.
> Currently, the shared params are private.
> h3. UPDATED proposal
> * Some Params are clearly safe to make public.  We will do so.
> * Some Params could be made public but may require caveats in the trait doc.
> * Some Params have turned out not to be shared in practice.  We can move 
> those Params to the classes which use them.
> *Public shared params*:
> * I/O column params
> ** HasFeaturesCol
> ** HasInputCol
> ** HasInputCols
> ** HasLabelCol
> ** HasOutputCol
> ** HasPredictionCol
> ** HasProbabilityCol
> ** HasRawPredictionCol
> ** HasVarianceCol
> ** HasWeightCol
> * Algorithm settings
> ** HasCheckpointInterval
> ** HasElasticNetParam
> ** HasFitIntercept
> ** HasMaxIter
> ** HasRegParam
> ** HasSeed
> ** HasStandardization (less common)
> ** HasStepSize
> ** HasTol
> *Questionable params*:
> * HasHandleInvalid (only used in StringIndexer, but might be more widely used 
> later on)
> * HasSolver (used in LinearRegression and GeneralizedLinearRegression, but 
> same meaning as Optimizer in LDA)
> *Params to be removed from sharedParams*:
> * HasThreshold (only used in LogisticRegression)
> * HasThresholds (only used in ProbabilisticClassifier)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16782) Use Sphinx autodoc to eliminate duplication of Python docstrings

2016-08-01 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402515#comment-15402515
 ] 

Nicholas Chammas commented on SPARK-16782:
--

Poking around a bit more, it seems like a possible approach would be to 
implement a new decorator that duplicates an existing docstring, optionally 
with some modifications. Not sure if that's something we would be interested in 
doing.

> Use Sphinx autodoc to eliminate duplication of Python docstrings
> 
>
> Key: SPARK-16782
> URL: https://issues.apache.org/jira/browse/SPARK-16782
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Reporter: Nicholas Chammas
>Priority: Minor
>
> In several cases it appears that we duplicate docstrings for methods that are 
> exposed in multiple places.
> For example, here are two instances of {{createDataFrame}} with identical 
> docstrings:
> * 
> https://github.com/apache/spark/blob/ab6e4aea5f39c429d5ea62a5170c8a1da612b74a/python/pyspark/sql/context.py#L218
> * 
> https://github.com/apache/spark/blob/39c836e976fcae51568bed5ebab28e148383b5d4/python/pyspark/sql/session.py#L406
> I believe we can eliminate this duplication by using the [Sphinx autodoc 
> extension|http://www.sphinx-doc.org/en/stable/ext/autodoc.html].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16782) Use Sphinx autodoc to eliminate duplication of Python docstrings

2016-08-01 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402477#comment-15402477
 ] 

Nicholas Chammas commented on SPARK-16782:
--

Hmm never mind. I think I've misunderstood the purpose of autodoc; it doesn't 
seem like we can use it to eliminate docstring duplication.

> Use Sphinx autodoc to eliminate duplication of Python docstrings
> 
>
> Key: SPARK-16782
> URL: https://issues.apache.org/jira/browse/SPARK-16782
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Reporter: Nicholas Chammas
>Priority: Minor
>
> In several cases it appears that we duplicate docstrings for methods that are 
> exposed in multiple places.
> For example, here are two instances of {{createDataFrame}} with identical 
> docstrings:
> * 
> https://github.com/apache/spark/blob/ab6e4aea5f39c429d5ea62a5170c8a1da612b74a/python/pyspark/sql/context.py#L218
> * 
> https://github.com/apache/spark/blob/39c836e976fcae51568bed5ebab28e148383b5d4/python/pyspark/sql/session.py#L406
> I believe we can eliminate this duplication by using the [Sphinx autodoc 
> extension|http://www.sphinx-doc.org/en/stable/ext/autodoc.html].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-16782) Use Sphinx autodoc to eliminate duplication of Python docstrings

2016-08-01 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas closed SPARK-16782.

Resolution: Invalid

> Use Sphinx autodoc to eliminate duplication of Python docstrings
> 
>
> Key: SPARK-16782
> URL: https://issues.apache.org/jira/browse/SPARK-16782
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Reporter: Nicholas Chammas
>Priority: Minor
>
> In several cases it appears that we duplicate docstrings for methods that are 
> exposed in multiple places.
> For example, here are two instances of {{createDataFrame}} with identical 
> docstrings:
> * 
> https://github.com/apache/spark/blob/ab6e4aea5f39c429d5ea62a5170c8a1da612b74a/python/pyspark/sql/context.py#L218
> * 
> https://github.com/apache/spark/blob/39c836e976fcae51568bed5ebab28e148383b5d4/python/pyspark/sql/session.py#L406
> I believe we can eliminate this duplication by using the [Sphinx autodoc 
> extension|http://www.sphinx-doc.org/en/stable/ext/autodoc.html].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12157) Support numpy types as return values of Python UDFs

2016-07-31 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401198#comment-15401198
 ] 

Nicholas Chammas commented on SPARK-12157:
--

OK. I've raised the issue of documenting this in PySpark here: 
https://issues.apache.org/jira/browse/SPARK-16824

> Support numpy types as return values of Python UDFs
> ---
>
> Key: SPARK-12157
> URL: https://issues.apache.org/jira/browse/SPARK-12157
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.2
>Reporter: Justin Uang
>
> Currently, if I have a python UDF
> {code}
> import pyspark.sql.types as T
> import pyspark.sql.functions as F
> from pyspark.sql import Row
> import numpy as np
> argmax = F.udf(lambda x: np.argmax(x), T.IntegerType())
> df = sqlContext.createDataFrame([Row(array=[1,2,3])])
> df.select(argmax("array")).count()
> {code}
> I get an exception that is fairly opaque:
> {code}
> Caused by: net.razorvine.pickle.PickleException: expected zero arguments for 
> construction of ClassDict (for numpy.dtype)
> at 
> net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
> at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:701)
> at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:171)
> at net.razorvine.pickle.Unpickler.load(Unpickler.java:85)
> at net.razorvine.pickle.Unpickler.loads(Unpickler.java:98)
> at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:404)
> at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:403)
> {code}
> Numpy types like np.int and np.float64 should automatically be cast to the 
> proper dtypes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16824) Add API docs for VectorUDT

2016-07-31 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401197#comment-15401197
 ] 

Nicholas Chammas commented on SPARK-16824:
--

cc [~josephkb] [~mengxr] - Should this type be publicly documented in PySpark?

> Add API docs for VectorUDT
> --
>
> Key: SPARK-16824
> URL: https://issues.apache.org/jira/browse/SPARK-16824
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Following on the [discussion 
> here|https://issues.apache.org/jira/browse/SPARK-12157?focusedCommentId=15401153=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15401153],
>  it appears that {{VectorUDT}} is missing documentation, at least in PySpark. 
> I'm not sure if this is intentional or not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16824) Add API docs for VectorUDT

2016-07-31 Thread Nicholas Chammas (JIRA)

Nicholas Chammas created SPARK-16824:


 Summary: Add API docs for VectorUDT
 Key: SPARK-16824
 URL: https://issues.apache.org/jira/browse/SPARK-16824
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, PySpark
Affects Versions: 2.0.0
Reporter: Nicholas Chammas
Priority: Minor


Following on the [discussion 
here|https://issues.apache.org/jira/browse/SPARK-12157?focusedCommentId=15401153=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15401153],
 it appears that {{VectorUDT}} is missing documentation, at least in PySpark. 
I'm not sure if this is intentional or not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12157) Support numpy types as return values of Python UDFs

2016-07-31 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401179#comment-15401179
 ] 

Nicholas Chammas commented on SPARK-12157:
--

Thanks for the pointer, Maciej. It appears that {{VectorUDT}} is 
[undocumented|http://spark.apache.org/docs/latest/api/python/search.html?q=VectorUDT_keywords=yes=default].
 Do you know if that is intentional?

> Support numpy types as return values of Python UDFs
> ---
>
> Key: SPARK-12157
> URL: https://issues.apache.org/jira/browse/SPARK-12157
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.2
>Reporter: Justin Uang
>
> Currently, if I have a python UDF
> {code}
> import pyspark.sql.types as T
> import pyspark.sql.functions as F
> from pyspark.sql import Row
> import numpy as np
> argmax = F.udf(lambda x: np.argmax(x), T.IntegerType())
> df = sqlContext.createDataFrame([Row(array=[1,2,3])])
> df.select(argmax("array")).count()
> {code}
> I get an exception that is fairly opaque:
> {code}
> Caused by: net.razorvine.pickle.PickleException: expected zero arguments for 
> construction of ClassDict (for numpy.dtype)
> at 
> net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
> at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:701)
> at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:171)
> at net.razorvine.pickle.Unpickler.load(Unpickler.java:85)
> at net.razorvine.pickle.Unpickler.loads(Unpickler.java:98)
> at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:404)
> at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:403)
> {code}
> Numpy types like np.int and np.float64 should automatically be cast to the 
> proper dtypes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12157) Support numpy types as return values of Python UDFs

2016-07-29 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15399743#comment-15399743
 ] 

Nicholas Chammas commented on SPARK-12157:
--

It appears that it's not possible to have a UDF that returns a {{Vector}}.

For example, consider this UDF:

{code}
featurize_udf = udf(
lambda person1, person2: featurize(person1, person2),
ArrayType(elementType=FloatType(), containsNull=False)
)
{code}

{{featurize()}} returns a {{DenseVector}}, which I understand is a wrapper for 
some numpy array type.

Trying to use this UDF on a DataFrame yields:

{code}
Traceback (most recent call last):
  File ".../thing.py", line 134, in 
.alias('pair_features'))
  File 
"/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/sql/dataframe.py",
 line 310, in collect
  File 
"/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
 line 933, in __call__
  File 
"/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/sql/utils.py",
 line 63, in deco
  File 
"/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
 line 312, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling 
o94.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 21 in 
stage 3.0 failed 1 times, most recent failure: Lost task 21.0 in stage 3.0 (TID 
34, localhost): net.razorvine.pickle.PickleException: expected zero arguments 
for construction of ClassDict (for pyspark.ml.linalg.DenseVector)
at 
net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)
at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)
at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)
at 
org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1$$anonfun$apply$5.apply(BatchEvalPythonExec.scala:137)
at 
org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1$$anonfun$apply$5.apply(BatchEvalPythonExec.scala:136)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at 
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at 
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:112)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at 
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:112)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at 
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:112)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

> Support numpy types as return values of Python UDFs
> ---
>
>

[jira] [Commented] (SPARK-12157) Support numpy types as return values of Python UDFs

2016-07-28 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398267#comment-15398267
 ] 

Nicholas Chammas commented on SPARK-12157:
--

I'm looking to define a UDF in PySpark that returns a 
{{pyspark.ml.linalg.Vector}}. Since {{Vector}} is a wrapper for numpy types, I 
believe this issue covers what I'm looking for.

My use case is that I want a UDF that takes in several DataFrame columns and 
extracts/computes features, returning them as a new {{Vector}} column. I 
believe {{VectorAssembler}} is for when you already have the features and you 
just want them put in a {{Vector}}.

[~josephkb] [~zjffdu] So is it possible to do that today? Have I misunderstood 
how to approach my use case?

> Support numpy types as return values of Python UDFs
> ---
>
> Key: SPARK-12157
> URL: https://issues.apache.org/jira/browse/SPARK-12157
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.2
>Reporter: Justin Uang
>
> Currently, if I have a python UDF
> {code}
> import pyspark.sql.types as T
> import pyspark.sql.functions as F
> from pyspark.sql import Row
> import numpy as np
> argmax = F.udf(lambda x: np.argmax(x), T.IntegerType())
> df = sqlContext.createDataFrame([Row(array=[1,2,3])])
> df.select(argmax("array")).count()
> {code}
> I get an exception that is fairly opaque:
> {code}
> Caused by: net.razorvine.pickle.PickleException: expected zero arguments for 
> construction of ClassDict (for numpy.dtype)
> at 
> net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
> at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:701)
> at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:171)
> at net.razorvine.pickle.Unpickler.load(Unpickler.java:85)
> at net.razorvine.pickle.Unpickler.loads(Unpickler.java:98)
> at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:404)
> at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:403)
> {code}
> Numpy types like np.int and np.float64 should automatically be cast to the 
> proper dtypes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16782) Use Sphinx autodoc to eliminate duplication of Python docstrings

2016-07-28 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398059#comment-15398059
 ] 

Nicholas Chammas commented on SPARK-16782:
--

[~davies] [~joshrosen] - I can take this on if the idea sounds good to you.

One thing I will have to look into is how the doctests show up, since we don't 
want those to be identical (or do we?) due to the different invocation methods. 
I'm not sure if autodoc will make that possible.

> Use Sphinx autodoc to eliminate duplication of Python docstrings
> 
>
> Key: SPARK-16782
> URL: https://issues.apache.org/jira/browse/SPARK-16782
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Reporter: Nicholas Chammas
>Priority: Minor
>
> In several cases it appears that we duplicate docstrings for methods that are 
> exposed in multiple places.
> For example, here are two instances of {{createDataFrame}} with identical 
> docstrings:
> * 
> https://github.com/apache/spark/blob/ab6e4aea5f39c429d5ea62a5170c8a1da612b74a/python/pyspark/sql/context.py#L218
> * 
> https://github.com/apache/spark/blob/39c836e976fcae51568bed5ebab28e148383b5d4/python/pyspark/sql/session.py#L406
> I believe we can eliminate this duplication by using the [Sphinx autodoc 
> extension|http://www.sphinx-doc.org/en/stable/ext/autodoc.html].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16782) Use Sphinx autodoc to eliminate duplication of Python docstrings

2016-07-28 Thread Nicholas Chammas (JIRA)

Nicholas Chammas created SPARK-16782:


 Summary: Use Sphinx autodoc to eliminate duplication of Python 
docstrings
 Key: SPARK-16782
 URL: https://issues.apache.org/jira/browse/SPARK-16782
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, PySpark
Reporter: Nicholas Chammas
Priority: Minor


In several cases it appears that we duplicate docstrings for methods that are 
exposed in multiple places.

For example, here are two instances of {{createDataFrame}} with identical 
docstrings:
* 
https://github.com/apache/spark/blob/ab6e4aea5f39c429d5ea62a5170c8a1da612b74a/python/pyspark/sql/context.py#L218
* 
https://github.com/apache/spark/blob/39c836e976fcae51568bed5ebab28e148383b5d4/python/pyspark/sql/session.py#L406

I believe we can eliminate this duplication by using the [Sphinx autodoc 
extension|http://www.sphinx-doc.org/en/stable/ext/autodoc.html].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16772) Correct API doc references to PySpark classes + formatting fixes

2016-07-28 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-16772:
-
Summary: Correct API doc references to PySpark classes + formatting fixes  
(was: Correct API doc references to PySpark classes)

> Correct API doc references to PySpark classes + formatting fixes
> 
>
> Key: SPARK-16772
> URL: https://issues.apache.org/jira/browse/SPARK-16772
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Reporter: Nicholas Chammas
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16772) Correct API doc references to PySpark classes

2016-07-28 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-16772:
-
Summary: Correct API doc references to PySpark classes  (was: Correct API 
doc references to DataType + other minor doc tweaks)

> Correct API doc references to PySpark classes
> -
>
> Key: SPARK-16772
> URL: https://issues.apache.org/jira/browse/SPARK-16772
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Reporter: Nicholas Chammas
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16772) Correct API doc references to DataType + other minor doc tweaks

2016-07-28 Thread Nicholas Chammas (JIRA)

Nicholas Chammas created SPARK-16772:


 Summary: Correct API doc references to DataType + other minor doc 
tweaks
 Key: SPARK-16772
 URL: https://issues.apache.org/jira/browse/SPARK-16772
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, PySpark
Reporter: Nicholas Chammas
Priority: Trivial






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7481) Add spark-cloud module to pull in aws+azure object store FS accessors; test integration

2016-07-22 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15389589#comment-15389589
 ] 

Nicholas Chammas commented on SPARK-7481:
-

[~ste...@apache.org] - Some relevant reading for you from the Internets about 
the trouble people currently go through to get Spark + S3A working:

http://deploymentzone.com/2015/12/20/s3a-on-spark-on-aws-ec2/

> Add spark-cloud module to pull in aws+azure object store FS accessors; test 
> integration
> ---
>
> Key: SPARK-7481
> URL: https://issues.apache.org/jira/browse/SPARK-7481
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.3.1
>Reporter: Steve Loughran
>
> To keep the s3n classpath right, to add s3a, swift & azure, the dependencies 
> of spark in a 2.6+ profile need to add the relevant object store packages 
> (hadoop-aws, hadoop-openstack, hadoop-azure)
> this adds more stuff to the client bundle, but will mean a single spark 
> package can talk to all of the stores.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12661) Drop Python 2.6 support in PySpark

2016-07-19 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15384535#comment-15384535
 ] 

Nicholas Chammas commented on SPARK-12661:
--

OK, sounds good to me.

> Drop Python 2.6 support in PySpark
> --
>
> Key: SPARK-12661
> URL: https://issues.apache.org/jira/browse/SPARK-12661
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Reporter: Davies Liu
>  Labels: releasenotes
>
> 1. stop testing with 2.6
> 2. remove the code for python 2.6
> see discussion : 
> https://www.mail-archive.com/user@spark.apache.org/msg43423.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12661) Drop Python 2.6 support in PySpark

2016-07-19 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15384513#comment-15384513
 ] 

Nicholas Chammas commented on SPARK-12661:
--

Yes, I mean communicating our intention to drop support in 2.1 by deprecating 
in 2.0.

It seems wrong to release 2.0 without any indication that Python 2.6 support is 
going to be dropped in 2.1. People may assume that since 2.0 was released with 
Python 2.6 support, then the 2.x line will continue that support.

Since it seems like 2.0 will have an RC5, shall we add in a brief note to the 
programming guide + 2.0 release notes about deprecating Python 2.6 support?

> Drop Python 2.6 support in PySpark
> --
>
> Key: SPARK-12661
> URL: https://issues.apache.org/jira/browse/SPARK-12661
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Reporter: Davies Liu
>  Labels: releasenotes
>
> 1. stop testing with 2.6
> 2. remove the code for python 2.6
> see discussion : 
> https://www.mail-archive.com/user@spark.apache.org/msg43423.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12661) Drop Python 2.6 support in PySpark

2016-07-19 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15384343#comment-15384343
 ] 

Nicholas Chammas commented on SPARK-12661:
--

To clarify what I mean by drop vs. deprecate, because I'm not sure if my 
distinction is standard:

* Deprecate means communicating to users to get off 2.6, but still keeping code 
and tests for 2.6 active.
* Drop means removing tests for 2.6 and potentially (though not necessarily) 
any code that exists only to support Python 2.6.

> Drop Python 2.6 support in PySpark
> --
>
> Key: SPARK-12661
> URL: https://issues.apache.org/jira/browse/SPARK-12661
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Reporter: Davies Liu
>  Labels: releasenotes
>
> 1. stop testing with 2.6
> 2. remove the code for python 2.6
> see discussion : 
> https://www.mail-archive.com/user@spark.apache.org/msg43423.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12661) Drop Python 2.6 support in PySpark

2016-07-19 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15384343#comment-15384343
 ] 

Nicholas Chammas edited comment on SPARK-12661 at 7/19/16 3:23 PM:
---

To clarify what I mean by drop vs. deprecate, because I'm not sure if my 
distinction is standard:

* Deprecate means communicating to users to get off 2.6, but still keeping code 
and tests for 2.6 active.
* Drop means removing tests for 2.6 and potentially (though not necessarily) 
also removing any code that exists only to support Python 2.6.


was (Author: nchammas):
To clarify what I mean by drop vs. deprecate, because I'm not sure if my 
distinction is standard:

* Deprecate means communicating to users to get off 2.6, but still keeping code 
and tests for 2.6 active.
* Drop means removing tests for 2.6 and potentially (though not necessarily) 
any code that exists only to support Python 2.6.

> Drop Python 2.6 support in PySpark
> --
>
> Key: SPARK-12661
> URL: https://issues.apache.org/jira/browse/SPARK-12661
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Reporter: Davies Liu
>  Labels: releasenotes
>
> 1. stop testing with 2.6
> 2. remove the code for python 2.6
> see discussion : 
> https://www.mail-archive.com/user@spark.apache.org/msg43423.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12661) Drop Python 2.6 support in PySpark

2016-07-19 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15384339#comment-15384339
 ] 

Nicholas Chammas commented on SPARK-12661:
--

Just double-checking on something: Is it OK to drop Python 2.6 support in a 
minor release (2.1) without officially deprecating it in a major release (2.0)?

As far as I can tell from the history on this ticket, that's our current plan, 
but it seems a bit off.

Shouldn't we make the deprecation notice with 2.0? Even just as a release note 
+ minor prose changes to the [Programming 
Guide|https://github.com/apache/spark/blob/branch-2.0/docs/programming-guide.md#linking-with-spark]
 would be sufficient, I think.

> Drop Python 2.6 support in PySpark
> --
>
> Key: SPARK-12661
> URL: https://issues.apache.org/jira/browse/SPARK-12661
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Reporter: Davies Liu
>  Labels: releasenotes
>
> 1. stop testing with 2.6
> 2. remove the code for python 2.6
> see discussion : 
> https://www.mail-archive.com/user@spark.apache.org/msg43423.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-16427) Expand documentation on the various RDD storage levels

2016-07-11 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas closed SPARK-16427.

Resolution: Invalid

> Expand documentation on the various RDD storage levels
> --
>
> Key: SPARK-16427
> URL: https://issues.apache.org/jira/browse/SPARK-16427
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Looking at the docs here
> http://spark.apache.org/docs/1.6.2/api/python/pyspark.html#pyspark.StorageLevel
> A newcomer to Spark won’t understand the meaning of {{_2}}, or the meaning of 
> {{_SER}} (or its value), and won’t understand how exactly memory and disk 
> play together when something like {{MEMORY_AND_DISK}} is selected.
> We should expand this documentation to explain what the various levels mean 
> and perhaps even when a user might want to use them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16427) Expand documentation on the various RDD storage levels

2016-07-11 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15371321#comment-15371321
 ] 

Nicholas Chammas commented on SPARK-16427:
--

Oh nevermind, this information is all available here: 
http://spark.apache.org/docs/1.6.2/programming-guide.html#rdd-persistence

> Expand documentation on the various RDD storage levels
> --
>
> Key: SPARK-16427
> URL: https://issues.apache.org/jira/browse/SPARK-16427
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Looking at the docs here
> http://spark.apache.org/docs/1.6.2/api/python/pyspark.html#pyspark.StorageLevel
> A newcomer to Spark won’t understand the meaning of {{_2}}, or the meaning of 
> {{_SER}} (or its value), and won’t understand how exactly memory and disk 
> play together when something like {{MEMORY_AND_DISK}} is selected.
> We should expand this documentation to explain what the various levels mean 
> and perhaps even when a user might want to use them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16427) Expand documentation on the various RDD storage levels

2016-07-07 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15366711#comment-15366711
 ] 

Nicholas Chammas commented on SPARK-16427:
--

My first question about this would be, how many places do we need to update 
this information? Just the Python and Scala docs? Is there a way to write the 
docs in one place and source them to another?

> Expand documentation on the various RDD storage levels
> --
>
> Key: SPARK-16427
> URL: https://issues.apache.org/jira/browse/SPARK-16427
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Looking at the docs here
> http://spark.apache.org/docs/1.6.2/api/python/pyspark.html#pyspark.StorageLevel
> A newcomer to Spark won’t understand the meaning of {{_2}}, or the meaning of 
> {{_SER}} (or its value), and won’t understand how exactly memory and disk 
> play together when something like {{MEMORY_AND_DISK}} is selected.
> We should expand this documentation to explain what the various levels mean 
> and perhaps even when a user might want to use them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3181) Add Robust Regression Algorithm with Huber Estimator

2016-07-07 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-3181:

Component/s: (was: MLilb)
 MLlib

> Add Robust Regression Algorithm with Huber Estimator
> 
>
> Key: SPARK-3181
> URL: https://issues.apache.org/jira/browse/SPARK-3181
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Fan Jiang
>Assignee: Yanbo Liang
>  Labels: features
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Linear least square estimates assume the error has normal distribution and 
> can behave badly when the errors are heavy-tailed. In practical we get 
> various types of data. We need to include Robust Regression  to employ a 
> fitting criterion that is not as vulnerable as least square.
> In 1973, Huber introduced M-estimation for regression which stands for 
> "maximum likelihood type". The method is resistant to outliers in the 
> response variable and has been widely used.
> The new feature for MLlib will contain 3 new files
> /main/scala/org/apache/spark/mllib/regression/RobustRegression.scala
> /test/scala/org/apache/spark/mllib/regression/RobustRegressionSuite.scala
> /main/scala/org/apache/spark/examples/mllib/HuberRobustRegression.scala
> and one new class HuberRobustGradient in 
> /main/scala/org/apache/spark/mllib/optimization/Gradient.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16156) RowMatrıx Covariance

2016-07-07 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-16156:
-
Component/s: (was: MLilb)
 MLlib

> RowMatrıx Covariance
> 
>
> Key: SPARK-16156
> URL: https://issues.apache.org/jira/browse/SPARK-16156
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.0
> Environment: Spark MLLIB
>Reporter: Hayri Volkan Agun
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Spark doesn't provide a good solution of covariance for RowMatrix with large 
> columns. This can be fixed with using efficient stable computations and 
> approximating to the true mean.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16074) Expose VectorUDT/MatrixUDT in a public API

2016-07-07 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-16074:
-
Component/s: (was: MLilb)
 MLlib

> Expose VectorUDT/MatrixUDT in a public API
> --
>
> Key: SPARK-16074
> URL: https://issues.apache.org/jira/browse/SPARK-16074
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
> Fix For: 2.0.0
>
>
> Both VectorUDT and MatrixUDT are private APIs, because UserDefinedType itself 
> is private in Spark. However, in order to let developers implement their own 
> transformers and estimators, we should expose both types in a public API to 
> simply the implementation of transformSchema, transform, etc. Otherwise, they 
> need to get the data types using reflection.
> Note that this doesn't mean to expose VectorUDT/MatrixUDT classes. We can 
> just have a method or a static value that returns VectorUDT/MatrixUDT 
> instance with DataType as the return type. There are two ways to implement 
> this:
> 1. following DataTypes.java in SQL, so Java users doesn't need the extra "()".
> 2. Define DataTypes in Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16377) Spark MLlib: MultilayerPerceptronClassifier - error while training

2016-07-07 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-16377:
-
Component/s: (was: MLilb)
 MLlib

> Spark MLlib: MultilayerPerceptronClassifier - error while training
> --
>
> Key: SPARK-16377
> URL: https://issues.apache.org/jira/browse/SPARK-16377
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.5.2
>Reporter: Mikhail Shiryaev
>
> Hi, 
> I am trying to train model by MultilayerPerceptronClassifier. 
> It works on sample data from 
> data/mllib/sample_multiclass_classification_data.txt with 4 features, 3 
> classes and layers [4, 4, 3]. 
> But when I try to use other input files with other features and classes (from 
> here for example: 
> https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html) 
> then I get errors. 
> Example: 
> Input file aloi (128 features, 1000 classes, layers [128, 128, 1000]): 
> with block size = 1: 
> ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation. 
> Decreasing step size to Infinity 
> ERROR LBFGS: Failure! Resetting history: breeze.optimize.FirstOrderException: 
> Line search failed 
> ERROR LBFGS: Failure again! Giving up and returning. Maybe the objective is 
> just poorly behaved? 
> with default block size = 128: 
>  java.lang.ArrayIndexOutOfBoundsException 
>   at java.lang.System.arraycopy(Native Method) 
>   at 
> org.apache.spark.ml.ann.DataStacker$$anonfun$3$$anonfun$apply$3$$anonfun$apply$4.apply(Layer.scala:629)
>  
>   at 
> org.apache.spark.ml.ann.DataStacker$$anonfun$3$$anonfun$apply$3$$anonfun$apply$4.apply(Layer.scala:628)
>  
>at scala.collection.immutable.List.foreach(List.scala:381) 
>at 
> org.apache.spark.ml.ann.DataStacker$$anonfun$3$$anonfun$apply$3.apply(Layer.scala:628)
>  
>at 
> org.apache.spark.ml.ann.DataStacker$$anonfun$3$$anonfun$apply$3.apply(Layer.scala:624)
>  
> Even if I modify sample_multiclass_classification_data.txt file (rename all 
> 4-th features to 5-th) and run with layers [5, 5, 3] then I also get the same 
> errors as for file above. 
> So to resume: 
> I can't run training with default block size and with more than 4 features. 
> If I set  block size to 1 then some actions are happened but I get errors 
> from LBFGS. 
> It is reproducible with Spark 1.5.2 and from master branch on github (from 
> 4-th July). 
> Did somebody already met with such behavior? 
> Is there bug in MultilayerPerceptronClassifier or I use it incorrectly? 
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16290) text type features column for classification

2016-07-07 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-16290:
-
Component/s: (was: MLilb)
 MLlib

> text type features column for classification
> 
>
> Key: SPARK-16290
> URL: https://issues.apache.org/jira/browse/SPARK-16290
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 1.6.2
>Reporter: mahendra singh
>  Labels: features
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> we have to improve spark ml and mllib in case of features columns . Mean we 
> can give text type of value also in features . 
> Suppose we have 4 features value 
> id. dept_name. score. result. 
> We can see dept_name will be text type so we have to handle it internally in 
> spark mean we have to change text to numerical column . 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16232) Getting error by making columns using DataFrame

2016-07-07 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-16232:
-
Component/s: (was: MLilb)
 MLlib

> Getting error by making columns using DataFrame 
> 
>
> Key: SPARK-16232
> URL: https://issues.apache.org/jira/browse/SPARK-16232
> Project: Spark
>  Issue Type: Question
>  Components: MLlib, PySpark
>Affects Versions: 1.5.1
> Environment: Winodws, ipython notebook
>Reporter: Inam Ur Rehman
>  Labels: ipython, pandas, pyspark, python
>
> I am using pyspark in ipython notebook for analysis. 
> I am following an example toturial this 
> http://nbviewer.jupyter.org/github/bensadeghi/pyspark-churn-prediction/blob/master/churn-prediction.ipynb
> I am getting error on this step in the 7th cell of notebook
> pd.DataFrame(CV_data.take(5), columns=CV_data.columns) 
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 11.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 11.0 (TID 18, localhost): org.apache.spark.api.python.PythonException: 
> Traceback (most recent call last):
> Here is the full error :
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 pd.DataFrame(CV_data.take(5), columns=CV_data.columns)
> C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\pyspark\sql\dataframe.py
>  in take(self, num)
> 303 with SCCallSiteSync(self._sc) as css:
> 304 port = 
> self._sc._jvm.org.apache.spark.sql.execution.EvaluatePython.takeAndServe(
> --> 305 self._jdf, num)
> 306 return list(_load_from_socket(port, 
> BatchedSerializer(PickleSerializer(
> 307 
> C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py
>  in __call__(self, *args)
> 536 answer = self.gateway_client.send_command(command)
> 537 return_value = get_return_value(answer, self.gateway_client,
> --> 538 self.target_id, self.name)
> 539 
> 540 for temp_arg in temp_args:
> C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\pyspark\sql\utils.py 
> in deco(*a, **kw)
>  34 def deco(*a, **kw):
>  35 try:
> ---> 36 return f(*a, **kw)
>  37 except py4j.protocol.Py4JJavaError as e:
>  38 s = e.java_exception.toString()
> C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)
> 298 raise Py4JJavaError(
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> 302 raise Py4JError(
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 11.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 11.0 (TID 18, localhost): org.apache.spark.api.python.PythonException: 
> Traceback (most recent call last):
>   File 
> "C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\worker.py",
>  line 111, in main
>   File 
> "C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\worker.py",
>  line 106, in process
>   File 
> "C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\serializers.py",
>  line 263, in dump_stream
> vs = list(itertools.islice(iterator, batch))
>   File 
> "C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\pyspark\sql\functions.py",
>  line 1417, in 
> func = lambda _, it: map(lambda x: returnType.toInternal(f(*x)), it)
>   File "", line 5, in 
> KeyError: False
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
>   at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
>   at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:397)
>   at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:362)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:706)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:706)
>   at 
>

[jira] [Created] (SPARK-16427) Expand documentation on the various RDD storage levels

2016-07-07 Thread Nicholas Chammas (JIRA)

Nicholas Chammas created SPARK-16427:


 Summary: Expand documentation on the various RDD storage levels
 Key: SPARK-16427
 URL: https://issues.apache.org/jira/browse/SPARK-16427
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Nicholas Chammas
Priority: Minor


Looking at the docs here

http://spark.apache.org/docs/1.6.2/api/python/pyspark.html#pyspark.StorageLevel

A newcomer to Spark won’t understand the meaning of {{_2}}, or the meaning of 
{{_SER}} (or its value), and won’t understand how exactly memory and disk play 
together when something like {{MEMORY_AND_DISK}} is selected.

We should expand this documentation to explain what the various levels mean and 
perhaps even when a user might want to use them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15760) Documentation missing for package-related config options

2016-07-07 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15366702#comment-15366702
 ] 

Nicholas Chammas commented on SPARK-15760:
--

Updating component since it seems we are using "Documentation" and not "docs".

> Documentation missing for package-related config options
> 
>
> Key: SPARK-15760
> URL: https://issues.apache.org/jira/browse/SPARK-15760
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.0.0
>
>
> There's no documentation about the config options that correlate to the 
> "--packages" (and friends) arguments of spark-submit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15441) dataset outer join seems to return incorrect result

2016-07-07 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-15441:
-
Component/s: (was: sq;)
 SQL

> dataset outer join seems to return incorrect result
> ---
>
> Key: SPARK-15441
> URL: https://issues.apache.org/jira/browse/SPARK-15441
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.0.0
>
>
> See notebook
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6122906529858466/2836020637783173/5382278320999420/latest.html
> {code}
> import org.apache.spark.sql.functions
> val left = List(("a", 1), ("a", 2), ("b", 3), ("c", 4)).toDS()
> val right = List(("a", "x"), ("b", "y"), ("d", "z")).toDS()
> // The last row _1 should be null, rather than (null, -1)
> left.toDF("k", "v").as[(String, Int)].alias("left")
>   .joinWith(right.toDF("k", "u").as[(String, String)].alias("right"), 
> functions.col("left.k") === functions.col("right.k"), "right_outer")
>   .show()
> {code}
> The returned result currently is
> {code}
> +-+-+
> |   _1|   _2|
> +-+-+
> |(a,2)|(a,x)|
> |(a,1)|(a,x)|
> |(b,3)|(b,y)|
> |(null,-1)|(d,z)|
> +-+-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15772) Improve Scala API docs

2016-07-07 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-15772:
-
Component/s: (was: docs)

> Improve Scala API docs 
> ---
>
> Key: SPARK-15772
> URL: https://issues.apache.org/jira/browse/SPARK-15772
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: nirav patel
>
> Hi, I just found out that spark python APIs are much more elaborate then 
> scala counterpart. e.g. 
> https://spark.apache.org/docs/1.4.1/api/python/pyspark.html?highlight=treereduce#pyspark.RDD.treeReduce
> https://spark.apache.org/docs/1.5.2/api/python/pyspark.html?highlight=treereduce#pyspark.RDD
> There are clear explanations of parameters; there are examples as well . I 
> think this would be great improvement on Scala API as well. It will make API 
> more friendly in first place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15760) Documentation missing for package-related config options

2016-07-07 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-15760:
-
Component/s: (was: docs)
 Documentation

> Documentation missing for package-related config options
> 
>
> Key: SPARK-15760
> URL: https://issues.apache.org/jira/browse/SPARK-15760
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.0.0
>
>
> There's no documentation about the config options that correlate to the 
> "--packages" (and friends) arguments of spark-submit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9999) Dataset API on top of Catalyst/DataFrame

2016-06-30 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357535#comment-15357535
 ] 

Nicholas Chammas commented on SPARK-:
-

{quote}
Python itself has no compile time type safety.
{quote}

Practically speaking, this is no longer true. You can get a decent measure of 
"compile" time type safety using recent additions to Python (both the language 
itself and the ecosystem).

Specifically, optional static type checking has been a focus in Python since 
3.5+, and according to Python's BDFL both Google and Dropbox are updating large 
parts of their codebases to use Python's new typing features. Static type 
checkers for Python like [mypy|http://mypy-lang.org/] are already in use and 
are backed by several core Python developers, including Guido van Rossum 
(Python's creator/BDFL).

So I don't think Datasets are a critical feature for PySpark just yet, and it 
will take some time for the general Python community to learn and take 
advantage of Python's new optional static typing features and tools, but I 
would keep this on the radar.

> Dataset API on top of Catalyst/DataFrame
> 
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
> Fix For: 2.0.0
>
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]
> The initial version of the Dataset API has been merged in Spark 1.6. However, 
> it will take a few more future releases to flush everything out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11744) bin/pyspark --version doesn't return version and exit

2016-06-23 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347297#comment-15347297
 ] 

Nicholas Chammas commented on SPARK-11744:
--

This is not the appropriate place to ask random PySpark questions. Please post 
a question on Stack Overflow or on the Spark user list.

> bin/pyspark --version doesn't return version and exit
> -
>
> Key: SPARK-11744
> URL: https://issues.apache.org/jira/browse/SPARK-11744
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
>Reporter: Nicholas Chammas
>Assignee: Saisai Shao
>Priority: Minor
> Fix For: 1.6.0
>
>
> {{bin/pyspark \-\-help}} offers a {{\-\-version}} option:
> {code}
> $ ./spark/bin/pyspark --help
> Usage: ./bin/pyspark [options]
> Options:
> ...
>   --version,  Print the version of current Spark
> ...
> {code}
> However, trying to get the version in this way doesn't yield the expected 
> results.
> Instead of printing the version and exiting, we get the version, a stack 
> trace, and then get dropped into a broken PySpark shell.
> {code}
> $ ./spark/bin/pyspark --version
> Python 2.7.10 (default, Aug 11 2015, 23:39:10) 
> [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.5.2
>   /_/
> 
> Type --help for more information.
> Traceback (most recent call last):
>   File "/home/ec2-user/spark/python/pyspark/shell.py", line 43, in 
> sc = SparkContext(pyFiles=add_files)
>   File "/home/ec2-user/spark/python/pyspark/context.py", line 110, in __init__
> SparkContext._ensure_initialized(self, gateway=gateway)
>   File "/home/ec2-user/spark/python/pyspark/context.py", line 234, in 
> _ensure_initialized
> SparkContext._gateway = gateway or launch_gateway()
>   File "/home/ec2-user/spark/python/pyspark/java_gateway.py", line 94, in 
> launch_gateway
> raise Exception("Java gateway process exited before sending the driver 
> its port number")
> Exception: Java gateway process exited before sending the driver its port 
> number
> >>> 
> >>> sc
> Traceback (most recent call last):
>   File "", line 1, in 
> NameError: name 'sc' is not defined
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2016-05-19 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292599#comment-15292599
 ] 

Nicholas Chammas commented on SPARK-3821:
-

You can deploy Spark today on Docker just fine. It's just that Spark itself 
does not maintain any official Dockerfiles and likely never will since the 
project is actually trying to push deployment stuff outside the main project 
(hence why spark-ec2 was moved out; you will not see spark-ec2 in the official 
docs once Spark 2.0 comes out). You may be more interested in the Apache Big 
Top project, which focuses on big data system deployment (including Spark) and 
may have stuff for Docker specifically. 

Mesos is a separate matter, because it's a resource manager (analogous to YARN) 
that integrates with Spark at a low level.

If you still think Spark should host and maintain an official Dockerfile and 
Docker images that are suitable for production use, please open a separate 
issue. I think the maintainers will reject it on the grounds that I have 
explained here, though. (Can't say for sure; after all I'm just a random 
contributor.)

> Develop an automated way of creating Spark images (AMI, Docker, and others)
> ---
>
> Key: SPARK-3821
> URL: https://issues.apache.org/jira/browse/SPARK-3821
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, EC2
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
> Attachments: packer-proposal.html
>
>
> Right now the creation of Spark AMIs or Docker containers is done manually. 
> With tools like [Packer|http://www.packer.io/], we should be able to automate 
> this work, and do so in such a way that multiple types of machine images can 
> be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2016-05-19 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292198#comment-15292198
 ] 

Nicholas Chammas commented on SPARK-3821:
-

Not sure if there is renewed interest, but at this point this issue is outside 
the scope of the Spark project. The original impetus for this issue was to 
create AMIs for spark-ec2 in an automated fashion, and spark-ec2 has been moved 
out of the main Spark project.

spark-ec2 now lives here: https://github.com/amplab/spark-ec2

> Develop an automated way of creating Spark images (AMI, Docker, and others)
> ---
>
> Key: SPARK-3821
> URL: https://issues.apache.org/jira/browse/SPARK-3821
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, EC2
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
> Attachments: packer-proposal.html
>
>
> Right now the creation of Spark AMIs or Docker containers is done manually. 
> With tools like [Packer|http://www.packer.io/], we should be able to automate 
> this work, and do so in such a way that multiple types of machine images can 
> be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15072) Remove SparkSession.withHiveSupport

2016-05-16 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285112#comment-15285112
 ] 

Nicholas Chammas commented on SPARK-15072:
--

Brief note from [~yhuai] on the motivation behind this issue: 
https://github.com/apache/spark/pull/13069#issuecomment-219516577

> Remove SparkSession.withHiveSupport
> ---
>
> Key: SPARK-15072
> URL: https://issues.apache.org/jira/browse/SPARK-15072
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Sandeep Singh
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10899) Support JDBC pushdown for additional commands

2016-05-12 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15282239#comment-15282239
 ] 

Nicholas Chammas commented on SPARK-10899:
--

Is {{COUNT}} also something that can be pushed down? If I have a DataFrame 
defined against a JDBC source and say {{df.count()}}, I think we should be able 
to push the count down into the database.

By the way, I would say {{LIMIT}} is probably the most urgent clause that needs 
to be pushed down. Right now if you do {{df.take(1)}} it appears that Spark 
will read the whole friggin table. It makes interactive development and testing 
impossible.

> Support JDBC pushdown for additional commands
> -
>
> Key: SPARK-10899
> URL: https://issues.apache.org/jira/browse/SPARK-10899
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Victor May
>Priority: Minor
>
> Spark DataFrames currently support predicate push-down with JDBC sources but 
> term predicate is used in a strict SQL meaning. It means it covers only WHERE 
> clause. Moreover it looks like it is limited to the logical conjunction (no 
> IN and OR I am afraid) and simple predicates.
> This creates a situation where a simple query such as "select * from table 
> limit 100" could easily result in the database being overloaded when the 
> table is large.
> This feature request is to expand  the support for push-downs to additional 
> SQL commands:
> -LIMIT
> -WHERE IN
> -WHERE NOT IN
> -GROUP BY



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7506) pyspark.sql.types.StructType.fromJson() is incorrectly named

2016-05-11 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280606#comment-15280606
 ] 

Nicholas Chammas edited comment on SPARK-7506 at 5/11/16 6:51 PM:
--

[~davies] - Would you be interested in a PR that adds a new method called 
{{fromDict()}} and makes {{fromJson()}} an alias of that method? It would 
preserve compatibility, while (kinda) correcting the naming problem here.


was (Author: nchammas):
[~davies] - Would you be interested in a PR that adds new methods called 
{{toDict()}} and {{fromDict()}}, and made {{json()}} and {{fromJson()}} aliases 
of those methods? It would preserve compatibility, while correcting the naming 
problem here.

> pyspark.sql.types.StructType.fromJson() is incorrectly named
> 
>
> Key: SPARK-7506
> URL: https://issues.apache.org/jira/browse/SPARK-7506
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 1.3.0, 1.3.1
>Reporter: Nicholas Chammas
>Priority: Minor
>
> {code}
> >>> json_rdd = sqlContext.jsonRDD(sc.parallelize(['{"name": "Nick"}']))
> >>> json_rdd.schema
> StructType(List(StructField(name,StringType,true)))
> >>> type(json_rdd.schema)
> 
> >>> json_rdd.schema.json()
> '{"fields":[{"metadata":{},"name":"name","nullable":true,"type":"string"}],"type":"struct"}'
> >>> pyspark.sql.types.StructType.fromJson(json_rdd.schema.json())
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Applications/apache-spark/spark-1.3.1-bin-hadoop2.4/python/pyspark/sql/types.py",
>  line 346, in fromJson
> return StructType([StructField.fromJson(f) for f in json["fields"]])
> TypeError: string indices must be integers, not str
> >>> import json
> >>> pyspark.sql.types.StructType.fromJson(json.loads(json_rdd.schema.json()))
> StructType(List(StructField(name,StringType,true)))
> >>>
> {code}
> So {{fromJson()}} doesn't actually expect JSON, which is a string. It expects 
> a dictionary.
> This method should probably be renamed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7506) pyspark.sql.types.StructType.fromJson() is incorrectly named

2016-05-11 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280606#comment-15280606
 ] 

Nicholas Chammas commented on SPARK-7506:
-

[~davies] - Would you be interested in a PR that adds new methods called 
{{toDict()}} and {{fromDict()}}, and made {{json()}} and {{fromJson()}} aliases 
of those methods? It would preserve compatibility, while correcting the naming 
problem here.

> pyspark.sql.types.StructType.fromJson() is incorrectly named
> 
>
> Key: SPARK-7506
> URL: https://issues.apache.org/jira/browse/SPARK-7506
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 1.3.0, 1.3.1
>Reporter: Nicholas Chammas
>Priority: Minor
>
> {code}
> >>> json_rdd = sqlContext.jsonRDD(sc.parallelize(['{"name": "Nick"}']))
> >>> json_rdd.schema
> StructType(List(StructField(name,StringType,true)))
> >>> type(json_rdd.schema)
> 
> >>> json_rdd.schema.json()
> '{"fields":[{"metadata":{},"name":"name","nullable":true,"type":"string"}],"type":"struct"}'
> >>> pyspark.sql.types.StructType.fromJson(json_rdd.schema.json())
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Applications/apache-spark/spark-1.3.1-bin-hadoop2.4/python/pyspark/sql/types.py",
>  line 346, in fromJson
> return StructType([StructField.fromJson(f) for f in json["fields"]])
> TypeError: string indices must be integers, not str
> >>> import json
> >>> pyspark.sql.types.StructType.fromJson(json.loads(json_rdd.schema.json()))
> StructType(List(StructField(name,StringType,true)))
> >>>
> {code}
> So {{fromJson()}} doesn't actually expect JSON, which is a string. It expects 
> a dictionary.
> This method should probably be renamed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15256) Clarify the docstring for DataFrameReader.jdbc()

2016-05-10 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-15256:
-
Description: 
The doc for the {{properties}} parameter [currently 
reads|https://github.com/apache/spark/blob/d37c7f7f042f7943b5b684e53cf4284c601fb347/python/pyspark/sql/readwriter.py#L437-L439]:

{quote}
:param properties: JDBC database connection arguments, a list of 
arbitrary string
   tag/value. Normally at least a "user" and "password" 
property
   should be included.
{quote}

This is incorrect, since {{properties}} is expected to be a dictionary.

Some of the other parameters have cryptic descriptions. I'll try to clarify 
those as well.

  was:
The doc for the {{properties}} parameter [currently 
reads|https://github.com/apache/spark/blob/d37c7f7f042f7943b5b684e53cf4284c601fb347/python/pyspark/sql/readwriter.py#L437-L439]:

{quote}
:param properties: JDBC database connection arguments, a list of 
arbitrary string
   tag/value. Normally at least a "user" and "password" 
property
   should be included.
{quote}

This is incorrect, since {{properties}} is expected to be a dictionary.


> Clarify the docstring for DataFrameReader.jdbc()
> 
>
> Key: SPARK-15256
> URL: https://issues.apache.org/jira/browse/SPARK-15256
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, PySpark
>Affects Versions: 1.6.1
>Reporter: Nicholas Chammas
>Priority: Minor
>
> The doc for the {{properties}} parameter [currently 
> reads|https://github.com/apache/spark/blob/d37c7f7f042f7943b5b684e53cf4284c601fb347/python/pyspark/sql/readwriter.py#L437-L439]:
> {quote}
> :param properties: JDBC database connection arguments, a list of 
> arbitrary string
>tag/value. Normally at least a "user" and 
> "password" property
>should be included.
> {quote}
> This is incorrect, since {{properties}} is expected to be a dictionary.
> Some of the other parameters have cryptic descriptions. I'll try to clarify 
> those as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15256) Clarify the docstring for DataFrameReader.jdbc()

2016-05-10 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-15256:
-
Summary: Clarify the docstring for DataFrameReader.jdbc()  (was: Correct 
the docstring for DataFrameReader.jdbc())

> Clarify the docstring for DataFrameReader.jdbc()
> 
>
> Key: SPARK-15256
> URL: https://issues.apache.org/jira/browse/SPARK-15256
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, PySpark
>Affects Versions: 1.6.1
>Reporter: Nicholas Chammas
>Priority: Minor
>
> The doc for the {{properties}} parameter [currently 
> reads|https://github.com/apache/spark/blob/d37c7f7f042f7943b5b684e53cf4284c601fb347/python/pyspark/sql/readwriter.py#L437-L439]:
> {quote}
> :param properties: JDBC database connection arguments, a list of 
> arbitrary string
>tag/value. Normally at least a "user" and 
> "password" property
>should be included.
> {quote}
> This is incorrect, since {{properties}} is expected to be a dictionary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15256) Correct the docstring for DataFrameReader.jdbc()

2016-05-10 Thread Nicholas Chammas (JIRA)

Nicholas Chammas created SPARK-15256:


 Summary: Correct the docstring for DataFrameReader.jdbc()
 Key: SPARK-15256
 URL: https://issues.apache.org/jira/browse/SPARK-15256
 Project: Spark
  Issue Type: Bug
  Components: Documentation, PySpark
Affects Versions: 1.6.1
Reporter: Nicholas Chammas
Priority: Minor


The doc for the {{properties}} parameter [currently 
reads|https://github.com/apache/spark/blob/d37c7f7f042f7943b5b684e53cf4284c601fb347/python/pyspark/sql/readwriter.py#L437-L439]:

{quote}
:param properties: JDBC database connection arguments, a list of 
arbitrary string
   tag/value. Normally at least a "user" and "password" 
property
   should be included.
{quote}

This is incorrect, since {{properties}} is expected to be a dictionary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15193) samplingRatio should default to 1.0 across the board

2016-05-10 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15278393#comment-15278393
 ] 

Nicholas Chammas edited comment on SPARK-15193 at 5/10/16 4:27 PM:
---

Nope, a sampling ratio of 1.0 and None mean [different 
things|http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.SQLContext.createDataFrame]:

{quote}
If schema inference is needed, samplingRatio is used to determined the ratio of 
rows used for schema inference. The first row will be used if samplingRatio is 
None.
{quote}

createDataFrame() is not consistent with jsonRDD() as of 1.6.1.

Looking in master, I can't seem to find jsonRDD() for Python anymore, so 
perhaps that got removed. So actually you're right, it looks like the default 
_did_ get standardized, but to None. :/


was (Author: nchammas):
Nope, a sampling ratio of 1.0 and None mean [different 
things|http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.SQLContext.createDataFrame]:

{quote}
If schema inference is needed, samplingRatio is used to determined the ratio of 
rows used for schema inference. The first row will be used if samplingRatio is 
None.
{quote}

createDataFrame() is not consistent with jsonRDD() as of 1.6.1.

Looking in master, I can't seem to find jsonRDD() for Python anymore, so 
perhaps that got removed. So actually, it looks like the default _did_ get 
standardized, but to None. :/

> samplingRatio should default to 1.0 across the board
> 
>
> Key: SPARK-15193
> URL: https://issues.apache.org/jira/browse/SPARK-15193
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Nicholas Chammas
>Priority: Minor
>
> The default sampling ratio for {{jsonRDD}} is 
> [1.0|http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.SQLContext.jsonRDD],
>  whereas for {{createDataFrame}} it's 
> [{{None}}|http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.SQLContext.createDataFrame].
> I think the default sampling ratio should be 1.0 across the board. Users 
> should have to explicitly supply a lower sampling ratio if they know their 
> dataset has a consistent structure. Otherwise, I think the "safer" thing to 
> default to is to check all the data.
> Targeting this for 2.0 in case we consider it a breaking change that would be 
> more difficult to get in later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15193) samplingRatio should default to 1.0 across the board

2016-05-10 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15278393#comment-15278393
 ] 

Nicholas Chammas commented on SPARK-15193:
--

Nope, a sampling ratio of 1.0 and None mean [different 
things|http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.SQLContext.createDataFrame]:

{quote}
If schema inference is needed, samplingRatio is used to determined the ratio of 
rows used for schema inference. The first row will be used if samplingRatio is 
None.
{quote}

createDataFrame() is not consistent with jsonRDD() as of 1.6.1.

Looking in master, I can't seem to find jsonRDD() for Python anymore, so 
perhaps that got removed. So actually, it looks like the default _did_ get 
standardized, but to None. :/

> samplingRatio should default to 1.0 across the board
> 
>
> Key: SPARK-15193
> URL: https://issues.apache.org/jira/browse/SPARK-15193
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Nicholas Chammas
>Priority: Minor
>
> The default sampling ratio for {{jsonRDD}} is 
> [1.0|http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.SQLContext.jsonRDD],
>  whereas for {{createDataFrame}} it's 
> [{{None}}|http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.SQLContext.createDataFrame].
> I think the default sampling ratio should be 1.0 across the board. Users 
> should have to explicitly supply a lower sampling ratio if they know their 
> dataset has a consistent structure. Otherwise, I think the "safer" thing to 
> default to is to check all the data.
> Targeting this for 2.0 in case we consider it a breaking change that would be 
> more difficult to get in later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15238) Clarify Python 3 support in docs

2016-05-09 Thread Nicholas Chammas (JIRA)

Nicholas Chammas created SPARK-15238:


 Summary: Clarify Python 3 support in docs
 Key: SPARK-15238
 URL: https://issues.apache.org/jira/browse/SPARK-15238
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, PySpark
Reporter: Nicholas Chammas
Priority: Trivial


The [current doc|http://spark.apache.org/docs/1.6.1/] reads:

{quote}
Spark runs on Java 7+, Python 2.6+ and R 3.1+. For the Scala API, Spark 1.6.1 
uses Scala 2.10. You will need to use a compatible Scala version (2.10.x).
{quote}

Projects that support Python 3 generally mention that explicitly. A casual 
Python user might assume from this line that Spark supports Python 2.6 and 2.7 
but not 3+.

More specifically, I gather from SPARK-4897 that Spark actually supports 3.4+ 
and not earlier versions of Python 3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12661) Drop Python 2.6 support in PySpark

2016-05-09 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15277450#comment-15277450
 ] 

Nicholas Chammas commented on SPARK-12661:
--

[~davies] / [~joshrosen] - Has this been settled on? The dev list discussion 
from January seemed to converge almost unanimously on dropping Python 2.6 
support in Spark 2.0, but I don't think an official decision was ever announced.

> Drop Python 2.6 support in PySpark
> --
>
> Key: SPARK-12661
> URL: https://issues.apache.org/jira/browse/SPARK-12661
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Reporter: Davies Liu
>  Labels: releasenotes
>
> 1. stop testing with 2.6
> 2. remove the code for python 2.6
> see discussion : 
> https://www.mail-archive.com/user@spark.apache.org/msg43423.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12661) Drop Python 2.6 support in PySpark

2016-05-09 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15277448#comment-15277448
 ] 

Nicholas Chammas commented on SPARK-12661:
--

[~shivaram] - Can you confirm that spark-ec2 will drop support for Python 2.6 
starting with Spark 2.0?

For the record, I don't think it will affect things much either way anymore 
since spark-ec2 is now a separate project.

> Drop Python 2.6 support in PySpark
> --
>
> Key: SPARK-12661
> URL: https://issues.apache.org/jira/browse/SPARK-12661
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Reporter: Davies Liu
>  Labels: releasenotes
>
> 1. stop testing with 2.6
> 2. remove the code for python 2.6
> see discussion : 
> https://www.mail-archive.com/user@spark.apache.org/msg43423.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15204) Nullable is not correct for Aggregator

2016-05-07 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275394#comment-15275394
 ] 

Nicholas Chammas commented on SPARK-15204:
--

Loosely related: SPARK-15191

> Nullable is not correct for Aggregator
> --
>
> Key: SPARK-15204
> URL: https://issues.apache.org/jira/browse/SPARK-15204
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: spark-2.0.0-SNAPSHOT
>Reporter: koert kuipers
>Priority: Minor
>
> {noformat}
> object SimpleSum extends Aggregator[Row, Int, Int] {
>   def zero: Int = 0
>   def reduce(b: Int, a: Row) = b + a.getInt(1)
>   def merge(b1: Int, b2: Int) = b1 + b2
>   def finish(b: Int) = b
>   def bufferEncoder: Encoder[Int] = Encoders.scalaInt
>   def outputEncoder: Encoder[Int] = Encoders.scalaInt
> }
> val df = List(("a", 1), ("a", 2), ("a", 3)).toDF("k", "v")
> val df1 = df.groupBy("k").agg(SimpleSum.toColumn as "v1")
> df1.printSchema
> df1.show
> root
>  |-- k: string (nullable = true)
>  |-- v1: integer (nullable = true)
> +---+---+
> |  k| v1|
> +---+---+
> |  a|  6|
> +---+---+
> {noformat}
> notice how v1 has nullable set to true. the default (and expected) behavior 
> for spark sql is to give an int column false for nullable. for example if i 
> had uses a built-in aggregator like "sum" instead if would have reported 
> nullable = false.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15191) createDataFrame() should mark fields that are known not to be null as not nullable

2016-05-06 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-15191:
-
Affects Version/s: 1.6.1

> createDataFrame() should mark fields that are known not to be null as not 
> nullable
> --
>
> Key: SPARK-15191
> URL: https://issues.apache.org/jira/browse/SPARK-15191
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.6.1
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Here's a brief reproduction:
> {code}
> >>> numbers = sqlContext.createDataFrame(
> ... data=[(1,), (2,), (3,), (4,), (5,)],
> ... samplingRatio=1  # go through all the data please!
> ... )
> >>> numbers.printSchema()
> root
>  |-- _1: long (nullable = true)
> {code}
> The field is marked as nullable even though none of the data is null and we 
> had {{createDataFrame()}} go through all the data.
> In situations like this, shouldn't {{createDataFrame()}} return a DataFrame 
> with the field marked as not nullable?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15191) createDataFrame() should mark fields that are known not to be null as not nullable

2016-05-06 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274785#comment-15274785
 ] 

Nicholas Chammas commented on SPARK-15191:
--

[~yhuai] - This loosely relates to the discussion in SPARK-11319.

> createDataFrame() should mark fields that are known not to be null as not 
> nullable
> --
>
> Key: SPARK-15191
> URL: https://issues.apache.org/jira/browse/SPARK-15191
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Here's a brief reproduction:
> {code}
> >>> numbers = sqlContext.createDataFrame(
> ... data=[(1,), (2,), (3,), (4,), (5,)],
> ... samplingRatio=1  # go through all the data please!
> ... )
> >>> numbers.printSchema()
> root
>  |-- _1: long (nullable = true)
> {code}
> The field is marked as nullable even though none of the data is null and we 
> had {{createDataFrame()}} go through all the data.
> In situations like this, shouldn't {{createDataFrame()}} return a DataFrame 
> with the field marked as not nullable?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15193) samplingRatio should default to 1.0 across the board

2016-05-06 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274783#comment-15274783
 ] 

Nicholas Chammas commented on SPARK-15193:
--

[~yhuai] - What do you think of this proposed change?

> samplingRatio should default to 1.0 across the board
> 
>
> Key: SPARK-15193
> URL: https://issues.apache.org/jira/browse/SPARK-15193
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Nicholas Chammas
>Priority: Minor
>
> The default sampling ratio for {{jsonRDD}} is 
> [1.0|http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.SQLContext.jsonRDD],
>  whereas for {{createDataFrame}} it's 
> [{{None}}|http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.SQLContext.createDataFrame].
> I think the default sampling ratio should be 1.0 across the board. Users 
> should have to explicitly supply a lower sampling ratio if they know their 
> dataset has a consistent structure. Otherwise, I think the "safer" thing to 
> default to is to check all the data.
> Targeting this for 2.0 in case we consider it a breaking change that would be 
> more difficult to get in later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15193) samplingRatio should default to 1.0 across the board

2016-05-06 Thread Nicholas Chammas (JIRA)

Nicholas Chammas created SPARK-15193:


 Summary: samplingRatio should default to 1.0 across the board
 Key: SPARK-15193
 URL: https://issues.apache.org/jira/browse/SPARK-15193
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Reporter: Nicholas Chammas
Priority: Minor


The default sampling ratio for {{jsonRDD}} is 
[1.0|http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.SQLContext.jsonRDD],
 whereas for {{createDataFrame}} it's 
[{{None}}|http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.SQLContext.createDataFrame].

I think the default sampling ratio should be 1.0 across the board. Users should 
have to explicitly supply a lower sampling ratio if they know their dataset has 
a consistent structure. Otherwise, I think the "safer" thing to default to is 
to check all the data.

Targeting this for 2.0 in case we consider it a breaking change that would be 
more difficult to get in later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15191) createDataFrame() should mark fields that are known not to be null as not nullable

2016-05-06 Thread Nicholas Chammas (JIRA)

Nicholas Chammas created SPARK-15191:


 Summary: createDataFrame() should mark fields that are known not 
to be null as not nullable
 Key: SPARK-15191
 URL: https://issues.apache.org/jira/browse/SPARK-15191
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Reporter: Nicholas Chammas
Priority: Minor


Here's a brief reproduction:

{code}
>>> numbers = sqlContext.createDataFrame(
... data=[(1,), (2,), (3,), (4,), (5,)],
... samplingRatio=1  # go through all the data please!
... )
>>> numbers.printSchema()
root
 |-- _1: long (nullable = true)
{code}

The field is marked as nullable even though none of the data is null and we had 
{{createDataFrame()}} go through all the data.

In situations like this, shouldn't {{createDataFrame()}} return a DataFrame 
with the field marked as not nullable?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13740) add null check for _verify_type in types.py

2016-05-06 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274738#comment-15274738
 ] 

Nicholas Chammas commented on SPARK-13740:
--

I noticed the PR only modifies PySpark. Are similar changes not required for 
the other language APIs?

> add null check for _verify_type in types.py
> ---
>
> Key: SPARK-13740
> URL: https://issues.apache.org/jira/browse/SPARK-13740
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11319) PySpark silently accepts null values in non-nullable DataFrame fields.

2016-05-06 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274728#comment-15274728
 ] 

Nicholas Chammas commented on SPARK-11319:
--

[~marmbrus] / [~yhuai] - Does SPARK-13740 resolve the problem described here? 
If I read that issue correctly, it appears that it's no longer possible to have 
null data in a field that is marked as not nullable.

> PySpark silently accepts null values in non-nullable DataFrame fields.
> --
>
> Key: SPARK-11319
> URL: https://issues.apache.org/jira/browse/SPARK-11319
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Kevin Cox
>
> Running the following code with a null value in a non-nullable column 
> silently works. This makes the code incredibly hard to trust.
> {code}
> In [2]: from pyspark.sql.types import *
> In [3]: sqlContext.createDataFrame([(None,)], StructType([StructField("a", 
> TimestampType(), False)])).collect()
> Out[3]: [Row(a=None)]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14932) Allow DataFrame.replace() to replace values with None

2016-04-26 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15259266#comment-15259266
 ] 

Nicholas Chammas commented on SPARK-14932:
--

[~marmbrus] - Not sure if you're a good person to ask about this, but does this 
request seem reasonable from an API standpoint?

> Allow DataFrame.replace() to replace values with None
> -
>
> Key: SPARK-14932
> URL: https://issues.apache.org/jira/browse/SPARK-14932
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Current doc: 
> http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.replace
> I would like to specify {{None}} as the value to substitute in. This is 
> currently 
> [disallowed|https://github.com/apache/spark/blob/9797cc20c0b8fb34659df11af8eccb9ed293c52c/python/pyspark/sql/dataframe.py#L1144-L1145].
>  My use case is for replacing bad values with {{None}} so I can then ignore 
> them with {{dropna()}}.
> For example, I have a dataset that incorrectly includes empty strings where 
> there should be {{None}} values. I would like to replace the empty strings 
> with {{None}} and then drop all null data with {{dropna()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14932) Allow DataFrame.replace() to replace values with None

2016-04-26 Thread Nicholas Chammas (JIRA)

Nicholas Chammas created SPARK-14932:


 Summary: Allow DataFrame.replace() to replace values with None
 Key: SPARK-14932
 URL: https://issues.apache.org/jira/browse/SPARK-14932
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Nicholas Chammas
Priority: Minor


Current doc: 
http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.replace

I would like to specify {{None}} as the value to substitute in. This is 
currently 
[disallowed|https://github.com/apache/spark/blob/9797cc20c0b8fb34659df11af8eccb9ed293c52c/python/pyspark/sql/dataframe.py#L1144-L1145].
 My use case is for replacing bad values with {{None}} so I can then ignore 
them with {{dropna()}}.

For example, I have a dataset that incorrectly includes empty strings where 
there should be {{None}} values. I would like to replace the empty strings with 
{{None}} and then drop all null data with {{dropna()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14742) Redirect spark-ec2 doc to new location

2016-04-19 Thread Nicholas Chammas (JIRA)

Nicholas Chammas created SPARK-14742:


 Summary: Redirect spark-ec2 doc to new location
 Key: SPARK-14742
 URL: https://issues.apache.org/jira/browse/SPARK-14742
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, EC2
Reporter: Nicholas Chammas
Priority: Minor


See: https://github.com/amplab/spark-ec2/pull/24#issuecomment-212033453

We need to redirect this page

http://spark.apache.org/docs/latest/ec2-scripts.html

to this page

https://github.com/amplab/spark-ec2#readme



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8327) Ganglia failed to start while starting standalone on EC 2 spark with spark-ec2

2016-04-19 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249144#comment-15249144
 ] 

Nicholas Chammas commented on SPARK-8327:
-

[~vvladymyrov] - Is this still an issue? If so, I suggest migrating this issue 
to the spark-ec2 tracker: https://github.com/amplab/spark-ec2/issues

> Ganglia failed to start while starting standalone on EC 2 spark with 
> spark-ec2 
> ---
>
> Key: SPARK-8327
> URL: https://issues.apache.org/jira/browse/SPARK-8327
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.3.1
>Reporter: Vladimir Vladimirov
>Priority: Minor
>
> exception shown
> {code}
> [FAILED] Starting httpd: httpd: Syntax error on line 199 of 
> /etc/httpd/conf/httpd.conf: Cannot load modules/libphp-5.5.so into server: 
> /etc/httpd/modules/libphp-5.5.so: cannot open shared object file: No such 
> file or directory [FAILED] 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6527) sc.binaryFiles can not access files on s3

2016-04-19 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249141#comment-15249141
 ] 

Nicholas Chammas commented on SPARK-6527:
-

Did the s3a suggestion work? If not, did anybody file an issue as Steve 
suggested with more detail?

> sc.binaryFiles can not access files on s3
> -
>
> Key: SPARK-6527
> URL: https://issues.apache.org/jira/browse/SPARK-6527
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Input/Output
>Affects Versions: 1.2.0, 1.3.0
> Environment: I am running Spark on EC2
>Reporter: Zhao Zhang
>Priority: Minor
>
> The sc.binaryFIles() can not access the files stored on s3. It can correctly 
> list the number of files, but report "file does not exist" when processing 
> them. I also tried sc.textFile() which works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-6527) sc.binaryFiles can not access files on s3

2016-04-19 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249141#comment-15249141
 ] 

Nicholas Chammas edited comment on SPARK-6527 at 4/20/16 2:27 AM:
--

Did the s3a suggestion work? If not, did anybody file an issue as Steve 
suggested?


was (Author: nchammas):
Did the s3a suggestion work? If not, did anybody file an issue as Steve 
suggested with more detail?

> sc.binaryFiles can not access files on s3
> -
>
> Key: SPARK-6527
> URL: https://issues.apache.org/jira/browse/SPARK-6527
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Input/Output
>Affects Versions: 1.2.0, 1.3.0
> Environment: I am running Spark on EC2
>Reporter: Zhao Zhang
>Priority: Minor
>
> The sc.binaryFIles() can not access the files stored on s3. It can correctly 
> list the number of files, but report "file does not exist" when processing 
> them. I also tried sc.textFile() which works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2016-04-05 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas closed SPARK-3821.
---
Resolution: Won't Fix

I'm resolving this as "Won't Fix" due to lack of interest, both on my part and 
on part of the Spark / spark-ec2 project maintainers.

If anyone's interested in picking this up, the code is here: 
https://github.com/nchammas/spark-ec2/tree/packer/image-build

I've mostly moved on from spark-ec2 to work on 
[Flintrock|https://github.com/nchammas/flintrock], which doesn't require custom 
AMIs.

> Develop an automated way of creating Spark images (AMI, Docker, and others)
> ---
>
> Key: SPARK-3821
> URL: https://issues.apache.org/jira/browse/SPARK-3821
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, EC2
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
> Attachments: packer-proposal.html
>
>
> Right now the creation of Spark AMIs or Docker containers is done manually. 
> With tools like [Packer|http://www.packer.io/], we should be able to automate 
> this work, and do so in such a way that multiple types of machine images can 
> be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3533) Add saveAsTextFileByKey() method to RDDs

2016-03-28 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15214577#comment-15214577
 ] 

Nicholas Chammas commented on SPARK-3533:
-

I've added 2 workaround to this issue to the description body.

> Add saveAsTextFileByKey() method to RDDs
> 
>
> Key: SPARK-3533
> URL: https://issues.apache.org/jira/browse/SPARK-3533
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 1.1.0
>Reporter: Nicholas Chammas
>
> Users often have a single RDD of key-value pairs that they want to save to 
> multiple locations based on the keys.
> For example, say I have an RDD like this:
> {code}
> >>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 
> >>> 'Frankie']).keyBy(lambda x: x[0])
> >>> a.collect()
> [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')]
> >>> a.keys().distinct().collect()
> ['B', 'F', 'N']
> {code}
> Now I want to write the RDD out to different paths depending on the keys, so 
> that I have one output directory per distinct key. Each output directory 
> could potentially have multiple {{part-}} files, one per RDD partition.
> So the output would look something like:
> {code}
> /path/prefix/B [/part-1, /part-2, etc]
> /path/prefix/F [/part-1, /part-2, etc]
> /path/prefix/N [/part-1, /part-2, etc]
> {code}
> Though it may be possible to do this with some combination of 
> {{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the 
> {{MultipleTextOutputFormat}} output format class, it isn't straightforward. 
> It's not clear if it's even possible at all in PySpark.
> Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs 
> that makes it easy to save RDDs out to multiple locations at once.
> ---
> Update: March 2016
> There are two workarounds to this problem:
> 1. See [this answer on Stack 
> Overflow|http://stackoverflow.com/a/26051042/877069], which implements 
> {{MultipleTextOutputFormat}}. (Scala-only)
> 2. See [this comment by Davies 
> Liu|https://github.com/apache/spark/pull/8375#issuecomment-202458325], which 
> uses DataFrames:
> {code}
> val df = rdd.map(t => Row(gen_key(t), t)).toDF("key", "text")
> df.write.partitionBy("key").text(path){code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3533) Add saveAsTextFileByKey() method to RDDs

2016-03-28 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-3533:

Description: 
Users often have a single RDD of key-value pairs that they want to save to 
multiple locations based on the keys.

For example, say I have an RDD like this:
{code}
>>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 'Frankie']).keyBy(lambda 
>>> x: x[0])
>>> a.collect()
[('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')]
>>> a.keys().distinct().collect()
['B', 'F', 'N']
{code}

Now I want to write the RDD out to different paths depending on the keys, so 
that I have one output directory per distinct key. Each output directory could 
potentially have multiple {{part-}} files, one per RDD partition.

So the output would look something like:

{code}
/path/prefix/B [/part-1, /part-2, etc]
/path/prefix/F [/part-1, /part-2, etc]
/path/prefix/N [/part-1, /part-2, etc]
{code}

Though it may be possible to do this with some combination of 
{{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the 
{{MultipleTextOutputFormat}} output format class, it isn't straightforward. 
It's not clear if it's even possible at all in PySpark.

Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs that 
makes it easy to save RDDs out to multiple locations at once.

---

Update: March 2016

There are two workarounds to this problem:

1. See [this answer on Stack 
Overflow|http://stackoverflow.com/a/26051042/877069], which implements 
{{MultipleTextOutputFormat}}. (Scala-only)
2. See [this comment by Davies 
Liu|https://github.com/apache/spark/pull/8375#issuecomment-202458325], which 
uses DataFrames:
{code}
val df = rdd.map(t => Row(gen_key(t), t)).toDF("key", "text")
df.write.partitionBy("key").text(path){code}


  was:
Users often have a single RDD of key-value pairs that they want to save to 
multiple locations based on the keys.

For example, say I have an RDD like this:
{code}
>>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 'Frankie']).keyBy(lambda 
>>> x: x[0])
>>> a.collect()
[('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')]
>>> a.keys().distinct().collect()
['B', 'F', 'N']
{code}

Now I want to write the RDD out to different paths depending on the keys, so 
that I have one output directory per distinct key. Each output directory could 
potentially have multiple {{part-}} files, one per RDD partition.

So the output would look something like:

{code}
/path/prefix/B [/part-1, /part-2, etc]
/path/prefix/F [/part-1, /part-2, etc]
/path/prefix/N [/part-1, /part-2, etc]
{code}

Though it may be possible to do this with some combination of 
{{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the 
{{MultipleTextOutputFormat}} output format class, it isn't straightforward. 
It's not clear if it's even possible at all in PySpark.

Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs that 
makes it easy to save RDDs out to multiple locations at once.


> Add saveAsTextFileByKey() method to RDDs
> 
>
> Key: SPARK-3533
> URL: https://issues.apache.org/jira/browse/SPARK-3533
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 1.1.0
>Reporter: Nicholas Chammas
>
> Users often have a single RDD of key-value pairs that they want to save to 
> multiple locations based on the keys.
> For example, say I have an RDD like this:
> {code}
> >>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 
> >>> 'Frankie']).keyBy(lambda x: x[0])
> >>> a.collect()
> [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')]
> >>> a.keys().distinct().collect()
> ['B', 'F', 'N']
> {code}
> Now I want to write the RDD out to different paths depending on the keys, so 
> that I have one output directory per distinct key. Each output directory 
> could potentially have multiple {{part-}} files, one per RDD partition.
> So the output would look something like:
> {code}
> /path/prefix/B [/part-1, /part-2, etc]
> /path/prefix/F [/part-1, /part-2, etc]
> /path/prefix/N [/part-1, /part-2, etc]
> {code}
> Though it may be possible to do this with some combination of 
> {{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the 
> {{MultipleTextOutputFormat}} output format class, it isn't straightforward. 
> It's not clear if it's even possible at all in PySpark.
> Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs 
> that makes it easy to save RDDs out to multiple locations at once.
> ---
> Update: March 2016
> There are two workarounds to this problem:
> 1. See [this answer on Stack 
> Overflow|http://stackoverflow.com/a/26051042/877069], which implements 
> {{MultipleTextOutputFormat}}. (Scala-only)
> 2. See [this comment by

[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors

2016-03-19 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197451#comment-15197451
 ] 

Nicholas Chammas commented on SPARK-7481:
-

(Sorry Steve; can't comment on your proposal since I don't know much about 
these kinds of build decisions.)

Just to add some more evidence to the record that this problem appears to 
affect many people, take a look at this: 
http://stackoverflow.com/search?q=%5Bapache-spark%5D+S3+Hadoop+2.6

Lots of confusion about how to access S3, with the recommended solution as 
before being to [use Spark built against Hadoop 
2.4|http://stackoverflow.com/a/30852341/877069].

> Add Hadoop 2.6+ profile to pull in object store FS accessors
> 
>
> Key: SPARK-7481
> URL: https://issues.apache.org/jira/browse/SPARK-7481
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.3.1
>Reporter: Steve Loughran
>
> To keep the s3n classpath right, to add s3a, swift & azure, the dependencies 
> of spark in a 2.6+ profile need to add the relevant object store packages 
> (hadoop-aws, hadoop-openstack, hadoop-azure)
> this adds more stuff to the client bundle, but will mean a single spark 
> package can talk to all of the stores.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7505) Update PySpark DataFrame docs: encourage getitem, mark as experimental, etc.

2016-03-05 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15181776#comment-15181776
 ] 

Nicholas Chammas commented on SPARK-7505:
-

I believe items 1, 3, and 4 still apply. They're minor documentation issues, 
but I think they should still be addressed.

> Update PySpark DataFrame docs: encourage __getitem__, mark as experimental, 
> etc.
> 
>
> Key: SPARK-7505
> URL: https://issues.apache.org/jira/browse/SPARK-7505
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark, SQL
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> The PySpark docs for DataFrame need the following fixes and improvements:
> # Per [SPARK-7035], we should encourage the use of {{\_\_getitem\_\_}} over 
> {{\_\_getattr\_\_}} and change all our examples accordingly.
> # *We should say clearly that the API is experimental.* (That is currently 
> not the case for the PySpark docs.)
> # We should provide an example of how to join and select from 2 DataFrames 
> that have identically named columns, because it is not obvious:
>   {code}
> >>> df1 = sqlContext.jsonRDD(sc.parallelize(['{"a": 4, "other": "I know"}']))
> >>> df2 = sqlContext.jsonRDD(sc.parallelize(['{"a": 4, "other": "I dunno"}']))
> >>> df12 = df1.join(df2, df1['a'] == df2['a'])
> >>> df12.select(df1['a'], df2['other']).show()
> a other   
> 
> 4 I dunno  {code}
> # 
> [{{DF.orderBy}}|https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.orderBy]
>  and 
> [{{DF.sort}}|https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sort]
>  should be marked as aliases if that's what they are.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13596) Move misc top-level build files into appropriate subdirs

2016-03-04 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180072#comment-15180072
 ] 

Nicholas Chammas commented on SPARK-13596:
--

Looks like {{tox.ini}} is only used by {{pep8}}, so if you move it into 
{{dev/}}, where the Python lint checks run from, that should work.

> Move misc top-level build files into appropriate subdirs
> 
>
> Key: SPARK-13596
> URL: https://issues.apache.org/jira/browse/SPARK-13596
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Sean Owen
>
> I'd like to file away a bunch of misc files that are in the top level of the 
> project in order to further tidy the build for 2.0.0. See also SPARK-13529, 
> SPARK-13548.
> Some of these may turn out to be difficult or impossible to move.
> I'd ideally like to move these files into {{build/}}:
> - {{.rat-excludes}}
> - {{checkstyle.xml}}
> - {{checkstyle-suppressions.xml}}
> - {{pylintrc}}
> - {{scalastyle-config.xml}}
> - {{tox.ini}}
> - {{project/}} (or does SBT need this in the root?)
> And ideally, these would go under {{dev/}}
> - {{make-distribution.sh}}
> And remove these
> - {{sbt/sbt}} (backwards-compatible location of {{build/sbt}} right?)
> Edited to add: apparently this can go in {{.github}} now:
> - {{CONTRIBUTING.md}}
> Other files in the top level seem to need to be there, like {{README.md}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors

2016-03-02 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176559#comment-15176559
 ] 

Nicholas Chammas commented on SPARK-7481:
-

I'm not comfortable working with Maven so I can't comment on the details of the 
approach we should take, but I will appreciate any progress towards making 
Spark built against Hadoop 2.6+ work with S3 out of the box, or as close to out 
of the box as possible.

Given Spark's close relation to S3 and EC2 (as far as Spark's user base is 
concerned), a good out of the box experience here is critical. Many people just 
expect it.

> Add Hadoop 2.6+ profile to pull in object store FS accessors
> 
>
> Key: SPARK-7481
> URL: https://issues.apache.org/jira/browse/SPARK-7481
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.3.1
>Reporter: Steve Loughran
>
> To keep the s3n classpath right, to add s3a, swift & azure, the dependencies 
> of spark in a 2.6+ profile need to add the relevant object store packages 
> (hadoop-aws, hadoop-openstack, hadoop-azure)
> this adds more stuff to the client bundle, but will mean a single spark 
> package can talk to all of the stores.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors

2016-03-02 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176551#comment-15176551
 ] 

Nicholas Chammas commented on SPARK-7481:
-

{quote}
One issue here that hadoop 2.6's hadoop-aws pulls in the whole AWT toolkit, 
which is pretty weighty, for s3a ... which isn't something I'd use in 2.6 
anyway.
{quote}

Did you mean something other than s3a here?

> Add Hadoop 2.6+ profile to pull in object store FS accessors
> 
>
> Key: SPARK-7481
> URL: https://issues.apache.org/jira/browse/SPARK-7481
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.3.1
>Reporter: Steve Loughran
>
> To keep the s3n classpath right, to add s3a, swift & azure, the dependencies 
> of spark in a 2.6+ profile need to add the relevant object store packages 
> (hadoop-aws, hadoop-openstack, hadoop-azure)
> this adds more stuff to the client bundle, but will mean a single spark 
> package can talk to all of the stores.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors

2016-03-01 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174438#comment-15174438
 ] 

Nicholas Chammas commented on SPARK-7481:
-

Many people seem to be downgrading to use Spark built against Hadoop 2.4 
because the Spark / Hadoop 2.6 package doesn't work against S3 out of the box.

* [Example 
1|https://issues.apache.org/jira/browse/SPARK-7442?focusedCommentId=14582965=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14582965]
* [Example 
2|https://issues.apache.org/jira/browse/SPARK-7442?focusedCommentId=14903750=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14903750]
* [Example 
3|https://github.com/nchammas/flintrock/issues/88#issuecomment-190905262]

If this proposal eliminates that bit of friction for users without being too 
burdensome on the team, then I'm for it.

Ideally, we want people using Spark built against the latest version of Hadoop 
anyway, right? This proposal would nudge people in that direction.

> Add Hadoop 2.6+ profile to pull in object store FS accessors
> 
>
> Key: SPARK-7481
> URL: https://issues.apache.org/jira/browse/SPARK-7481
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.3.1
>Reporter: Steve Loughran
>
> To keep the s3n classpath right, to add s3a, swift & azure, the dependencies 
> of spark in a 2.6+ profile need to add the relevant object store packages 
> (hadoop-aws, hadoop-openstack, hadoop-azure)
> this adds more stuff to the client bundle, but will mean a single spark 
> package can talk to all of the stores.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5189) Reorganize EC2 scripts so that nodes can be provisioned independent of Spark master

2016-01-27 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15119220#comment-15119220
 ] 

Nicholas Chammas commented on SPARK-5189:
-

FWIW, I found this issue to be practically unsolvable without rewriting most of 
spark-ec2, so I started a new project that aims to replace spark-ec2 for most 
of its use cases: [Flintrock|https://github.com/nchammas/flintrock]

> Reorganize EC2 scripts so that nodes can be provisioned independent of Spark 
> master
> ---
>
> Key: SPARK-5189
> URL: https://issues.apache.org/jira/browse/SPARK-5189
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>
> As of 1.2.0, we launch Spark clusters on EC2 by setting up the master first, 
> then setting up all the slaves together. This includes broadcasting files 
> from the lonely master to potentially hundreds of slaves.
> There are 2 main problems with this approach:
> # Broadcasting files from the master to all slaves using 
> [{{copy-dir}}|https://github.com/mesos/spark-ec2/blob/branch-1.3/copy-dir.sh] 
> (e.g. during [ephemeral-hdfs 
> init|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/ephemeral-hdfs/init.sh#L36],
>  or during [Spark 
> setup|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/spark/setup.sh#L3])
>  takes a long time. This time increases as the number of slaves increases.
>  I did some testing in {{us-east-1}}. This is, concretely, what the problem 
> looks like:
>  || number of slaves ({{m3.large}}) || launch time (best of 6 tries) ||
> | 1 | 8m 44s |
> | 10 | 13m 45s |
> | 25 | 22m 50s |
> | 50 | 37m 30s |
> | 75 | 51m 30s |
> | 99 | 1h 5m 30s |
>  Unfortunately, I couldn't report on 100 slaves or more due to SPARK-6246, 
> but I think the point is clear enough.
>  We can extrapolate from this data that *every additional slave adds roughly 
> 35 seconds to the launch time* (so a cluster with 100 slaves would take 1h 6m 
> 5s to launch).
> # It's more complicated to add slaves to an existing cluster (a la 
> [SPARK-2008]), since slaves are only configured through the master during the 
> setup of the master itself.
> Logically, the operations we want to implement are:
> * Provision a Spark node
> * Join a node to a cluster (including an empty cluster) as either a master or 
> a slave
> * Remove a node from a cluster
> We need our scripts to roughly be organized to match the above operations. 
> The goals would be:
> # When launching a cluster, enable all cluster nodes to be provisioned in 
> parallel, removing the master-to-slave file broadcast bottleneck.
> # Facilitate cluster modifications like adding or removing nodes.
> # Enable exploration of infrastructure tools like 
> [Terraform|https://www.terraform.io/] that might simplify {{spark-ec2}} 
> internals and perhaps even allow us to build [one tool that launches Spark 
> clusters on several different cloud 
> platforms|https://groups.google.com/forum/#!topic/terraform-tool/eD23GLLkfDw].
> More concretely, the modifications we need to make are:
> * Replace all occurrences of {{copy-dir}} or {{rsync}}-to-slaves with 
> equivalent, slave-side operations.
> * Repurpose {{setup-slave.sh}} as {{provision-spark-node.sh}} and make sure 
> it fully creates a node that can be used as either a master or slave.
> * Create a new script, {{join-to-cluster.sh}}, that takes a provisioned node, 
> configures it as a master or slave, and joins it to a cluster.
> * Move any remaining logic in {{setup.sh}} up to {{spark_ec2.py}} and delete 
> that script.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12824) Failure to maintain consistent RDD references in pyspark

2016-01-14 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15098887#comment-15098887
 ] 

Nicholas Chammas commented on SPARK-12824:
--

Ah, good catch. This appears to be a known behavior of lambdas in Python: 
http://docs.python-guide.org/en/latest/writing/gotchas/#late-binding-closures

> Failure to maintain consistent RDD references in pyspark
> 
>
> Key: SPARK-12824
> URL: https://issues.apache.org/jira/browse/SPARK-12824
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
> Environment: Spark 1.5.2, Python 2.7.10, and IPython 4.0.0.
>Reporter: Paul Shearer
>
> Below is a simple {{pyspark}} script that tries to split an RDD into a 
> dictionary containing several RDDs. 
> As the *sample run* shows, the script only works if we do a {{collect()}} on 
> the intermediate RDDs as they are created. Of course I would not want to do 
> that in practice, since it doesn't scale.
> What's really strange is, I'm not assigning the intermediate {{collect()}} 
> results to any variable. So the difference in behavior is due solely to a 
> hidden side-effect of the computation triggered by the {{collect()}} call. 
> Spark is supposed to be a very functional framework with minimal side 
> effects. Why is it only possible to get the desired behavior by triggering 
> some mysterious side effect using {{collect()}}? 
> It seems that all the keys in the dictionary are referencing the same object 
> even though in the code they are clearly supposed to be different objects.
> The run below is with Spark 1.5.2, Python 2.7.10, and IPython 4.0.0.
> h3. spark_script.py
> {noformat}
> from pprint import PrettyPrinter
> pp = PrettyPrinter(indent=4).pprint
> logger = sc._jvm.org.apache.log4j
> logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR )
> logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR )
> 
> def split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False):
> d = dict()
> for key_value in key_values:
> d[key_value] = rdd.filter(lambda row: row[key_field] == key_value)
> if collect_in_loop:
> d[key_value].collect()
> return d
> def print_results(d):
> for k in d:
> print k
> pp(d[k].collect())
> 
> rdd = sc.parallelize([
> {'color':'red','size':3},
> {'color':'red', 'size':7},
> {'color':'red', 'size':8},
> {'color':'red', 'size':10},
> {'color':'green', 'size':9},
> {'color':'green', 'size':5},
> {'color':'green', 'size':50},
> {'color':'blue', 'size':4},
> {'color':'purple', 'size':6}])
> key_field = 'color'
> key_values = ['red', 'green', 'blue', 'purple']
> 
> print '### run WITH collect in loop: '
> d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=True)
> print_results(d)
> print '### run WITHOUT collect in loop: '
> d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False)
> print_results(d)
> {noformat}
> h3. Sample run in IPython shell
> {noformat}
> In [1]: execfile('spark_script.py')
> ### run WITH collect in loop: 
> blue
> [{   'color': 'blue', 'size': 4}]
> purple
> [{   'color': 'purple', 'size': 6}]
> green
> [   {   'color': 'green', 'size': 9},
> {   'color': 'green', 'size': 5},
> {   'color': 'green', 'size': 50}]
> red
> [   {   'color': 'red', 'size': 3},
> {   'color': 'red', 'size': 7},
> {   'color': 'red', 'size': 8},
> {   'color': 'red', 'size': 10}]
> ### run WITHOUT collect in loop: 
> blue
> [{   'color': 'purple', 'size': 6}]
> purple
> [{   'color': 'purple', 'size': 6}]
> green
> [{   'color': 'purple', 'size': 6}]
> red
> [{   'color': 'purple', 'size': 6}]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12824) Failure to maintain consistent RDD references in pyspark

2016-01-14 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15098336#comment-15098336
 ] 

Nicholas Chammas commented on SPARK-12824:
--

I can reproduce this issue. Here's a more concise reproduction:

{code}
from __future__ import print_function

rdd = sc.parallelize([
{'color':'red','size':3},
{'color':'red', 'size':7},
{'color':'red', 'size':8},
{'color':'red', 'size':10},
{'color':'green', 'size':9},
{'color':'green', 'size':5},
{'color':'green', 'size':50},
{'color':'blue', 'size':4},
{'color':'purple', 'size':6}])


colors = ['purple', 'red', 'green', 'blue']

# Defer collect() till print
color_rdds = {
color: rdd.filter(lambda x: x['color'] == color)
for color in colors
}
for k, v in color_rdds.items():
print(k, v.collect())


# collect() upfront
color_rdds = {
color: rdd.filter(lambda x: x['color'] == color).collect()
for color in colors
}
for k, v in color_rdds.items():
print(k, v)
{code}

Output:

{code}
# Defer collect() till print
purple [{'color': 'blue', 'size': 4}]
blue [{'color': 'blue', 'size': 4}]
green [{'color': 'blue', 'size': 4}]
red [{'color': 'blue', 'size': 4}]

---

# collect() upfront
purple [{'color': 'purple', 'size': 6}]
blue [{'color': 'blue', 'size': 4}]
green [{'color': 'green', 'size': 9}, {'color': 'green', 'size': 5}, {'color': 
'green', 'size': 50}]
red [{'color': 'red', 'size': 3}, {'color': 'red', 'size': 7}, {'color': 'red', 
'size': 8}, {'color': 'red', 'size': 10}]
{code}

Observations:
* The color that gets repeated in the first block of output is always the last 
color in {{colors}}.
* This happens on Python 2 and 3, and with both {{items()}} and {{iteritems()}}.

This smells like an RDD naming issue, or something related to lazy evaluation. 
The filtered RDDs that get generated in the first block under {{color_rdds}} 
don't have names. Then, when they all get {{collect()}}-ed at once, they all 
evaluate to the last filtered RDD.

cc [~davies] / [~joshrosen]

> Failure to maintain consistent RDD references in pyspark
> 
>
> Key: SPARK-12824
> URL: https://issues.apache.org/jira/browse/SPARK-12824
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
> Environment: Spark 1.5.2, Python 2.7.10, and IPython 4.0.0.
>Reporter: Paul Shearer
>
> Below is a simple {{pyspark}} script that tries to split an RDD into a 
> dictionary containing several RDDs. 
> As the *sample run* shows, the script only works if we do a {{collect()}} on 
> the intermediate RDDs as they are created. Of course I would not want to do 
> that in practice, since it doesn't scale.
> What's really strange is, I'm not assigning the intermediate {{collect()}} 
> results to any variable. So the difference in behavior is due solely to a 
> hidden side-effect of the computation triggered by the {{collect()}} call. 
> Spark is supposed to be a very functional framework with minimal side 
> effects. Why is it only possible to get the desired behavior by triggering 
> some mysterious side effect using {{collect()}}? 
> It seems that all the keys in the dictionary are referencing the same object 
> even though in the code they are clearly supposed to be different objects.
> The run below is with Spark 1.5.2, Python 2.7.10, and IPython 4.0.0.
> h3. spark_script.py
> {noformat}
> from pprint import PrettyPrinter
> pp = PrettyPrinter(indent=4).pprint
> logger = sc._jvm.org.apache.log4j
> logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR )
> logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR )
> 
> def split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False):
> d = dict()
> for key_value in key_values:
> d[key_value] = rdd.filter(lambda row: row[key_field] == key_value)
> if collect_in_loop:
> d[key_value].collect()
> return d
> def print_results(d):
> for k in d:
> print k
> pp(d[k].collect())
> 
> rdd = sc.parallelize([
> {'color':'red','size':3},
> {'color':'red', 'size':7},
> {'color':'red', 'size':8},
> {'color':'red', 'size':10},
> {'color':'green', 'size':9},
> {'color':'green', 'size':5},
> {'color':'green', 'size':50},
> {'color':'blue', 'size':4},
> {'color':'purple', 'size':6}])
> key_field = 'color'
> key_values = ['red', 'green', 'blue', 'purple']
> 
> print '### run WITH collect in loop: '
> d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=True)
> print_results(d)
> print '### run WITHOUT collect in loop: '
> d = split_RDD_by_key(rdd, key_field, key_values,

[jira] [Comment Edited] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2015-12-18 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14203280#comment-14203280
 ] 

Nicholas Chammas edited comment on SPARK-3821 at 12/18/15 9:08 PM:
---

After much dilly-dallying, I am happy to present:
* A brief proposal / design doc ([fixed JIRA attachment | 
https://issues.apache.org/jira/secure/attachment/12680371/packer-proposal.html],
 [md file on GitHub | 
https://github.com/nchammas/spark-ec2/blob/packer/image-build/proposal.md])
* [Initial implementation | 
https://github.com/nchammas/spark-ec2/tree/packer/image-build] and [README | 
https://github.com/nchammas/spark-ec2/blob/packer/image-build/README.md]
* New AMIs generated by this implementation: [Base AMIs | 
https://github.com/nchammas/spark-ec2/tree/packer/ami-list/base], [Spark 1.1.0 
Pre-Installed | 
https://github.com/nchammas/spark-ec2/tree/packer/ami-list/1.1.0]

To try out the new AMIs with {{spark-ec2}}, you'll need to update [these | 
https://github.com/apache/spark/blob/7e9d975676d56ace0e84c2200137e4cd4eba074a/ec2/spark_ec2.py#L47]
 [two | 
https://github.com/apache/spark/blob/7e9d975676d56ace0e84c2200137e4cd4eba074a/ec2/spark_ec2.py#L593]
 lines (well, really, just the first one) to point to [my {{spark-ec2}} repo on 
the {{packer}} branch | 
https://github.com/nchammas/spark-ec2/tree/packer/image-build].

Your candid feedback and/or improvements are most welcome!


was (Author: nchammas):
After much dilly-dallying, I am happy to present:
* A brief proposal / design doc ([fixed JIRA attachment | 
https://issues.apache.org/jira/secure/attachment/12680371/packer-proposal.html],
 [md file on GitHub | 
https://github.com/nchammas/spark-ec2/blob/packer/packer/proposal.md])
* [Initial implementation | 
https://github.com/nchammas/spark-ec2/tree/packer/packer] and [README | 
https://github.com/nchammas/spark-ec2/blob/packer/packer/README.md]
* New AMIs generated by this implementation: [Base AMIs | 
https://github.com/nchammas/spark-ec2/tree/packer/ami-list/base], [Spark 1.1.0 
Pre-Installed | 
https://github.com/nchammas/spark-ec2/tree/packer/ami-list/1.1.0]

To try out the new AMIs with {{spark-ec2}}, you'll need to update [these | 
https://github.com/apache/spark/blob/7e9d975676d56ace0e84c2200137e4cd4eba074a/ec2/spark_ec2.py#L47]
 [two | 
https://github.com/apache/spark/blob/7e9d975676d56ace0e84c2200137e4cd4eba074a/ec2/spark_ec2.py#L593]
 lines (well, really, just the first one) to point to [my {{spark-ec2}} repo on 
the {{packer}} branch | 
https://github.com/nchammas/spark-ec2/tree/packer/packer].

Your candid feedback and/or improvements are most welcome!

> Develop an automated way of creating Spark images (AMI, Docker, and others)
> ---
>
> Key: SPARK-3821
> URL: https://issues.apache.org/jira/browse/SPARK-3821
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, EC2
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
> Attachments: packer-proposal.html
>
>
> Right now the creation of Spark AMIs or Docker containers is done manually. 
> With tools like [Packer|http://www.packer.io/], we should be able to automate 
> this work, and do so in such a way that multiple types of machine images can 
> be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

< 1 2 3 4 5 6 7 8 9 10 >

401 - 500 of 1094 matches

Mail list logo