[jira] [Commented] (SPARK-18084) write.partitionBy() does not recognize nested columns that select() can access
[ https://issues.apache.org/jira/browse/SPARK-18084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15606484#comment-15606484 ] Nicholas Chammas commented on SPARK-18084: -- cc [~marmbrus] - Dunno if this is actually bug or just an unsupported or inappropriate use case. > write.partitionBy() does not recognize nested columns that select() can access > -- > > Key: SPARK-18084 > URL: https://issues.apache.org/jira/browse/SPARK-18084 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Nicholas Chammas >Priority: Minor > > Here's a simple repro in the PySpark shell: > {code} > from pyspark.sql import Row > rdd = spark.sparkContext.parallelize([Row(a=Row(b=5))]) > df = spark.createDataFrame(rdd) > df.printSchema() > df.select('a.b').show() # works > df.write.partitionBy('a.b').text('/tmp/test') # doesn't work > {code} > Here's what I see when I run this: > {code} > >>> from pyspark.sql import Row > >>> rdd = spark.sparkContext.parallelize([Row(a=Row(b=5))]) > >>> df = spark.createDataFrame(rdd) > >>> df.printSchema() > root > |-- a: struct (nullable = true) > ||-- b: long (nullable = true) > >>> df.show() > +---+ > | a| > +---+ > |[5]| > +---+ > >>> df.select('a.b').show() > +---+ > | b| > +---+ > | 5| > +---+ > >>> df.write.partitionBy('a.b').text('/tmp/test') > Traceback (most recent call last): > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py", > line 319, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o233.text. > : org.apache.spark.sql.AnalysisException: Partition column a.b not found in > schema > StructType(StructField(a,StructType(StructField(b,LongType,true)),true)); > at > org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$10.apply(PartitioningUtils.scala:368) > at > org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$10.apply(PartitioningUtils.scala:368) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:367) > at > org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:366) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.execution.datasources.PartitioningUtils$.partitionColumnsSchema(PartitioningUtils.scala:366) > at > org.apache.spark.sql.execution.datasources.PartitioningUtils$.validatePartitionColumn(PartitioningUtils.scala:349) > at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:458) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194) > at org.apache.spark.sql.DataFrameWriter.text(DataFrameWriter.scala:534) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:214) > at java.lang.Thread.run(Thread.java:745) > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File "", line 1, in > File >
[jira] [Updated] (SPARK-18084) write.partitionBy() does not recognize nested columns that select() can access
[ https://issues.apache.org/jira/browse/SPARK-18084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-18084: - Issue Type: Bug (was: Improvement) > write.partitionBy() does not recognize nested columns that select() can access > -- > > Key: SPARK-18084 > URL: https://issues.apache.org/jira/browse/SPARK-18084 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Nicholas Chammas >Priority: Minor > > Here's a simple repro in the PySpark shell: > {code} > from pyspark.sql import Row > rdd = spark.sparkContext.parallelize([Row(a=Row(b=5))]) > df = spark.createDataFrame(rdd) > df.printSchema() > df.select('a.b').show() # works > df.write.partitionBy('a.b').text('/tmp/test') # doesn't work > {code} > Here's what I see when I run this: > {code} > >>> from pyspark.sql import Row > >>> rdd = spark.sparkContext.parallelize([Row(a=Row(b=5))]) > >>> df = spark.createDataFrame(rdd) > >>> df.printSchema() > root > |-- a: struct (nullable = true) > ||-- b: long (nullable = true) > >>> df.show() > +---+ > | a| > +---+ > |[5]| > +---+ > >>> df.select('a.b').show() > +---+ > | b| > +---+ > | 5| > +---+ > >>> df.write.partitionBy('a.b').text('/tmp/test') > Traceback (most recent call last): > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py", > line 319, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o233.text. > : org.apache.spark.sql.AnalysisException: Partition column a.b not found in > schema > StructType(StructField(a,StructType(StructField(b,LongType,true)),true)); > at > org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$10.apply(PartitioningUtils.scala:368) > at > org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$10.apply(PartitioningUtils.scala:368) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:367) > at > org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:366) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.execution.datasources.PartitioningUtils$.partitionColumnsSchema(PartitioningUtils.scala:366) > at > org.apache.spark.sql.execution.datasources.PartitioningUtils$.validatePartitionColumn(PartitioningUtils.scala:349) > at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:458) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194) > at org.apache.spark.sql.DataFrameWriter.text(DataFrameWriter.scala:534) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:214) > at java.lang.Thread.run(Thread.java:745) > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File "", line 1, in > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/readwriter.py", > line 656, in text > self._jwrite.text(path) > File >
[jira] [Created] (SPARK-18084) write.partitionBy() does not recognize nested columns that select() can access
Nicholas Chammas created SPARK-18084: Summary: write.partitionBy() does not recognize nested columns that select() can access Key: SPARK-18084 URL: https://issues.apache.org/jira/browse/SPARK-18084 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.1, 2.0.0 Reporter: Nicholas Chammas Priority: Minor Here's a simple repro in the PySpark shell: {code} from pyspark.sql import Row rdd = spark.sparkContext.parallelize([Row(a=Row(b=5))]) df = spark.createDataFrame(rdd) df.printSchema() df.select('a.b').show() # works df.write.partitionBy('a.b').text('/tmp/test') # doesn't work {code} Here's what I see when I run this: {code} >>> from pyspark.sql import Row >>> rdd = spark.sparkContext.parallelize([Row(a=Row(b=5))]) >>> df = spark.createDataFrame(rdd) >>> df.printSchema() root |-- a: struct (nullable = true) ||-- b: long (nullable = true) >>> df.show() +---+ | a| +---+ |[5]| +---+ >>> df.select('a.b').show() +---+ | b| +---+ | 5| +---+ >>> df.write.partitionBy('a.b').text('/tmp/test') Traceback (most recent call last): File "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o233.text. : org.apache.spark.sql.AnalysisException: Partition column a.b not found in schema StructType(StructField(a,StructType(StructField(b,LongType,true)),true)); at org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$10.apply(PartitioningUtils.scala:368) at org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$10.apply(PartitioningUtils.scala:368) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:367) at org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:366) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.execution.datasources.PartitioningUtils$.partitionColumnsSchema(PartitioningUtils.scala:366) at org.apache.spark.sql.execution.datasources.PartitioningUtils$.validatePartitionColumn(PartitioningUtils.scala:349) at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:458) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194) at org.apache.spark.sql.DataFrameWriter.text(DataFrameWriter.scala:534) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "", line 1, in File "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/readwriter.py", line 656, in text self._jwrite.text(path) File "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py", line 69, in deco raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: 'Partition column a.b not
[jira] [Commented] (SPARK-12757) Use reference counting to prevent blocks from being evicted during reads
[ https://issues.apache.org/jira/browse/SPARK-12757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15603211#comment-15603211 ] Nicholas Chammas commented on SPARK-12757: -- Just to link back, [~josephkb] is reporting that [this GraphFrames issue|https://github.com/graphframes/graphframes/issues/116] may be related to the work done here. > Use reference counting to prevent blocks from being evicted during reads > > > Key: SPARK-12757 > URL: https://issues.apache.org/jira/browse/SPARK-12757 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.0.0 > > > As a pre-requisite to off-heap caching of blocks, we need a mechanism to > prevent pages / blocks from being evicted while they are being read. With > on-heap objects, evicting a block while it is being read merely leads to > memory-accounting problems (because we assume that an evicted block is a > candidate for garbage-collection, which will not be true during a read), but > with off-heap memory this will lead to either data corruption or segmentation > faults. > To address this, we should add a reference-counting mechanism to track which > blocks/pages are being read in order to prevent them from being evicted > prematurely. I propose to do this in two phases: first, add a safe, > conservative approach in which all BlockManager.get*() calls implicitly > increment the reference count of blocks and where tasks' references are > automatically freed upon task completion. This will be correct but may have > adverse performance impacts because it will prevent legitimate block > evictions. In phase two, we should incrementally add release() calls in order > to fix the eviction of unreferenced blocks. The latter change may need to > touch many different components, which is why I propose to do it separately > in order to make the changes easier to reason about and review. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-17976) Global options to spark-submit should not be position-sensitive
[ https://issues.apache.org/jira/browse/SPARK-17976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas closed SPARK-17976. Resolution: Not A Problem Ah, makes perfect sense. Would have realized that myself if I had held off on reporting this for just a day or so. Apologies. > Global options to spark-submit should not be position-sensitive > --- > > Key: SPARK-17976 > URL: https://issues.apache.org/jira/browse/SPARK-17976 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.0.0, 2.0.1 >Reporter: Nicholas Chammas >Priority: Minor > > It is maddening that this does what you expect: > {code} > spark-submit --packages graphframes:graphframes:0.2.0-spark2.0-s_2.11 \ > file.py > {code} > whereas this doesn't because {{--packages}} is totally ignored: > {code} > spark-submit file.py \ > --packages graphframes:graphframes:0.2.0-spark2.0-s_2.11 > {code} > Ideally, global options should be valid no matter where they are specified. > If that's too much work, then I think at the very least {{spark-submit}} > should display a warning that some input is being ignored. (Ideally, it > should error out, but that's probably not possible for > backwards-compatibility reasons at this point.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17976) Global options to spark-submit should not be position-sensitive
Nicholas Chammas created SPARK-17976: Summary: Global options to spark-submit should not be position-sensitive Key: SPARK-17976 URL: https://issues.apache.org/jira/browse/SPARK-17976 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 2.0.1, 2.0.0 Reporter: Nicholas Chammas Priority: Minor It is maddening that this does what you expect: {code} spark-submit --packages graphframes:graphframes:0.2.0-spark2.0-s_2.11 \ file.py {code} whereas this doesn't because {{--packages}} is totally ignored: {code} spark-submit file.py \ --packages graphframes:graphframes:0.2.0-spark2.0-s_2.11 {code} Ideally, global options should be valid no matter where they are specified. If that's too much work, then I think at the very least {{spark-submit}} should display a warning that some input is being ignored. (Ideally, it should error out, but that's probably not possible for backwards-compatibility reasons at this point.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14742) Redirect spark-ec2 doc to new location
[ https://issues.apache.org/jira/browse/SPARK-14742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15452150#comment-15452150 ] Nicholas Chammas edited comment on SPARK-14742 at 8/31/16 12:50 PM: Sounds good to me. It's working now. was (Author: nchammas): Sounds good to me. > Redirect spark-ec2 doc to new location > -- > > Key: SPARK-14742 > URL: https://issues.apache.org/jira/browse/SPARK-14742 > Project: Spark > Issue Type: Documentation > Components: Documentation, EC2 >Reporter: Nicholas Chammas >Assignee: Sean Owen >Priority: Trivial > Fix For: 2.0.0 > > > See: https://github.com/amplab/spark-ec2/pull/24#issuecomment-212033453 > We need to redirect this page > http://spark.apache.org/docs/latest/ec2-scripts.html > to this page > https://github.com/amplab/spark-ec2#readme -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14742) Redirect spark-ec2 doc to new location
[ https://issues.apache.org/jira/browse/SPARK-14742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15452150#comment-15452150 ] Nicholas Chammas commented on SPARK-14742: -- Sounds good to me. > Redirect spark-ec2 doc to new location > -- > > Key: SPARK-14742 > URL: https://issues.apache.org/jira/browse/SPARK-14742 > Project: Spark > Issue Type: Documentation > Components: Documentation, EC2 >Reporter: Nicholas Chammas >Assignee: Sean Owen >Priority: Trivial > Fix For: 2.0.0 > > > See: https://github.com/amplab/spark-ec2/pull/24#issuecomment-212033453 > We need to redirect this page > http://spark.apache.org/docs/latest/ec2-scripts.html > to this page > https://github.com/amplab/spark-ec2#readme -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14742) Redirect spark-ec2 doc to new location
[ https://issues.apache.org/jira/browse/SPARK-14742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15450626#comment-15450626 ] Nicholas Chammas commented on SPARK-14742: -- {quote} Otherwise the only way to get to this link is if you have it bookmarked. {quote} Or any page that has linked to it in the past. All those links are now broken. That's my main concern. > Redirect spark-ec2 doc to new location > -- > > Key: SPARK-14742 > URL: https://issues.apache.org/jira/browse/SPARK-14742 > Project: Spark > Issue Type: Documentation > Components: Documentation, EC2 >Reporter: Nicholas Chammas >Assignee: Sean Owen >Priority: Trivial > Fix For: 2.0.0 > > > See: https://github.com/amplab/spark-ec2/pull/24#issuecomment-212033453 > We need to redirect this page > http://spark.apache.org/docs/latest/ec2-scripts.html > to this page > https://github.com/amplab/spark-ec2#readme -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14742) Redirect spark-ec2 doc to new location
[ https://issues.apache.org/jira/browse/SPARK-14742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15450602#comment-15450602 ] Nicholas Chammas commented on SPARK-14742: -- http://spark.apache.org/docs/latest/ec2-scripts.html I am seeing that this URL is not redirecting to the new spark-ec2 location on GitHub. [~srowen] - Can you fix that? I can see that we have some kind of redirect setup, but I guess it's not working. https://github.com/apache/spark/blob/master/docs/ec2-scripts.md > Redirect spark-ec2 doc to new location > -- > > Key: SPARK-14742 > URL: https://issues.apache.org/jira/browse/SPARK-14742 > Project: Spark > Issue Type: Documentation > Components: Documentation, EC2 >Reporter: Nicholas Chammas >Assignee: Sean Owen >Priority: Trivial > Fix For: 2.0.0 > > > See: https://github.com/amplab/spark-ec2/pull/24#issuecomment-212033453 > We need to redirect this page > http://spark.apache.org/docs/latest/ec2-scripts.html > to this page > https://github.com/amplab/spark-ec2#readme -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17220) Upgrade Py4J to 0.10.3
[ https://issues.apache.org/jira/browse/SPARK-17220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-17220: - Component/s: PySpark > Upgrade Py4J to 0.10.3 > -- > > Key: SPARK-17220 > URL: https://issues.apache.org/jira/browse/SPARK-17220 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Weiqing Yang >Priority: Minor > > Py4J 0.10.3 has landed. It includes some important bug fixes. For example: > Both sides: fixed memory leak issue with ClientServer and potential deadlock > issue by creating a memory leak test suite. (Py4J 0.10.2) > Both sides: added more memory leak tests and fixed a potential memory leak > related to listeners. (Py4J 0.10.3) > So it's time to upgrade py4j from 0.10.1 to 0.10.3. The changelog is > available at https://www.py4j.org/changelog.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14241) Output of monotonically_increasing_id lacks stable relation with rows of DataFrame
[ https://issues.apache.org/jira/browse/SPARK-14241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437616#comment-15437616 ] Nicholas Chammas commented on SPARK-14241: -- [~marmbrus] - Would it be tough to make this function deterministic, or somehow "stable"? The linked Stack Overflow question shows some pretty surprising behavior from an end-user perspective. If this would be tough to change, what are some alternatives you would recommend? Do you think, for example, it would be possible to make a window function that _is_ deterministic and does effectively the same thing? Maybe something like {{row_number()}}, except the {{WindowSpec}} would not need to specify any partitioning or ordering. (Required ordering would be the main downside of using {{row_number()}} instead of {{monotonically_increasing_id()}}.) > Output of monotonically_increasing_id lacks stable relation with rows of > DataFrame > -- > > Key: SPARK-14241 > URL: https://issues.apache.org/jira/browse/SPARK-14241 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 1.6.0, 1.6.1 >Reporter: Paul Shearer > > If you use monotonically_increasing_id() to append a column of IDs to a > DataFrame, the IDs do not have a stable, deterministic relationship to the > rows they are appended to. A given ID value can land on different rows > depending on what happens in the task graph: > http://stackoverflow.com/questions/35705038/how-do-i-add-an-persistent-column-of-row-ids-to-spark-dataframe/35706321#35706321 > From a user perspective this behavior is very unexpected, and many things one > would normally like to do with an ID column are in fact only possible under > very narrow circumstances. The function should either be made deterministic, > or there should be a prominent warning note in the API docs regarding its > behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer
[ https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428988#comment-15428988 ] Nicholas Chammas commented on SPARK-17025: -- {quote} We'd need to figure out a good design for this, especially since it will require a Python process to be started up for what might otherwise be pure Scala applications. {quote} Ah, I guess since a Transformer using some Python code may be persisted and then loaded into a Scala application, right? Sounds hairy. Anyway, thanks for chiming in Joseph. I'll watch the linked issue. > Cannot persist PySpark ML Pipeline model that includes custom Transformer > - > > Key: SPARK-17025 > URL: https://issues.apache.org/jira/browse/SPARK-17025 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > Following the example in [this Databricks blog > post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html] > under "Python tuning", I'm trying to save an ML Pipeline model. > This pipeline, however, includes a custom transformer. When I try to save the > model, the operation fails because the custom transformer doesn't have a > {{_to_java}} attribute. > {code} > Traceback (most recent call last): > File ".../file.py", line 56, in > model.bestModel.save('model') > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 222, in save > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 217, in write > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py", > line 93, in __init__ > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 254, in _to_java > AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java' > {code} > Looking at the source code for > [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py], > I see that not even the base Transformer class has such an attribute. > I'm assuming this is missing functionality that is intended to be patched up > (i.e. [like > this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]). > I'm not sure if there is an existing JIRA for this (my searches didn't turn > up clear results). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer
[ https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417788#comment-15417788 ] Nicholas Chammas edited comment on SPARK-17025 at 8/11/16 7:33 PM: --- cc [~josephkb], [~mengxr] I guess a first step would be to add a {{_to_java}} method to the base {{Transformer}} class that simply raises {{NotImplementedError}}. Ultimately though, is there a way to have the base class handle this work automatically, or do custom transformers need to each implement their own {{_to_java}} method? was (Author: nchammas): cc [~josephkb], [~mengxr] I guess a first step be to add a {{_to_java}} method to the base Transformer class that simply raises {{NotImplementedError}}. Is there a way to have the base class handle this work automatically, or do custom transformers need to each implement their own {{_to_java}} method? > Cannot persist PySpark ML Pipeline model that includes custom Transformer > - > > Key: SPARK-17025 > URL: https://issues.apache.org/jira/browse/SPARK-17025 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > Following the example in [this Databricks blog > post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html] > under "Python tuning", I'm trying to save an ML Pipeline model. > This pipeline, however, includes a custom transformer. When I try to save the > model, the operation fails because the custom transformer doesn't have a > {{_to_java}} attribute. > {code} > Traceback (most recent call last): > File ".../file.py", line 56, in > model.bestModel.save('model') > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 222, in save > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 217, in write > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py", > line 93, in __init__ > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 254, in _to_java > AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java' > {code} > Looking at the source code for > [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py], > I see that not even the base Transformer class has such an attribute. > I'm assuming this is missing functionality that is intended to be patched up > (i.e. [like > this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]). > I'm not sure if there is an existing JIRA for this (my searches didn't turn > up clear results). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer
[ https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417788#comment-15417788 ] Nicholas Chammas edited comment on SPARK-17025 at 8/11/16 7:27 PM: --- cc [~josephkb], [~mengxr] I guess a first step be to add a {{_to_java}} method to the base Transformer class that simply raises {{NotImplementedError}}. Is there a way to have the base class handle this work automatically, or do custom transformers need to each implement their own {{_to_java}} method? was (Author: nchammas): cc [~josephkb] [~mengxr] > Cannot persist PySpark ML Pipeline model that includes custom Transformer > - > > Key: SPARK-17025 > URL: https://issues.apache.org/jira/browse/SPARK-17025 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > Following the example in [this Databricks blog > post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html] > under "Python tuning", I'm trying to save an ML Pipeline model. > This pipeline, however, includes a custom transformer. When I try to save the > model, the operation fails because the custom transformer doesn't have a > {{_to_java}} attribute. > {code} > Traceback (most recent call last): > File ".../file.py", line 56, in > model.bestModel.save('model') > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 222, in save > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 217, in write > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py", > line 93, in __init__ > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 254, in _to_java > AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java' > {code} > Looking at the source code for > [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py], > I see that not even the base Transformer class has such an attribute. > I'm assuming this is missing functionality that is intended to be patched up > (i.e. [like > this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]). > I'm not sure if there is an existing JIRA for this (my searches didn't turn > up clear results). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer
[ https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417788#comment-15417788 ] Nicholas Chammas commented on SPARK-17025: -- cc [~josephkb] [~mengxr] > Cannot persist PySpark ML Pipeline model that includes custom Transformer > - > > Key: SPARK-17025 > URL: https://issues.apache.org/jira/browse/SPARK-17025 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > Following the example in [this Databricks blog > post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html] > under "Python tuning", I'm trying to save an ML Pipeline model. > This pipeline, however, includes a custom transformer. When I try to save the > model, the operation fails because the custom transformer doesn't have a > {{_to_java}} attribute. > {code} > Traceback (most recent call last): > File ".../file.py", line 56, in > model.bestModel.save('model') > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 222, in save > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 217, in write > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py", > line 93, in __init__ > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 254, in _to_java > AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java' > {code} > Looking at the source code for > [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py], > I see that not even the base Transformer class has such an attribute. > I'm assuming this is missing functionality that is intended to be patched up > (i.e. [like > this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]). > I'm not sure if there is an existing JIRA for this (my searches didn't turn > up clear results). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer
Nicholas Chammas created SPARK-17025: Summary: Cannot persist PySpark ML Pipeline model that includes custom Transformer Key: SPARK-17025 URL: https://issues.apache.org/jira/browse/SPARK-17025 Project: Spark Issue Type: Bug Components: ML, PySpark Affects Versions: 2.0.0 Reporter: Nicholas Chammas Priority: Minor Following the example in [this Databricks blog post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html] under "Python tuning", I'm trying to save an ML Pipeline model. This pipeline, however, includes a custom transformer. When I try to save the model, the operation fails because the custom transformer doesn't have a {{_to_java}} attribute. {code} Traceback (most recent call last): File ".../file.py", line 56, in model.bestModel.save('model') File "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 222, in save File "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 217, in write File "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py", line 93, in __init__ File "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 254, in _to_java AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java' {code} Looking at the source code for [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py], I see that not even the base Transformer class has such an attribute. I'm assuming this is missing functionality that is intended to be patched up (i.e. [like this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]). I'm not sure if there is an existing JIRA for this (my searches didn't turn up clear results). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16921) RDD/DataFrame persist() and cache() should return Python context managers
[ https://issues.apache.org/jira/browse/SPARK-16921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15414067#comment-15414067 ] Nicholas Chammas commented on SPARK-16921: -- [~holdenk] - Probably won't be able to do it myself for some time, so feel free to take that crack at it and just ping me as a reviewer. > RDD/DataFrame persist() and cache() should return Python context managers > - > > Key: SPARK-16921 > URL: https://issues.apache.org/jira/browse/SPARK-16921 > Project: Spark > Issue Type: New Feature > Components: PySpark, Spark Core, SQL >Reporter: Nicholas Chammas >Priority: Minor > > [Context > managers|https://docs.python.org/3/reference/datamodel.html#context-managers] > are a natural way to capture closely related setup and teardown code in > Python. > For example, they are commonly used when doing file I/O: > {code} > with open('/path/to/file') as f: > contents = f.read() > ... > {code} > Once the program exits the with block, {{f}} is automatically closed. > I think it makes sense to apply this pattern to persisting and unpersisting > DataFrames and RDDs. There are many cases when you want to persist a > DataFrame for a specific set of operations and then unpersist it immediately > afterwards. > For example, take model training. Today, you might do something like this: > {code} > labeled_data.persist() > model = pipeline.fit(labeled_data) > labeled_data.unpersist() > {code} > If {{persist()}} returned a context manager, you could rewrite this as > follows: > {code} > with labeled_data.persist(): > model = pipeline.fit(labeled_data) > {code} > Upon exiting the {{with}} block, {{labeled_data}} would automatically be > unpersisted. > This can be done in a backwards-compatible way since {{persist()}} would > still return the parent DataFrame or RDD as it does today, but add two > methods to the object: {{\_\_enter\_\_()}} and {{\_\_exit\_\_()}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16921) RDD/DataFrame persist() and cache() should return Python context managers
Nicholas Chammas created SPARK-16921: Summary: RDD/DataFrame persist() and cache() should return Python context managers Key: SPARK-16921 URL: https://issues.apache.org/jira/browse/SPARK-16921 Project: Spark Issue Type: New Feature Components: PySpark, Spark Core, SQL Reporter: Nicholas Chammas Priority: Minor [Context managers|https://docs.python.org/3/reference/datamodel.html#context-managers] are a natural way to capture closely related setup and teardown code in Python. For example, they are commonly used when doing file I/O: {code} with open('/path/to/file') as f: contents = f.read() ... {code} Once the program exits the with block, {{f}} is automatically closed. I think it makes sense to apply this pattern to persisting and unpersisting DataFrames and RDDs. There are many cases when you want to persist a DataFrame for a specific set of operations and then unpersist it immediately afterwards. For example, take model training. Today, you might do something like this: {code} labeled_data.persist() model = pipeline.fit(labeled_data) labeled_data.unpersist() {code} If {{persist()}} returned a context manager, you could rewrite this as follows: {code} with labeled_data.persist(): model = pipeline.fit(labeled_data) {code} Upon exiting the {{with}} block, {{labeled_data}} would automatically be unpersisted. This can be done in a backwards-compatible way since {{persist()}} would still return the parent DataFrame or RDD as it does today, but add two methods to the object: {{\_\_enter\_\_()}} and {{\_\_exit\_\_()}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7505) Update PySpark DataFrame docs: encourage __getitem__, mark as experimental, etc.
[ https://issues.apache.org/jira/browse/SPARK-7505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas closed SPARK-7505. --- Resolution: Invalid Closing this as invalid as I believe these issues are no longer important. > Update PySpark DataFrame docs: encourage __getitem__, mark as experimental, > etc. > > > Key: SPARK-7505 > URL: https://issues.apache.org/jira/browse/SPARK-7505 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark, SQL >Affects Versions: 1.3.1, 1.4.1, 1.5.0 >Reporter: Nicholas Chammas >Priority: Minor > > The PySpark docs for DataFrame need the following fixes and improvements: > # Per [SPARK-7035], we should encourage the use of {{\_\_getitem\_\_}} over > {{\_\_getattr\_\_}} and change all our examples accordingly. > # *We should say clearly that the API is experimental.* (That is currently > not the case for the PySpark docs.) > # We should provide an example of how to join and select from 2 DataFrames > that have identically named columns, because it is not obvious: > {code} > >>> df1 = sqlContext.jsonRDD(sc.parallelize(['{"a": 4, "other": "I know"}'])) > >>> df2 = sqlContext.jsonRDD(sc.parallelize(['{"a": 4, "other": "I dunno"}'])) > >>> df12 = df1.join(df2, df1['a'] == df2['a']) > >>> df12.select(df1['a'], df2['other']).show() > a other > > 4 I dunno {code} > # > [{{DF.orderBy}}|https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.orderBy] > and > [{{DF.sort}}|https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sort] > should be marked as aliases if that's what they are. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5312) Use sbt to detect new or changed public classes in PRs
[ https://issues.apache.org/jira/browse/SPARK-5312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15409767#comment-15409767 ] Nicholas Chammas commented on SPARK-5312: - [~boyork] - Shall we close this? It doesn't look like it has any momentum. > Use sbt to detect new or changed public classes in PRs > -- > > Key: SPARK-5312 > URL: https://issues.apache.org/jira/browse/SPARK-5312 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Reporter: Nicholas Chammas >Priority: Minor > > We currently use an [unwieldy grep/sed > contraption|https://github.com/apache/spark/blob/19556454881e05a6c2470d406d50f004b88088a2/dev/run-tests-jenkins#L152-L174] > to detect new public classes in PRs. > -Apparently, sbt lets you get a list of public classes [much more > directly|http://www.scala-sbt.org/0.13/docs/Howto-Inspect-the-Build.html] via > {{show compile:discoveredMainClasses}}. We should use that instead.- > There is a tool called [ClassUtil|http://software.clapper.org/classutil/] > that seems to help give this kind of information much more directly. We > should look into using that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7146) Should ML sharedParams be a public API?
[ https://issues.apache.org/jira/browse/SPARK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405300#comment-15405300 ] Nicholas Chammas edited comment on SPARK-7146 at 8/3/16 4:45 AM: - A quick update from a PySpark user: I am using HasInputCol, HasInputCols, HasLabelCol, and HasOutputCol to create custom transformers, and I find them very handy. I know Python does not have a notion of "private" classes, but knowing these are part of the public API would be good. In summary: The updated proposal looks good to me, with the caveat that I only just started learning the new ML Pipeline API. was (Author: nchammas): A quick update from a PySpark user: I am using HasInputCol, HasInputCols, HasLabelCol, and HasOutputCol to create custom transformers, and I find them very handy. I know Python does not have a notion of "private" classes, but knowing these are part of the public API would be good. I summary: The updated proposal looks good to me, with the caveat that I only just started learning the new ML Pipeline API. > Should ML sharedParams be a public API? > --- > > Key: SPARK-7146 > URL: https://issues.apache.org/jira/browse/SPARK-7146 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Joseph K. Bradley > > Proposal: Make most of the Param traits in sharedParams.scala public. Mark > them as DeveloperApi. > Pros: > * Sharing the Param traits helps to encourage standardized Param names and > documentation. > Cons: > * Users have to be careful since parameters can have different meanings for > different algorithms. > * If the shared Params are public, then implementations could test for the > traits. It is unclear if we want users to rely on these traits, which are > somewhat experimental. > Currently, the shared params are private. > h3. UPDATED proposal > * Some Params are clearly safe to make public. We will do so. > * Some Params could be made public but may require caveats in the trait doc. > * Some Params have turned out not to be shared in practice. We can move > those Params to the classes which use them. > *Public shared params*: > * I/O column params > ** HasFeaturesCol > ** HasInputCol > ** HasInputCols > ** HasLabelCol > ** HasOutputCol > ** HasPredictionCol > ** HasProbabilityCol > ** HasRawPredictionCol > ** HasVarianceCol > ** HasWeightCol > * Algorithm settings > ** HasCheckpointInterval > ** HasElasticNetParam > ** HasFitIntercept > ** HasMaxIter > ** HasRegParam > ** HasSeed > ** HasStandardization (less common) > ** HasStepSize > ** HasTol > *Questionable params*: > * HasHandleInvalid (only used in StringIndexer, but might be more widely used > later on) > * HasSolver (used in LinearRegression and GeneralizedLinearRegression, but > same meaning as Optimizer in LDA) > *Params to be removed from sharedParams*: > * HasThreshold (only used in LogisticRegression) > * HasThresholds (only used in ProbabilisticClassifier) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7146) Should ML sharedParams be a public API?
[ https://issues.apache.org/jira/browse/SPARK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405300#comment-15405300 ] Nicholas Chammas commented on SPARK-7146: - A quick update from a PySpark user: I am using HasInputCol, HasInputCols, HasLabelCol, and HasOutputCol to create custom transformers, and I find them very handy. I know Python does not have a notion of "private" classes, but knowing these are part of the public API would be good. I summary: The updated proposal looks good to me, with the caveat that I only just started learning the new ML Pipeline API. > Should ML sharedParams be a public API? > --- > > Key: SPARK-7146 > URL: https://issues.apache.org/jira/browse/SPARK-7146 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Joseph K. Bradley > > Proposal: Make most of the Param traits in sharedParams.scala public. Mark > them as DeveloperApi. > Pros: > * Sharing the Param traits helps to encourage standardized Param names and > documentation. > Cons: > * Users have to be careful since parameters can have different meanings for > different algorithms. > * If the shared Params are public, then implementations could test for the > traits. It is unclear if we want users to rely on these traits, which are > somewhat experimental. > Currently, the shared params are private. > h3. UPDATED proposal > * Some Params are clearly safe to make public. We will do so. > * Some Params could be made public but may require caveats in the trait doc. > * Some Params have turned out not to be shared in practice. We can move > those Params to the classes which use them. > *Public shared params*: > * I/O column params > ** HasFeaturesCol > ** HasInputCol > ** HasInputCols > ** HasLabelCol > ** HasOutputCol > ** HasPredictionCol > ** HasProbabilityCol > ** HasRawPredictionCol > ** HasVarianceCol > ** HasWeightCol > * Algorithm settings > ** HasCheckpointInterval > ** HasElasticNetParam > ** HasFitIntercept > ** HasMaxIter > ** HasRegParam > ** HasSeed > ** HasStandardization (less common) > ** HasStepSize > ** HasTol > *Questionable params*: > * HasHandleInvalid (only used in StringIndexer, but might be more widely used > later on) > * HasSolver (used in LinearRegression and GeneralizedLinearRegression, but > same meaning as Optimizer in LDA) > *Params to be removed from sharedParams*: > * HasThreshold (only used in LogisticRegression) > * HasThresholds (only used in ProbabilisticClassifier) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16782) Use Sphinx autodoc to eliminate duplication of Python docstrings
[ https://issues.apache.org/jira/browse/SPARK-16782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402515#comment-15402515 ] Nicholas Chammas commented on SPARK-16782: -- Poking around a bit more, it seems like a possible approach would be to implement a new decorator that duplicates an existing docstring, optionally with some modifications. Not sure if that's something we would be interested in doing. > Use Sphinx autodoc to eliminate duplication of Python docstrings > > > Key: SPARK-16782 > URL: https://issues.apache.org/jira/browse/SPARK-16782 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Reporter: Nicholas Chammas >Priority: Minor > > In several cases it appears that we duplicate docstrings for methods that are > exposed in multiple places. > For example, here are two instances of {{createDataFrame}} with identical > docstrings: > * > https://github.com/apache/spark/blob/ab6e4aea5f39c429d5ea62a5170c8a1da612b74a/python/pyspark/sql/context.py#L218 > * > https://github.com/apache/spark/blob/39c836e976fcae51568bed5ebab28e148383b5d4/python/pyspark/sql/session.py#L406 > I believe we can eliminate this duplication by using the [Sphinx autodoc > extension|http://www.sphinx-doc.org/en/stable/ext/autodoc.html]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16782) Use Sphinx autodoc to eliminate duplication of Python docstrings
[ https://issues.apache.org/jira/browse/SPARK-16782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402477#comment-15402477 ] Nicholas Chammas commented on SPARK-16782: -- Hmm never mind. I think I've misunderstood the purpose of autodoc; it doesn't seem like we can use it to eliminate docstring duplication. > Use Sphinx autodoc to eliminate duplication of Python docstrings > > > Key: SPARK-16782 > URL: https://issues.apache.org/jira/browse/SPARK-16782 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Reporter: Nicholas Chammas >Priority: Minor > > In several cases it appears that we duplicate docstrings for methods that are > exposed in multiple places. > For example, here are two instances of {{createDataFrame}} with identical > docstrings: > * > https://github.com/apache/spark/blob/ab6e4aea5f39c429d5ea62a5170c8a1da612b74a/python/pyspark/sql/context.py#L218 > * > https://github.com/apache/spark/blob/39c836e976fcae51568bed5ebab28e148383b5d4/python/pyspark/sql/session.py#L406 > I believe we can eliminate this duplication by using the [Sphinx autodoc > extension|http://www.sphinx-doc.org/en/stable/ext/autodoc.html]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-16782) Use Sphinx autodoc to eliminate duplication of Python docstrings
[ https://issues.apache.org/jira/browse/SPARK-16782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas closed SPARK-16782. Resolution: Invalid > Use Sphinx autodoc to eliminate duplication of Python docstrings > > > Key: SPARK-16782 > URL: https://issues.apache.org/jira/browse/SPARK-16782 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Reporter: Nicholas Chammas >Priority: Minor > > In several cases it appears that we duplicate docstrings for methods that are > exposed in multiple places. > For example, here are two instances of {{createDataFrame}} with identical > docstrings: > * > https://github.com/apache/spark/blob/ab6e4aea5f39c429d5ea62a5170c8a1da612b74a/python/pyspark/sql/context.py#L218 > * > https://github.com/apache/spark/blob/39c836e976fcae51568bed5ebab28e148383b5d4/python/pyspark/sql/session.py#L406 > I believe we can eliminate this duplication by using the [Sphinx autodoc > extension|http://www.sphinx-doc.org/en/stable/ext/autodoc.html]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12157) Support numpy types as return values of Python UDFs
[ https://issues.apache.org/jira/browse/SPARK-12157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401198#comment-15401198 ] Nicholas Chammas commented on SPARK-12157: -- OK. I've raised the issue of documenting this in PySpark here: https://issues.apache.org/jira/browse/SPARK-16824 > Support numpy types as return values of Python UDFs > --- > > Key: SPARK-12157 > URL: https://issues.apache.org/jira/browse/SPARK-12157 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 1.5.2 >Reporter: Justin Uang > > Currently, if I have a python UDF > {code} > import pyspark.sql.types as T > import pyspark.sql.functions as F > from pyspark.sql import Row > import numpy as np > argmax = F.udf(lambda x: np.argmax(x), T.IntegerType()) > df = sqlContext.createDataFrame([Row(array=[1,2,3])]) > df.select(argmax("array")).count() > {code} > I get an exception that is fairly opaque: > {code} > Caused by: net.razorvine.pickle.PickleException: expected zero arguments for > construction of ClassDict (for numpy.dtype) > at > net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) > at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:701) > at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:171) > at net.razorvine.pickle.Unpickler.load(Unpickler.java:85) > at net.razorvine.pickle.Unpickler.loads(Unpickler.java:98) > at > org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:404) > at > org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:403) > {code} > Numpy types like np.int and np.float64 should automatically be cast to the > proper dtypes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16824) Add API docs for VectorUDT
[ https://issues.apache.org/jira/browse/SPARK-16824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401197#comment-15401197 ] Nicholas Chammas commented on SPARK-16824: -- cc [~josephkb] [~mengxr] - Should this type be publicly documented in PySpark? > Add API docs for VectorUDT > -- > > Key: SPARK-16824 > URL: https://issues.apache.org/jira/browse/SPARK-16824 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 2.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > Following on the [discussion > here|https://issues.apache.org/jira/browse/SPARK-12157?focusedCommentId=15401153=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15401153], > it appears that {{VectorUDT}} is missing documentation, at least in PySpark. > I'm not sure if this is intentional or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16824) Add API docs for VectorUDT
Nicholas Chammas created SPARK-16824: Summary: Add API docs for VectorUDT Key: SPARK-16824 URL: https://issues.apache.org/jira/browse/SPARK-16824 Project: Spark Issue Type: Improvement Components: Documentation, PySpark Affects Versions: 2.0.0 Reporter: Nicholas Chammas Priority: Minor Following on the [discussion here|https://issues.apache.org/jira/browse/SPARK-12157?focusedCommentId=15401153=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15401153], it appears that {{VectorUDT}} is missing documentation, at least in PySpark. I'm not sure if this is intentional or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12157) Support numpy types as return values of Python UDFs
[ https://issues.apache.org/jira/browse/SPARK-12157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401179#comment-15401179 ] Nicholas Chammas commented on SPARK-12157: -- Thanks for the pointer, Maciej. It appears that {{VectorUDT}} is [undocumented|http://spark.apache.org/docs/latest/api/python/search.html?q=VectorUDT_keywords=yes=default]. Do you know if that is intentional? > Support numpy types as return values of Python UDFs > --- > > Key: SPARK-12157 > URL: https://issues.apache.org/jira/browse/SPARK-12157 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 1.5.2 >Reporter: Justin Uang > > Currently, if I have a python UDF > {code} > import pyspark.sql.types as T > import pyspark.sql.functions as F > from pyspark.sql import Row > import numpy as np > argmax = F.udf(lambda x: np.argmax(x), T.IntegerType()) > df = sqlContext.createDataFrame([Row(array=[1,2,3])]) > df.select(argmax("array")).count() > {code} > I get an exception that is fairly opaque: > {code} > Caused by: net.razorvine.pickle.PickleException: expected zero arguments for > construction of ClassDict (for numpy.dtype) > at > net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) > at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:701) > at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:171) > at net.razorvine.pickle.Unpickler.load(Unpickler.java:85) > at net.razorvine.pickle.Unpickler.loads(Unpickler.java:98) > at > org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:404) > at > org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:403) > {code} > Numpy types like np.int and np.float64 should automatically be cast to the > proper dtypes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12157) Support numpy types as return values of Python UDFs
[ https://issues.apache.org/jira/browse/SPARK-12157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15399743#comment-15399743 ] Nicholas Chammas commented on SPARK-12157: -- It appears that it's not possible to have a UDF that returns a {{Vector}}. For example, consider this UDF: {code} featurize_udf = udf( lambda person1, person2: featurize(person1, person2), ArrayType(elementType=FloatType(), containsNull=False) ) {code} {{featurize()}} returns a {{DenseVector}}, which I understand is a wrapper for some numpy array type. Trying to use this UDF on a DataFrame yields: {code} Traceback (most recent call last): File ".../thing.py", line 134, in .alias('pair_features')) File "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 310, in collect File "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__ File "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o94.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 21 in stage 3.0 failed 1 times, most recent failure: Lost task 21.0 in stage 3.0 (TID 34, localhost): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.ml.linalg.DenseVector) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112) at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1$$anonfun$apply$5.apply(BatchEvalPythonExec.scala:137) at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1$$anonfun$apply$5.apply(BatchEvalPythonExec.scala:136) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:112) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:112) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:112) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} > Support numpy types as return values of Python UDFs > --- > >
[jira] [Commented] (SPARK-12157) Support numpy types as return values of Python UDFs
[ https://issues.apache.org/jira/browse/SPARK-12157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398267#comment-15398267 ] Nicholas Chammas commented on SPARK-12157: -- I'm looking to define a UDF in PySpark that returns a {{pyspark.ml.linalg.Vector}}. Since {{Vector}} is a wrapper for numpy types, I believe this issue covers what I'm looking for. My use case is that I want a UDF that takes in several DataFrame columns and extracts/computes features, returning them as a new {{Vector}} column. I believe {{VectorAssembler}} is for when you already have the features and you just want them put in a {{Vector}}. [~josephkb] [~zjffdu] So is it possible to do that today? Have I misunderstood how to approach my use case? > Support numpy types as return values of Python UDFs > --- > > Key: SPARK-12157 > URL: https://issues.apache.org/jira/browse/SPARK-12157 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 1.5.2 >Reporter: Justin Uang > > Currently, if I have a python UDF > {code} > import pyspark.sql.types as T > import pyspark.sql.functions as F > from pyspark.sql import Row > import numpy as np > argmax = F.udf(lambda x: np.argmax(x), T.IntegerType()) > df = sqlContext.createDataFrame([Row(array=[1,2,3])]) > df.select(argmax("array")).count() > {code} > I get an exception that is fairly opaque: > {code} > Caused by: net.razorvine.pickle.PickleException: expected zero arguments for > construction of ClassDict (for numpy.dtype) > at > net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) > at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:701) > at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:171) > at net.razorvine.pickle.Unpickler.load(Unpickler.java:85) > at net.razorvine.pickle.Unpickler.loads(Unpickler.java:98) > at > org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:404) > at > org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:403) > {code} > Numpy types like np.int and np.float64 should automatically be cast to the > proper dtypes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16782) Use Sphinx autodoc to eliminate duplication of Python docstrings
[ https://issues.apache.org/jira/browse/SPARK-16782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398059#comment-15398059 ] Nicholas Chammas commented on SPARK-16782: -- [~davies] [~joshrosen] - I can take this on if the idea sounds good to you. One thing I will have to look into is how the doctests show up, since we don't want those to be identical (or do we?) due to the different invocation methods. I'm not sure if autodoc will make that possible. > Use Sphinx autodoc to eliminate duplication of Python docstrings > > > Key: SPARK-16782 > URL: https://issues.apache.org/jira/browse/SPARK-16782 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Reporter: Nicholas Chammas >Priority: Minor > > In several cases it appears that we duplicate docstrings for methods that are > exposed in multiple places. > For example, here are two instances of {{createDataFrame}} with identical > docstrings: > * > https://github.com/apache/spark/blob/ab6e4aea5f39c429d5ea62a5170c8a1da612b74a/python/pyspark/sql/context.py#L218 > * > https://github.com/apache/spark/blob/39c836e976fcae51568bed5ebab28e148383b5d4/python/pyspark/sql/session.py#L406 > I believe we can eliminate this duplication by using the [Sphinx autodoc > extension|http://www.sphinx-doc.org/en/stable/ext/autodoc.html]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16782) Use Sphinx autodoc to eliminate duplication of Python docstrings
Nicholas Chammas created SPARK-16782: Summary: Use Sphinx autodoc to eliminate duplication of Python docstrings Key: SPARK-16782 URL: https://issues.apache.org/jira/browse/SPARK-16782 Project: Spark Issue Type: Improvement Components: Documentation, PySpark Reporter: Nicholas Chammas Priority: Minor In several cases it appears that we duplicate docstrings for methods that are exposed in multiple places. For example, here are two instances of {{createDataFrame}} with identical docstrings: * https://github.com/apache/spark/blob/ab6e4aea5f39c429d5ea62a5170c8a1da612b74a/python/pyspark/sql/context.py#L218 * https://github.com/apache/spark/blob/39c836e976fcae51568bed5ebab28e148383b5d4/python/pyspark/sql/session.py#L406 I believe we can eliminate this duplication by using the [Sphinx autodoc extension|http://www.sphinx-doc.org/en/stable/ext/autodoc.html]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16772) Correct API doc references to PySpark classes + formatting fixes
[ https://issues.apache.org/jira/browse/SPARK-16772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-16772: - Summary: Correct API doc references to PySpark classes + formatting fixes (was: Correct API doc references to PySpark classes) > Correct API doc references to PySpark classes + formatting fixes > > > Key: SPARK-16772 > URL: https://issues.apache.org/jira/browse/SPARK-16772 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Reporter: Nicholas Chammas >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16772) Correct API doc references to PySpark classes
[ https://issues.apache.org/jira/browse/SPARK-16772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-16772: - Summary: Correct API doc references to PySpark classes (was: Correct API doc references to DataType + other minor doc tweaks) > Correct API doc references to PySpark classes > - > > Key: SPARK-16772 > URL: https://issues.apache.org/jira/browse/SPARK-16772 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Reporter: Nicholas Chammas >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16772) Correct API doc references to DataType + other minor doc tweaks
Nicholas Chammas created SPARK-16772: Summary: Correct API doc references to DataType + other minor doc tweaks Key: SPARK-16772 URL: https://issues.apache.org/jira/browse/SPARK-16772 Project: Spark Issue Type: Improvement Components: Documentation, PySpark Reporter: Nicholas Chammas Priority: Trivial -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add spark-cloud module to pull in aws+azure object store FS accessors; test integration
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15389589#comment-15389589 ] Nicholas Chammas commented on SPARK-7481: - [~ste...@apache.org] - Some relevant reading for you from the Internets about the trouble people currently go through to get Spark + S3A working: http://deploymentzone.com/2015/12/20/s3a-on-spark-on-aws-ec2/ > Add spark-cloud module to pull in aws+azure object store FS accessors; test > integration > --- > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.3.1 >Reporter: Steve Loughran > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) > this adds more stuff to the client bundle, but will mean a single spark > package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12661) Drop Python 2.6 support in PySpark
[ https://issues.apache.org/jira/browse/SPARK-12661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15384535#comment-15384535 ] Nicholas Chammas commented on SPARK-12661: -- OK, sounds good to me. > Drop Python 2.6 support in PySpark > -- > > Key: SPARK-12661 > URL: https://issues.apache.org/jira/browse/SPARK-12661 > Project: Spark > Issue Type: Task > Components: PySpark >Reporter: Davies Liu > Labels: releasenotes > > 1. stop testing with 2.6 > 2. remove the code for python 2.6 > see discussion : > https://www.mail-archive.com/user@spark.apache.org/msg43423.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12661) Drop Python 2.6 support in PySpark
[ https://issues.apache.org/jira/browse/SPARK-12661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15384513#comment-15384513 ] Nicholas Chammas commented on SPARK-12661: -- Yes, I mean communicating our intention to drop support in 2.1 by deprecating in 2.0. It seems wrong to release 2.0 without any indication that Python 2.6 support is going to be dropped in 2.1. People may assume that since 2.0 was released with Python 2.6 support, then the 2.x line will continue that support. Since it seems like 2.0 will have an RC5, shall we add in a brief note to the programming guide + 2.0 release notes about deprecating Python 2.6 support? > Drop Python 2.6 support in PySpark > -- > > Key: SPARK-12661 > URL: https://issues.apache.org/jira/browse/SPARK-12661 > Project: Spark > Issue Type: Task > Components: PySpark >Reporter: Davies Liu > Labels: releasenotes > > 1. stop testing with 2.6 > 2. remove the code for python 2.6 > see discussion : > https://www.mail-archive.com/user@spark.apache.org/msg43423.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12661) Drop Python 2.6 support in PySpark
[ https://issues.apache.org/jira/browse/SPARK-12661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15384343#comment-15384343 ] Nicholas Chammas commented on SPARK-12661: -- To clarify what I mean by drop vs. deprecate, because I'm not sure if my distinction is standard: * Deprecate means communicating to users to get off 2.6, but still keeping code and tests for 2.6 active. * Drop means removing tests for 2.6 and potentially (though not necessarily) any code that exists only to support Python 2.6. > Drop Python 2.6 support in PySpark > -- > > Key: SPARK-12661 > URL: https://issues.apache.org/jira/browse/SPARK-12661 > Project: Spark > Issue Type: Task > Components: PySpark >Reporter: Davies Liu > Labels: releasenotes > > 1. stop testing with 2.6 > 2. remove the code for python 2.6 > see discussion : > https://www.mail-archive.com/user@spark.apache.org/msg43423.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12661) Drop Python 2.6 support in PySpark
[ https://issues.apache.org/jira/browse/SPARK-12661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15384343#comment-15384343 ] Nicholas Chammas edited comment on SPARK-12661 at 7/19/16 3:23 PM: --- To clarify what I mean by drop vs. deprecate, because I'm not sure if my distinction is standard: * Deprecate means communicating to users to get off 2.6, but still keeping code and tests for 2.6 active. * Drop means removing tests for 2.6 and potentially (though not necessarily) also removing any code that exists only to support Python 2.6. was (Author: nchammas): To clarify what I mean by drop vs. deprecate, because I'm not sure if my distinction is standard: * Deprecate means communicating to users to get off 2.6, but still keeping code and tests for 2.6 active. * Drop means removing tests for 2.6 and potentially (though not necessarily) any code that exists only to support Python 2.6. > Drop Python 2.6 support in PySpark > -- > > Key: SPARK-12661 > URL: https://issues.apache.org/jira/browse/SPARK-12661 > Project: Spark > Issue Type: Task > Components: PySpark >Reporter: Davies Liu > Labels: releasenotes > > 1. stop testing with 2.6 > 2. remove the code for python 2.6 > see discussion : > https://www.mail-archive.com/user@spark.apache.org/msg43423.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12661) Drop Python 2.6 support in PySpark
[ https://issues.apache.org/jira/browse/SPARK-12661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15384339#comment-15384339 ] Nicholas Chammas commented on SPARK-12661: -- Just double-checking on something: Is it OK to drop Python 2.6 support in a minor release (2.1) without officially deprecating it in a major release (2.0)? As far as I can tell from the history on this ticket, that's our current plan, but it seems a bit off. Shouldn't we make the deprecation notice with 2.0? Even just as a release note + minor prose changes to the [Programming Guide|https://github.com/apache/spark/blob/branch-2.0/docs/programming-guide.md#linking-with-spark] would be sufficient, I think. > Drop Python 2.6 support in PySpark > -- > > Key: SPARK-12661 > URL: https://issues.apache.org/jira/browse/SPARK-12661 > Project: Spark > Issue Type: Task > Components: PySpark >Reporter: Davies Liu > Labels: releasenotes > > 1. stop testing with 2.6 > 2. remove the code for python 2.6 > see discussion : > https://www.mail-archive.com/user@spark.apache.org/msg43423.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-16427) Expand documentation on the various RDD storage levels
[ https://issues.apache.org/jira/browse/SPARK-16427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas closed SPARK-16427. Resolution: Invalid > Expand documentation on the various RDD storage levels > -- > > Key: SPARK-16427 > URL: https://issues.apache.org/jira/browse/SPARK-16427 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Nicholas Chammas >Priority: Minor > > Looking at the docs here > http://spark.apache.org/docs/1.6.2/api/python/pyspark.html#pyspark.StorageLevel > A newcomer to Spark won’t understand the meaning of {{_2}}, or the meaning of > {{_SER}} (or its value), and won’t understand how exactly memory and disk > play together when something like {{MEMORY_AND_DISK}} is selected. > We should expand this documentation to explain what the various levels mean > and perhaps even when a user might want to use them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16427) Expand documentation on the various RDD storage levels
[ https://issues.apache.org/jira/browse/SPARK-16427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15371321#comment-15371321 ] Nicholas Chammas commented on SPARK-16427: -- Oh nevermind, this information is all available here: http://spark.apache.org/docs/1.6.2/programming-guide.html#rdd-persistence > Expand documentation on the various RDD storage levels > -- > > Key: SPARK-16427 > URL: https://issues.apache.org/jira/browse/SPARK-16427 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Nicholas Chammas >Priority: Minor > > Looking at the docs here > http://spark.apache.org/docs/1.6.2/api/python/pyspark.html#pyspark.StorageLevel > A newcomer to Spark won’t understand the meaning of {{_2}}, or the meaning of > {{_SER}} (or its value), and won’t understand how exactly memory and disk > play together when something like {{MEMORY_AND_DISK}} is selected. > We should expand this documentation to explain what the various levels mean > and perhaps even when a user might want to use them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16427) Expand documentation on the various RDD storage levels
[ https://issues.apache.org/jira/browse/SPARK-16427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15366711#comment-15366711 ] Nicholas Chammas commented on SPARK-16427: -- My first question about this would be, how many places do we need to update this information? Just the Python and Scala docs? Is there a way to write the docs in one place and source them to another? > Expand documentation on the various RDD storage levels > -- > > Key: SPARK-16427 > URL: https://issues.apache.org/jira/browse/SPARK-16427 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Nicholas Chammas >Priority: Minor > > Looking at the docs here > http://spark.apache.org/docs/1.6.2/api/python/pyspark.html#pyspark.StorageLevel > A newcomer to Spark won’t understand the meaning of {{_2}}, or the meaning of > {{_SER}} (or its value), and won’t understand how exactly memory and disk > play together when something like {{MEMORY_AND_DISK}} is selected. > We should expand this documentation to explain what the various levels mean > and perhaps even when a user might want to use them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3181) Add Robust Regression Algorithm with Huber Estimator
[ https://issues.apache.org/jira/browse/SPARK-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-3181: Component/s: (was: MLilb) MLlib > Add Robust Regression Algorithm with Huber Estimator > > > Key: SPARK-3181 > URL: https://issues.apache.org/jira/browse/SPARK-3181 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Fan Jiang >Assignee: Yanbo Liang > Labels: features > Original Estimate: 0h > Remaining Estimate: 0h > > Linear least square estimates assume the error has normal distribution and > can behave badly when the errors are heavy-tailed. In practical we get > various types of data. We need to include Robust Regression to employ a > fitting criterion that is not as vulnerable as least square. > In 1973, Huber introduced M-estimation for regression which stands for > "maximum likelihood type". The method is resistant to outliers in the > response variable and has been widely used. > The new feature for MLlib will contain 3 new files > /main/scala/org/apache/spark/mllib/regression/RobustRegression.scala > /test/scala/org/apache/spark/mllib/regression/RobustRegressionSuite.scala > /main/scala/org/apache/spark/examples/mllib/HuberRobustRegression.scala > and one new class HuberRobustGradient in > /main/scala/org/apache/spark/mllib/optimization/Gradient.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16156) RowMatrıx Covariance
[ https://issues.apache.org/jira/browse/SPARK-16156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-16156: - Component/s: (was: MLilb) MLlib > RowMatrıx Covariance > > > Key: SPARK-16156 > URL: https://issues.apache.org/jira/browse/SPARK-16156 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.0 > Environment: Spark MLLIB >Reporter: Hayri Volkan Agun > Original Estimate: 168h > Remaining Estimate: 168h > > Spark doesn't provide a good solution of covariance for RowMatrix with large > columns. This can be fixed with using efficient stable computations and > approximating to the true mean. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16074) Expose VectorUDT/MatrixUDT in a public API
[ https://issues.apache.org/jira/browse/SPARK-16074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-16074: - Component/s: (was: MLilb) MLlib > Expose VectorUDT/MatrixUDT in a public API > -- > > Key: SPARK-16074 > URL: https://issues.apache.org/jira/browse/SPARK-16074 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > Fix For: 2.0.0 > > > Both VectorUDT and MatrixUDT are private APIs, because UserDefinedType itself > is private in Spark. However, in order to let developers implement their own > transformers and estimators, we should expose both types in a public API to > simply the implementation of transformSchema, transform, etc. Otherwise, they > need to get the data types using reflection. > Note that this doesn't mean to expose VectorUDT/MatrixUDT classes. We can > just have a method or a static value that returns VectorUDT/MatrixUDT > instance with DataType as the return type. There are two ways to implement > this: > 1. following DataTypes.java in SQL, so Java users doesn't need the extra "()". > 2. Define DataTypes in Scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16377) Spark MLlib: MultilayerPerceptronClassifier - error while training
[ https://issues.apache.org/jira/browse/SPARK-16377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-16377: - Component/s: (was: MLilb) MLlib > Spark MLlib: MultilayerPerceptronClassifier - error while training > -- > > Key: SPARK-16377 > URL: https://issues.apache.org/jira/browse/SPARK-16377 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 1.5.2 >Reporter: Mikhail Shiryaev > > Hi, > I am trying to train model by MultilayerPerceptronClassifier. > It works on sample data from > data/mllib/sample_multiclass_classification_data.txt with 4 features, 3 > classes and layers [4, 4, 3]. > But when I try to use other input files with other features and classes (from > here for example: > https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html) > then I get errors. > Example: > Input file aloi (128 features, 1000 classes, layers [128, 128, 1000]): > with block size = 1: > ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation. > Decreasing step size to Infinity > ERROR LBFGS: Failure! Resetting history: breeze.optimize.FirstOrderException: > Line search failed > ERROR LBFGS: Failure again! Giving up and returning. Maybe the objective is > just poorly behaved? > with default block size = 128: > java.lang.ArrayIndexOutOfBoundsException > at java.lang.System.arraycopy(Native Method) > at > org.apache.spark.ml.ann.DataStacker$$anonfun$3$$anonfun$apply$3$$anonfun$apply$4.apply(Layer.scala:629) > > at > org.apache.spark.ml.ann.DataStacker$$anonfun$3$$anonfun$apply$3$$anonfun$apply$4.apply(Layer.scala:628) > >at scala.collection.immutable.List.foreach(List.scala:381) >at > org.apache.spark.ml.ann.DataStacker$$anonfun$3$$anonfun$apply$3.apply(Layer.scala:628) > >at > org.apache.spark.ml.ann.DataStacker$$anonfun$3$$anonfun$apply$3.apply(Layer.scala:624) > > Even if I modify sample_multiclass_classification_data.txt file (rename all > 4-th features to 5-th) and run with layers [5, 5, 3] then I also get the same > errors as for file above. > So to resume: > I can't run training with default block size and with more than 4 features. > If I set block size to 1 then some actions are happened but I get errors > from LBFGS. > It is reproducible with Spark 1.5.2 and from master branch on github (from > 4-th July). > Did somebody already met with such behavior? > Is there bug in MultilayerPerceptronClassifier or I use it incorrectly? > Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16290) text type features column for classification
[ https://issues.apache.org/jira/browse/SPARK-16290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-16290: - Component/s: (was: MLilb) MLlib > text type features column for classification > > > Key: SPARK-16290 > URL: https://issues.apache.org/jira/browse/SPARK-16290 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Affects Versions: 1.6.2 >Reporter: mahendra singh > Labels: features > Original Estimate: 504h > Remaining Estimate: 504h > > we have to improve spark ml and mllib in case of features columns . Mean we > can give text type of value also in features . > Suppose we have 4 features value > id. dept_name. score. result. > We can see dept_name will be text type so we have to handle it internally in > spark mean we have to change text to numerical column . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16232) Getting error by making columns using DataFrame
[ https://issues.apache.org/jira/browse/SPARK-16232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-16232: - Component/s: (was: MLilb) MLlib > Getting error by making columns using DataFrame > > > Key: SPARK-16232 > URL: https://issues.apache.org/jira/browse/SPARK-16232 > Project: Spark > Issue Type: Question > Components: MLlib, PySpark >Affects Versions: 1.5.1 > Environment: Winodws, ipython notebook >Reporter: Inam Ur Rehman > Labels: ipython, pandas, pyspark, python > > I am using pyspark in ipython notebook for analysis. > I am following an example toturial this > http://nbviewer.jupyter.org/github/bensadeghi/pyspark-churn-prediction/blob/master/churn-prediction.ipynb > I am getting error on this step in the 7th cell of notebook > pd.DataFrame(CV_data.take(5), columns=CV_data.columns) > Py4JJavaError: An error occurred while calling > z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 11.0 failed 1 times, most recent failure: Lost task 0.0 in stage > 11.0 (TID 18, localhost): org.apache.spark.api.python.PythonException: > Traceback (most recent call last): > Here is the full error : > Py4JJavaError Traceback (most recent call last) > in () > > 1 pd.DataFrame(CV_data.take(5), columns=CV_data.columns) > C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\pyspark\sql\dataframe.py > in take(self, num) > 303 with SCCallSiteSync(self._sc) as css: > 304 port = > self._sc._jvm.org.apache.spark.sql.execution.EvaluatePython.takeAndServe( > --> 305 self._jdf, num) > 306 return list(_load_from_socket(port, > BatchedSerializer(PickleSerializer( > 307 > C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py > in __call__(self, *args) > 536 answer = self.gateway_client.send_command(command) > 537 return_value = get_return_value(answer, self.gateway_client, > --> 538 self.target_id, self.name) > 539 > 540 for temp_arg in temp_args: > C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\pyspark\sql\utils.py > in deco(*a, **kw) > 34 def deco(*a, **kw): > 35 try: > ---> 36 return f(*a, **kw) > 37 except py4j.protocol.Py4JJavaError as e: > 38 s = e.java_exception.toString() > C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py > in get_return_value(answer, gateway_client, target_id, name) > 298 raise Py4JJavaError( > 299 'An error occurred while calling {0}{1}{2}.\n'. > --> 300 format(target_id, '.', name), value) > 301 else: > 302 raise Py4JError( > Py4JJavaError: An error occurred while calling > z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 11.0 failed 1 times, most recent failure: Lost task 0.0 in stage > 11.0 (TID 18, localhost): org.apache.spark.api.python.PythonException: > Traceback (most recent call last): > File > "C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\worker.py", > line 111, in main > File > "C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\worker.py", > line 106, in process > File > "C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\serializers.py", > line 263, in dump_stream > vs = list(itertools.islice(iterator, batch)) > File > "C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\pyspark\sql\functions.py", > line 1417, in > func = lambda _, it: map(lambda x: returnType.toInternal(f(*x)), it) > File "", line 5, in > KeyError: False > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166) > at > org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207) > at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125) > at > org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:397) > at > org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:362) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:706) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:706) > at >
[jira] [Created] (SPARK-16427) Expand documentation on the various RDD storage levels
Nicholas Chammas created SPARK-16427: Summary: Expand documentation on the various RDD storage levels Key: SPARK-16427 URL: https://issues.apache.org/jira/browse/SPARK-16427 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Nicholas Chammas Priority: Minor Looking at the docs here http://spark.apache.org/docs/1.6.2/api/python/pyspark.html#pyspark.StorageLevel A newcomer to Spark won’t understand the meaning of {{_2}}, or the meaning of {{_SER}} (or its value), and won’t understand how exactly memory and disk play together when something like {{MEMORY_AND_DISK}} is selected. We should expand this documentation to explain what the various levels mean and perhaps even when a user might want to use them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15760) Documentation missing for package-related config options
[ https://issues.apache.org/jira/browse/SPARK-15760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15366702#comment-15366702 ] Nicholas Chammas commented on SPARK-15760: -- Updating component since it seems we are using "Documentation" and not "docs". > Documentation missing for package-related config options > > > Key: SPARK-15760 > URL: https://issues.apache.org/jira/browse/SPARK-15760 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 1.6.1, 2.0.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.0.0 > > > There's no documentation about the config options that correlate to the > "--packages" (and friends) arguments of spark-submit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15441) dataset outer join seems to return incorrect result
[ https://issues.apache.org/jira/browse/SPARK-15441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-15441: - Component/s: (was: sq;) SQL > dataset outer join seems to return incorrect result > --- > > Key: SPARK-15441 > URL: https://issues.apache.org/jira/browse/SPARK-15441 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Wenchen Fan >Priority: Critical > Fix For: 2.0.0 > > > See notebook > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6122906529858466/2836020637783173/5382278320999420/latest.html > {code} > import org.apache.spark.sql.functions > val left = List(("a", 1), ("a", 2), ("b", 3), ("c", 4)).toDS() > val right = List(("a", "x"), ("b", "y"), ("d", "z")).toDS() > // The last row _1 should be null, rather than (null, -1) > left.toDF("k", "v").as[(String, Int)].alias("left") > .joinWith(right.toDF("k", "u").as[(String, String)].alias("right"), > functions.col("left.k") === functions.col("right.k"), "right_outer") > .show() > {code} > The returned result currently is > {code} > +-+-+ > | _1| _2| > +-+-+ > |(a,2)|(a,x)| > |(a,1)|(a,x)| > |(b,3)|(b,y)| > |(null,-1)|(d,z)| > +-+-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15772) Improve Scala API docs
[ https://issues.apache.org/jira/browse/SPARK-15772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-15772: - Component/s: (was: docs) > Improve Scala API docs > --- > > Key: SPARK-15772 > URL: https://issues.apache.org/jira/browse/SPARK-15772 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: nirav patel > > Hi, I just found out that spark python APIs are much more elaborate then > scala counterpart. e.g. > https://spark.apache.org/docs/1.4.1/api/python/pyspark.html?highlight=treereduce#pyspark.RDD.treeReduce > https://spark.apache.org/docs/1.5.2/api/python/pyspark.html?highlight=treereduce#pyspark.RDD > There are clear explanations of parameters; there are examples as well . I > think this would be great improvement on Scala API as well. It will make API > more friendly in first place. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15760) Documentation missing for package-related config options
[ https://issues.apache.org/jira/browse/SPARK-15760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-15760: - Component/s: (was: docs) Documentation > Documentation missing for package-related config options > > > Key: SPARK-15760 > URL: https://issues.apache.org/jira/browse/SPARK-15760 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 1.6.1, 2.0.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.0.0 > > > There's no documentation about the config options that correlate to the > "--packages" (and friends) arguments of spark-submit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9999) Dataset API on top of Catalyst/DataFrame
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357535#comment-15357535 ] Nicholas Chammas commented on SPARK-: - {quote} Python itself has no compile time type safety. {quote} Practically speaking, this is no longer true. You can get a decent measure of "compile" time type safety using recent additions to Python (both the language itself and the ecosystem). Specifically, optional static type checking has been a focus in Python since 3.5+, and according to Python's BDFL both Google and Dropbox are updating large parts of their codebases to use Python's new typing features. Static type checkers for Python like [mypy|http://mypy-lang.org/] are already in use and are backed by several core Python developers, including Guido van Rossum (Python's creator/BDFL). So I don't think Datasets are a critical feature for PySpark just yet, and it will take some time for the general Python community to learn and take advantage of Python's new optional static typing features and tools, but I would keep this on the radar. > Dataset API on top of Catalyst/DataFrame > > > Key: SPARK- > URL: https://issues.apache.org/jira/browse/SPARK- > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Reynold Xin >Assignee: Michael Armbrust > Fix For: 2.0.0 > > > The RDD API is very flexible, and as a result harder to optimize its > execution in some cases. The DataFrame API, on the other hand, is much easier > to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to > use UDFs, lack of strong types in Scala/Java). > The goal of Spark Datasets is to provide an API that allows users to easily > express transformations on domain objects, while also providing the > performance and robustness advantages of the Spark SQL execution engine. > h2. Requirements > - *Fast* - In most cases, the performance of Datasets should be equal to or > better than working with RDDs. Encoders should be as fast or faster than > Kryo and Java serialization, and unnecessary conversion should be avoided. > - *Typesafe* - Similar to RDDs, objects and functions that operate on those > objects should provide compile-time safety where possible. When converting > from data where the schema is not known at compile-time (for example data > read from an external source such as JSON), the conversion function should > fail-fast if there is a schema mismatch. > - *Support for a variety of object models* - Default encoders should be > provided for a variety of object models: primitive types, case classes, > tuples, POJOs, JavaBeans, etc. Ideally, objects that follow standard > conventions, such as Avro SpecificRecords, should also work out of the box. > - *Java Compatible* - Datasets should provide a single API that works in > both Scala and Java. Where possible, shared types like Array will be used in > the API. Where not possible, overloaded functions should be provided for > both languages. Scala concepts, such as ClassTags should not be required in > the user-facing API. > - *Interoperates with DataFrames* - Users should be able to seamlessly > transition between Datasets and DataFrames, without specifying conversion > boiler-plate. When names used in the input schema line-up with fields in the > given class, no extra mapping should be necessary. Libraries like MLlib > should not need to provide different interfaces for accepting DataFrames and > Datasets as input. > For a detailed outline of the complete proposed API: > [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files] > For an initial discussion of the design considerations in this API: [design > doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#] > The initial version of the Dataset API has been merged in Spark 1.6. However, > it will take a few more future releases to flush everything out. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11744) bin/pyspark --version doesn't return version and exit
[ https://issues.apache.org/jira/browse/SPARK-11744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347297#comment-15347297 ] Nicholas Chammas commented on SPARK-11744: -- This is not the appropriate place to ask random PySpark questions. Please post a question on Stack Overflow or on the Spark user list. > bin/pyspark --version doesn't return version and exit > - > > Key: SPARK-11744 > URL: https://issues.apache.org/jira/browse/SPARK-11744 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 >Reporter: Nicholas Chammas >Assignee: Saisai Shao >Priority: Minor > Fix For: 1.6.0 > > > {{bin/pyspark \-\-help}} offers a {{\-\-version}} option: > {code} > $ ./spark/bin/pyspark --help > Usage: ./bin/pyspark [options] > Options: > ... > --version, Print the version of current Spark > ... > {code} > However, trying to get the version in this way doesn't yield the expected > results. > Instead of printing the version and exiting, we get the version, a stack > trace, and then get dropped into a broken PySpark shell. > {code} > $ ./spark/bin/pyspark --version > Python 2.7.10 (default, Aug 11 2015, 23:39:10) > [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 1.5.2 > /_/ > > Type --help for more information. > Traceback (most recent call last): > File "/home/ec2-user/spark/python/pyspark/shell.py", line 43, in > sc = SparkContext(pyFiles=add_files) > File "/home/ec2-user/spark/python/pyspark/context.py", line 110, in __init__ > SparkContext._ensure_initialized(self, gateway=gateway) > File "/home/ec2-user/spark/python/pyspark/context.py", line 234, in > _ensure_initialized > SparkContext._gateway = gateway or launch_gateway() > File "/home/ec2-user/spark/python/pyspark/java_gateway.py", line 94, in > launch_gateway > raise Exception("Java gateway process exited before sending the driver > its port number") > Exception: Java gateway process exited before sending the driver its port > number > >>> > >>> sc > Traceback (most recent call last): > File "", line 1, in > NameError: name 'sc' is not defined > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292599#comment-15292599 ] Nicholas Chammas commented on SPARK-3821: - You can deploy Spark today on Docker just fine. It's just that Spark itself does not maintain any official Dockerfiles and likely never will since the project is actually trying to push deployment stuff outside the main project (hence why spark-ec2 was moved out; you will not see spark-ec2 in the official docs once Spark 2.0 comes out). You may be more interested in the Apache Big Top project, which focuses on big data system deployment (including Spark) and may have stuff for Docker specifically. Mesos is a separate matter, because it's a resource manager (analogous to YARN) that integrates with Spark at a low level. If you still think Spark should host and maintain an official Dockerfile and Docker images that are suitable for production use, please open a separate issue. I think the maintainers will reject it on the grounds that I have explained here, though. (Can't say for sure; after all I'm just a random contributor.) > Develop an automated way of creating Spark images (AMI, Docker, and others) > --- > > Key: SPARK-3821 > URL: https://issues.apache.org/jira/browse/SPARK-3821 > Project: Spark > Issue Type: Improvement > Components: Build, EC2 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas > Attachments: packer-proposal.html > > > Right now the creation of Spark AMIs or Docker containers is done manually. > With tools like [Packer|http://www.packer.io/], we should be able to automate > this work, and do so in such a way that multiple types of machine images can > be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292198#comment-15292198 ] Nicholas Chammas commented on SPARK-3821: - Not sure if there is renewed interest, but at this point this issue is outside the scope of the Spark project. The original impetus for this issue was to create AMIs for spark-ec2 in an automated fashion, and spark-ec2 has been moved out of the main Spark project. spark-ec2 now lives here: https://github.com/amplab/spark-ec2 > Develop an automated way of creating Spark images (AMI, Docker, and others) > --- > > Key: SPARK-3821 > URL: https://issues.apache.org/jira/browse/SPARK-3821 > Project: Spark > Issue Type: Improvement > Components: Build, EC2 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas > Attachments: packer-proposal.html > > > Right now the creation of Spark AMIs or Docker containers is done manually. > With tools like [Packer|http://www.packer.io/], we should be able to automate > this work, and do so in such a way that multiple types of machine images can > be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15072) Remove SparkSession.withHiveSupport
[ https://issues.apache.org/jira/browse/SPARK-15072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285112#comment-15285112 ] Nicholas Chammas commented on SPARK-15072: -- Brief note from [~yhuai] on the motivation behind this issue: https://github.com/apache/spark/pull/13069#issuecomment-219516577 > Remove SparkSession.withHiveSupport > --- > > Key: SPARK-15072 > URL: https://issues.apache.org/jira/browse/SPARK-15072 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Sandeep Singh > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10899) Support JDBC pushdown for additional commands
[ https://issues.apache.org/jira/browse/SPARK-10899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15282239#comment-15282239 ] Nicholas Chammas commented on SPARK-10899: -- Is {{COUNT}} also something that can be pushed down? If I have a DataFrame defined against a JDBC source and say {{df.count()}}, I think we should be able to push the count down into the database. By the way, I would say {{LIMIT}} is probably the most urgent clause that needs to be pushed down. Right now if you do {{df.take(1)}} it appears that Spark will read the whole friggin table. It makes interactive development and testing impossible. > Support JDBC pushdown for additional commands > - > > Key: SPARK-10899 > URL: https://issues.apache.org/jira/browse/SPARK-10899 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.5.0 >Reporter: Victor May >Priority: Minor > > Spark DataFrames currently support predicate push-down with JDBC sources but > term predicate is used in a strict SQL meaning. It means it covers only WHERE > clause. Moreover it looks like it is limited to the logical conjunction (no > IN and OR I am afraid) and simple predicates. > This creates a situation where a simple query such as "select * from table > limit 100" could easily result in the database being overloaded when the > table is large. > This feature request is to expand the support for push-downs to additional > SQL commands: > -LIMIT > -WHERE IN > -WHERE NOT IN > -GROUP BY -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7506) pyspark.sql.types.StructType.fromJson() is incorrectly named
[ https://issues.apache.org/jira/browse/SPARK-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280606#comment-15280606 ] Nicholas Chammas edited comment on SPARK-7506 at 5/11/16 6:51 PM: -- [~davies] - Would you be interested in a PR that adds a new method called {{fromDict()}} and makes {{fromJson()}} an alias of that method? It would preserve compatibility, while (kinda) correcting the naming problem here. was (Author: nchammas): [~davies] - Would you be interested in a PR that adds new methods called {{toDict()}} and {{fromDict()}}, and made {{json()}} and {{fromJson()}} aliases of those methods? It would preserve compatibility, while correcting the naming problem here. > pyspark.sql.types.StructType.fromJson() is incorrectly named > > > Key: SPARK-7506 > URL: https://issues.apache.org/jira/browse/SPARK-7506 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 1.3.0, 1.3.1 >Reporter: Nicholas Chammas >Priority: Minor > > {code} > >>> json_rdd = sqlContext.jsonRDD(sc.parallelize(['{"name": "Nick"}'])) > >>> json_rdd.schema > StructType(List(StructField(name,StringType,true))) > >>> type(json_rdd.schema) > > >>> json_rdd.schema.json() > '{"fields":[{"metadata":{},"name":"name","nullable":true,"type":"string"}],"type":"struct"}' > >>> pyspark.sql.types.StructType.fromJson(json_rdd.schema.json()) > Traceback (most recent call last): > File "", line 1, in > File > "/Applications/apache-spark/spark-1.3.1-bin-hadoop2.4/python/pyspark/sql/types.py", > line 346, in fromJson > return StructType([StructField.fromJson(f) for f in json["fields"]]) > TypeError: string indices must be integers, not str > >>> import json > >>> pyspark.sql.types.StructType.fromJson(json.loads(json_rdd.schema.json())) > StructType(List(StructField(name,StringType,true))) > >>> > {code} > So {{fromJson()}} doesn't actually expect JSON, which is a string. It expects > a dictionary. > This method should probably be renamed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7506) pyspark.sql.types.StructType.fromJson() is incorrectly named
[ https://issues.apache.org/jira/browse/SPARK-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280606#comment-15280606 ] Nicholas Chammas commented on SPARK-7506: - [~davies] - Would you be interested in a PR that adds new methods called {{toDict()}} and {{fromDict()}}, and made {{json()}} and {{fromJson()}} aliases of those methods? It would preserve compatibility, while correcting the naming problem here. > pyspark.sql.types.StructType.fromJson() is incorrectly named > > > Key: SPARK-7506 > URL: https://issues.apache.org/jira/browse/SPARK-7506 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 1.3.0, 1.3.1 >Reporter: Nicholas Chammas >Priority: Minor > > {code} > >>> json_rdd = sqlContext.jsonRDD(sc.parallelize(['{"name": "Nick"}'])) > >>> json_rdd.schema > StructType(List(StructField(name,StringType,true))) > >>> type(json_rdd.schema) > > >>> json_rdd.schema.json() > '{"fields":[{"metadata":{},"name":"name","nullable":true,"type":"string"}],"type":"struct"}' > >>> pyspark.sql.types.StructType.fromJson(json_rdd.schema.json()) > Traceback (most recent call last): > File "", line 1, in > File > "/Applications/apache-spark/spark-1.3.1-bin-hadoop2.4/python/pyspark/sql/types.py", > line 346, in fromJson > return StructType([StructField.fromJson(f) for f in json["fields"]]) > TypeError: string indices must be integers, not str > >>> import json > >>> pyspark.sql.types.StructType.fromJson(json.loads(json_rdd.schema.json())) > StructType(List(StructField(name,StringType,true))) > >>> > {code} > So {{fromJson()}} doesn't actually expect JSON, which is a string. It expects > a dictionary. > This method should probably be renamed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15256) Clarify the docstring for DataFrameReader.jdbc()
[ https://issues.apache.org/jira/browse/SPARK-15256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-15256: - Description: The doc for the {{properties}} parameter [currently reads|https://github.com/apache/spark/blob/d37c7f7f042f7943b5b684e53cf4284c601fb347/python/pyspark/sql/readwriter.py#L437-L439]: {quote} :param properties: JDBC database connection arguments, a list of arbitrary string tag/value. Normally at least a "user" and "password" property should be included. {quote} This is incorrect, since {{properties}} is expected to be a dictionary. Some of the other parameters have cryptic descriptions. I'll try to clarify those as well. was: The doc for the {{properties}} parameter [currently reads|https://github.com/apache/spark/blob/d37c7f7f042f7943b5b684e53cf4284c601fb347/python/pyspark/sql/readwriter.py#L437-L439]: {quote} :param properties: JDBC database connection arguments, a list of arbitrary string tag/value. Normally at least a "user" and "password" property should be included. {quote} This is incorrect, since {{properties}} is expected to be a dictionary. > Clarify the docstring for DataFrameReader.jdbc() > > > Key: SPARK-15256 > URL: https://issues.apache.org/jira/browse/SPARK-15256 > Project: Spark > Issue Type: Bug > Components: Documentation, PySpark >Affects Versions: 1.6.1 >Reporter: Nicholas Chammas >Priority: Minor > > The doc for the {{properties}} parameter [currently > reads|https://github.com/apache/spark/blob/d37c7f7f042f7943b5b684e53cf4284c601fb347/python/pyspark/sql/readwriter.py#L437-L439]: > {quote} > :param properties: JDBC database connection arguments, a list of > arbitrary string >tag/value. Normally at least a "user" and > "password" property >should be included. > {quote} > This is incorrect, since {{properties}} is expected to be a dictionary. > Some of the other parameters have cryptic descriptions. I'll try to clarify > those as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15256) Clarify the docstring for DataFrameReader.jdbc()
[ https://issues.apache.org/jira/browse/SPARK-15256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-15256: - Summary: Clarify the docstring for DataFrameReader.jdbc() (was: Correct the docstring for DataFrameReader.jdbc()) > Clarify the docstring for DataFrameReader.jdbc() > > > Key: SPARK-15256 > URL: https://issues.apache.org/jira/browse/SPARK-15256 > Project: Spark > Issue Type: Bug > Components: Documentation, PySpark >Affects Versions: 1.6.1 >Reporter: Nicholas Chammas >Priority: Minor > > The doc for the {{properties}} parameter [currently > reads|https://github.com/apache/spark/blob/d37c7f7f042f7943b5b684e53cf4284c601fb347/python/pyspark/sql/readwriter.py#L437-L439]: > {quote} > :param properties: JDBC database connection arguments, a list of > arbitrary string >tag/value. Normally at least a "user" and > "password" property >should be included. > {quote} > This is incorrect, since {{properties}} is expected to be a dictionary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15256) Correct the docstring for DataFrameReader.jdbc()
Nicholas Chammas created SPARK-15256: Summary: Correct the docstring for DataFrameReader.jdbc() Key: SPARK-15256 URL: https://issues.apache.org/jira/browse/SPARK-15256 Project: Spark Issue Type: Bug Components: Documentation, PySpark Affects Versions: 1.6.1 Reporter: Nicholas Chammas Priority: Minor The doc for the {{properties}} parameter [currently reads|https://github.com/apache/spark/blob/d37c7f7f042f7943b5b684e53cf4284c601fb347/python/pyspark/sql/readwriter.py#L437-L439]: {quote} :param properties: JDBC database connection arguments, a list of arbitrary string tag/value. Normally at least a "user" and "password" property should be included. {quote} This is incorrect, since {{properties}} is expected to be a dictionary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15193) samplingRatio should default to 1.0 across the board
[ https://issues.apache.org/jira/browse/SPARK-15193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15278393#comment-15278393 ] Nicholas Chammas edited comment on SPARK-15193 at 5/10/16 4:27 PM: --- Nope, a sampling ratio of 1.0 and None mean [different things|http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.SQLContext.createDataFrame]: {quote} If schema inference is needed, samplingRatio is used to determined the ratio of rows used for schema inference. The first row will be used if samplingRatio is None. {quote} createDataFrame() is not consistent with jsonRDD() as of 1.6.1. Looking in master, I can't seem to find jsonRDD() for Python anymore, so perhaps that got removed. So actually you're right, it looks like the default _did_ get standardized, but to None. :/ was (Author: nchammas): Nope, a sampling ratio of 1.0 and None mean [different things|http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.SQLContext.createDataFrame]: {quote} If schema inference is needed, samplingRatio is used to determined the ratio of rows used for schema inference. The first row will be used if samplingRatio is None. {quote} createDataFrame() is not consistent with jsonRDD() as of 1.6.1. Looking in master, I can't seem to find jsonRDD() for Python anymore, so perhaps that got removed. So actually, it looks like the default _did_ get standardized, but to None. :/ > samplingRatio should default to 1.0 across the board > > > Key: SPARK-15193 > URL: https://issues.apache.org/jira/browse/SPARK-15193 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Reporter: Nicholas Chammas >Priority: Minor > > The default sampling ratio for {{jsonRDD}} is > [1.0|http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.SQLContext.jsonRDD], > whereas for {{createDataFrame}} it's > [{{None}}|http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.SQLContext.createDataFrame]. > I think the default sampling ratio should be 1.0 across the board. Users > should have to explicitly supply a lower sampling ratio if they know their > dataset has a consistent structure. Otherwise, I think the "safer" thing to > default to is to check all the data. > Targeting this for 2.0 in case we consider it a breaking change that would be > more difficult to get in later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15193) samplingRatio should default to 1.0 across the board
[ https://issues.apache.org/jira/browse/SPARK-15193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15278393#comment-15278393 ] Nicholas Chammas commented on SPARK-15193: -- Nope, a sampling ratio of 1.0 and None mean [different things|http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.SQLContext.createDataFrame]: {quote} If schema inference is needed, samplingRatio is used to determined the ratio of rows used for schema inference. The first row will be used if samplingRatio is None. {quote} createDataFrame() is not consistent with jsonRDD() as of 1.6.1. Looking in master, I can't seem to find jsonRDD() for Python anymore, so perhaps that got removed. So actually, it looks like the default _did_ get standardized, but to None. :/ > samplingRatio should default to 1.0 across the board > > > Key: SPARK-15193 > URL: https://issues.apache.org/jira/browse/SPARK-15193 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Reporter: Nicholas Chammas >Priority: Minor > > The default sampling ratio for {{jsonRDD}} is > [1.0|http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.SQLContext.jsonRDD], > whereas for {{createDataFrame}} it's > [{{None}}|http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.SQLContext.createDataFrame]. > I think the default sampling ratio should be 1.0 across the board. Users > should have to explicitly supply a lower sampling ratio if they know their > dataset has a consistent structure. Otherwise, I think the "safer" thing to > default to is to check all the data. > Targeting this for 2.0 in case we consider it a breaking change that would be > more difficult to get in later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15238) Clarify Python 3 support in docs
Nicholas Chammas created SPARK-15238: Summary: Clarify Python 3 support in docs Key: SPARK-15238 URL: https://issues.apache.org/jira/browse/SPARK-15238 Project: Spark Issue Type: Improvement Components: Documentation, PySpark Reporter: Nicholas Chammas Priority: Trivial The [current doc|http://spark.apache.org/docs/1.6.1/] reads: {quote} Spark runs on Java 7+, Python 2.6+ and R 3.1+. For the Scala API, Spark 1.6.1 uses Scala 2.10. You will need to use a compatible Scala version (2.10.x). {quote} Projects that support Python 3 generally mention that explicitly. A casual Python user might assume from this line that Spark supports Python 2.6 and 2.7 but not 3+. More specifically, I gather from SPARK-4897 that Spark actually supports 3.4+ and not earlier versions of Python 3. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12661) Drop Python 2.6 support in PySpark
[ https://issues.apache.org/jira/browse/SPARK-12661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15277450#comment-15277450 ] Nicholas Chammas commented on SPARK-12661: -- [~davies] / [~joshrosen] - Has this been settled on? The dev list discussion from January seemed to converge almost unanimously on dropping Python 2.6 support in Spark 2.0, but I don't think an official decision was ever announced. > Drop Python 2.6 support in PySpark > -- > > Key: SPARK-12661 > URL: https://issues.apache.org/jira/browse/SPARK-12661 > Project: Spark > Issue Type: Task > Components: PySpark >Reporter: Davies Liu > Labels: releasenotes > > 1. stop testing with 2.6 > 2. remove the code for python 2.6 > see discussion : > https://www.mail-archive.com/user@spark.apache.org/msg43423.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12661) Drop Python 2.6 support in PySpark
[ https://issues.apache.org/jira/browse/SPARK-12661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15277448#comment-15277448 ] Nicholas Chammas commented on SPARK-12661: -- [~shivaram] - Can you confirm that spark-ec2 will drop support for Python 2.6 starting with Spark 2.0? For the record, I don't think it will affect things much either way anymore since spark-ec2 is now a separate project. > Drop Python 2.6 support in PySpark > -- > > Key: SPARK-12661 > URL: https://issues.apache.org/jira/browse/SPARK-12661 > Project: Spark > Issue Type: Task > Components: PySpark >Reporter: Davies Liu > Labels: releasenotes > > 1. stop testing with 2.6 > 2. remove the code for python 2.6 > see discussion : > https://www.mail-archive.com/user@spark.apache.org/msg43423.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15204) Nullable is not correct for Aggregator
[ https://issues.apache.org/jira/browse/SPARK-15204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275394#comment-15275394 ] Nicholas Chammas commented on SPARK-15204: -- Loosely related: SPARK-15191 > Nullable is not correct for Aggregator > -- > > Key: SPARK-15204 > URL: https://issues.apache.org/jira/browse/SPARK-15204 > Project: Spark > Issue Type: Bug > Components: SQL > Environment: spark-2.0.0-SNAPSHOT >Reporter: koert kuipers >Priority: Minor > > {noformat} > object SimpleSum extends Aggregator[Row, Int, Int] { > def zero: Int = 0 > def reduce(b: Int, a: Row) = b + a.getInt(1) > def merge(b1: Int, b2: Int) = b1 + b2 > def finish(b: Int) = b > def bufferEncoder: Encoder[Int] = Encoders.scalaInt > def outputEncoder: Encoder[Int] = Encoders.scalaInt > } > val df = List(("a", 1), ("a", 2), ("a", 3)).toDF("k", "v") > val df1 = df.groupBy("k").agg(SimpleSum.toColumn as "v1") > df1.printSchema > df1.show > root > |-- k: string (nullable = true) > |-- v1: integer (nullable = true) > +---+---+ > | k| v1| > +---+---+ > | a| 6| > +---+---+ > {noformat} > notice how v1 has nullable set to true. the default (and expected) behavior > for spark sql is to give an int column false for nullable. for example if i > had uses a built-in aggregator like "sum" instead if would have reported > nullable = false. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15191) createDataFrame() should mark fields that are known not to be null as not nullable
[ https://issues.apache.org/jira/browse/SPARK-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-15191: - Affects Version/s: 1.6.1 > createDataFrame() should mark fields that are known not to be null as not > nullable > -- > > Key: SPARK-15191 > URL: https://issues.apache.org/jira/browse/SPARK-15191 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 1.6.1 >Reporter: Nicholas Chammas >Priority: Minor > > Here's a brief reproduction: > {code} > >>> numbers = sqlContext.createDataFrame( > ... data=[(1,), (2,), (3,), (4,), (5,)], > ... samplingRatio=1 # go through all the data please! > ... ) > >>> numbers.printSchema() > root > |-- _1: long (nullable = true) > {code} > The field is marked as nullable even though none of the data is null and we > had {{createDataFrame()}} go through all the data. > In situations like this, shouldn't {{createDataFrame()}} return a DataFrame > with the field marked as not nullable? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15191) createDataFrame() should mark fields that are known not to be null as not nullable
[ https://issues.apache.org/jira/browse/SPARK-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274785#comment-15274785 ] Nicholas Chammas commented on SPARK-15191: -- [~yhuai] - This loosely relates to the discussion in SPARK-11319. > createDataFrame() should mark fields that are known not to be null as not > nullable > -- > > Key: SPARK-15191 > URL: https://issues.apache.org/jira/browse/SPARK-15191 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Reporter: Nicholas Chammas >Priority: Minor > > Here's a brief reproduction: > {code} > >>> numbers = sqlContext.createDataFrame( > ... data=[(1,), (2,), (3,), (4,), (5,)], > ... samplingRatio=1 # go through all the data please! > ... ) > >>> numbers.printSchema() > root > |-- _1: long (nullable = true) > {code} > The field is marked as nullable even though none of the data is null and we > had {{createDataFrame()}} go through all the data. > In situations like this, shouldn't {{createDataFrame()}} return a DataFrame > with the field marked as not nullable? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15193) samplingRatio should default to 1.0 across the board
[ https://issues.apache.org/jira/browse/SPARK-15193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274783#comment-15274783 ] Nicholas Chammas commented on SPARK-15193: -- [~yhuai] - What do you think of this proposed change? > samplingRatio should default to 1.0 across the board > > > Key: SPARK-15193 > URL: https://issues.apache.org/jira/browse/SPARK-15193 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Reporter: Nicholas Chammas >Priority: Minor > > The default sampling ratio for {{jsonRDD}} is > [1.0|http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.SQLContext.jsonRDD], > whereas for {{createDataFrame}} it's > [{{None}}|http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.SQLContext.createDataFrame]. > I think the default sampling ratio should be 1.0 across the board. Users > should have to explicitly supply a lower sampling ratio if they know their > dataset has a consistent structure. Otherwise, I think the "safer" thing to > default to is to check all the data. > Targeting this for 2.0 in case we consider it a breaking change that would be > more difficult to get in later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15193) samplingRatio should default to 1.0 across the board
Nicholas Chammas created SPARK-15193: Summary: samplingRatio should default to 1.0 across the board Key: SPARK-15193 URL: https://issues.apache.org/jira/browse/SPARK-15193 Project: Spark Issue Type: Improvement Components: PySpark, SQL Reporter: Nicholas Chammas Priority: Minor The default sampling ratio for {{jsonRDD}} is [1.0|http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.SQLContext.jsonRDD], whereas for {{createDataFrame}} it's [{{None}}|http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.SQLContext.createDataFrame]. I think the default sampling ratio should be 1.0 across the board. Users should have to explicitly supply a lower sampling ratio if they know their dataset has a consistent structure. Otherwise, I think the "safer" thing to default to is to check all the data. Targeting this for 2.0 in case we consider it a breaking change that would be more difficult to get in later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15191) createDataFrame() should mark fields that are known not to be null as not nullable
Nicholas Chammas created SPARK-15191: Summary: createDataFrame() should mark fields that are known not to be null as not nullable Key: SPARK-15191 URL: https://issues.apache.org/jira/browse/SPARK-15191 Project: Spark Issue Type: Improvement Components: PySpark, SQL Reporter: Nicholas Chammas Priority: Minor Here's a brief reproduction: {code} >>> numbers = sqlContext.createDataFrame( ... data=[(1,), (2,), (3,), (4,), (5,)], ... samplingRatio=1 # go through all the data please! ... ) >>> numbers.printSchema() root |-- _1: long (nullable = true) {code} The field is marked as nullable even though none of the data is null and we had {{createDataFrame()}} go through all the data. In situations like this, shouldn't {{createDataFrame()}} return a DataFrame with the field marked as not nullable? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13740) add null check for _verify_type in types.py
[ https://issues.apache.org/jira/browse/SPARK-13740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274738#comment-15274738 ] Nicholas Chammas commented on SPARK-13740: -- I noticed the PR only modifies PySpark. Are similar changes not required for the other language APIs? > add null check for _verify_type in types.py > --- > > Key: SPARK-13740 > URL: https://issues.apache.org/jira/browse/SPARK-13740 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11319) PySpark silently accepts null values in non-nullable DataFrame fields.
[ https://issues.apache.org/jira/browse/SPARK-11319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274728#comment-15274728 ] Nicholas Chammas commented on SPARK-11319: -- [~marmbrus] / [~yhuai] - Does SPARK-13740 resolve the problem described here? If I read that issue correctly, it appears that it's no longer possible to have null data in a field that is marked as not nullable. > PySpark silently accepts null values in non-nullable DataFrame fields. > -- > > Key: SPARK-11319 > URL: https://issues.apache.org/jira/browse/SPARK-11319 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: Kevin Cox > > Running the following code with a null value in a non-nullable column > silently works. This makes the code incredibly hard to trust. > {code} > In [2]: from pyspark.sql.types import * > In [3]: sqlContext.createDataFrame([(None,)], StructType([StructField("a", > TimestampType(), False)])).collect() > Out[3]: [Row(a=None)] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14932) Allow DataFrame.replace() to replace values with None
[ https://issues.apache.org/jira/browse/SPARK-14932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15259266#comment-15259266 ] Nicholas Chammas commented on SPARK-14932: -- [~marmbrus] - Not sure if you're a good person to ask about this, but does this request seem reasonable from an API standpoint? > Allow DataFrame.replace() to replace values with None > - > > Key: SPARK-14932 > URL: https://issues.apache.org/jira/browse/SPARK-14932 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Nicholas Chammas >Priority: Minor > > Current doc: > http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.replace > I would like to specify {{None}} as the value to substitute in. This is > currently > [disallowed|https://github.com/apache/spark/blob/9797cc20c0b8fb34659df11af8eccb9ed293c52c/python/pyspark/sql/dataframe.py#L1144-L1145]. > My use case is for replacing bad values with {{None}} so I can then ignore > them with {{dropna()}}. > For example, I have a dataset that incorrectly includes empty strings where > there should be {{None}} values. I would like to replace the empty strings > with {{None}} and then drop all null data with {{dropna()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14932) Allow DataFrame.replace() to replace values with None
Nicholas Chammas created SPARK-14932: Summary: Allow DataFrame.replace() to replace values with None Key: SPARK-14932 URL: https://issues.apache.org/jira/browse/SPARK-14932 Project: Spark Issue Type: Improvement Components: SQL Reporter: Nicholas Chammas Priority: Minor Current doc: http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.replace I would like to specify {{None}} as the value to substitute in. This is currently [disallowed|https://github.com/apache/spark/blob/9797cc20c0b8fb34659df11af8eccb9ed293c52c/python/pyspark/sql/dataframe.py#L1144-L1145]. My use case is for replacing bad values with {{None}} so I can then ignore them with {{dropna()}}. For example, I have a dataset that incorrectly includes empty strings where there should be {{None}} values. I would like to replace the empty strings with {{None}} and then drop all null data with {{dropna()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14742) Redirect spark-ec2 doc to new location
Nicholas Chammas created SPARK-14742: Summary: Redirect spark-ec2 doc to new location Key: SPARK-14742 URL: https://issues.apache.org/jira/browse/SPARK-14742 Project: Spark Issue Type: Documentation Components: Documentation, EC2 Reporter: Nicholas Chammas Priority: Minor See: https://github.com/amplab/spark-ec2/pull/24#issuecomment-212033453 We need to redirect this page http://spark.apache.org/docs/latest/ec2-scripts.html to this page https://github.com/amplab/spark-ec2#readme -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8327) Ganglia failed to start while starting standalone on EC 2 spark with spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-8327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249144#comment-15249144 ] Nicholas Chammas commented on SPARK-8327: - [~vvladymyrov] - Is this still an issue? If so, I suggest migrating this issue to the spark-ec2 tracker: https://github.com/amplab/spark-ec2/issues > Ganglia failed to start while starting standalone on EC 2 spark with > spark-ec2 > --- > > Key: SPARK-8327 > URL: https://issues.apache.org/jira/browse/SPARK-8327 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 1.3.1 >Reporter: Vladimir Vladimirov >Priority: Minor > > exception shown > {code} > [FAILED] Starting httpd: httpd: Syntax error on line 199 of > /etc/httpd/conf/httpd.conf: Cannot load modules/libphp-5.5.so into server: > /etc/httpd/modules/libphp-5.5.so: cannot open shared object file: No such > file or directory [FAILED] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6527) sc.binaryFiles can not access files on s3
[ https://issues.apache.org/jira/browse/SPARK-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249141#comment-15249141 ] Nicholas Chammas commented on SPARK-6527: - Did the s3a suggestion work? If not, did anybody file an issue as Steve suggested with more detail? > sc.binaryFiles can not access files on s3 > - > > Key: SPARK-6527 > URL: https://issues.apache.org/jira/browse/SPARK-6527 > Project: Spark > Issue Type: Bug > Components: EC2, Input/Output >Affects Versions: 1.2.0, 1.3.0 > Environment: I am running Spark on EC2 >Reporter: Zhao Zhang >Priority: Minor > > The sc.binaryFIles() can not access the files stored on s3. It can correctly > list the number of files, but report "file does not exist" when processing > them. I also tried sc.textFile() which works fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6527) sc.binaryFiles can not access files on s3
[ https://issues.apache.org/jira/browse/SPARK-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249141#comment-15249141 ] Nicholas Chammas edited comment on SPARK-6527 at 4/20/16 2:27 AM: -- Did the s3a suggestion work? If not, did anybody file an issue as Steve suggested? was (Author: nchammas): Did the s3a suggestion work? If not, did anybody file an issue as Steve suggested with more detail? > sc.binaryFiles can not access files on s3 > - > > Key: SPARK-6527 > URL: https://issues.apache.org/jira/browse/SPARK-6527 > Project: Spark > Issue Type: Bug > Components: EC2, Input/Output >Affects Versions: 1.2.0, 1.3.0 > Environment: I am running Spark on EC2 >Reporter: Zhao Zhang >Priority: Minor > > The sc.binaryFIles() can not access the files stored on s3. It can correctly > list the number of files, but report "file does not exist" when processing > them. I also tried sc.textFile() which works fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas closed SPARK-3821. --- Resolution: Won't Fix I'm resolving this as "Won't Fix" due to lack of interest, both on my part and on part of the Spark / spark-ec2 project maintainers. If anyone's interested in picking this up, the code is here: https://github.com/nchammas/spark-ec2/tree/packer/image-build I've mostly moved on from spark-ec2 to work on [Flintrock|https://github.com/nchammas/flintrock], which doesn't require custom AMIs. > Develop an automated way of creating Spark images (AMI, Docker, and others) > --- > > Key: SPARK-3821 > URL: https://issues.apache.org/jira/browse/SPARK-3821 > Project: Spark > Issue Type: Improvement > Components: Build, EC2 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas > Attachments: packer-proposal.html > > > Right now the creation of Spark AMIs or Docker containers is done manually. > With tools like [Packer|http://www.packer.io/], we should be able to automate > this work, and do so in such a way that multiple types of machine images can > be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3533) Add saveAsTextFileByKey() method to RDDs
[ https://issues.apache.org/jira/browse/SPARK-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15214577#comment-15214577 ] Nicholas Chammas commented on SPARK-3533: - I've added 2 workaround to this issue to the description body. > Add saveAsTextFileByKey() method to RDDs > > > Key: SPARK-3533 > URL: https://issues.apache.org/jira/browse/SPARK-3533 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 1.1.0 >Reporter: Nicholas Chammas > > Users often have a single RDD of key-value pairs that they want to save to > multiple locations based on the keys. > For example, say I have an RDD like this: > {code} > >>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', > >>> 'Frankie']).keyBy(lambda x: x[0]) > >>> a.collect() > [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')] > >>> a.keys().distinct().collect() > ['B', 'F', 'N'] > {code} > Now I want to write the RDD out to different paths depending on the keys, so > that I have one output directory per distinct key. Each output directory > could potentially have multiple {{part-}} files, one per RDD partition. > So the output would look something like: > {code} > /path/prefix/B [/part-1, /part-2, etc] > /path/prefix/F [/part-1, /part-2, etc] > /path/prefix/N [/part-1, /part-2, etc] > {code} > Though it may be possible to do this with some combination of > {{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the > {{MultipleTextOutputFormat}} output format class, it isn't straightforward. > It's not clear if it's even possible at all in PySpark. > Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs > that makes it easy to save RDDs out to multiple locations at once. > --- > Update: March 2016 > There are two workarounds to this problem: > 1. See [this answer on Stack > Overflow|http://stackoverflow.com/a/26051042/877069], which implements > {{MultipleTextOutputFormat}}. (Scala-only) > 2. See [this comment by Davies > Liu|https://github.com/apache/spark/pull/8375#issuecomment-202458325], which > uses DataFrames: > {code} > val df = rdd.map(t => Row(gen_key(t), t)).toDF("key", "text") > df.write.partitionBy("key").text(path){code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3533) Add saveAsTextFileByKey() method to RDDs
[ https://issues.apache.org/jira/browse/SPARK-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-3533: Description: Users often have a single RDD of key-value pairs that they want to save to multiple locations based on the keys. For example, say I have an RDD like this: {code} >>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 'Frankie']).keyBy(lambda >>> x: x[0]) >>> a.collect() [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')] >>> a.keys().distinct().collect() ['B', 'F', 'N'] {code} Now I want to write the RDD out to different paths depending on the keys, so that I have one output directory per distinct key. Each output directory could potentially have multiple {{part-}} files, one per RDD partition. So the output would look something like: {code} /path/prefix/B [/part-1, /part-2, etc] /path/prefix/F [/part-1, /part-2, etc] /path/prefix/N [/part-1, /part-2, etc] {code} Though it may be possible to do this with some combination of {{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the {{MultipleTextOutputFormat}} output format class, it isn't straightforward. It's not clear if it's even possible at all in PySpark. Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs that makes it easy to save RDDs out to multiple locations at once. --- Update: March 2016 There are two workarounds to this problem: 1. See [this answer on Stack Overflow|http://stackoverflow.com/a/26051042/877069], which implements {{MultipleTextOutputFormat}}. (Scala-only) 2. See [this comment by Davies Liu|https://github.com/apache/spark/pull/8375#issuecomment-202458325], which uses DataFrames: {code} val df = rdd.map(t => Row(gen_key(t), t)).toDF("key", "text") df.write.partitionBy("key").text(path){code} was: Users often have a single RDD of key-value pairs that they want to save to multiple locations based on the keys. For example, say I have an RDD like this: {code} >>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 'Frankie']).keyBy(lambda >>> x: x[0]) >>> a.collect() [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')] >>> a.keys().distinct().collect() ['B', 'F', 'N'] {code} Now I want to write the RDD out to different paths depending on the keys, so that I have one output directory per distinct key. Each output directory could potentially have multiple {{part-}} files, one per RDD partition. So the output would look something like: {code} /path/prefix/B [/part-1, /part-2, etc] /path/prefix/F [/part-1, /part-2, etc] /path/prefix/N [/part-1, /part-2, etc] {code} Though it may be possible to do this with some combination of {{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the {{MultipleTextOutputFormat}} output format class, it isn't straightforward. It's not clear if it's even possible at all in PySpark. Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs that makes it easy to save RDDs out to multiple locations at once. > Add saveAsTextFileByKey() method to RDDs > > > Key: SPARK-3533 > URL: https://issues.apache.org/jira/browse/SPARK-3533 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 1.1.0 >Reporter: Nicholas Chammas > > Users often have a single RDD of key-value pairs that they want to save to > multiple locations based on the keys. > For example, say I have an RDD like this: > {code} > >>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', > >>> 'Frankie']).keyBy(lambda x: x[0]) > >>> a.collect() > [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')] > >>> a.keys().distinct().collect() > ['B', 'F', 'N'] > {code} > Now I want to write the RDD out to different paths depending on the keys, so > that I have one output directory per distinct key. Each output directory > could potentially have multiple {{part-}} files, one per RDD partition. > So the output would look something like: > {code} > /path/prefix/B [/part-1, /part-2, etc] > /path/prefix/F [/part-1, /part-2, etc] > /path/prefix/N [/part-1, /part-2, etc] > {code} > Though it may be possible to do this with some combination of > {{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the > {{MultipleTextOutputFormat}} output format class, it isn't straightforward. > It's not clear if it's even possible at all in PySpark. > Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs > that makes it easy to save RDDs out to multiple locations at once. > --- > Update: March 2016 > There are two workarounds to this problem: > 1. See [this answer on Stack > Overflow|http://stackoverflow.com/a/26051042/877069], which implements > {{MultipleTextOutputFormat}}. (Scala-only) > 2. See [this comment by
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197451#comment-15197451 ] Nicholas Chammas commented on SPARK-7481: - (Sorry Steve; can't comment on your proposal since I don't know much about these kinds of build decisions.) Just to add some more evidence to the record that this problem appears to affect many people, take a look at this: http://stackoverflow.com/search?q=%5Bapache-spark%5D+S3+Hadoop+2.6 Lots of confusion about how to access S3, with the recommended solution as before being to [use Spark built against Hadoop 2.4|http://stackoverflow.com/a/30852341/877069]. > Add Hadoop 2.6+ profile to pull in object store FS accessors > > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.3.1 >Reporter: Steve Loughran > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) > this adds more stuff to the client bundle, but will mean a single spark > package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7505) Update PySpark DataFrame docs: encourage __getitem__, mark as experimental, etc.
[ https://issues.apache.org/jira/browse/SPARK-7505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15181776#comment-15181776 ] Nicholas Chammas commented on SPARK-7505: - I believe items 1, 3, and 4 still apply. They're minor documentation issues, but I think they should still be addressed. > Update PySpark DataFrame docs: encourage __getitem__, mark as experimental, > etc. > > > Key: SPARK-7505 > URL: https://issues.apache.org/jira/browse/SPARK-7505 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark, SQL >Affects Versions: 1.3.1, 1.4.1, 1.5.0 >Reporter: Nicholas Chammas >Priority: Minor > > The PySpark docs for DataFrame need the following fixes and improvements: > # Per [SPARK-7035], we should encourage the use of {{\_\_getitem\_\_}} over > {{\_\_getattr\_\_}} and change all our examples accordingly. > # *We should say clearly that the API is experimental.* (That is currently > not the case for the PySpark docs.) > # We should provide an example of how to join and select from 2 DataFrames > that have identically named columns, because it is not obvious: > {code} > >>> df1 = sqlContext.jsonRDD(sc.parallelize(['{"a": 4, "other": "I know"}'])) > >>> df2 = sqlContext.jsonRDD(sc.parallelize(['{"a": 4, "other": "I dunno"}'])) > >>> df12 = df1.join(df2, df1['a'] == df2['a']) > >>> df12.select(df1['a'], df2['other']).show() > a other > > 4 I dunno {code} > # > [{{DF.orderBy}}|https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.orderBy] > and > [{{DF.sort}}|https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sort] > should be marked as aliases if that's what they are. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13596) Move misc top-level build files into appropriate subdirs
[ https://issues.apache.org/jira/browse/SPARK-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180072#comment-15180072 ] Nicholas Chammas commented on SPARK-13596: -- Looks like {{tox.ini}} is only used by {{pep8}}, so if you move it into {{dev/}}, where the Python lint checks run from, that should work. > Move misc top-level build files into appropriate subdirs > > > Key: SPARK-13596 > URL: https://issues.apache.org/jira/browse/SPARK-13596 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.0.0 >Reporter: Sean Owen > > I'd like to file away a bunch of misc files that are in the top level of the > project in order to further tidy the build for 2.0.0. See also SPARK-13529, > SPARK-13548. > Some of these may turn out to be difficult or impossible to move. > I'd ideally like to move these files into {{build/}}: > - {{.rat-excludes}} > - {{checkstyle.xml}} > - {{checkstyle-suppressions.xml}} > - {{pylintrc}} > - {{scalastyle-config.xml}} > - {{tox.ini}} > - {{project/}} (or does SBT need this in the root?) > And ideally, these would go under {{dev/}} > - {{make-distribution.sh}} > And remove these > - {{sbt/sbt}} (backwards-compatible location of {{build/sbt}} right?) > Edited to add: apparently this can go in {{.github}} now: > - {{CONTRIBUTING.md}} > Other files in the top level seem to need to be there, like {{README.md}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176559#comment-15176559 ] Nicholas Chammas commented on SPARK-7481: - I'm not comfortable working with Maven so I can't comment on the details of the approach we should take, but I will appreciate any progress towards making Spark built against Hadoop 2.6+ work with S3 out of the box, or as close to out of the box as possible. Given Spark's close relation to S3 and EC2 (as far as Spark's user base is concerned), a good out of the box experience here is critical. Many people just expect it. > Add Hadoop 2.6+ profile to pull in object store FS accessors > > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.3.1 >Reporter: Steve Loughran > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) > this adds more stuff to the client bundle, but will mean a single spark > package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176551#comment-15176551 ] Nicholas Chammas commented on SPARK-7481: - {quote} One issue here that hadoop 2.6's hadoop-aws pulls in the whole AWT toolkit, which is pretty weighty, for s3a ... which isn't something I'd use in 2.6 anyway. {quote} Did you mean something other than s3a here? > Add Hadoop 2.6+ profile to pull in object store FS accessors > > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.3.1 >Reporter: Steve Loughran > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) > this adds more stuff to the client bundle, but will mean a single spark > package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174438#comment-15174438 ] Nicholas Chammas commented on SPARK-7481: - Many people seem to be downgrading to use Spark built against Hadoop 2.4 because the Spark / Hadoop 2.6 package doesn't work against S3 out of the box. * [Example 1|https://issues.apache.org/jira/browse/SPARK-7442?focusedCommentId=14582965=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14582965] * [Example 2|https://issues.apache.org/jira/browse/SPARK-7442?focusedCommentId=14903750=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14903750] * [Example 3|https://github.com/nchammas/flintrock/issues/88#issuecomment-190905262] If this proposal eliminates that bit of friction for users without being too burdensome on the team, then I'm for it. Ideally, we want people using Spark built against the latest version of Hadoop anyway, right? This proposal would nudge people in that direction. > Add Hadoop 2.6+ profile to pull in object store FS accessors > > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.3.1 >Reporter: Steve Loughran > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) > this adds more stuff to the client bundle, but will mean a single spark > package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5189) Reorganize EC2 scripts so that nodes can be provisioned independent of Spark master
[ https://issues.apache.org/jira/browse/SPARK-5189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15119220#comment-15119220 ] Nicholas Chammas commented on SPARK-5189: - FWIW, I found this issue to be practically unsolvable without rewriting most of spark-ec2, so I started a new project that aims to replace spark-ec2 for most of its use cases: [Flintrock|https://github.com/nchammas/flintrock] > Reorganize EC2 scripts so that nodes can be provisioned independent of Spark > master > --- > > Key: SPARK-5189 > URL: https://issues.apache.org/jira/browse/SPARK-5189 > Project: Spark > Issue Type: Improvement > Components: EC2 >Reporter: Nicholas Chammas > > As of 1.2.0, we launch Spark clusters on EC2 by setting up the master first, > then setting up all the slaves together. This includes broadcasting files > from the lonely master to potentially hundreds of slaves. > There are 2 main problems with this approach: > # Broadcasting files from the master to all slaves using > [{{copy-dir}}|https://github.com/mesos/spark-ec2/blob/branch-1.3/copy-dir.sh] > (e.g. during [ephemeral-hdfs > init|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/ephemeral-hdfs/init.sh#L36], > or during [Spark > setup|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/spark/setup.sh#L3]) > takes a long time. This time increases as the number of slaves increases. > I did some testing in {{us-east-1}}. This is, concretely, what the problem > looks like: > || number of slaves ({{m3.large}}) || launch time (best of 6 tries) || > | 1 | 8m 44s | > | 10 | 13m 45s | > | 25 | 22m 50s | > | 50 | 37m 30s | > | 75 | 51m 30s | > | 99 | 1h 5m 30s | > Unfortunately, I couldn't report on 100 slaves or more due to SPARK-6246, > but I think the point is clear enough. > We can extrapolate from this data that *every additional slave adds roughly > 35 seconds to the launch time* (so a cluster with 100 slaves would take 1h 6m > 5s to launch). > # It's more complicated to add slaves to an existing cluster (a la > [SPARK-2008]), since slaves are only configured through the master during the > setup of the master itself. > Logically, the operations we want to implement are: > * Provision a Spark node > * Join a node to a cluster (including an empty cluster) as either a master or > a slave > * Remove a node from a cluster > We need our scripts to roughly be organized to match the above operations. > The goals would be: > # When launching a cluster, enable all cluster nodes to be provisioned in > parallel, removing the master-to-slave file broadcast bottleneck. > # Facilitate cluster modifications like adding or removing nodes. > # Enable exploration of infrastructure tools like > [Terraform|https://www.terraform.io/] that might simplify {{spark-ec2}} > internals and perhaps even allow us to build [one tool that launches Spark > clusters on several different cloud > platforms|https://groups.google.com/forum/#!topic/terraform-tool/eD23GLLkfDw]. > More concretely, the modifications we need to make are: > * Replace all occurrences of {{copy-dir}} or {{rsync}}-to-slaves with > equivalent, slave-side operations. > * Repurpose {{setup-slave.sh}} as {{provision-spark-node.sh}} and make sure > it fully creates a node that can be used as either a master or slave. > * Create a new script, {{join-to-cluster.sh}}, that takes a provisioned node, > configures it as a master or slave, and joins it to a cluster. > * Move any remaining logic in {{setup.sh}} up to {{spark_ec2.py}} and delete > that script. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12824) Failure to maintain consistent RDD references in pyspark
[ https://issues.apache.org/jira/browse/SPARK-12824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15098887#comment-15098887 ] Nicholas Chammas commented on SPARK-12824: -- Ah, good catch. This appears to be a known behavior of lambdas in Python: http://docs.python-guide.org/en/latest/writing/gotchas/#late-binding-closures > Failure to maintain consistent RDD references in pyspark > > > Key: SPARK-12824 > URL: https://issues.apache.org/jira/browse/SPARK-12824 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 > Environment: Spark 1.5.2, Python 2.7.10, and IPython 4.0.0. >Reporter: Paul Shearer > > Below is a simple {{pyspark}} script that tries to split an RDD into a > dictionary containing several RDDs. > As the *sample run* shows, the script only works if we do a {{collect()}} on > the intermediate RDDs as they are created. Of course I would not want to do > that in practice, since it doesn't scale. > What's really strange is, I'm not assigning the intermediate {{collect()}} > results to any variable. So the difference in behavior is due solely to a > hidden side-effect of the computation triggered by the {{collect()}} call. > Spark is supposed to be a very functional framework with minimal side > effects. Why is it only possible to get the desired behavior by triggering > some mysterious side effect using {{collect()}}? > It seems that all the keys in the dictionary are referencing the same object > even though in the code they are clearly supposed to be different objects. > The run below is with Spark 1.5.2, Python 2.7.10, and IPython 4.0.0. > h3. spark_script.py > {noformat} > from pprint import PrettyPrinter > pp = PrettyPrinter(indent=4).pprint > logger = sc._jvm.org.apache.log4j > logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR ) > logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR ) > > def split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False): > d = dict() > for key_value in key_values: > d[key_value] = rdd.filter(lambda row: row[key_field] == key_value) > if collect_in_loop: > d[key_value].collect() > return d > def print_results(d): > for k in d: > print k > pp(d[k].collect()) > > rdd = sc.parallelize([ > {'color':'red','size':3}, > {'color':'red', 'size':7}, > {'color':'red', 'size':8}, > {'color':'red', 'size':10}, > {'color':'green', 'size':9}, > {'color':'green', 'size':5}, > {'color':'green', 'size':50}, > {'color':'blue', 'size':4}, > {'color':'purple', 'size':6}]) > key_field = 'color' > key_values = ['red', 'green', 'blue', 'purple'] > > print '### run WITH collect in loop: ' > d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=True) > print_results(d) > print '### run WITHOUT collect in loop: ' > d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False) > print_results(d) > {noformat} > h3. Sample run in IPython shell > {noformat} > In [1]: execfile('spark_script.py') > ### run WITH collect in loop: > blue > [{ 'color': 'blue', 'size': 4}] > purple > [{ 'color': 'purple', 'size': 6}] > green > [ { 'color': 'green', 'size': 9}, > { 'color': 'green', 'size': 5}, > { 'color': 'green', 'size': 50}] > red > [ { 'color': 'red', 'size': 3}, > { 'color': 'red', 'size': 7}, > { 'color': 'red', 'size': 8}, > { 'color': 'red', 'size': 10}] > ### run WITHOUT collect in loop: > blue > [{ 'color': 'purple', 'size': 6}] > purple > [{ 'color': 'purple', 'size': 6}] > green > [{ 'color': 'purple', 'size': 6}] > red > [{ 'color': 'purple', 'size': 6}] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12824) Failure to maintain consistent RDD references in pyspark
[ https://issues.apache.org/jira/browse/SPARK-12824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15098336#comment-15098336 ] Nicholas Chammas commented on SPARK-12824: -- I can reproduce this issue. Here's a more concise reproduction: {code} from __future__ import print_function rdd = sc.parallelize([ {'color':'red','size':3}, {'color':'red', 'size':7}, {'color':'red', 'size':8}, {'color':'red', 'size':10}, {'color':'green', 'size':9}, {'color':'green', 'size':5}, {'color':'green', 'size':50}, {'color':'blue', 'size':4}, {'color':'purple', 'size':6}]) colors = ['purple', 'red', 'green', 'blue'] # Defer collect() till print color_rdds = { color: rdd.filter(lambda x: x['color'] == color) for color in colors } for k, v in color_rdds.items(): print(k, v.collect()) # collect() upfront color_rdds = { color: rdd.filter(lambda x: x['color'] == color).collect() for color in colors } for k, v in color_rdds.items(): print(k, v) {code} Output: {code} # Defer collect() till print purple [{'color': 'blue', 'size': 4}] blue [{'color': 'blue', 'size': 4}] green [{'color': 'blue', 'size': 4}] red [{'color': 'blue', 'size': 4}] --- # collect() upfront purple [{'color': 'purple', 'size': 6}] blue [{'color': 'blue', 'size': 4}] green [{'color': 'green', 'size': 9}, {'color': 'green', 'size': 5}, {'color': 'green', 'size': 50}] red [{'color': 'red', 'size': 3}, {'color': 'red', 'size': 7}, {'color': 'red', 'size': 8}, {'color': 'red', 'size': 10}] {code} Observations: * The color that gets repeated in the first block of output is always the last color in {{colors}}. * This happens on Python 2 and 3, and with both {{items()}} and {{iteritems()}}. This smells like an RDD naming issue, or something related to lazy evaluation. The filtered RDDs that get generated in the first block under {{color_rdds}} don't have names. Then, when they all get {{collect()}}-ed at once, they all evaluate to the last filtered RDD. cc [~davies] / [~joshrosen] > Failure to maintain consistent RDD references in pyspark > > > Key: SPARK-12824 > URL: https://issues.apache.org/jira/browse/SPARK-12824 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 > Environment: Spark 1.5.2, Python 2.7.10, and IPython 4.0.0. >Reporter: Paul Shearer > > Below is a simple {{pyspark}} script that tries to split an RDD into a > dictionary containing several RDDs. > As the *sample run* shows, the script only works if we do a {{collect()}} on > the intermediate RDDs as they are created. Of course I would not want to do > that in practice, since it doesn't scale. > What's really strange is, I'm not assigning the intermediate {{collect()}} > results to any variable. So the difference in behavior is due solely to a > hidden side-effect of the computation triggered by the {{collect()}} call. > Spark is supposed to be a very functional framework with minimal side > effects. Why is it only possible to get the desired behavior by triggering > some mysterious side effect using {{collect()}}? > It seems that all the keys in the dictionary are referencing the same object > even though in the code they are clearly supposed to be different objects. > The run below is with Spark 1.5.2, Python 2.7.10, and IPython 4.0.0. > h3. spark_script.py > {noformat} > from pprint import PrettyPrinter > pp = PrettyPrinter(indent=4).pprint > logger = sc._jvm.org.apache.log4j > logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR ) > logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR ) > > def split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False): > d = dict() > for key_value in key_values: > d[key_value] = rdd.filter(lambda row: row[key_field] == key_value) > if collect_in_loop: > d[key_value].collect() > return d > def print_results(d): > for k in d: > print k > pp(d[k].collect()) > > rdd = sc.parallelize([ > {'color':'red','size':3}, > {'color':'red', 'size':7}, > {'color':'red', 'size':8}, > {'color':'red', 'size':10}, > {'color':'green', 'size':9}, > {'color':'green', 'size':5}, > {'color':'green', 'size':50}, > {'color':'blue', 'size':4}, > {'color':'purple', 'size':6}]) > key_field = 'color' > key_values = ['red', 'green', 'blue', 'purple'] > > print '### run WITH collect in loop: ' > d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=True) > print_results(d) > print '### run WITHOUT collect in loop: ' > d = split_RDD_by_key(rdd, key_field, key_values,
[jira] [Comment Edited] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14203280#comment-14203280 ] Nicholas Chammas edited comment on SPARK-3821 at 12/18/15 9:08 PM: --- After much dilly-dallying, I am happy to present: * A brief proposal / design doc ([fixed JIRA attachment | https://issues.apache.org/jira/secure/attachment/12680371/packer-proposal.html], [md file on GitHub | https://github.com/nchammas/spark-ec2/blob/packer/image-build/proposal.md]) * [Initial implementation | https://github.com/nchammas/spark-ec2/tree/packer/image-build] and [README | https://github.com/nchammas/spark-ec2/blob/packer/image-build/README.md] * New AMIs generated by this implementation: [Base AMIs | https://github.com/nchammas/spark-ec2/tree/packer/ami-list/base], [Spark 1.1.0 Pre-Installed | https://github.com/nchammas/spark-ec2/tree/packer/ami-list/1.1.0] To try out the new AMIs with {{spark-ec2}}, you'll need to update [these | https://github.com/apache/spark/blob/7e9d975676d56ace0e84c2200137e4cd4eba074a/ec2/spark_ec2.py#L47] [two | https://github.com/apache/spark/blob/7e9d975676d56ace0e84c2200137e4cd4eba074a/ec2/spark_ec2.py#L593] lines (well, really, just the first one) to point to [my {{spark-ec2}} repo on the {{packer}} branch | https://github.com/nchammas/spark-ec2/tree/packer/image-build]. Your candid feedback and/or improvements are most welcome! was (Author: nchammas): After much dilly-dallying, I am happy to present: * A brief proposal / design doc ([fixed JIRA attachment | https://issues.apache.org/jira/secure/attachment/12680371/packer-proposal.html], [md file on GitHub | https://github.com/nchammas/spark-ec2/blob/packer/packer/proposal.md]) * [Initial implementation | https://github.com/nchammas/spark-ec2/tree/packer/packer] and [README | https://github.com/nchammas/spark-ec2/blob/packer/packer/README.md] * New AMIs generated by this implementation: [Base AMIs | https://github.com/nchammas/spark-ec2/tree/packer/ami-list/base], [Spark 1.1.0 Pre-Installed | https://github.com/nchammas/spark-ec2/tree/packer/ami-list/1.1.0] To try out the new AMIs with {{spark-ec2}}, you'll need to update [these | https://github.com/apache/spark/blob/7e9d975676d56ace0e84c2200137e4cd4eba074a/ec2/spark_ec2.py#L47] [two | https://github.com/apache/spark/blob/7e9d975676d56ace0e84c2200137e4cd4eba074a/ec2/spark_ec2.py#L593] lines (well, really, just the first one) to point to [my {{spark-ec2}} repo on the {{packer}} branch | https://github.com/nchammas/spark-ec2/tree/packer/packer]. Your candid feedback and/or improvements are most welcome! > Develop an automated way of creating Spark images (AMI, Docker, and others) > --- > > Key: SPARK-3821 > URL: https://issues.apache.org/jira/browse/SPARK-3821 > Project: Spark > Issue Type: Improvement > Components: Build, EC2 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas > Attachments: packer-proposal.html > > > Right now the creation of Spark AMIs or Docker containers is done manually. > With tools like [Packer|http://www.packer.io/], we should be able to automate > this work, and do so in such a way that multiple types of machine images can > be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org