[jira] [Commented] (SPARK-1982) saveToParquetFile doesn't support ByteType
[ https://issues.apache.org/jira/browse/SPARK-1982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014945#comment-14014945 ] Andre Schumacher commented on SPARK-1982: - It turns out that ByteType primitive types weren't correctly treated earlier. Since Parquet doesn't have these one fix is to use fixed-length byte arrays (which are treated as primitives also). This is fine until there will be support for nested types. Even then I think one may want to treat these as actual arrays and not primitives. Anyway.. PR available here: https://github.com/apache/spark/pull/934 saveToParquetFile doesn't support ByteType -- Key: SPARK-1982 URL: https://issues.apache.org/jira/browse/SPARK-1982 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Michael Armbrust Assignee: Andre Schumacher {code} java.lang.RuntimeException: Unsupported datatype ByteType at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetRelation.scala:201) ... {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1983) Expose private `inferSchema` method in SQLContext for Scala and Java API
Kuldeep created SPARK-1983: -- Summary: Expose private `inferSchema` method in SQLContext for Scala and Java API Key: SPARK-1983 URL: https://issues.apache.org/jira/browse/SPARK-1983 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0 Reporter: Kuldeep -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1983) Expose private `inferSchema` method in SQLContext for Scala and Java API
[ https://issues.apache.org/jira/browse/SPARK-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuldeep updated SPARK-1983: --- Description: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L298 The above method can be useful to expose in Scala and Java API for making SparkSQL work without creating classes. It would let one to create tables from a simple RDD of Map without defining classes. was: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L298 The above method can be useful to expose in Scala and Java API for making SparkSQL work without creating classes. Expose private `inferSchema` method in SQLContext for Scala and Java API Key: SPARK-1983 URL: https://issues.apache.org/jira/browse/SPARK-1983 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0 Reporter: Kuldeep https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L298 The above method can be useful to expose in Scala and Java API for making SparkSQL work without creating classes. It would let one to create tables from a simple RDD of Map without defining classes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-983) Support external sorting for RDD#sortByKey()
[ https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015020#comment-14015020 ] Madhu Siddalingaiah commented on SPARK-983: --- I tested some additions locally that seem to work well so far. I created a SortedPartitionsRDD and a sortPartitions(...) method in [RDD|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala]: {code} /** * Return a new RDD containing sorted partitions in this RDD. */ def sortPartitions(lt: (T, T) = Boolean): RDD[T] = new SortedPartitionsRDD(this, sc.clean(lt)) {code} I haven't added the spill/merge code to SortedPartitionsRDD yet. I wanted to get some buy in on this method as it's an addition to the API. It fits nicely with [OrderedRDDFunctions|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/OrderedRDDFunctions.scala] and passes all tests in [SortingSuite|https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/rdd/SortingSuite.scala]. I think this method can be used to address [SPARK-1021|https://issues.apache.org/jira/browse/SPARK-1021] as well as many use cases outside of sortByKey(). Does everyone agree? If so, I'll move forward with external sort in SortedPartitionsRDD and necessary tests. Support external sorting for RDD#sortByKey() Key: SPARK-983 URL: https://issues.apache.org/jira/browse/SPARK-983 Project: Spark Issue Type: New Feature Affects Versions: 0.9.0 Reporter: Reynold Xin Assignee: Madhu Siddalingaiah Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a buffer to hold the entire partition, then sorts it. This will cause an OOM if an entire partition cannot fit in memory, which is especially problematic for skewed data. Rather than OOMing, the behavior should be similar to the [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala], where we fallback to disk if we detect memory pressure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1984) Maven build requires SCALA_HOME to be set even though it's not needed
Patrick Wendell created SPARK-1984: -- Summary: Maven build requires SCALA_HOME to be set even though it's not needed Key: SPARK-1984 URL: https://issues.apache.org/jira/browse/SPARK-1984 Project: Spark Issue Type: Bug Components: Build Reporter: Patrick Wendell Assignee: Patrick Wendell -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1984) Maven build requires SCALA_HOME to be set even though it's not needed
[ https://issues.apache.org/jira/browse/SPARK-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1984: --- Fix Version/s: 1.1.0 Maven build requires SCALA_HOME to be set even though it's not needed - Key: SPARK-1984 URL: https://issues.apache.org/jira/browse/SPARK-1984 Project: Spark Issue Type: Bug Components: Build Reporter: Patrick Wendell Assignee: Patrick Wendell Fix For: 1.1.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1984) Maven build requires SCALA_HOME to be set even though it's not needed
[ https://issues.apache.org/jira/browse/SPARK-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015054#comment-14015054 ] Patrick Wendell commented on SPARK-1984: master: https://github.com/apache/spark/commit/d8c005d5371f81a2a06c5d27c7021e1ae43d7193 1.0: https://github.com/apache/spark/commit/a54a48f83674bb3c6f9aca9f736448338b029dfd Maven build requires SCALA_HOME to be set even though it's not needed - Key: SPARK-1984 URL: https://issues.apache.org/jira/browse/SPARK-1984 Project: Spark Issue Type: Bug Components: Build Reporter: Patrick Wendell Assignee: Patrick Wendell Fix For: 1.0.1, 1.1.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1984) Maven build requires SCALA_HOME to be set even though it's not needed
[ https://issues.apache.org/jira/browse/SPARK-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1984: --- Fix Version/s: (was: 1.0.0) 1.0.1 Maven build requires SCALA_HOME to be set even though it's not needed - Key: SPARK-1984 URL: https://issues.apache.org/jira/browse/SPARK-1984 Project: Spark Issue Type: Bug Components: Build Reporter: Patrick Wendell Assignee: Patrick Wendell Fix For: 1.0.1, 1.1.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1985) SPARK_HOME shouldn't be required when spark.executor.uri is provided
Gerard Maas created SPARK-1985: -- Summary: SPARK_HOME shouldn't be required when spark.executor.uri is provided Key: SPARK-1985 URL: https://issues.apache.org/jira/browse/SPARK-1985 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: MESOS Reporter: Gerard Maas When trying to run that simple example on a Mesos installation, I get an error that SPARK_HOME is not set. A local spark installation should not be required to run a job on Mesos. All that's needed is the executor package, being the assembly.tar.gz on a reachable location (HDFS/S3/HTTP). I went looking into the code and indeed there's a check on SPARK_HOME [2] regardless of the presence of the assembly but it's actually only used if the assembly is not provided (which is a kind-of best-effort recovery strategy). Current flow: if (!SPARK_HOME) fail(No SPARK_HOME) else if (assembly) { use assembly) } else { try use SPARK_HOME to build spark_executor } Should be: sparkExecutor = if (assembly) {assembly} else if (SPARK_HOME) {try use SPARK_HOME to build spark_executor} else { fail(No executor found. Please provide spark.executor.uri (preferred) or spark.home) [1] http://apache-spark-user-list.1001560.n3.nabble.com/ClassNotFoundException-with-Spark-Mesos-spark-shell-works-fine-td6165.html [2] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala#L89 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1986) lib.Analytics should be in org.apache.spark.examples
Ankur Dave created SPARK-1986: - Summary: lib.Analytics should be in org.apache.spark.examples Key: SPARK-1986 URL: https://issues.apache.org/jira/browse/SPARK-1986 Project: Spark Issue Type: Bug Components: GraphX Reporter: Ankur Dave Assignee: Ankur Dave The org.apache.spark.graphx.lib.Analytics driver is currently hard to run; the user has to figure out the correct invocation involving spark-submit. Instead, it should be put into the examples package to enable running it using bin/run-example. Here is how Analytics must be invoked currently: ``` ~/spark/bin/spark-submit --master spark://$(wget -q -O - http://169.254.169.254/latest/meta-data/public-hostname):7077 --class org.apache.spark.graphx.lib.Analytics ~/spark/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop1.0.4.jar triangles /soc-LiveJournal1.txt --numEPart=256 ``` Any JAR can be supplied in place of the assembly jar, as long as it exists. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1986) lib.Analytics should be in org.apache.spark.examples
[ https://issues.apache.org/jira/browse/SPARK-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave updated SPARK-1986: -- Issue Type: Improvement (was: Bug) lib.Analytics should be in org.apache.spark.examples Key: SPARK-1986 URL: https://issues.apache.org/jira/browse/SPARK-1986 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Ankur Dave Assignee: Ankur Dave The org.apache.spark.graphx.lib.Analytics driver is currently hard to run; the user has to figure out the correct invocation involving spark-submit. Instead, it should be put into the examples package to enable running it using bin/run-example. Here is how Analytics must be invoked currently: ``` ~/spark/bin/spark-submit --master spark://$(wget -q -O - http://169.254.169.254/latest/meta-data/public-hostname):7077 --class org.apache.spark.graphx.lib.Analytics ~/spark/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop1.0.4.jar triangles /soc-LiveJournal1.txt --numEPart=256 ``` Any JAR can be supplied in place of the assembly jar, as long as it exists. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1988) Enable storing edges out-of-core
Ankur Dave created SPARK-1988: - Summary: Enable storing edges out-of-core Key: SPARK-1988 URL: https://issues.apache.org/jira/browse/SPARK-1988 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Ankur Dave Assignee: Ankur Dave A graph's edges are usually the largest component of the graph, and a cluster may not have enough memory to hold them. For example, a graph with 20 billion edges requires at least 400 GB of memory, because each edge takes 20 bytes. GraphX only ever accesses the edges using full table scans or cluster scans using the clustered index on source vertex ID. The edges are therefore amenable to being stored on disk. EdgePartition should provide the option of storing edges on disk transparently and streaming through them as needed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1989) Exit executors faster if they get into a cycle of heavy GC
Matei Zaharia created SPARK-1989: Summary: Exit executors faster if they get into a cycle of heavy GC Key: SPARK-1989 URL: https://issues.apache.org/jira/browse/SPARK-1989 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Matei Zaharia Fix For: 1.1.0 I've seen situations where an application is allocating too much memory across its tasks + cache to proceed, but Java gets into a cycle where it repeatedly runs full GCs, frees up a bit of the heap, and continues instead of giving up. This then leads to timeouts and confusing error messages. It would be better to crash with OOM sooner. The JVM has options to support this: http://java.dzone.com/articles/tracking-excessive-garbage. The right solution would probably be: - Add some config options used by spark-submit to set XX:GCTimeLimit and XX:GCHeapFreeLimit, with more conservative values than the defaults (e.g. 90% time limit, 5% free limit) - Make sure we pass these into the Java options for executors in each deployment mode -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1990) spark-ec2 should only need Python 2.6, not 2.7
Matei Zaharia created SPARK-1990: Summary: spark-ec2 should only need Python 2.6, not 2.7 Key: SPARK-1990 URL: https://issues.apache.org/jira/browse/SPARK-1990 Project: Spark Issue Type: Improvement Reporter: Matei Zaharia Fix For: 1.0.1, 1.1.0 There were some posts on the lists that spark-ec2 does not work with Python 2.6. In addition, we should check the Python version at the top of the script and exit if it's too old. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1790) Update EC2 scripts to support r3 instance types
[ https://issues.apache.org/jira/browse/SPARK-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1790: - Fix Version/s: 1.0.1 Update EC2 scripts to support r3 instance types --- Key: SPARK-1790 URL: https://issues.apache.org/jira/browse/SPARK-1790 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 0.9.0, 0.9.1, 1.0.0 Reporter: Matei Zaharia Assignee: Sujeet Varakhedi Labels: Starter Fix For: 1.0.1 These were recently added by Amazon as a cheaper high-memory option -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1990) spark-ec2 should only need Python 2.6, not 2.7
[ https://issues.apache.org/jira/browse/SPARK-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015146#comment-14015146 ] Matei Zaharia commented on SPARK-1990: -- BTW here is the first error this gets: {code} Connection to ec2-54-186-88-202.us-west-2.compute.amazonaws.com closed. Traceback (most recent call last): File spark_ec2.py, line 824, in module main() File spark_ec2.py, line 816, in main real_main() File spark_ec2.py, line 701, in real_main setup_cluster(conn, master_nodes, slave_nodes, opts, True) File spark_ec2.py, line 430, in setup_cluster dot_ssh_tar = ssh_read(master, opts, ['tar', 'c', '.ssh']) File spark_ec2.py, line 638, in ssh_read return subprocess.check_output( AttributeError: 'module' object has no attribute 'check_output' {code} spark-ec2 should only need Python 2.6, not 2.7 -- Key: SPARK-1990 URL: https://issues.apache.org/jira/browse/SPARK-1990 Project: Spark Issue Type: Improvement Reporter: Matei Zaharia Labels: Starter Fix For: 1.0.1, 1.1.0 There were some posts on the lists that spark-ec2 does not work with Python 2.6. In addition, we should check the Python version at the top of the script and exit if it's too old. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1991) Support custom StorageLevels for vertices and edges
Ankur Dave created SPARK-1991: - Summary: Support custom StorageLevels for vertices and edges Key: SPARK-1991 URL: https://issues.apache.org/jira/browse/SPARK-1991 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Ankur Dave Assignee: Ankur Dave Large graphs may not fit entirely in memory. If we supported custom storage levels for the vertices and edges of a graph, the user could specify MEMORY_AND_DISK and then repartition the graph to use many small partitions, each of which does fit in memory. Spark would then automatically load partitions from disk as needed. Also, the replicated storage levels would be helpful for fault tolerance, and the serialized ones would improve efficiency for non-primitive vertex and edge attributes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1958) Calling .collect() on a SchemaRDD should call executeCollect() on the underlying query plan.
[ https://issues.apache.org/jira/browse/SPARK-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015158#comment-14015158 ] Cheng Lian commented on SPARK-1958: --- PR: https://github.com/apache/spark/pull/939 Calling .collect() on a SchemaRDD should call executeCollect() on the underlying query plan. Key: SPARK-1958 URL: https://issues.apache.org/jira/browse/SPARK-1958 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Michael Armbrust Assignee: Cheng Lian Fix For: 1.1.0 In some cases (like LIMIT) executeCollect() makes optimizations that execute().collect() will not. -- This message was sent by Atlassian JIRA (v6.2#6252)