[jira] [Commented] (SPARK-1982) saveToParquetFile doesn't support ByteType

2014-06-01 Thread Andre Schumacher (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014945#comment-14014945
 ] 

Andre Schumacher commented on SPARK-1982:
-

It turns out that ByteType primitive types weren't correctly treated earlier. 
Since Parquet doesn't have these one fix is to use fixed-length byte arrays 
(which are treated as primitives also). This is fine until there will be 
support for nested types. Even then I think one may want to treat these as 
actual arrays and not primitives.

Anyway.. PR available here: https://github.com/apache/spark/pull/934

 saveToParquetFile doesn't support ByteType
 --

 Key: SPARK-1982
 URL: https://issues.apache.org/jira/browse/SPARK-1982
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Michael Armbrust
Assignee: Andre Schumacher

 {code}
 java.lang.RuntimeException: Unsupported datatype ByteType
   at scala.sys.package$.error(package.scala:27)
   at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetRelation.scala:201)
 ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1983) Expose private `inferSchema` method in SQLContext for Scala and Java API

2014-06-01 Thread Kuldeep (JIRA)
Kuldeep created SPARK-1983:
--

 Summary: Expose private `inferSchema` method in SQLContext for 
Scala and Java API
 Key: SPARK-1983
 URL: https://issues.apache.org/jira/browse/SPARK-1983
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0
Reporter: Kuldeep






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1983) Expose private `inferSchema` method in SQLContext for Scala and Java API

2014-06-01 Thread Kuldeep (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kuldeep updated SPARK-1983:
---

Description: 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L298

The above method can be useful to expose in Scala and Java API for making 
SparkSQL work without creating classes. It would let one to create tables from 
a simple RDD of Map without defining classes.

  was:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L298

The above method can be useful to expose in Scala and Java API for making 
SparkSQL work without creating classes.


 Expose private `inferSchema` method in SQLContext for Scala and Java API
 

 Key: SPARK-1983
 URL: https://issues.apache.org/jira/browse/SPARK-1983
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0
Reporter: Kuldeep

 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L298
 The above method can be useful to expose in Scala and Java API for making 
 SparkSQL work without creating classes. It would let one to create tables 
 from a simple RDD of Map without defining classes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-983) Support external sorting for RDD#sortByKey()

2014-06-01 Thread Madhu Siddalingaiah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015020#comment-14015020
 ] 

Madhu Siddalingaiah commented on SPARK-983:
---

I tested some additions locally that seem to work well so far. I created a 
SortedPartitionsRDD and a sortPartitions(...) method in 
[RDD|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala]:

{code}
  /**
   * Return a new RDD containing sorted partitions in this RDD.
   */
  def sortPartitions(lt: (T, T) = Boolean): RDD[T] = new 
SortedPartitionsRDD(this, sc.clean(lt))
{code}

I haven't added the spill/merge code to SortedPartitionsRDD yet. I wanted to 
get some buy in on this method as it's an addition to the API. It fits nicely 
with 
[OrderedRDDFunctions|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/OrderedRDDFunctions.scala]
 and passes all tests in 
[SortingSuite|https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/rdd/SortingSuite.scala].

I think this method can be used to address 
[SPARK-1021|https://issues.apache.org/jira/browse/SPARK-1021] as well as many 
use cases outside of sortByKey(). Does everyone agree? If so, I'll move forward 
with external sort in SortedPartitionsRDD and necessary tests.

 Support external sorting for RDD#sortByKey()
 

 Key: SPARK-983
 URL: https://issues.apache.org/jira/browse/SPARK-983
 Project: Spark
  Issue Type: New Feature
Affects Versions: 0.9.0
Reporter: Reynold Xin
Assignee: Madhu Siddalingaiah

 Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a 
 buffer to hold the entire partition, then sorts it. This will cause an OOM if 
 an entire partition cannot fit in memory, which is especially problematic for 
 skewed data. Rather than OOMing, the behavior should be similar to the 
 [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala],
  where we fallback to disk if we detect memory pressure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1984) Maven build requires SCALA_HOME to be set even though it's not needed

2014-06-01 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-1984:
--

 Summary: Maven build requires SCALA_HOME to be set even though 
it's not needed
 Key: SPARK-1984
 URL: https://issues.apache.org/jira/browse/SPARK-1984
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Patrick Wendell
Assignee: Patrick Wendell






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1984) Maven build requires SCALA_HOME to be set even though it's not needed

2014-06-01 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1984:
---

Fix Version/s: 1.1.0

 Maven build requires SCALA_HOME to be set even though it's not needed
 -

 Key: SPARK-1984
 URL: https://issues.apache.org/jira/browse/SPARK-1984
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Patrick Wendell
Assignee: Patrick Wendell
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1984) Maven build requires SCALA_HOME to be set even though it's not needed

2014-06-01 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015054#comment-14015054
 ] 

Patrick Wendell commented on SPARK-1984:


master: 
https://github.com/apache/spark/commit/d8c005d5371f81a2a06c5d27c7021e1ae43d7193

1.0:
https://github.com/apache/spark/commit/a54a48f83674bb3c6f9aca9f736448338b029dfd

 Maven build requires SCALA_HOME to be set even though it's not needed
 -

 Key: SPARK-1984
 URL: https://issues.apache.org/jira/browse/SPARK-1984
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Patrick Wendell
Assignee: Patrick Wendell
 Fix For: 1.0.1, 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1984) Maven build requires SCALA_HOME to be set even though it's not needed

2014-06-01 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1984:
---

Fix Version/s: (was: 1.0.0)
   1.0.1

 Maven build requires SCALA_HOME to be set even though it's not needed
 -

 Key: SPARK-1984
 URL: https://issues.apache.org/jira/browse/SPARK-1984
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Patrick Wendell
Assignee: Patrick Wendell
 Fix For: 1.0.1, 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1985) SPARK_HOME shouldn't be required when spark.executor.uri is provided

2014-06-01 Thread Gerard Maas (JIRA)
Gerard Maas created SPARK-1985:
--

 Summary: SPARK_HOME shouldn't be required when spark.executor.uri 
is provided
 Key: SPARK-1985
 URL: https://issues.apache.org/jira/browse/SPARK-1985
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: MESOS
Reporter: Gerard Maas


When trying to run that simple example on a  Mesos installation,  I get an 
error that SPARK_HOME is not set. A local spark installation should not be 
required to run a job on Mesos. All that's needed is the executor package, 
being the assembly.tar.gz on a reachable location (HDFS/S3/HTTP).

I went looking into the code and indeed there's a check on SPARK_HOME [2] 
regardless of the presence of the assembly but it's actually only used if the 
assembly is not provided (which is a kind-of best-effort recovery strategy).

Current flow:

if (!SPARK_HOME) fail(No SPARK_HOME) 
else if (assembly) { use assembly) }
else { try use SPARK_HOME to build spark_executor } 

Should be:
sparkExecutor =  if (assembly) {assembly} 
 else if (SPARK_HOME) {try use SPARK_HOME to build 
spark_executor}
 else { fail(No executor found. Please provide 
spark.executor.uri (preferred) or spark.home)


[1] 
http://apache-spark-user-list.1001560.n3.nabble.com/ClassNotFoundException-with-Spark-Mesos-spark-shell-works-fine-td6165.html

[2] 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala#L89



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1986) lib.Analytics should be in org.apache.spark.examples

2014-06-01 Thread Ankur Dave (JIRA)
Ankur Dave created SPARK-1986:
-

 Summary: lib.Analytics should be in org.apache.spark.examples
 Key: SPARK-1986
 URL: https://issues.apache.org/jira/browse/SPARK-1986
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: Ankur Dave
Assignee: Ankur Dave


The org.apache.spark.graphx.lib.Analytics driver is currently hard to run; the 
user has to figure out the correct invocation involving spark-submit. Instead, 
it should be put into the examples package to enable running it using 
bin/run-example.

Here is how Analytics must be invoked currently:
```
~/spark/bin/spark-submit --master spark://$(wget -q -O - 
http://169.254.169.254/latest/meta-data/public-hostname):7077 --class 
org.apache.spark.graphx.lib.Analytics 
~/spark/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop1.0.4.jar
 triangles /soc-LiveJournal1.txt --numEPart=256
```
Any JAR can be supplied in place of the assembly jar, as long as it exists.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1986) lib.Analytics should be in org.apache.spark.examples

2014-06-01 Thread Ankur Dave (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave updated SPARK-1986:
--

Issue Type: Improvement  (was: Bug)

 lib.Analytics should be in org.apache.spark.examples
 

 Key: SPARK-1986
 URL: https://issues.apache.org/jira/browse/SPARK-1986
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Ankur Dave
Assignee: Ankur Dave

 The org.apache.spark.graphx.lib.Analytics driver is currently hard to run; 
 the user has to figure out the correct invocation involving spark-submit. 
 Instead, it should be put into the examples package to enable running it 
 using bin/run-example.
 Here is how Analytics must be invoked currently:
 ```
 ~/spark/bin/spark-submit --master spark://$(wget -q -O - 
 http://169.254.169.254/latest/meta-data/public-hostname):7077 --class 
 org.apache.spark.graphx.lib.Analytics 
 ~/spark/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop1.0.4.jar
  triangles /soc-LiveJournal1.txt --numEPart=256
 ```
 Any JAR can be supplied in place of the assembly jar, as long as it exists.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1988) Enable storing edges out-of-core

2014-06-01 Thread Ankur Dave (JIRA)
Ankur Dave created SPARK-1988:
-

 Summary: Enable storing edges out-of-core
 Key: SPARK-1988
 URL: https://issues.apache.org/jira/browse/SPARK-1988
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Ankur Dave
Assignee: Ankur Dave


A graph's edges are usually the largest component of the graph, and a cluster 
may not have enough memory to hold them. For example, a graph with 20 billion 
edges requires at least 400 GB of memory, because each edge takes 20 bytes.

GraphX only ever accesses the edges using full table scans or cluster scans 
using the clustered index on source vertex ID. The edges are therefore amenable 
to being stored on disk. EdgePartition should provide the option of storing 
edges on disk transparently and streaming through them as needed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1989) Exit executors faster if they get into a cycle of heavy GC

2014-06-01 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1989:


 Summary: Exit executors faster if they get into a cycle of heavy GC
 Key: SPARK-1989
 URL: https://issues.apache.org/jira/browse/SPARK-1989
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Matei Zaharia
 Fix For: 1.1.0


I've seen situations where an application is allocating too much memory across 
its tasks + cache to proceed, but Java gets into a cycle where it repeatedly 
runs full GCs, frees up a bit of the heap, and continues instead of giving up. 
This then leads to timeouts and confusing error messages. It would be better to 
crash with OOM sooner. The JVM has options to support this: 
http://java.dzone.com/articles/tracking-excessive-garbage.

The right solution would probably be:
- Add some config options used by spark-submit to set XX:GCTimeLimit and 
XX:GCHeapFreeLimit, with more conservative values than the defaults (e.g. 90% 
time limit, 5% free limit)
- Make sure we pass these into the Java options for executors in each 
deployment mode



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1990) spark-ec2 should only need Python 2.6, not 2.7

2014-06-01 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1990:


 Summary: spark-ec2 should only need Python 2.6, not 2.7
 Key: SPARK-1990
 URL: https://issues.apache.org/jira/browse/SPARK-1990
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia
 Fix For: 1.0.1, 1.1.0


There were some posts on the lists that spark-ec2 does not work with Python 
2.6. In addition, we should check the Python version at the top of the script 
and exit if it's too old.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1790) Update EC2 scripts to support r3 instance types

2014-06-01 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-1790:
-

Fix Version/s: 1.0.1

 Update EC2 scripts to support r3 instance types
 ---

 Key: SPARK-1790
 URL: https://issues.apache.org/jira/browse/SPARK-1790
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 0.9.0, 0.9.1, 1.0.0
Reporter: Matei Zaharia
Assignee: Sujeet Varakhedi
  Labels: Starter
 Fix For: 1.0.1


 These were recently added by Amazon as a cheaper high-memory option



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1990) spark-ec2 should only need Python 2.6, not 2.7

2014-06-01 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015146#comment-14015146
 ] 

Matei Zaharia commented on SPARK-1990:
--

BTW here is the first error this gets:
{code}
Connection to ec2-54-186-88-202.us-west-2.compute.amazonaws.com closed.
Traceback (most recent call last):
  File spark_ec2.py, line 824, in module
main()
  File spark_ec2.py, line 816, in main
real_main()
  File spark_ec2.py, line 701, in real_main
setup_cluster(conn, master_nodes, slave_nodes, opts, True)
  File spark_ec2.py, line 430, in setup_cluster
dot_ssh_tar = ssh_read(master, opts, ['tar', 'c', '.ssh'])
  File spark_ec2.py, line 638, in ssh_read
return subprocess.check_output(
AttributeError: 'module' object has no attribute 'check_output'
{code}

 spark-ec2 should only need Python 2.6, not 2.7
 --

 Key: SPARK-1990
 URL: https://issues.apache.org/jira/browse/SPARK-1990
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia
  Labels: Starter
 Fix For: 1.0.1, 1.1.0


 There were some posts on the lists that spark-ec2 does not work with Python 
 2.6. In addition, we should check the Python version at the top of the script 
 and exit if it's too old.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1991) Support custom StorageLevels for vertices and edges

2014-06-01 Thread Ankur Dave (JIRA)
Ankur Dave created SPARK-1991:
-

 Summary: Support custom StorageLevels for vertices and edges
 Key: SPARK-1991
 URL: https://issues.apache.org/jira/browse/SPARK-1991
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Ankur Dave
Assignee: Ankur Dave


Large graphs may not fit entirely in memory. If we supported custom storage 
levels for the vertices and edges of a graph, the user could specify 
MEMORY_AND_DISK and then repartition the graph to use many small partitions, 
each of which does fit in memory. Spark would then automatically load 
partitions from disk as needed.

Also, the replicated storage levels would be helpful for fault tolerance, and 
the serialized ones would improve efficiency for non-primitive vertex and edge 
attributes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1958) Calling .collect() on a SchemaRDD should call executeCollect() on the underlying query plan.

2014-06-01 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015158#comment-14015158
 ] 

Cheng Lian commented on SPARK-1958:
---

PR: https://github.com/apache/spark/pull/939

 Calling .collect() on a SchemaRDD should call executeCollect() on the 
 underlying query plan.
 

 Key: SPARK-1958
 URL: https://issues.apache.org/jira/browse/SPARK-1958
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Michael Armbrust
Assignee: Cheng Lian
 Fix For: 1.1.0


 In some cases (like LIMIT) executeCollect() makes optimizations that 
 execute().collect() will not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)