date:20140714

[jira] [Resolved] (SPARK-2390) Files in .sparkStaging on HDFS cannot be deleted and wastes the space of HDFS

2014-07-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2390.


   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1326
[https://github.com/apache/spark/pull/1326]

> Files in .sparkStaging on HDFS cannot be deleted and wastes the space of HDFS
> -
>
> Key: SPARK-2390
> URL: https://issues.apache.org/jira/browse/SPARK-2390
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Kousuke Saruta
> Fix For: 1.1.0
>
>
> When running jobs with YARN Cluster mode and using HistoryServer, the files 
> in the Staging Directory cannot be deleted.
> HistoryServer uses directory where event log is written, and the directory is 
> represented as a instance of o.a.h.f.FileSystem created by using 
> FileSystem.get.
> {code:title=FileLogger.scala}
> private val fileSystem = Utils.getHadoopFileSystem(new URI(logDir))
> {code}
> {code:title=utils.getHadoopFileSystem}
> def getHadoopFileSystem(path: URI): FileSystem = {
>   FileSystem.get(path, SparkHadoopUtil.get.newConfiguration())
> }
> {code}
> On the other hand, ApplicationMaster has a instance named fs, which also 
> created by using FileSystem.get.
> {code:title=ApplicationMaster}
> private val fs = FileSystem.get(yarnConf)
> {code}
> FileSystem.get returns cached same instance when URI passed to the method 
> represents same file system and the method is called by same user.
> Because of the behavior, when the directory for event log is on HDFS, fs of 
> ApplicationMaster and fileSystem of FileLogger is same instance.
> When shutting down ApplicationMaster, fileSystem.close is called in 
> FileLogger#stop, which is invoked by SparkContext#stop indirectly.
> {code:title=FileLogger.stop}
> def stop() {
>   hadoopDataStream.foreach(_.close())
>   writer.foreach(_.close())
>   fileSystem.close()
> }
> {code}
> And  ApplicationMaster#cleanupStagingDir also called by JVM shutdown hook. In 
> this method, fs.delete(stagingDirPath) is invoked. 
> Because fs.delete in ApplicationMaster is called after fileSystem.close in 
> FileLogger, fs.delete fails and results not deleting files in the staging 
> directory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2390) Files in .sparkStaging on HDFS cannot be deleted and wastes the space of HDFS

2014-07-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2390:
---

Assignee: Kousuke Saruta

> Files in .sparkStaging on HDFS cannot be deleted and wastes the space of HDFS
> -
>
> Key: SPARK-2390
> URL: https://issues.apache.org/jira/browse/SPARK-2390
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
> Fix For: 1.1.0
>
>
> When running jobs with YARN Cluster mode and using HistoryServer, the files 
> in the Staging Directory cannot be deleted.
> HistoryServer uses directory where event log is written, and the directory is 
> represented as a instance of o.a.h.f.FileSystem created by using 
> FileSystem.get.
> {code:title=FileLogger.scala}
> private val fileSystem = Utils.getHadoopFileSystem(new URI(logDir))
> {code}
> {code:title=utils.getHadoopFileSystem}
> def getHadoopFileSystem(path: URI): FileSystem = {
>   FileSystem.get(path, SparkHadoopUtil.get.newConfiguration())
> }
> {code}
> On the other hand, ApplicationMaster has a instance named fs, which also 
> created by using FileSystem.get.
> {code:title=ApplicationMaster}
> private val fs = FileSystem.get(yarnConf)
> {code}
> FileSystem.get returns cached same instance when URI passed to the method 
> represents same file system and the method is called by same user.
> Because of the behavior, when the directory for event log is on HDFS, fs of 
> ApplicationMaster and fileSystem of FileLogger is same instance.
> When shutting down ApplicationMaster, fileSystem.close is called in 
> FileLogger#stop, which is invoked by SparkContext#stop indirectly.
> {code:title=FileLogger.stop}
> def stop() {
>   hadoopDataStream.foreach(_.close())
>   writer.foreach(_.close())
>   fileSystem.close()
> }
> {code}
> And  ApplicationMaster#cleanupStagingDir also called by JVM shutdown hook. In 
> this method, fs.delete(stagingDirPath) is invoked. 
> Because fs.delete in ApplicationMaster is called after fileSystem.close in 
> FileLogger, fs.delete fails and results not deleting files in the staging 
> directory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2399) Add support for LZ4 compression

2014-07-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2399:
---

Target Version/s: 1.1.0

> Add support for LZ4 compression
> ---
>
> Key: SPARK-2399
> URL: https://issues.apache.org/jira/browse/SPARK-2399
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Greg Bowyer
>Assignee: Reynold Xin
>  Labels: compression, lz4
> Attachments: SPARK-2399-Make-spark-compression-able-to-use-LZ4.patch
>
>
> LZ4 is a compression codec of the same ideas as googles snappy, but has some 
> advantages:
> * It is faster than snappy with a similar compression ratio
> * The implementation is Apache licensed and not GPL
> It has shown promise in both the lucene and hadoop communities, and it looks 
> like its a really easy add to spark io compression.
> Attached is a patch that does this



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2014-07-14 Thread wangfei (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061767#comment-14061767
 ] 

wangfei commented on SPARK-2243:


What do you mean of "but it's something we could support in the future." Can 
you give some examples? Thanks.

> Support multiple SparkContexts in the same JVM
> --
>
> Key: SPARK-2243
> URL: https://issues.apache.org/jira/browse/SPARK-2243
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
>Affects Versions: 1.0.0
>Reporter: Miguel Angel Fernandez Diaz
>
> We're developing a platform where we create several Spark contexts for 
> carrying out different calculations. Is there any restriction when using 
> several Spark contexts? We have two contexts, one for Spark calculations and 
> another one for Spark Streaming jobs. The next error arises when we first 
> execute a Spark calculation and, once the execution is finished, a Spark 
> Streaming job is launched:
> {code}
> 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
>   at 
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
>   at 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to 
> java.io.FileNotFoundException
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:18

[jira] [Created] (SPARK-2488) Model SerDe in MLlib

2014-07-14 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-2488:


 Summary: Model SerDe in MLlib
 Key: SPARK-2488
 URL: https://issues.apache.org/jira/browse/SPARK-2488
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng


Support model serialization/deserialization in MLlib. The first version could 
be text-based.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2399) Add support for LZ4 compression

2014-07-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2399:
---

Assignee: Reynold Xin

> Add support for LZ4 compression
> ---
>
> Key: SPARK-2399
> URL: https://issues.apache.org/jira/browse/SPARK-2399
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Greg Bowyer
>Assignee: Reynold Xin
>  Labels: compression, lz4
> Attachments: SPARK-2399-Make-spark-compression-able-to-use-LZ4.patch
>
>
> LZ4 is a compression codec of the same ideas as googles snappy, but has some 
> advantages:
> * It is faster than snappy with a similar compression ratio
> * The implementation is Apache licensed and not GPL
> It has shown promise in both the lucene and hadoop communities, and it looks 
> like its a really easy add to spark io compression.
> Attached is a patch that does this



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1994) Aggregates return incorrect results on first execution

2014-07-14 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-1994:


Fix Version/s: 1.1.0
   1.0.1

> Aggregates return incorrect results on first execution
> --
>
> Key: SPARK-1994
> URL: https://issues.apache.org/jira/browse/SPARK-1994
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
> Fix For: 1.0.1, 1.1.0
>
>
> [~adav] has a full reproduction but he has found a case where the first run 
> returns corrupted results, but the second case does not.  The same does not 
> occur when reading from HDFS a second time...
> {code}
> sql("SELECT lang, COUNT(*) AS cnt FROM tweetTable GROUP BY lang ORDER BY cnt 
> DESC").collect.foreach(println)
> [bg,16636]
> [16266,16266]
> [16223,16223]
> [16161,16161]
> [16047,16047]
> [lt,11405]
> [hu,11380]
> [el,10845]
> [da,10289]
> [fi,10261]
> [9897,9897]
> [9765,9765]
> [9751,9751]
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2399) Add support for LZ4 compression

2014-07-14 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061745#comment-14061745
 ] 

Reynold Xin commented on SPARK-2399:


Actually never mind. I will submit a pull request based on your change. Thanks!


> Add support for LZ4 compression
> ---
>
> Key: SPARK-2399
> URL: https://issues.apache.org/jira/browse/SPARK-2399
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Greg Bowyer
>  Labels: compression, lz4
> Attachments: SPARK-2399-Make-spark-compression-able-to-use-LZ4.patch
>
>
> LZ4 is a compression codec of the same ideas as googles snappy, but has some 
> advantages:
> * It is faster than snappy with a similar compression ratio
> * The implementation is Apache licensed and not GPL
> It has shown promise in both the lucene and hadoop communities, and it looks 
> like its a really easy add to spark io compression.
> Attached is a patch that does this



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2469) Lower shuffle compression buffer memory usage (replace LZF with Snappy for default compression codec)

2014-07-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2469:
---

Summary: Lower shuffle compression buffer memory usage (replace LZF with 
Snappy for default compression codec)  (was: Lower shuffle compression buffer 
memory usage)

> Lower shuffle compression buffer memory usage (replace LZF with Snappy for 
> default compression codec)
> -
>
> Key: SPARK-2469
> URL: https://issues.apache.org/jira/browse/SPARK-2469
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>
> I was looking into the memory usage of shuffle and one annoying thing is the 
> default compression codec (LZF) is that the implementation we use allocates 
> buffers pretty generously. I did a simple experiment and found that creating 
> 1000 LZFOutputStream allocated 198976424 bytes (~190MB). If we have a shuffle 
> task that uses 10k reducers and 32 threads running currently, the memory used 
> by the lzf stream alone would be ~ 60GB.
> In comparison, Snappy only allocates ~ 65MB for every 1k SnappyOutputStream. 
> However, Snappy's compression is slightly lower than LZF's. In my experience, 
> it leads to 10 - 20% increase in size. Compression ratio does matter here 
> because we are sending data across the network. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2399) Add support for LZ4 compression

2014-07-14 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061736#comment-14061736
 ] 

Reynold Xin commented on SPARK-2399:


Do you mind submitting a pull request for this? I was actually looking into lz4 
and wanted to add one.

> Add support for LZ4 compression
> ---
>
> Key: SPARK-2399
> URL: https://issues.apache.org/jira/browse/SPARK-2399
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Greg Bowyer
>  Labels: compression, lz4
> Attachments: SPARK-2399-Make-spark-compression-able-to-use-LZ4.patch
>
>
> LZ4 is a compression codec of the same ideas as googles snappy, but has some 
> advantages:
> * It is faster than snappy with a similar compression ratio
> * The implementation is Apache licensed and not GPL
> It has shown promise in both the lucene and hadoop communities, and it looks 
> like its a really easy add to spark io compression.
> Attached is a patch that does this



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2486) Utils.getCallSite can crash under JVMTI profilers

2014-07-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2486:
---

Assignee: William Benton

> Utils.getCallSite can crash under JVMTI profilers
> -
>
> Key: SPARK-2486
> URL: https://issues.apache.org/jira/browse/SPARK-2486
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.1
> Environment: running under profilers (observed on OS X under YourKit 
> with CPU profiling and/or object allocation site tracking enabled)
>Reporter: William Benton
>Assignee: William Benton
>Priority: Minor
> Fix For: 1.1.0
>
>
> When running under an instrumenting profiler, Utils.getCallSite sometimes 
> crashes with an NPE while examining stack trace elements.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2486) Utils.getCallSite can crash under JVMTI profilers

2014-07-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2486.


   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1413
[https://github.com/apache/spark/pull/1413]

> Utils.getCallSite can crash under JVMTI profilers
> -
>
> Key: SPARK-2486
> URL: https://issues.apache.org/jira/browse/SPARK-2486
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.1
> Environment: running under profilers (observed on OS X under YourKit 
> with CPU profiling and/or object allocation site tracking enabled)
>Reporter: William Benton
>Priority: Minor
> Fix For: 1.1.0
>
>
> When running under an instrumenting profiler, Utils.getCallSite sometimes 
> crashes with an NPE while examining stack trace elements.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2481) The environment variables SPARK_HISTORY_OPTS is covered in start-history-server.sh

2014-07-14 Thread Guoqiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-2481:
---

Description: 
If we have the following code in the conf/spark-env.sh  
{{export SPARK_HISTORY_OPTS="-DSpark.history.XX=XX"}}
The environment variables SPARK_HISTORY_OPTS is covered in 
[start-history-server.sh|https://github.com/apache/spark/blob/master/sbin/start-history-server.sh]
 
{code}
if [ $# != 0 ]; then
  echo "Using command line arguments for setting the log directory is 
deprecated. Please "
  echo "set the spark.history.fs.logDirectory configuration option instead."
  export SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS 
-Dspark.history.fs.logDirectory=$1"
fi
{code}

  was:
In 
If we have the following code in the conf/spark-env.sh  
{{export SPARK_HISTORY_OPTS="-DSpark.history.XX=XX"}}
The environment variables SPARK_HISTORY_OPTS is covered in 
[start-history-server.sh|https://github.com/apache/spark/blob/master/sbin/start-history-server.sh]
 
{code}
if [ $# != 0 ]; then
  echo "Using command line arguments for setting the log directory is 
deprecated. Please "
  echo "set the spark.history.fs.logDirectory configuration option instead."
  export SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS 
-Dspark.history.fs.logDirectory=$1"
fi
{code}


> The environment variables SPARK_HISTORY_OPTS is covered in 
> start-history-server.sh
> --
>
> Key: SPARK-2481
> URL: https://issues.apache.org/jira/browse/SPARK-2481
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>
> If we have the following code in the conf/spark-env.sh  
> {{export SPARK_HISTORY_OPTS="-DSpark.history.XX=XX"}}
> The environment variables SPARK_HISTORY_OPTS is covered in 
> [start-history-server.sh|https://github.com/apache/spark/blob/master/sbin/start-history-server.sh]
>  
> {code}
> if [ $# != 0 ]; then
>   echo "Using command line arguments for setting the log directory is 
> deprecated. Please "
>   echo "set the spark.history.fs.logDirectory configuration option instead."
>   export SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS 
> -Dspark.history.fs.logDirectory=$1"
> fi
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2481) The environment variables SPARK_HISTORY_OPTS is covered in start-history-server.sh

2014-07-14 Thread Guoqiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-2481:
---

Description: 
In 
If we have the following code in the conf/spark-env.sh  
{{export SPARK_HISTORY_OPTS="-DSpark.history.XX=XX"}}
The environment variables SPARK_HISTORY_OPTS is covered in 
[start-history-server.sh|https://github.com/apache/spark/blob/master/sbin/start-history-server.sh]
 
{code}
if [ $# != 0 ]; then
  echo "Using command line arguments for setting the log directory is 
deprecated. Please "
  echo "set the spark.history.fs.logDirectory configuration option instead."
  export SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS 
-Dspark.history.fs.logDirectory=$1"
fi
{code}

  was:
In 
If we have the following code in the conf/spark-env.sh  
{{export SPARK_HISTORY_OPTS="-DSpark.history.XX=XX"}}

Environment variable will be overwritten
[start-history-server.sh|https://github.com/apache/spark/blob/master/sbin/start-history-server.sh]
{code}
if [ $# != 0 ]; then
  echo "Using command line arguments for setting the log directory is 
deprecated. Please "
  echo "set the spark.history.fs.logDirectory configuration option instead."
  export SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS 
-Dspark.history.fs.logDirectory=$1"
fi
{code}


> The environment variables SPARK_HISTORY_OPTS is covered in 
> start-history-server.sh
> --
>
> Key: SPARK-2481
> URL: https://issues.apache.org/jira/browse/SPARK-2481
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>
> In 
> If we have the following code in the conf/spark-env.sh  
> {{export SPARK_HISTORY_OPTS="-DSpark.history.XX=XX"}}
> The environment variables SPARK_HISTORY_OPTS is covered in 
> [start-history-server.sh|https://github.com/apache/spark/blob/master/sbin/start-history-server.sh]
>  
> {code}
> if [ $# != 0 ]; then
>   echo "Using command line arguments for setting the log directory is 
> deprecated. Please "
>   echo "set the spark.history.fs.logDirectory configuration option instead."
>   export SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS 
> -Dspark.history.fs.logDirectory=$1"
> fi
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2467) Revert SparkBuild to publish-local to both .m2 and .ivy2.

2014-07-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2467.


   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1398
[https://github.com/apache/spark/pull/1398]

> Revert SparkBuild to publish-local to both .m2 and .ivy2.
> -
>
> Key: SPARK-2467
> URL: https://issues.apache.org/jira/browse/SPARK-2467
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 1.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2481) The environment variables SPARK_HISTORY_OPTS is covered in start-history-server.sh

2014-07-14 Thread Guoqiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-2481:
---

Description: 
In 
If we have the following code in the conf/spark-env.sh  
{{export SPARK_HISTORY_OPTS="-DSpark.history.XX=XX"}}

Environment variable will be overwritten
[start-history-server.sh|https://github.com/apache/spark/blob/master/sbin/start-history-server.sh]
{code}
if [ $# != 0 ]; then
  echo "Using command line arguments for setting the log directory is 
deprecated. Please "
  echo "set the spark.history.fs.logDirectory configuration option instead."
  export SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS 
-Dspark.history.fs.logDirectory=$1"
fi
{code}

  was:
In 
If we have the following code in the conf/spark-env.sh  
{{export SPARK_HISTORY_OPTS="-DSpark.history.XX=XX"}}

[start-history-server.sh|https://github.com/apache/spark/blob/master/sbin/start-history-server.sh]
{code}
if [ $# != 0 ]; then
  echo "Using command line arguments for setting the log directory is 
deprecated. Please "
  echo "set the spark.history.fs.logDirectory configuration option instead."
  export SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS 
-Dspark.history.fs.logDirectory=$1"
fi
{code}


> The environment variables SPARK_HISTORY_OPTS is covered in 
> start-history-server.sh
> --
>
> Key: SPARK-2481
> URL: https://issues.apache.org/jira/browse/SPARK-2481
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>
> In 
> If we have the following code in the conf/spark-env.sh  
> {{export SPARK_HISTORY_OPTS="-DSpark.history.XX=XX"}}
> Environment variable will be overwritten
> [start-history-server.sh|https://github.com/apache/spark/blob/master/sbin/start-history-server.sh]
> {code}
> if [ $# != 0 ]; then
>   echo "Using command line arguments for setting the log directory is 
> deprecated. Please "
>   echo "set the spark.history.fs.logDirectory configuration option instead."
>   export SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS 
> -Dspark.history.fs.logDirectory=$1"
> fi
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2481) The environment variables SPARK_HISTORY_OPTS is covered in start-history-server.sh

2014-07-14 Thread Guoqiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-2481:
---

Description: 
In 
If we have the following code in the conf/spark-env.sh  
{{export SPARK_HISTORY_OPTS="-DSpark.history.XX=XX"}}

[start-history-server.sh|https://github.com/apache/spark/blob/master/sbin/start-history-server.sh]
{code}
if [ $# != 0 ]; then
  echo "Using command line arguments for setting the log directory is 
deprecated. Please "
  echo "set the spark.history.fs.logDirectory configuration option instead."
  export SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS 
-Dspark.history.fs.logDirectory=$1"
fi
{code}

> The environment variables SPARK_HISTORY_OPTS is covered in 
> start-history-server.sh
> --
>
> Key: SPARK-2481
> URL: https://issues.apache.org/jira/browse/SPARK-2481
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>
> In 
> If we have the following code in the conf/spark-env.sh  
> {{export SPARK_HISTORY_OPTS="-DSpark.history.XX=XX"}}
> [start-history-server.sh|https://github.com/apache/spark/blob/master/sbin/start-history-server.sh]
> {code}
> if [ $# != 0 ]; then
>   echo "Using command line arguments for setting the log directory is 
> deprecated. Please "
>   echo "set the spark.history.fs.logDirectory configuration option instead."
>   export SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS 
> -Dspark.history.fs.logDirectory=$1"
> fi
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2471) SBT assembly does not include runtime dependencies

2014-07-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2471:
---

Summary: SBT assembly does not include runtime dependencies  (was: 
Dependency issues after having sbt read from pom)

> SBT assembly does not include runtime dependencies
> --
>
> Key: SPARK-2471
> URL: https://issues.apache.org/jira/browse/SPARK-2471
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 1.1.0
>Reporter: Xiangrui Meng
>Priority: Blocker
> Attachments: diff.list
>
>
> After SPARK-1776 (PR #772), the content in the assembly jar changed slightly. 
> I built the assembly jar with sbt and found jets3t is missing after the 
> change, along with some others (commons/httpclient and commons/daemon). 
> jets3t is required to access S3 data. I don't know whether others are 
> important as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2467) Revert SparkBuild to publish-local to both .m2 and .ivy2.

2014-07-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2467:
---

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-2487

> Revert SparkBuild to publish-local to both .m2 and .ivy2.
> -
>
> Key: SPARK-2467
> URL: https://issues.apache.org/jira/browse/SPARK-2467
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2481) The environment variables SPARK_HISTORY_OPTS is covered in start-history-server.sh

2014-07-14 Thread Guoqiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-2481:
---

Summary: The environment variables SPARK_HISTORY_OPTS is covered in 
start-history-server.sh  (was: The environment variables SPARK_HISTORY_OPTS is 
covered in spark-env.sh )

> The environment variables SPARK_HISTORY_OPTS is covered in 
> start-history-server.sh
> --
>
> Key: SPARK-2481
> URL: https://issues.apache.org/jira/browse/SPARK-2481
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2487) Follow up from SBT build refactor (i.e. SPARK-1776)

2014-07-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2487:
---

Summary: Follow up from SBT build refactor (i.e. SPARK-1776)  (was: Follow 
up from SBT build refactor)

> Follow up from SBT build refactor (i.e. SPARK-1776)
> ---
>
> Key: SPARK-2487
> URL: https://issues.apache.org/jira/browse/SPARK-2487
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Patrick Wendell
>
> This is to track follw up issues relating to SPARK-1776, which was a major 
> re-factoring of the SBT build in Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2476) Have sbt-assembly include runtime dependencies in jar

2014-07-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2476:
---

Issue Type: Sub-task  (was: Task)
Parent: SPARK-2487

> Have sbt-assembly include runtime dependencies in jar
> -
>
> Key: SPARK-2476
> URL: https://issues.apache.org/jira/browse/SPARK-2476
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Reporter: Patrick Wendell
>Assignee: Prashant Sharma
>
> If possible, we should try to contribute the ability to include 
> runtime-scoped dependencies in the assembly jar created with sbt-assembly.
> Currently in only reads compile-scoped dependencies:
> https://github.com/sbt/sbt-assembly/blob/master/src/main/scala/sbtassembly/Plugin.scala#L495



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2467) Revert SparkBuild to publish-local to both .m2 and .ivy2.

2014-07-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2467:
---

Assignee: Takuya Ueshin

> Revert SparkBuild to publish-local to both .m2 and .ivy2.
> -
>
> Key: SPARK-2467
> URL: https://issues.apache.org/jira/browse/SPARK-2467
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2487) Follow up from SBT build refactor

2014-07-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2487:
---

Description: This is to track follw up issues relating to SPARK-1776, which 
was a major re-factoring of the SBT build in Spark.

> Follow up from SBT build refactor
> -
>
> Key: SPARK-2487
> URL: https://issues.apache.org/jira/browse/SPARK-2487
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Patrick Wendell
>
> This is to track follw up issues relating to SPARK-1776, which was a major 
> re-factoring of the SBT build in Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2487) Follow up from SBT build refactor

2014-07-14 Thread Patrick Wendell (JIRA)

Patrick Wendell created SPARK-2487:
--

 Summary: Follow up from SBT build refactor
 Key: SPARK-2487
 URL: https://issues.apache.org/jira/browse/SPARK-2487
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Patrick Wendell






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2471) Dependency issues after having sbt read from pom

2014-07-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2471:
---

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-2487

> Dependency issues after having sbt read from pom
> 
>
> Key: SPARK-2471
> URL: https://issues.apache.org/jira/browse/SPARK-2471
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 1.1.0
>Reporter: Xiangrui Meng
>Priority: Blocker
> Attachments: diff.list
>
>
> After SPARK-1776 (PR #772), the content in the assembly jar changed slightly. 
> I built the assembly jar with sbt and found jets3t is missing after the 
> change, along with some others (commons/httpclient and commons/daemon). 
> jets3t is required to access S3 data. I don't know whether others are 
> important as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2480) Resolve sbt warnings "NOTE: SPARK_YARN is deprecated, please use -Pyarn flag"

2014-07-14 Thread Guoqiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-2480:
---

Summary: Resolve sbt warnings "NOTE: SPARK_YARN is deprecated, please use 
-Pyarn flag"  (was: Remove "NOTE: SPARK_YARN is deprecated, please use -Pyarn 
flag")

> Resolve sbt warnings "NOTE: SPARK_YARN is deprecated, please use -Pyarn flag"
> -
>
> Key: SPARK-2480
> URL: https://issues.apache.org/jira/browse/SPARK-2480
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Reopened] (SPARK-2480) Remove "NOTE: SPARK_YARN is deprecated, please use -Pyarn flag"

2014-07-14 Thread Guoqiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li reopened SPARK-2480:



> Remove "NOTE: SPARK_YARN is deprecated, please use -Pyarn flag"
> ---
>
> Key: SPARK-2480
> URL: https://issues.apache.org/jira/browse/SPARK-2480
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2480) Remove "NOTE: SPARK_YARN is deprecated, please use -Pyarn flag"

2014-07-14 Thread Guoqiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061707#comment-14061707
 ] 

Guoqiang Li edited comment on SPARK-2480 at 7/15/14 5:50 AM:
-

I'm sorry, my description is not correct . I mean we should use -Pyarn instead 
of SPARK_YARN in run-tests
PR: https://github.com/apache/spark/pull/1404


was (Author: gq):
I'm sorry, my description is not correct . I mean we should use -Pyarn instead 
of SPARK_YARN
PR: https://github.com/apache/spark/pull/1404

> Remove "NOTE: SPARK_YARN is deprecated, please use -Pyarn flag"
> ---
>
> Key: SPARK-2480
> URL: https://issues.apache.org/jira/browse/SPARK-2480
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2480) Remove "NOTE: SPARK_YARN is deprecated, please use -Pyarn flag"

2014-07-14 Thread Guoqiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061707#comment-14061707
 ] 

Guoqiang Li commented on SPARK-2480:


I'm sorry, my description is not correct . I mean we should use -Pyarn instead 
of SPARK_YARN
PR: https://github.com/apache/spark/pull/1404

> Remove "NOTE: SPARK_YARN is deprecated, please use -Pyarn flag"
> ---
>
> Key: SPARK-2480
> URL: https://issues.apache.org/jira/browse/SPARK-2480
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2470) Fix PEP 8 violations

2014-07-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2470:
---

Component/s: PySpark

> Fix PEP 8 violations
> 
>
> Key: SPARK-2470
> URL: https://issues.apache.org/jira/browse/SPARK-2470
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Reynold Xin
>Assignee: Prashant Sharma
>
> Let's fix all our pep8 violations so we can turn the pep8 checker on in 
> continuous integration. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2467) Revert SparkBuild to publish-local to both .m2 and .ivy2.

2014-07-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2467:
---

Component/s: Build

> Revert SparkBuild to publish-local to both .m2 and .ivy2.
> -
>
> Key: SPARK-2467
> URL: https://issues.apache.org/jira/browse/SPARK-2467
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Takuya Ueshin
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2480) Remove "NOTE: SPARK_YARN is deprecated, please use -Pyarn flag"

2014-07-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2480.


Resolution: Not a Problem

This is left there intentionally to guide users... why do you want to remove it?

> Remove "NOTE: SPARK_YARN is deprecated, please use -Pyarn flag"
> ---
>
> Key: SPARK-2480
> URL: https://issues.apache.org/jira/browse/SPARK-2480
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2477) Using appendBias for adding intercept in GeneralizedLinearAlgorithm

2014-07-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2477:
---

Component/s: MLlib

> Using appendBias for adding intercept in GeneralizedLinearAlgorithm
> ---
>
> Key: SPARK-2477
> URL: https://issues.apache.org/jira/browse/SPARK-2477
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: DB Tsai
>
> Instead of using prependOne currently in GeneralizedLinearAlgorithm, we would 
> like to use appendBias for 1) keeping the indices of original training set 
> unchanged by adding the intercept into the last element of vector and 2) 
> using the same public API for consistently adding intercept. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2014-07-14 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061698#comment-14061698
 ] 

Patrick Wendell commented on SPARK-2243:


I think the Spark JobServer shares a single SparkContext. As I said, I think 
certain uses of static state make it impossible to share multiple 
SparkContext's in one JVM.

> Support multiple SparkContexts in the same JVM
> --
>
> Key: SPARK-2243
> URL: https://issues.apache.org/jira/browse/SPARK-2243
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
>Affects Versions: 1.0.0
>Reporter: Miguel Angel Fernandez Diaz
>
> We're developing a platform where we create several Spark contexts for 
> carrying out different calculations. Is there any restriction when using 
> several Spark contexts? We have two contexts, one for Spark calculations and 
> another one for Spark Streaming jobs. The next error arises when we first 
> execute a Spark calculation and, once the execution is finished, a Spark 
> Streaming job is launched:
> {code}
> 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
>   at 
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
>   at 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to 
> java.io.FileNotFoundException
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)

[jira] [Commented] (SPARK-2459) the user should be able to configure the resources used by JDBC server

2014-07-14 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061688#comment-14061688
 ] 

Michael Armbrust commented on SPARK-2459:
-

Is the correct thing to do here to just use spark-submit?

> the user should be able to configure the resources used by JDBC server
> --
>
> Key: SPARK-2459
> URL: https://issues.apache.org/jira/browse/SPARK-2459
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.0.1
>Reporter: Nan Zhu
>
> I'm trying the jdbc server
> I found that the jdbc server always occupies all cores in the cluster
> the reason is that when creating HiveContext, it doesn't set anything related 
> to spark.cores.max or spark.executor.memory
> SparkSQLEnv.scala(https://github.com/apache/spark/blob/8032fe2fae3ac40a02c6018c52e76584a14b3438/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLEnv.scala)
>   L41-L43
> [~liancheng] 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2459) the user should be able to configure the resources used by JDBC server

2014-07-14 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2459:


 Target Version/s: 1.1.0
Affects Version/s: (was: 1.1.0)
   1.0.1

> the user should be able to configure the resources used by JDBC server
> --
>
> Key: SPARK-2459
> URL: https://issues.apache.org/jira/browse/SPARK-2459
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.0.1
>Reporter: Nan Zhu
>
> I'm trying the jdbc server
> I found that the jdbc server always occupies all cores in the cluster
> the reason is that when creating HiveContext, it doesn't set anything related 
> to spark.cores.max or spark.executor.memory
> SparkSQLEnv.scala(https://github.com/apache/spark/blob/8032fe2fae3ac40a02c6018c52e76584a14b3438/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLEnv.scala)
>   L41-L43
> [~liancheng] 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2486) Utils.getCallSite can crash under JVMTI profilers

2014-07-14 Thread William Benton (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061625#comment-14061625
 ] 

William Benton commented on SPARK-2486:
---

A (trivial but functional) workaround is here:  
https://github.com/apache/spark/pull/1413

> Utils.getCallSite can crash under JVMTI profilers
> -
>
> Key: SPARK-2486
> URL: https://issues.apache.org/jira/browse/SPARK-2486
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.1
> Environment: running under profilers (observed on OS X under YourKit 
> with CPU profiling and/or object allocation site tracking enabled)
>Reporter: William Benton
>Priority: Minor
>
> When running under an instrumenting profiler, Utils.getCallSite sometimes 
> crashes with an NPE while examining stack trace elements.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2486) Utils.getCallSite can crash under JVMTI profilers

2014-07-14 Thread William Benton (JIRA)

William Benton created SPARK-2486:
-

 Summary: Utils.getCallSite can crash under JVMTI profilers
 Key: SPARK-2486
 URL: https://issues.apache.org/jira/browse/SPARK-2486
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
 Environment: running under profilers (observed on OS X under YourKit 
with CPU profiling and/or object allocation site tracking enabled)
Reporter: William Benton
Priority: Minor


When running under an instrumenting profiler, Utils.getCallSite sometimes 
crashes with an NPE while examining stack trace elements.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2014-07-14 Thread wangfei (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061620#comment-14061620
 ] 

wangfei commented on SPARK-2243:


Spark Job Server may new multiple SparkContext in one jvm process, so we should 
consider support multiple SparkContext now, do you think so?

> Support multiple SparkContexts in the same JVM
> --
>
> Key: SPARK-2243
> URL: https://issues.apache.org/jira/browse/SPARK-2243
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
>Affects Versions: 1.0.0
>Reporter: Miguel Angel Fernandez Diaz
>
> We're developing a platform where we create several Spark contexts for 
> carrying out different calculations. Is there any restriction when using 
> several Spark contexts? We have two contexts, one for Spark calculations and 
> another one for Spark Streaming jobs. The next error arises when we first 
> execute a Spark calculation and, once the execution is finished, a Spark 
> Streaming job is launched:
> {code}
> 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
>   at 
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
>   at 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to 
> java.io.FileNotFoundException
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSeri

[jira] [Created] (SPARK-2485) Usage of HiveClient not threadsafe.

2014-07-14 Thread Michael Armbrust (JIRA)

Michael Armbrust created SPARK-2485:
---

 Summary: Usage of HiveClient not threadsafe.
 Key: SPARK-2485
 URL: https://issues.apache.org/jira/browse/SPARK-2485
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust


When making concurrent queries against the hive metastore, sometimes we get an 
exception that includes the following stack trace: 

{code}
Caused by: java.lang.Throwable: get_table failed: out of sequence response
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:76)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table(ThriftHiveMetastore.java:936)
{code}

Likely, we need to synchronize our use of HiveClient.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2484) By default does not run hive compatibility tests

2014-07-14 Thread Guoqiang Li (JIRA)

Guoqiang Li created SPARK-2484:
--

 Summary: By default does not run hive compatibility tests
 Key: SPARK-2484
 URL: https://issues.apache.org/jira/browse/SPARK-2484
 Project: Spark
  Issue Type: Improvement
Reporter: Guoqiang Li


hive compatibility test takes a long time, in some cases, we don't need to run 
it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2483) HiveQL parses accessing struct fields in an array incorrectly.

2014-07-14 Thread Michael Armbrust (JIRA)

Michael Armbrust created SPARK-2483:
---

 Summary: HiveQL parses accessing struct fields in an array 
incorrectly.
 Key: SPARK-2483
 URL: https://issues.apache.org/jira/browse/SPARK-2483
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust


Test case:
{code}
case class Data(a: Int, B: Int, n: Nested, nestedArray: Seq[Nested])
case class Nested(a: Int, B: Int)

  test("nested repeated resolution") {
TestHive.sparkContext.parallelize(Data(1, 2, Nested(1,2), Seq(Nested(1,2))) 
:: Nil)
  .registerAsTable("nestedRepeatedTest")
hql("SELECT nestedArray[0].a FROM nestedRepeatedTest").collect()
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2458) Make failed application log visible on History Server

2014-07-14 Thread Masayoshi TSUZUKI (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061536#comment-14061536
 ] 

Masayoshi TSUZUKI commented on SPARK-2458:
--

History server uses FsHistoryProvider as ApplicationHistoryProvider by default.
{code:title=HistoryServer.scala|borderStyle=solid}
val providerName = conf.getOption("spark.history.provider")
  .getOrElse(classOf[FsHistoryProvider].getName())
val provider = Class.forName(providerName)
  .getConstructor(classOf[SparkConf])
  .newInstance(conf)
  .asInstanceOf[ApplicationHistoryProvider]
{code}

While FsHistoryServer continuously check for new log directories, it filters 
out the directory which doesn't contain APPLICATION_COMPLETE file.
{code:title=FsHistoryProvider.scala|borderStyle=solid}
  val logInfos = logDirs.filter {
dir => fs.isFile(new Path(dir.getPath(), 
EventLoggingListener.APPLICATION_COMPLETE))
  }
{code}


> Make failed application log visible on History Server
> -
>
> Key: SPARK-2458
> URL: https://issues.apache.org/jira/browse/SPARK-2458
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.0.0
>Reporter: Masayoshi TSUZUKI
>
> History server is very helpful for debugging application correctness & 
> performance after the application finished. However, when the application 
> failed, the link is not listed on the hisotry server UI and history can't be 
> viewed.
> It would be very useful if we can check the history of failed application.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2482) Resolve sbt warnings during build

2014-07-14 Thread Guoqiang Li (JIRA)

Guoqiang Li created SPARK-2482:
--

 Summary: Resolve sbt warnings during build
 Key: SPARK-2482
 URL: https://issues.apache.org/jira/browse/SPARK-2482
 Project: Spark
  Issue Type: Bug
Reporter: Guoqiang Li


At the same time, import the scala.language.postfixOps and 
org.scalatest.time.SpanSugar._ cause scala.language.postfixOps doesn't work



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2481) The environment variables SPARK_HISTORY_OPTS is covered in spark-env.sh

2014-07-14 Thread Guoqiang Li (JIRA)

Guoqiang Li created SPARK-2481:
--

 Summary: The environment variables SPARK_HISTORY_OPTS is covered 
in spark-env.sh 
 Key: SPARK-2481
 URL: https://issues.apache.org/jira/browse/SPARK-2481
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Guoqiang Li






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2480) Remove "NOTE: SPARK_YARN is deprecated, please use -Pyarn flag"

2014-07-14 Thread Guoqiang Li (JIRA)

Guoqiang Li created SPARK-2480:
--

 Summary: Remove "NOTE: SPARK_YARN is deprecated, please use -Pyarn 
flag"
 Key: SPARK-2480
 URL: https://issues.apache.org/jira/browse/SPARK-2480
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Guoqiang Li
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2419) Misc updates to streaming programming guide

2014-07-14 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2419:
-

Description: 
This JIRA collects together a number of small issues that should be added to 
the streaming programming guide

- Receivers consume an executor slot and highlight the fact the # cores > # 
receivers is necessary
- Classes of spark-streaming-XYZ cannot be access from Spark Shell
- Deploying and using spark-streaming-XYZ requires spark-streaming-XYZ.jar and 
its dependencies to be packaged with application JAR
- Ordering and parallelism of the output operations
- Receiver's should be serializable
- Add more information on how socketStream: input stream => iterator function.

  was:
This JIRA collects together a number of small issues that should be added to 
the streaming programming guide

- Receivers consume an executor slot and highlight the fact the # cores > # 
receivers is necessary
- spark-streaming-XYZ cannot be access from Spark Shell
- Deploying and using spark-streaming-XYZ requires spark-streaming-XYZ.jar and 
its dependencies to be packaged with application JAR
- Ordering and parallelism of the output operations
- Receiver's should be serializable
- Add more information on how socketStream: input stream => iterator function.


> Misc updates to streaming programming guide
> ---
>
> Key: SPARK-2419
> URL: https://issues.apache.org/jira/browse/SPARK-2419
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> This JIRA collects together a number of small issues that should be added to 
> the streaming programming guide
> - Receivers consume an executor slot and highlight the fact the # cores > # 
> receivers is necessary
> - Classes of spark-streaming-XYZ cannot be access from Spark Shell
> - Deploying and using spark-streaming-XYZ requires spark-streaming-XYZ.jar 
> and its dependencies to be packaged with application JAR
> - Ordering and parallelism of the output operations
> - Receiver's should be serializable
> - Add more information on how socketStream: input stream => iterator function.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2419) Misc updates to streaming programming guide

2014-07-14 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2419:
-

Description: 
This JIRA collects together a number of small issues that should be added to 
the streaming programming guide

- Receivers consume an executor slot and highlight the fact the # cores > # 
receivers is necessary
- spark-streaming-XYZ cannot be access from Spark Shell
- Deploying and using spark-streaming-XYZ requires spark-streaming-XYZ.jar and 
its dependencies to be packaged with application JAR
- Ordering and parallelism of the output operations
- Receiver's should be serializable
- Add more information on how socketStream: input stream => iterator function.

  was:
This JIRA collects together a number of small issues that should be added to 
the streaming programming guide

- Receivers consume an executor slot and highlight the fact the # cores > # 
receivers is necessary
- Deploying requires spark-streaming-XYZ and its dependencies to be packaged 
with application JAR. 
- Ordering and parallelism of the output operations
- Receiver's should be serializable
- Add more information on how socketStream: input stream => iterator function.


> Misc updates to streaming programming guide
> ---
>
> Key: SPARK-2419
> URL: https://issues.apache.org/jira/browse/SPARK-2419
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> This JIRA collects together a number of small issues that should be added to 
> the streaming programming guide
> - Receivers consume an executor slot and highlight the fact the # cores > # 
> receivers is necessary
> - spark-streaming-XYZ cannot be access from Spark Shell
> - Deploying and using spark-streaming-XYZ requires spark-streaming-XYZ.jar 
> and its dependencies to be packaged with application JAR
> - Ordering and parallelism of the output operations
> - Receiver's should be serializable
> - Add more information on how socketStream: input stream => iterator function.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1866) Closure cleaner does not null shadowed fields when outer scope is referenced

2014-07-14 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061501#comment-14061501
 ] 

Kan Zhang commented on SPARK-1866:
--

My previous comment may be less readable, let me try again:

The root cause is when the class for line {{sc.parallelize()...}} is generated, 
variable {{instances}} defined in the preceding line gets imported by the 
parser (since it thinks {{instances}} is referenced by this line) and becomes 
part of the outer object for the closure. This outer object is referenced by 
the closure through variable {{x}}. However, currently we choose not to null 
(or clone) outer objects when we clean closures since we can't be sure it is 
safe to do so (see commit 
[f346e64|https://github.com/apache/spark/commit/f346e64637fa4f9bd95fcc966caa496bea5feca0]).
 As a result, {{instances}} is not nulled by ClosureCleaner even though it is 
not actually used within the closure. This type of exception will pop up 
whenever a closure references outer objects that are not serializable.


> Closure cleaner does not null shadowed fields when outer scope is referenced
> 
>
> Key: SPARK-1866
> URL: https://issues.apache.org/jira/browse/SPARK-1866
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Aaron Davidson
>Assignee: Kan Zhang
>Priority: Critical
> Fix For: 1.0.1, 1.1.0
>
>
> Take the following example:
> {code}
> val x = 5
> val instances = new org.apache.hadoop.fs.Path("/") /* non-serializable */
> sc.parallelize(0 until 10).map { _ =>
>   val instances = 3
>   (instances, x)
> }.collect
> {code}
> This produces a "java.io.NotSerializableException: 
> org.apache.hadoop.fs.Path", despite the fact that the outer instances is not 
> actually used within the closure. If you change the name of the outer 
> variable instances to something else, the code executes correctly, indicating 
> that it is the fact that the two variables share a name that causes the issue.
> Additionally, if the outer scope is not used (i.e., we do not reference "x" 
> in the above example), the issue does not appear.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Issue Comment Deleted] (SPARK-1866) Closure cleaner does not null shadowed fields when outer scope is referenced

2014-07-14 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-1866:
-

Comment: was deleted

(was: Unfortunately this type of error will pop up whenever a closure 
references user objects (any objects other than nested closure objects) that 
are not serializable. Our current approach is we don't clone (or null) user 
objects since we can't be sure it is safe to do so (see commit 
f346e64637fa4f9bd95fcc966caa496bea5feca0). 

Spark shell synthesizes a class for each line. In this case, the class for the 
closure line imports {{instances}} as a field (since the parser thinks it is 
referenced by this line) and the corresponding line object is referenced by the 
closure via {{x}}. 

My take on this is advising users to avoid name collisions as a workaround.)

> Closure cleaner does not null shadowed fields when outer scope is referenced
> 
>
> Key: SPARK-1866
> URL: https://issues.apache.org/jira/browse/SPARK-1866
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Aaron Davidson
>Assignee: Kan Zhang
>Priority: Critical
> Fix For: 1.0.1, 1.1.0
>
>
> Take the following example:
> {code}
> val x = 5
> val instances = new org.apache.hadoop.fs.Path("/") /* non-serializable */
> sc.parallelize(0 until 10).map { _ =>
>   val instances = 3
>   (instances, x)
> }.collect
> {code}
> This produces a "java.io.NotSerializableException: 
> org.apache.hadoop.fs.Path", despite the fact that the outer instances is not 
> actually used within the closure. If you change the name of the outer 
> variable instances to something else, the code executes correctly, indicating 
> that it is the fact that the two variables share a name that causes the issue.
> Additionally, if the outer scope is not used (i.e., we do not reference "x" 
> in the above example), the issue does not appear.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2446) Add BinaryType support to Parquet I/O.

2014-07-14 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2446.
-

   Resolution: Fixed
Fix Version/s: 1.1.0
 Assignee: Takuya Ueshin

> Add BinaryType support to Parquet I/O.
> --
>
> Key: SPARK-2446
> URL: https://issues.apache.org/jira/browse/SPARK-2446
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 1.1.0
>
>
> To support {{BinaryType}}, the following changes are needed:
> - Make {{StringType}} use {{OriginalType.UTF8}}
> - Add {{BinaryType}} using {{PrimitiveTypeName.BINARY}} without 
> {{OriginalType}}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2446) Add BinaryType support to Parquet I/O.

2014-07-14 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061382#comment-14061382
 ] 

Michael Armbrust commented on SPARK-2446:
-

Note that this commit changes the semantics when loading in data that was 
created with prior versions of Spark SQL.  Before, we were writing out strings 
as Binary data without adding any other annotations. Thus, when data is read in 
from prior versions, data that was StringType will now become BinaryType.  
Users that need strings can CAST that column to a String.  It was decided that 
while this breaks compatibility, it does make us compatible with other systems 
(Hive, Thrift, etc) and adds support for Binary data, so this is the right 
decision long term.

> Add BinaryType support to Parquet I/O.
> --
>
> Key: SPARK-2446
> URL: https://issues.apache.org/jira/browse/SPARK-2446
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Takuya Ueshin
> Fix For: 1.1.0
>
>
> To support {{BinaryType}}, the following changes are needed:
> - Make {{StringType}} use {{OriginalType.UTF8}}
> - Add {{BinaryType}} using {{PrimitiveTypeName.BINARY}} without 
> {{OriginalType}}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2479) Comparing floating-point numbers using relative error in UnitTests

2014-07-14 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-2479:
---

Description: 
Due to rounding errors, most floating-point numbers end up being slightly 
imprecise. As long as this imprecision stays small, it can usually be ignored. 
Otherwise, different machine may have different rounding errors which will 
cause some of false negative tests.

See the following article for detail.
http://floating-point-gui.de/errors/comparison/
For example:
float a = 0.15 + 0.15
float b = 0.1 + 0.2
if(a == b) // can be false!
if(a >= b) // can also be false!


  was:Due to rounding errors, most floating-point numbers end up being slightly 
imprecise. As long as this imprecision stays small, it can usually be ignored. 
Otherwise, different machine may have different rounding errors which will 
cause some of false negative tests.


> Comparing floating-point numbers using relative error in UnitTests
> --
>
> Key: SPARK-2479
> URL: https://issues.apache.org/jira/browse/SPARK-2479
> Project: Spark
>  Issue Type: Improvement
>Reporter: DB Tsai
>
> Due to rounding errors, most floating-point numbers end up being slightly 
> imprecise. As long as this imprecision stays small, it can usually be 
> ignored. Otherwise, different machine may have different rounding errors 
> which will cause some of false negative tests.
> See the following article for detail.
> http://floating-point-gui.de/errors/comparison/
> For example:
>   float a = 0.15 + 0.15
>   float b = 0.1 + 0.2
>   if(a == b) // can be false!
>   if(a >= b) // can also be false!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2478) Add Python APIs for decision tree

2014-07-14 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-2478:
-

Description: 
In v1.0, we only support decision tree in Scala/Java. It would be nice to add 
Python support. It may require some refactoring of the current decision tree 
API to make it easier to construct a decision tree algorithm in Python.

1. Simplify decision tree constructors such that only simple types are used.
  a. Hide the implementation of Impurity from users.
  b. Replace enums by strings.
2. Make separate public decision tree classes for regression & classification 
(with shared internals).  Eliminate algo parameter.
3. Implement wrappers in Python for DecisionTree.
4. Implement wrappers in Python for DecisionTreeModel.

  was:
In v1.0, we only support decision tree in Scala/Java. It would be nice to add 
Python support. It may require some refactoring of the current decision tree 
API to make it easier to construct a decision tree algorithm in Python.

1. Simplify decision tree constructors such that only simple types are used.
  a. Hide the implementation of Impurity from users.
  b. Replace enums by strings.
2. Implement wrappers in Python for DecisionTree.
3. Implement wrappers in Python for DecisionTreeModel.


> Add Python APIs for decision tree
> -
>
> Key: SPARK-2478
> URL: https://issues.apache.org/jira/browse/SPARK-2478
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Joseph K. Bradley
>
> In v1.0, we only support decision tree in Scala/Java. It would be nice to add 
> Python support. It may require some refactoring of the current decision tree 
> API to make it easier to construct a decision tree algorithm in Python.
> 1. Simplify decision tree constructors such that only simple types are used.
>   a. Hide the implementation of Impurity from users.
>   b. Replace enums by strings.
> 2. Make separate public decision tree classes for regression & classification 
> (with shared internals).  Eliminate algo parameter.
> 3. Implement wrappers in Python for DecisionTree.
> 4. Implement wrappers in Python for DecisionTreeModel.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2479) Comparing floating-point numbers using relative error in UnitTests

2014-07-14 Thread DB Tsai (JIRA)

DB Tsai created SPARK-2479:
--

 Summary: Comparing floating-point numbers using relative error in 
UnitTests
 Key: SPARK-2479
 URL: https://issues.apache.org/jira/browse/SPARK-2479
 Project: Spark
  Issue Type: Improvement
Reporter: DB Tsai


Due to rounding errors, most floating-point numbers end up being slightly 
imprecise. As long as this imprecision stays small, it can usually be ignored. 
Otherwise, different machine may have different rounding errors which will 
cause some of false negative tests.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

2014-07-14 Thread Mubarak Seyed

[jira] [Created] (SPARK-2478) Add Python APIs for decision tree

2014-07-14 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-2478:


 Summary: Add Python APIs for decision tree
 Key: SPARK-2478
 URL: https://issues.apache.org/jira/browse/SPARK-2478
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Joseph K. Bradley


In v1.0, we only support decision tree in Scala/Java. It would be nice to add 
Python support. It may require some refactoring of the current decision tree 
API to make it easier to construct a decision tree algorithm in Python.

1. Simplify decision tree constructors such that only simple types are used.
  a. Hide the implementation of Impurity from users.
  b. Replace enums by strings.
2. Implement wrappers in Python for DecisionTree.
3. Implement wrappers in Python for DecisionTreeModel.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1576) Passing of JAVA_OPTS to YARN on command line

2014-07-14 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-1576:
-

Affects Version/s: (was: 1.0.0)

> Passing of JAVA_OPTS to YARN on command line
> 
>
> Key: SPARK-1576
> URL: https://issues.apache.org/jira/browse/SPARK-1576
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 0.9.0, 0.9.1
>Reporter: Nishkam Ravi
> Fix For: 0.9.0, 0.9.2
>
> Attachments: SPARK-1576.patch
>
>
> JAVA_OPTS can be passed by using either env variables (i.e., SPARK_JAVA_OPTS) 
> or as config vars (after Patrick's recent change). It would be good to allow 
> the user to pass them on command line as well to restrict scope to single 
> application invocation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1576) Passing of JAVA_OPTS to YARN on command line

2014-07-14 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-1576:
-

Fix Version/s: (was: 1.0.0)

> Passing of JAVA_OPTS to YARN on command line
> 
>
> Key: SPARK-1576
> URL: https://issues.apache.org/jira/browse/SPARK-1576
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 0.9.0, 0.9.1
>Reporter: Nishkam Ravi
> Fix For: 0.9.0, 0.9.2
>
> Attachments: SPARK-1576.patch
>
>
> JAVA_OPTS can be passed by using either env variables (i.e., SPARK_JAVA_OPTS) 
> or as config vars (after Patrick's recent change). It would be good to allow 
> the user to pass them on command line as well to restrict scope to single 
> application invocation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2477) Using appendBias for adding intercept in GeneralizedLinearAlgorithm

2014-07-14 Thread DB Tsai (JIRA)

DB Tsai created SPARK-2477:
--

 Summary: Using appendBias for adding intercept in 
GeneralizedLinearAlgorithm
 Key: SPARK-2477
 URL: https://issues.apache.org/jira/browse/SPARK-2477
 Project: Spark
  Issue Type: Improvement
Reporter: DB Tsai


Instead of using prependOne currently in GeneralizedLinearAlgorithm, we would 
like to use appendBias for 1) keeping the indices of original training set 
unchanged by adding the intercept into the last element of vector and 2) using 
the same public API for consistently adding intercept. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2406) Partitioned Parquet Support

2014-07-14 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061232#comment-14061232
 ] 

Michael Armbrust commented on SPARK-2406:
-

I think there are two ways we can achieve this, each with their own pros/cons.  
One would just piggyback on the existing hive partitioning API as you propose, 
but use our (possibly more efficient) parquet reader.  The other would give you 
the ability to partition parquet tables, without needing to pull in all of hive.

> Partitioned Parquet Support
> ---
>
> Key: SPARK-2406
> URL: https://issues.apache.org/jira/browse/SPARK-2406
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2475) Check whether #cores > #receivers in local mode

2014-07-14 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061194#comment-14061194
 ] 

Patrick Wendell commented on SPARK-2475:


Good call! I've hit this one a few times :P

> Check whether #cores > #receivers in local mode
> ---
>
> Key: SPARK-2475
> URL: https://issues.apache.org/jira/browse/SPARK-2475
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Tathagata Das
>
> When the number of slots in local mode is not more than the number of 
> receivers, then the system should throw an error. Otherwise the system just 
> keeps waiting for resources to process the received data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1946) Submit stage after executors have been registered

2014-07-14 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-1946.
--

   Resolution: Fixed
Fix Version/s: 1.1.0

> Submit stage after executors have been registered
> -
>
> Key: SPARK-1946
> URL: https://issues.apache.org/jira/browse/SPARK-1946
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Zhihui
> Fix For: 1.1.0
>
> Attachments: Spark Task Scheduler Optimization Proposal.pptx
>
>
> Because creating TaskSetManager and registering executors are asynchronous, 
> if running job without enough executors, it will lead to some issues
> * early stages' tasks run without preferred locality.
> * the default parallelism in yarn is based on number of executors, 
> * the number of intermediate files per node for shuffle (this can bring the 
> node down btw)
> * and amount of memory consumed on a node for rdd MEMORY persisted data 
> (making the job fail if disk is not specified : like some of the mllib algos 
> ?)
> * and so on ...
> (thanks [~mridulm80] 's [comments | 
> https://github.com/apache/spark/pull/900#issuecomment-45780405])
> A simple solution is sleeping few seconds in application, so that executors 
> have enough time to register.
> A better way is to make DAGScheduler submit stage after a few of executors 
> have been registered by configuration properties.
> \# submit stage only after successfully registered executors arrived the 
> ratio, default value 0 in Standalone mode and 0.9 in Yarn mode
> spark.scheduler.minRegisteredRatio = 0.8
> \# whatever registered number is arrived, submit stage after the 
> maxRegisteredWaitingTime(millisecond), default value 1
> spark.scheduler.maxRegisteredWaitingTime = 5000



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1453) Improve the way Spark on Yarn waits for executors before starting

2014-07-14 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-1453.
--

Resolution: Duplicate

duplicate of https://issues.apache.org/jira/browse/SPARK-1946

> Improve the way Spark on Yarn waits for executors before starting
> -
>
> Key: SPARK-1453
> URL: https://issues.apache.org/jira/browse/SPARK-1453
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>
> Currently Spark on Yarn just delays a few seconds between when the spark 
> context is initialized and when it allows the job to start.  If you are on a 
> busy hadoop cluster is might take longer to get the number of executors. 
> In the very least we could make this timeout a configurable value.  Its 
> currently hardcoded to 3 seconds.  
> Better yet would be to allow user to give a minimum number of executors it 
> wants to wait for, but that looks much more complex. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2476) Have sbt-assembly include runtime dependencies in jar

2014-07-14 Thread Patrick Wendell (JIRA)

Patrick Wendell created SPARK-2476:
--

 Summary: Have sbt-assembly include runtime dependencies in jar
 Key: SPARK-2476
 URL: https://issues.apache.org/jira/browse/SPARK-2476
 Project: Spark
  Issue Type: Task
  Components: Build
 Environment: If possible, we should try to contribute the ability to 
include runtime-scoped dependencies in the assembly jar created with 
sbt-assembly.

Currently in only reads compile-scoped dependencies:
https://github.com/sbt/sbt-assembly/blob/master/src/main/scala/sbtassembly/Plugin.scala#L495
Reporter: Patrick Wendell
Assignee: Prashant Sharma






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2476) Have sbt-assembly include runtime dependencies in jar

2014-07-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2476:
---

Description: 
If possible, we should try to contribute the ability to include runtime-scoped 
dependencies in the assembly jar created with sbt-assembly.

Currently in only reads compile-scoped dependencies:
https://github.com/sbt/sbt-assembly/blob/master/src/main/scala/sbtassembly/Plugin.scala#L495

> Have sbt-assembly include runtime dependencies in jar
> -
>
> Key: SPARK-2476
> URL: https://issues.apache.org/jira/browse/SPARK-2476
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: Patrick Wendell
>Assignee: Prashant Sharma
>
> If possible, we should try to contribute the ability to include 
> runtime-scoped dependencies in the assembly jar created with sbt-assembly.
> Currently in only reads compile-scoped dependencies:
> https://github.com/sbt/sbt-assembly/blob/master/src/main/scala/sbtassembly/Plugin.scala#L495



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2476) Have sbt-assembly include runtime dependencies in jar

2014-07-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2476:
---

Environment: (was: If possible, we should try to contribute the ability 
to include runtime-scoped dependencies in the assembly jar created with 
sbt-assembly.

Currently in only reads compile-scoped dependencies:
https://github.com/sbt/sbt-assembly/blob/master/src/main/scala/sbtassembly/Plugin.scala#L495)

> Have sbt-assembly include runtime dependencies in jar
> -
>
> Key: SPARK-2476
> URL: https://issues.apache.org/jira/browse/SPARK-2476
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: Patrick Wendell
>Assignee: Prashant Sharma
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2443) Reading from Partitioned Tables is Slow

2014-07-14 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2443:


Fix Version/s: 1.0.2

> Reading from Partitioned Tables is Slow
> ---
>
> Key: SPARK-2443
> URL: https://issues.apache.org/jira/browse/SPARK-2443
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Zongheng Yang
> Fix For: 1.1.0, 1.0.2
>
>
> Here are some numbers, all queries return ~20million:
> {code}
> SELECT COUNT(*) FROM 
> 5.496467726 s
> SELECT COUNT(*) FROM 
> 50.26947 s
> SELECT COUNT(*) FROM  instead of through hive>
> 2s
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2443) Reading from Partitioned Tables is Slow

2014-07-14 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2443.
-

   Resolution: Fixed
Fix Version/s: 1.1.0

> Reading from Partitioned Tables is Slow
> ---
>
> Key: SPARK-2443
> URL: https://issues.apache.org/jira/browse/SPARK-2443
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Zongheng Yang
> Fix For: 1.1.0
>
>
> Here are some numbers, all queries return ~20million:
> {code}
> SELECT COUNT(*) FROM 
> 5.496467726 s
> SELECT COUNT(*) FROM 
> 50.26947 s
> SELECT COUNT(*) FROM  instead of through hive>
> 2s
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2419) Misc updates to streaming programming guide

2014-07-14 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2419:
-

Description: 
This JIRA collects together a number of small issues that should be added to 
the streaming programming guide

- Receivers consume an executor slot and highlight the fact the # cores > # 
receivers is necessary
- Deploying requires spark-streaming-XYZ and its dependencies to be packaged 
with application JAR. 
- Ordering and parallelism of the output operations
- Receiver's should be serializable
- Add more information on how socketStream: input stream => iterator function.

  was:
This JIRA collects together a number of small issues that should be added to 
the streaming programming guide

- Receivers consume an executor slot
- Ordering and parallelism of the output operations
- Receiver's should be serializable
- Add more information on how socketStream: input stream => iterator function.


> Misc updates to streaming programming guide
> ---
>
> Key: SPARK-2419
> URL: https://issues.apache.org/jira/browse/SPARK-2419
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> This JIRA collects together a number of small issues that should be added to 
> the streaming programming guide
> - Receivers consume an executor slot and highlight the fact the # cores > # 
> receivers is necessary
> - Deploying requires spark-streaming-XYZ and its dependencies to be packaged 
> with application JAR. 
> - Ordering and parallelism of the output operations
> - Receiver's should be serializable
> - Add more information on how socketStream: input stream => iterator function.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2448) Table name is not getting applied to their attributes after "registerAsTable"

2014-07-14 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2448.
-

Resolution: Duplicate

Thanks for reporting, I think [~yhuai] also filed the same bug.  So, I'm going 
to close this one.

> Table name is not getting applied to their attributes after "registerAsTable"
> -
>
> Key: SPARK-2448
> URL: https://issues.apache.org/jira/browse/SPARK-2448
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Jerry Lam
>
> The following sample code will fail:
> {code}
> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
> import hiveContext._
> hql("USE test")
> hql("select id from m").registerAsTable("m")
> hql("select s.id from m join s on (s.id=m.id)").collect().foreach(println)
> {code}
> The exception:
> {noformat}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 
> 0.0:736 failed 4 times, most recent failure: Exception failure in TID 167 on 
> host node05: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> No function to evaluate expression. type: UnresolvedAttribute, tree: 'm.id
> 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.eval(unresolved.scala:59)
> 
> org.apache.spark.sql.catalyst.expressions.Equals.eval(predicates.scala:151)
> 
> org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:52)
> 
> org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:52)
> scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> scala.collection.Iterator$class.foreach(Iterator.scala:727)
> scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> scala.collection.AbstractIterator.to(Iterator.scala:1157)
> 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
> org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
> 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)
> 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
> org.apache.spark.scheduler.Task.run(Task.scala:51)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> java.lang.Thread.run(Thread.java:662)
> {noformat}
> The query execution plan:
> {noformat}
> == Logical Plan ==
> Project ['s.id]
>  Join Inner, Some((id#106 = 'm.id))
>   Project [id#96 AS id#62]
>MetastoreRelation test, m, None
>   MetastoreRelation test, s, Some(s)
> == Optimized Logical Plan ==
> Project ['s.id]
>  Join Inner, Some((id#106 = 'm.id))
>   Project []
>MetastoreRelation test, m, None
>   Project [id#106]
>MetastoreRelation test, s, Some(s)
> == Physical Plan ==
> Project ['s.id]
>  Filter (id#106:0 = 'm.id)
>   CartesianProduct
>HiveTableScan [], (MetastoreRelation test, m, None), None
>HiveTableScan [id#106], (MetastoreRelation test, s, Some(s)), None
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2446) Add BinaryType support to Parquet I/O.

2014-07-14 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2446:


Target Version/s: 1.1.0

> Add BinaryType support to Parquet I/O.
> --
>
> Key: SPARK-2446
> URL: https://issues.apache.org/jira/browse/SPARK-2446
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Takuya Ueshin
>
> To support {{BinaryType}}, the following changes are needed:
> - Make {{StringType}} use {{OriginalType.UTF8}}
> - Add {{BinaryType}} using {{PrimitiveTypeName.BINARY}} without 
> {{OriginalType}}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2449) Spark sql reflection code requires a constructor taking all the fields for the table

2014-07-14 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2449:


Target Version/s: 1.1.0

> Spark sql reflection code requires a constructor taking all the fields for 
> the table
> 
>
> Key: SPARK-2449
> URL: https://issues.apache.org/jira/browse/SPARK-2449
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Ian O Connell
>
> The reflection code does a lookup for the fields passed to the constructor to 
> make the types for the table. Specifically the code:
>   val params = t.member(nme.CONSTRUCTOR).asMethod.paramss
> in ScalaReflection.scala
> Simple repo case from the spark shell:
> trait PersonTrait extends Product
> case class Person(a: Int) extends PersonTrait
> val l: List[PersonTrait] = List(1, 2, 3, 4).map(Person(_))
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext._
> sc.parallelize(l).registerAsTable("people")
> scala> sc.parallelize(l).registerAsTable("people")
> scala.ScalaReflectionException:  is not a method
>   at scala.reflect.api.Symbols$SymbolApi$class.asMethod(Symbols.scala:279)
>   at 
> scala.reflect.internal.Symbols$SymbolContextApiImpl.asMethod(Symbols.scala:73)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:52)
>   at 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2475) Check whether #cores > #receivers in local mode

2014-07-14 Thread Tathagata Das (JIRA)

Tathagata Das created SPARK-2475:


 Summary: Check whether #cores > #receivers in local mode
 Key: SPARK-2475
 URL: https://issues.apache.org/jira/browse/SPARK-2475
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Tathagata Das


When the number of slots in local mode is not more than the number of 
receivers, then the system should throw an error. Otherwise the system just 
keeps waiting for resources to process the received data.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2465) Use long as user / item ID for ALS

2014-07-14 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061107#comment-14061107
 ] 

Sean Owen commented on SPARK-2465:
--

Forgot to add that when I've implemented this, and used longs for IDs, we used 
a simple zig-zag variable length encoding for integers. This is because, often, 
IDs really were numbers and so tended to be small. Hence an 8-byte long might 
only take a few bytes on disk. Some serialization frameworks like protobuf do 
this kind of thing automatically; we wrote it by hand in Writables. I know Java 
doesn't do this, but don't know about Kryo. Anyway, if the serialized size is 
the issue (and it's not the only issue), there are maybe ways of getting around 
that. It doesn't help if the values really are hashes since the values go all 
over the range of integers.

> Use long as user / item ID for ALS
> --
>
> Key: SPARK-2465
> URL: https://issues.apache.org/jira/browse/SPARK-2465
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.0.1
>Reporter: Sean Owen
>Priority: Minor
> Attachments: Screen Shot 2014-07-13 at 8.49.40 PM.png
>
>
> I'd like to float this for consideration: use longs instead of ints for user 
> and product IDs in the ALS implementation.
> The main reason for is that identifiers are not generally numeric at all, and 
> will be hashed to an integer. (This is a separate issue.) Hashing to 32 bits 
> means collisions are likely after hundreds of thousands of users and items, 
> which is not unrealistic. Hashing to 64 bits pushes this back to billions.
> It would also mean numeric IDs that happen to be larger than the largest int 
> can be used directly as identifiers.
> On the downside of course: 8 bytes instead of 4 bytes of memory used per 
> Rating.
> Thoughts? I will post a PR so as to show what the change would be.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1215) Clustering: Index out of bounds error

2014-07-14 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061095#comment-14061095
 ] 

Joseph K. Bradley commented on SPARK-1215:
--

Submitted fix as PR 1407: https://github.com/apache/spark/pull/1407

> Clustering: Index out of bounds error
> -
>
> Key: SPARK-1215
> URL: https://issues.apache.org/jira/browse/SPARK-1215
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: dewshick
>Assignee: Joseph K. Bradley
> Attachments: test.csv
>
>
> code:
> import org.apache.spark.mllib.clustering._
> val test = sc.makeRDD(Array(4,4,4,4,4).map(e => Array(e.toDouble)))
> val kmeans = new KMeans().setK(4)
> kmeans.run(test) evals with java.lang.ArrayIndexOutOfBoundsException
> error:
> 14/01/17 12:35:54 INFO scheduler.DAGScheduler: Stage 25 (collectAsMap at 
> KMeans.scala:243) finished in 0.047 s
> 14/01/17 12:35:54 INFO spark.SparkContext: Job finished: collectAsMap at 
> KMeans.scala:243, took 16.389537116 s
> Exception in thread "main" java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at com.simontuffs.onejar.Boot.run(Boot.java:340)
>   at com.simontuffs.onejar.Boot.main(Boot.java:166)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
>   at 
> org.apache.spark.mllib.clustering.LocalKMeans$.kMeansPlusPlus(LocalKMeans.scala:47)
>   at 
> org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:247)
>   at 
> org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
>   at scala.collection.immutable.Range.foreach(Range.scala:81)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
>   at scala.collection.immutable.Range.map(Range.scala:46)
>   at 
> org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:244)
>   at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:124)
>   at Clustering$$anonfun$1.apply$mcDI$sp(Clustering.scala:21)
>   at Clustering$$anonfun$1.apply(Clustering.scala:19)
>   at Clustering$$anonfun$1.apply(Clustering.scala:19)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
>   at scala.collection.immutable.Range.foreach(Range.scala:78)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
>   at scala.collection.immutable.Range.map(Range.scala:46)
>   at Clustering$.main(Clustering.scala:19)
>   at Clustering.main(Clustering.scala)
>   ... 6 more



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-1215) Clustering: Index out of bounds error

2014-07-14 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061095#comment-14061095
 ] 

Joseph K. Bradley edited comment on SPARK-1215 at 7/14/14 7:35 PM:
---

Submitted fix as PR 1407: https://github.com/apache/spark/pull/1407

Made default behavior to return k clusters still, with some duplicated


was (Author: josephkb):
Submitted fix as PR 1407: https://github.com/apache/spark/pull/1407

> Clustering: Index out of bounds error
> -
>
> Key: SPARK-1215
> URL: https://issues.apache.org/jira/browse/SPARK-1215
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: dewshick
>Assignee: Joseph K. Bradley
> Attachments: test.csv
>
>
> code:
> import org.apache.spark.mllib.clustering._
> val test = sc.makeRDD(Array(4,4,4,4,4).map(e => Array(e.toDouble)))
> val kmeans = new KMeans().setK(4)
> kmeans.run(test) evals with java.lang.ArrayIndexOutOfBoundsException
> error:
> 14/01/17 12:35:54 INFO scheduler.DAGScheduler: Stage 25 (collectAsMap at 
> KMeans.scala:243) finished in 0.047 s
> 14/01/17 12:35:54 INFO spark.SparkContext: Job finished: collectAsMap at 
> KMeans.scala:243, took 16.389537116 s
> Exception in thread "main" java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at com.simontuffs.onejar.Boot.run(Boot.java:340)
>   at com.simontuffs.onejar.Boot.main(Boot.java:166)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
>   at 
> org.apache.spark.mllib.clustering.LocalKMeans$.kMeansPlusPlus(LocalKMeans.scala:47)
>   at 
> org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:247)
>   at 
> org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
>   at scala.collection.immutable.Range.foreach(Range.scala:81)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
>   at scala.collection.immutable.Range.map(Range.scala:46)
>   at 
> org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:244)
>   at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:124)
>   at Clustering$$anonfun$1.apply$mcDI$sp(Clustering.scala:21)
>   at Clustering$$anonfun$1.apply(Clustering.scala:19)
>   at Clustering$$anonfun$1.apply(Clustering.scala:19)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
>   at scala.collection.immutable.Range.foreach(Range.scala:78)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
>   at scala.collection.immutable.Range.map(Range.scala:46)
>   at Clustering$.main(Clustering.scala:19)
>   at Clustering.main(Clustering.scala)
>   ... 6 more



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2468) zero-copy shuffle network communication

2014-07-14 Thread Mridul Muralidharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061094#comment-14061094
 ] 

Mridul Muralidharan commented on SPARK-2468:



Ah, small files - those are indeed a problem.

Btw, we do dispose off map'ed blocks as soon as it is done; so we dont need to 
wait for gc to free them. Also note that the files are closed as soon as opened 
and mmap'ed - so they do not count towards open file count/ulimit.

Agree on 1, 3 and 4 - some of these apply to sendfile too btw : so not 
avoidable; but it is the best we have right now.
Since we use mmap'ed buffers and rarely transfer the same file again, the 
performance jump might not be the order(s) of magnitude other projects claim - 
but then even 10% (or whatever) improvement in our case would be substantial !

> zero-copy shuffle network communication
> ---
>
> Key: SPARK-2468
> URL: https://issues.apache.org/jira/browse/SPARK-2468
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Critical
>
> Right now shuffle send goes through the block manager. This is inefficient 
> because it requires loading a block from disk into a kernel buffer, then into 
> a user space buffer, and then back to a kernel send buffer before it reaches 
> the NIC. It does multiple copies of the data and context switching between 
> kernel/user. It also creates unnecessary buffer in the JVM that increases GC
> Instead, we should use FileChannel.transferTo, which handles this in the 
> kernel space with zero-copy. See 
> http://www.ibm.com/developerworks/library/j-zerocopy/
> One potential solution is to use Netty NIO.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1981) Add AWS Kinesis streaming support

2014-07-14 Thread Chris Fregly (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061007#comment-14061007
 ] 

Chris Fregly commented on SPARK-1981:
-

quick update:

i completed all code, examples, tests, build, and documentation changes this 
weekend.  everything looks good.

however, when i went to merge last night, i noticed this PR:  
https://github.com/apache/spark/pull/772 

this changes the underlying maven and sbt builds a bit - for the better, of 
course!

reverting my build changes and adapting to the new build structure are the last 
step which i plan to tackle today.

almost there!


> Add AWS Kinesis streaming support
> -
>
> Key: SPARK-1981
> URL: https://issues.apache.org/jira/browse/SPARK-1981
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Chris Fregly
>Assignee: Chris Fregly
>
> Add AWS Kinesis support to Spark Streaming.
> Initial discussion occured here:  https://github.com/apache/spark/pull/223
> I discussed this with Parviz from AWS recently and we agreed that I would 
> take this over.
> Look for a new PR that takes into account all the feedback from the earlier 
> PR including spark-1.0-compliant implementation, AWS-license-aware build 
> support, tests, comments, and style guide compliance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2443) Reading from Partitioned Tables is Slow

2014-07-14 Thread Zongheng Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061001#comment-14061001
 ] 

Zongheng Yang commented on SPARK-2443:
--

I opened a new PR: https://github.com/apache/spark/pull/1408

> Reading from Partitioned Tables is Slow
> ---
>
> Key: SPARK-2443
> URL: https://issues.apache.org/jira/browse/SPARK-2443
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Zongheng Yang
>
> Here are some numbers, all queries return ~20million:
> {code}
> SELECT COUNT(*) FROM 
> 5.496467726 s
> SELECT COUNT(*) FROM 
> 50.26947 s
> SELECT COUNT(*) FROM  instead of through hive>
> 2s
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2468) zero-copy shuffle network communication

2014-07-14 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060937#comment-14060937
 ] 

Reynold Xin edited comment on SPARK-2468 at 7/14/14 5:57 PM:
-

We do use mmap for large blocks. However, most of the shuffle blocks are small 
so a lot of blocks are not mapped. In addition, there are multiple problems 
with memory mapped files:

1. Memory mapped blocks are off-heap and are not managed by the JVM, which 
creates another memory space to tune/mange

2. Memory mapped blocks cannot be reused and are only released at GC. It is 
easy to have too many files opened.

3. On Linux machines with Huge Pages configured (which is increasingly more 
common with large memory), the default behavior is each file will consume 2MB, 
leading to OOM very soon.

4. For large blocks that span multiple pages, it creates page faults which 
leads to unnecessary context switches

The last one is probably much less important.



was (Author: rxin):
We do use mmap for large blocks. However, most of the shuffle blocks are small 
so a lot of blocks are not mapped. In addition, there are multiple problems 
with memory mapped files:

1. Memory mapped blocks are off-heap and are not managed by the JVM, which 
creates another memory space to tune/mange

2. Memory mapped blocks cannot be reused and are only released at GC

3. On Linux machines with Huge Pages configured (which is increasingly more 
common with large memory), the default behavior is each file will consume 2MB, 
leading to OOM very soon.

4. For large blocks that span multiple pages, it creates page faults which 
leads to unnecessary context switches

The last one is probably much less important.


> zero-copy shuffle network communication
> ---
>
> Key: SPARK-2468
> URL: https://issues.apache.org/jira/browse/SPARK-2468
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Critical
>
> Right now shuffle send goes through the block manager. This is inefficient 
> because it requires loading a block from disk into a kernel buffer, then into 
> a user space buffer, and then back to a kernel send buffer before it reaches 
> the NIC. It does multiple copies of the data and context switching between 
> kernel/user. It also creates unnecessary buffer in the JVM that increases GC
> Instead, we should use FileChannel.transferTo, which handles this in the 
> kernel space with zero-copy. See 
> http://www.ibm.com/developerworks/library/j-zerocopy/
> One potential solution is to use Netty NIO.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2468) zero-copy shuffle network communication

2014-07-14 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060937#comment-14060937
 ] 

Reynold Xin commented on SPARK-2468:


We do use mmap for large blocks. However, most of the shuffle blocks are small 
so a lot of blocks are not mapped. In addition, there are multiple problems 
with memory mapped files:

1. Memory mapped blocks are off-heap and are not managed by the JVM, which 
creates another memory space to tune/mange

2. Memory mapped blocks cannot be reused and are only released at GC

3. On Linux machines with Huge Pages configured (which is increasingly more 
common with large memory), the default behavior is each file will consume 2MB, 
leading to OOM very soon.

4. For large blocks that span multiple pages, it creates page faults which 
leads to unnecessary context switches


> zero-copy shuffle network communication
> ---
>
> Key: SPARK-2468
> URL: https://issues.apache.org/jira/browse/SPARK-2468
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Critical
>
> Right now shuffle send goes through the block manager. This is inefficient 
> because it requires loading a block from disk into a kernel buffer, then into 
> a user space buffer, and then back to a kernel send buffer before it reaches 
> the NIC. It does multiple copies of the data and context switching between 
> kernel/user. It also creates unnecessary buffer in the JVM that increases GC
> Instead, we should use FileChannel.transferTo, which handles this in the 
> kernel space with zero-copy. See 
> http://www.ibm.com/developerworks/library/j-zerocopy/
> One potential solution is to use Netty NIO.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2468) zero-copy shuffle network communication

2014-07-14 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060937#comment-14060937
 ] 

Reynold Xin edited comment on SPARK-2468 at 7/14/14 5:55 PM:
-

We do use mmap for large blocks. However, most of the shuffle blocks are small 
so a lot of blocks are not mapped. In addition, there are multiple problems 
with memory mapped files:

1. Memory mapped blocks are off-heap and are not managed by the JVM, which 
creates another memory space to tune/mange

2. Memory mapped blocks cannot be reused and are only released at GC

3. On Linux machines with Huge Pages configured (which is increasingly more 
common with large memory), the default behavior is each file will consume 2MB, 
leading to OOM very soon.

4. For large blocks that span multiple pages, it creates page faults which 
leads to unnecessary context switches

The last one is probably much less important.



was (Author: rxin):
We do use mmap for large blocks. However, most of the shuffle blocks are small 
so a lot of blocks are not mapped. In addition, there are multiple problems 
with memory mapped files:

1. Memory mapped blocks are off-heap and are not managed by the JVM, which 
creates another memory space to tune/mange

2. Memory mapped blocks cannot be reused and are only released at GC

3. On Linux machines with Huge Pages configured (which is increasingly more 
common with large memory), the default behavior is each file will consume 2MB, 
leading to OOM very soon.

4. For large blocks that span multiple pages, it creates page faults which 
leads to unnecessary context switches


> zero-copy shuffle network communication
> ---
>
> Key: SPARK-2468
> URL: https://issues.apache.org/jira/browse/SPARK-2468
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Critical
>
> Right now shuffle send goes through the block manager. This is inefficient 
> because it requires loading a block from disk into a kernel buffer, then into 
> a user space buffer, and then back to a kernel send buffer before it reaches 
> the NIC. It does multiple copies of the data and context switching between 
> kernel/user. It also creates unnecessary buffer in the JVM that increases GC
> Instead, we should use FileChannel.transferTo, which handles this in the 
> kernel space with zero-copy. See 
> http://www.ibm.com/developerworks/library/j-zerocopy/
> One potential solution is to use Netty NIO.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2443) Reading from Partitioned Tables is Slow

2014-07-14 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060894#comment-14060894
 ] 

Michael Armbrust commented on SPARK-2443:
-

[~jerrylam] since Spark SQL is an Alpha component we have been pretty 
aggressive about back porting to the 1.0 branch.  This patch will likely be 
included in the 1.0.2 release if/when we make one.

> Reading from Partitioned Tables is Slow
> ---
>
> Key: SPARK-2443
> URL: https://issues.apache.org/jira/browse/SPARK-2443
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Zongheng Yang
>
> Here are some numbers, all queries return ~20million:
> {code}
> SELECT COUNT(*) FROM 
> 5.496467726 s
> SELECT COUNT(*) FROM 
> 50.26947 s
> SELECT COUNT(*) FROM  instead of through hive>
> 2s
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2474) For a registered table in OverrideCatalog, the Analyzer failed to resolve references in the format of "tableName.fieldName"

2014-07-14 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-2474:


Summary: For a registered table in OverrideCatalog, the Analyzer failed to 
resolve references in the format of "tableName.fieldName"  (was: In some cases, 
the Analyzer failed to resolve a table registered in OverrideCatalog)

> For a registered table in OverrideCatalog, the Analyzer failed to resolve 
> references in the format of "tableName.fieldName"
> ---
>
> Key: SPARK-2474
> URL: https://issues.apache.org/jira/browse/SPARK-2474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.1
>Reporter: Yin Huai
>
> To reproduce the error, execute the following code in hive/console...
> {code}
> val m = hql("select key from src")
> m.registerAsTable("m")
> hql("select m.key from m")
> {code}
> Then, you will see
> {code}
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
> attributes: 'm.key, tree:
> Project ['m.key]
>  LowerCaseSchema 
>   Project [key#6]
>LowerCaseSchema 
> MetastoreRelation default, src, None
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:71)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156)
> ...
> {code}
> However, if you run
> {code}
> hql("select tmp.key from m tmp")
> {code}
> We are fine.
> {code}
> SchemaRDD[3] at RDD at SchemaRDD.scala:104
> == Query Plan ==
> HiveTableScan [key#8], (MetastoreRelation default, src, None), None
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2474) In some cases, the Analyzer failed to resolve a table registered in OverrideCatalog

2014-07-14 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060828#comment-14060828
 ] 

Yin Huai commented on SPARK-2474:
-

I think the problem is the lookupRelation in OverrideCatalog. 

The current version is 
{code}
abstract override def lookupRelation(
databaseName: Option[String],
tableName: String,
alias: Option[String] = None): LogicalPlan = {
val (dbName, tblName) = processDatabaseAndTableName(databaseName, tableName)
val overriddenTable = overrides.get((dbName, tblName))

// If an alias was specified by the lookup, wrap the plan in a subquery so 
that attributes are
// properly qualified with this alias.
val withAlias =
  overriddenTable.map(r => alias.map(a => Subquery(a, r)).getOrElse(r))

withAlias.getOrElse(super.lookupRelation(dbName, tblName, alias))
  }
{code}
You can notice that we do not insert a Subquery for the tableName (i.e. 
Subquery(tableName, logicalPlan)). Seems the SimpleCatalog.lookupRelation does 
not have this issue because we have 
{code}
val tableWithQualifiers = Subquery(tblName, table)
{code}


> In some cases, the Analyzer failed to resolve a table registered in 
> OverrideCatalog
> ---
>
> Key: SPARK-2474
> URL: https://issues.apache.org/jira/browse/SPARK-2474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.1
>Reporter: Yin Huai
>
> To reproduce the error, execute the following code in hive/console...
> {code}
> val m = hql("select key from src")
> m.registerAsTable("m")
> hql("select m.key from m")
> {code}
> Then, you will see
> {code}
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
> attributes: 'm.key, tree:
> Project ['m.key]
>  LowerCaseSchema 
>   Project [key#6]
>LowerCaseSchema 
> MetastoreRelation default, src, None
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:71)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156)
> ...
> {code}
> However, if you run
> {code}
> hql("select tmp.key from m tmp")
> {code}
> We are fine.
> {code}
> SchemaRDD[3] at RDD at SchemaRDD.scala:104
> == Query Plan ==
> HiveTableScan [key#8], (MetastoreRelation default, src, None), None
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2474) In some cases, the Analyzer failed to resolve a table registered in OverrideCatalog

2014-07-14 Thread Yin Huai (JIRA)

Yin Huai created SPARK-2474:
---

 Summary: In some cases, the Analyzer failed to resolve a table 
registered in OverrideCatalog
 Key: SPARK-2474
 URL: https://issues.apache.org/jira/browse/SPARK-2474
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.1
Reporter: Yin Huai


To reproduce the error, execute the following code in hive/console...
{code}
val m = hql("select key from src")
m.registerAsTable("m")
hql("select m.key from m")
{code}
Then, you will see
{code}
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
attributes: 'm.key, tree:
Project ['m.key]
 LowerCaseSchema 
  Project [key#6]
   LowerCaseSchema 
MetastoreRelation default, src, None

at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:71)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156)
...
{code}

However, if you run
{code}
hql("select tmp.key from m tmp")
{code}
We are fine.
{code}
SchemaRDD[3] at RDD at SchemaRDD.scala:104
== Query Plan ==
HiveTableScan [key#8], (MetastoreRelation default, src, None), None
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2154) Worker goes down.

2014-07-14 Thread Aaron Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060813#comment-14060813
 ] 

Aaron Davidson commented on SPARK-2154:
---

Created this PR to hopefully fix that: https://github.com/apache/spark/pull/1405

> Worker goes down.
> -
>
> Key: SPARK-2154
> URL: https://issues.apache.org/jira/browse/SPARK-2154
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.8.1, 0.9.0, 1.0.0
> Environment: Spark on cluster of three nodes on Ubuntu 12.04.4 LTS
>Reporter: siva venkat gogineni
>  Labels: patch
> Attachments: Sccreenhot at various states of driver ..jpg
>
>
> Worker dies when i try to submit drivers more than the allocated cores. When 
> I submit 9 drivers with one core for each driver on a cluster having 8 cores 
> all together the worker dies as soon as i submit the 9 the driver. It works 
> fine until it reaches 8 cores, As soon as i submit 9th driver the driver 
> status remains "Submitted" and the worker crashes. I understand that we 
> cannot run  drivers more than the allocated cores but the problem here is 
> instead of the 9th driver being in queue it is being executed and as a result 
> it is crashing the worker. Let me know if there is a way to get around this 
> issue or is it being fixed in the upcoming version?
> Cluster Details:
> Spark 1.00
> 2 nodes with 4 cores each.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2443) Reading from Partitioned Tables is Slow

2014-07-14 Thread Jerry Lam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060790#comment-14060790
 ] 

Jerry Lam commented on SPARK-2443:
--

I wonder if this fix can be easily merged into the current spark release 
(1.0.1)? We desperately need this fix to perform the benchmark. Thank you!

> Reading from Partitioned Tables is Slow
> ---
>
> Key: SPARK-2443
> URL: https://issues.apache.org/jira/browse/SPARK-2443
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Zongheng Yang
>
> Here are some numbers, all queries return ~20million:
> {code}
> SELECT COUNT(*) FROM 
> 5.496467726 s
> SELECT COUNT(*) FROM 
> 50.26947 s
> SELECT COUNT(*) FROM  instead of through hive>
> 2s
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2278) groupBy & groupByKey should support custom comparator

2014-07-14 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060774#comment-14060774
 ] 

Sean Owen commented on SPARK-2278:
--

The more direct parallel certainly also exists, if you want to write it that 
way. Given and RDD of V, you can first groupBy some derived value type K, to 
get an RDD of (K,Iterable[V]). From there, you can mapValues and apply a reduce 
function yourself. (And something analogous for groupByKey)

The big "but" to this approach is that you materialize the values all together 
at once for a key, and then manually apply a reduce function. This is what 
reduceBy is doing under the hood for you, probably more optimally. Still you 
could break it down if you needed more control.

The part where you get to define the value K that determines grouping -- that's 
what you need and why you don't necessarily need a Comparator anywhere to get 
your job done.

Yes, understanding the 'func' is key, and it's more obvious coming from Scala. 
It answers the requirements you have as far as I understand them (with the 
caveat above about ordering and sortBy). I suggest you can resolve this by 
suggesting a doc update somewhere.

> groupBy & groupByKey should support custom comparator
> -
>
> Key: SPARK-2278
> URL: https://issues.apache.org/jira/browse/SPARK-2278
> Project: Spark
>  Issue Type: New Feature
>  Components: Java API
>Affects Versions: 1.0.0
>Reporter: Hans Uhlig
>
> To maintain parity with MapReduce you should be able to specify a custom key 
> equality function in groupBy/groupByKey similar to sortByKey. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-546) Support full outer join and multiple join in a single shuffle

2014-07-14 Thread Aaron (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060763#comment-14060763
 ] 

Aaron commented on SPARK-546:
-

I created a PR for a full outer join implementation here:
https://github.com/apache/spark/pull/1395

If there is interest I can also implement multiJoin.

> Support full outer join and multiple join in a single shuffle
> -
>
> Key: SPARK-546
> URL: https://issues.apache.org/jira/browse/SPARK-546
> Project: Spark
>  Issue Type: Improvement
>Reporter: Reynold Xin
>
> RDD[(K,V)] now supports left/right outer join but not full outer join.
> Also it'd be nice to provide a way for users to join multiple RDDs on the 
> same key in a single shuffle.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2154) Worker goes down.

2014-07-14 Thread siva venkat gogineni (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

siva venkat gogineni updated SPARK-2154:


Attachment: Sccreenhot at various states of driver ..jpg

> Worker goes down.
> -
>
> Key: SPARK-2154
> URL: https://issues.apache.org/jira/browse/SPARK-2154
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.8.1, 0.9.0, 1.0.0
> Environment: Spark on cluster of three nodes on Ubuntu 12.04.4 LTS
>Reporter: siva venkat gogineni
>  Labels: patch
> Attachments: Sccreenhot at various states of driver ..jpg
>
>
> Worker dies when i try to submit drivers more than the allocated cores. When 
> I submit 9 drivers with one core for each driver on a cluster having 8 cores 
> all together the worker dies as soon as i submit the 9 the driver. It works 
> fine until it reaches 8 cores, As soon as i submit 9th driver the driver 
> status remains "Submitted" and the worker crashes. I understand that we 
> cannot run  drivers more than the allocated cores but the problem here is 
> instead of the 9th driver being in queue it is being executed and as a result 
> it is crashing the worker. Let me know if there is a way to get around this 
> issue or is it being fixed in the upcoming version?
> Cluster Details:
> Spark 1.00
> 2 nodes with 4 cores each.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2278) groupBy & groupByKey should support custom comparator

2014-07-14 Thread Hans Uhlig (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060735#comment-14060735
 ] 

Hans Uhlig commented on SPARK-2278:
---

I think I am understanding my confusion. I was looking for something that would 
parallel the existing Mapreduce reduce function which receives a Key and the 
associated iterable list of values. The documentation doesn't actually describe 
the behavior of each method, the expected behavior of the referenced function 
nor contain a description of the transformation particularly well. It looks as 
though you need to chain groupByKey then flatMapValues to accomplish what I am 
looking for. Still being able to specify a comparator or custom equality 
operator would be good.

> groupBy & groupByKey should support custom comparator
> -
>
> Key: SPARK-2278
> URL: https://issues.apache.org/jira/browse/SPARK-2278
> Project: Spark
>  Issue Type: New Feature
>  Components: Java API
>Affects Versions: 1.0.0
>Reporter: Hans Uhlig
>
> To maintain parity with MapReduce you should be able to specify a custom key 
> equality function in groupBy/groupByKey similar to sortByKey. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2154) Worker goes down.

2014-07-14 Thread siva venkat gogineni (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060728#comment-14060728
 ] 

siva venkat gogineni commented on SPARK-2154:
-

It looks it has been fixed in 1.0.1, But as a result of this fix there is 
another bug which is created.
If we launch a new driver when there are drivers already running in the 
cluster. It shows the driver is submitted, But it does not go to running stage 
even after existing drivers have completed . The driver changes state from 
submitted state to running only when we submit another driver.

> Worker goes down.
> -
>
> Key: SPARK-2154
> URL: https://issues.apache.org/jira/browse/SPARK-2154
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.8.1, 0.9.0, 1.0.0
> Environment: Spark on cluster of three nodes on Ubuntu 12.04.4 LTS
>Reporter: siva venkat gogineni
>  Labels: patch
>
> Worker dies when i try to submit drivers more than the allocated cores. When 
> I submit 9 drivers with one core for each driver on a cluster having 8 cores 
> all together the worker dies as soon as i submit the 9 the driver. It works 
> fine until it reaches 8 cores, As soon as i submit 9th driver the driver 
> status remains "Submitted" and the worker crashes. I understand that we 
> cannot run  drivers more than the allocated cores but the problem here is 
> instead of the 9th driver being in queue it is being executed and as a result 
> it is crashing the worker. Let me know if there is a way to get around this 
> issue or is it being fixed in the upcoming version?
> Cluster Details:
> Spark 1.00
> 2 nodes with 4 cores each.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2278) groupBy & groupByKey should support custom comparator

2014-07-14 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060718#comment-14060718
 ] 

Sean Owen commented on SPARK-2278:
--

groupBy / groupByKey vs collectBy / collectByKey?

(This is kind of a tangent, but a Comparator is not ideal here. Really the 
requirements is to define key equality differently, and that can be defined as 
"when compare() == 0", although ordering is not needed. But yes Comparator gets 
used this way in Java.)

In JavaRDD, you define a function of the key which yields a value whose 
equality matches how you want to group. Again given the hypothetical Employee, 
grouping by name:

{code}
new Function() {
  public String call(Employee e) {
return e.getName();
  }
}
{code}

There's not a copy here. This may not have been what you had in mind though. 
For Java it would have been:

{code}
new Comparator() {
  public int compare(Employee e1, Employee e2) {
return e1.getName().compareTo(e2.getName());
  }
}
{code}

That's the equivalent. Although the disadvantage I see right now in the JavaRDD 
is you can't further define the ordering you want, in cases like sortBy, where 
the right ordering isn't the natural ordering of some function of the values.

What is the role of func vs comp in your example for groupBy (?) though?

> groupBy & groupByKey should support custom comparator
> -
>
> Key: SPARK-2278
> URL: https://issues.apache.org/jira/browse/SPARK-2278
> Project: Spark
>  Issue Type: New Feature
>  Components: Java API
>Affects Versions: 1.0.0
>Reporter: Hans Uhlig
>
> To maintain parity with MapReduce you should be able to specify a custom key 
> equality function in groupBy/groupByKey similar to sortByKey. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2473) Direct graph-database interaction/connection

2014-07-14 Thread David Deisadze (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Deisadze updated SPARK-2473:
--

Priority: Trivial  (was: Critical)

> Direct graph-database interaction/connection
> 
>
> Key: SPARK-2473
> URL: https://issues.apache.org/jira/browse/SPARK-2473
> Project: Spark
>  Issue Type: Question
> Environment: windows, hadoop 12 node cluster
>Reporter: David Deisadze
>Priority: Trivial
>
> Is it possible to integrate neo4j into graphx (so that graphx can do analysis 
> seamlessly) without having to use neo4js cypher to generate json file of the 
> graph?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Closed] (SPARK-2473) Direct graph-database interaction/connection

2014-07-14 Thread David Deisadze (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Deisadze closed SPARK-2473.
-

Resolution: Later

I asked the spark mailing list community: 
http://apache-spark-user-list.1001560.n3.nabble.com/Graph-database-integration-for-spark-graphx-td9607.html

> Direct graph-database interaction/connection
> 
>
> Key: SPARK-2473
> URL: https://issues.apache.org/jira/browse/SPARK-2473
> Project: Spark
>  Issue Type: Question
> Environment: windows, hadoop 12 node cluster
>Reporter: David Deisadze
>Priority: Trivial
>
> Is it possible to integrate neo4j into graphx (so that graphx can do analysis 
> seamlessly) without having to use neo4js cypher to generate json file of the 
> graph?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2278) groupBy & groupByKey should support custom comparator

2014-07-14 Thread Hans Uhlig (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060691#comment-14060691
 ] 

Hans Uhlig commented on SPARK-2278:
---

I think I am not communicating this very well. Perhaps I am looking for a 
different function entirely

JavaRDD JavaRDD.collectBy(Iterable Function 
func(Iterable), Comparator comp, Partitioner partitioner, int 
numPartitions)

JavaPairRDD JavaPairRDD.collectByKey( Iterable> 
Function func(Kin,Iterable), Comparator comp, 
Partitioner partitioner, int numPartitions)

> groupBy & groupByKey should support custom comparator
> -
>
> Key: SPARK-2278
> URL: https://issues.apache.org/jira/browse/SPARK-2278
> Project: Spark
>  Issue Type: New Feature
>  Components: Java API
>Affects Versions: 1.0.0
>Reporter: Hans Uhlig
>
> To maintain parity with MapReduce you should be able to specify a custom key 
> equality function in groupBy/groupByKey similar to sortByKey. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2473) Direct graph-database interaction/connection

2014-07-14 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060679#comment-14060679
 ] 

Sean Owen commented on SPARK-2473:
--

(If you're asking a question, I don't think a JIRA is the right place, even 
though there is unfortunately a "Question" issue type. Ask on the user@ mailing 
list. This isn't "Critical" in any event :) )

> Direct graph-database interaction/connection
> 
>
> Key: SPARK-2473
> URL: https://issues.apache.org/jira/browse/SPARK-2473
> Project: Spark
>  Issue Type: Question
> Environment: windows, hadoop 12 node cluster
>Reporter: David Deisadze
>Priority: Critical
>
> Is it possible to integrate neo4j into graphx (so that graphx can do analysis 
> seamlessly) without having to use neo4js cypher to generate json file of the 
> graph?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

1 2 >

1 - 100 of 123 matches

Mail list logo