date:20141211


[ 
https://issues.apache.org/jira/browse/SPARK-4828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242388#comment-14242388
 ] 

Apache Spark commented on SPARK-4828:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/3675

 sum and avg over empty table should return null
 ---

 Key: SPARK-4828
 URL: https://issues.apache.org/jira/browse/SPARK-4828
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4829) eliminate expressions calculation in count expression

2014-12-11 Thread Adrian Wang (JIRA)

Adrian Wang created SPARK-4829:
--

 Summary: eliminate expressions calculation in count expression
 Key: SPARK-4829
 URL: https://issues.apache.org/jira/browse/SPARK-4829
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Adrian Wang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4817) [streaming]Print the specified number of data and handle all of the elements in RDD


[ 
https://issues.apache.org/jira/browse/SPARK-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242402#comment-14242402
 ] 

Sean Owen commented on SPARK-4817:
--

[~tianyi] I agree that it would be nice to add a parameter to {{print()}}. But 
that is precisely what SPARK-3326 already covered. It's fine to want to 
process all of an RDD, and then print some of it. But, this does not require 
a new method at all.

[~surq] I don't understand your examples, relative to what you say you want to 
do. The first example only prints elements, and does it with needless 
complexity. Why {{collect()}}? The second example also doesn't do anything but 
print but tries to manually run a new job?

Something simply like this seems to be just what you want. It does something 
with the entire RDD, then prints just the first 100 elements:

{code}
stream.foreachRDD { rdd =
  rdd.foreach(row = ... do whateveryou want with every element ...)
  rdd.take(100).foreach(println)
}
{code}

I think this also works fine:

{code}
stream.foreachRDD { rdd =
  rdd.foreach(row = ... do whateveryou want with every element ...)
}
stream.print(100)
{code}

... if SPARK-3326 is implemented to add an argument to {{print()}}.

 [streaming]Print the specified number of data and handle all of the elements 
 in RDD
 ---

 Key: SPARK-4817
 URL: https://issues.apache.org/jira/browse/SPARK-4817
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: 宿荣全
Priority: Minor

 Dstream.print function:Print 10 elements and handle 11 elements.
 A new function based on Dstream.print function is presented:
 the new function:
 Print the specified number of data and handle all of the elements in RDD.
 there is a work scene:
 val dstream = stream.map-filter-mapPartitions-print
 the data after filter need update database in mapPartitions,but don't need 
 print each data,only need to print the top 20 for view the data processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4829) eliminate expressions calculation in count expression


[ 
https://issues.apache.org/jira/browse/SPARK-4829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242413#comment-14242413
 ] 

Apache Spark commented on SPARK-4829:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/3676

 eliminate expressions calculation in count expression
 -

 Key: SPARK-4829
 URL: https://issues.apache.org/jira/browse/SPARK-4829
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Adrian Wang





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2984) FileNotFoundException on _temporary directory

2014-12-11 Thread Paulo Motta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242436#comment-14242436
 ] 

Paulo Motta commented on SPARK-2984:


We're also facing a similar issue when using S3N, described in detail on this 
thread: https://www.mail-archive.com/user@spark.apache.org/msg17253.html

Here is the relevant exception:

{code}
2014-12-10 19:05:13,823 ERROR [sparkDriver-akka.actor.default-dispatcher-16] 
scheduler.JobScheduler (Logging.scala:logError(96)) - Error runnin
g job streaming job 141823830 ms.0
java.io.FileNotFoundException: File 
s3n://BUCKET/_temporary/0/task_201412101900_0039_m_33 does not exist.
at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.listStatus(NativeS3FileSystem.java:506)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
at 
org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136)
at 
org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:995)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:878)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:845)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:803)
at 
MyDumperClass$$anonfun$main$1.apply(IncrementalDumpsJenkins.scala:100)
at MyDumperClass$$anonfun$main$1.apply(IncrementalDumpsJenkins.scala:79)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2014-12-10 19:05:13,829 INFO  [Driver] yarn.ApplicationMaster 
(Logging.scala:logInfo(59)) - Unregistering ApplicationMaster with FAILED
{code}

I'm quite sure it has to do with eventual consistency on S3, since it's common 
to publish files on S3 and they're not promptly available (when you try s3cmd 
ls s3://mybucket/whatever soon after a file is posted, for instance), only 
after a few seconds after they appear.  Is there already a configuration for 
retrying to fetch S3 files if they'e not found (maybe with some kind of 
exponential backoff)? Maybe this could be a solution, if it's not yet available.

 FileNotFoundException on _temporary directory
 -

 Key: SPARK-2984
 URL: https://issues.apache.org/jira/browse/SPARK-2984
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Ash
Priority: Critical

 We've seen several stacktraces and threads on the user mailing list where 
 people are having issues with a {{FileNotFoundException}} stemming from an 
 HDFS path containing {{_temporary}}.
 I ([~aash]) think this may be related to {{spark.speculation}}.  I think the 
 error condition might manifest in this circumstance:
 1) task T starts on a executor E1
 2) it takes a long time, so task T' is started on another executor E2
 3) T finishes in E1 so moves its data from {{_temporary}} to the final 
 destination and deletes the {{_temporary}} directory during cleanup
 4) T' finishes in E2 and attempts to move its data from {{_temporary}}, but 
 those files no longer exist!  exception
 Some samples:
 {noformat}
 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job 
 140774430 ms.0
 java.io.FileNotFoundException: File 
 hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07
  does not exist.
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708)
 at

[jira] [Commented] (SPARK-4156) Add expectation maximization for Gaussian mixture models to MLLib clustering

2014-12-11 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242486#comment-14242486
 ] 

Meethu Mathew commented on SPARK-4156:
--

[~tgaloppo] The current version of the code has no predict function to return 
the cluster labels, i.e, the index of the cluster to which the point has 
maximum membership.We have written a predict function to return the cluster 
labels and  the membership values.We would be happy to contribute this to your 
code.
cc [~mengxr] 

 Add expectation maximization for Gaussian mixture models to MLLib clustering
 

 Key: SPARK-4156
 URL: https://issues.apache.org/jira/browse/SPARK-4156
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Travis Galoppo
Assignee: Travis Galoppo

 As an additional clustering algorithm, implement expectation maximization for 
 Gaussian mixture models



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4830) Spark Java Application : java.lang.ClassNotFoundException

Mykhaylo Telizhyn created SPARK-4830:


 Summary: Spark Java Application : java.lang.ClassNotFoundException
 Key: SPARK-4830
 URL: https://issues.apache.org/jira/browse/SPARK-4830
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.0
Reporter: Mykhaylo Telizhyn






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4156) Add expectation maximization for Gaussian mixture models to MLLib clustering

2014-12-11 Thread Travis Galoppo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242494#comment-14242494
 ] 

Travis Galoppo commented on SPARK-4156:
---

[~MeethuMathew] This would be great! If possible, please issue a pull request 
against my repo and I will merge it in as soon as possible.


 Add expectation maximization for Gaussian mixture models to MLLib clustering
 

 Key: SPARK-4156
 URL: https://issues.apache.org/jira/browse/SPARK-4156
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Travis Galoppo
Assignee: Travis Galoppo

 As an additional clustering algorithm, implement expectation maximization for 
 Gaussian mixture models



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4830) Spark Java Application : java.lang.ClassNotFoundException


 [ 
https://issues.apache.org/jira/browse/SPARK-4830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mykhaylo Telizhyn updated SPARK-4830:
-
Description: 
We have Spark Streaming application that consumes messages from RabbitMQ and 
process them. When generating hundreds of events on RabbitMQ and running our 
application on spark standalone cluster we see some 
java.lang.ClassNotFoundException exceptions in the log.


Application Overview:


Our domain model is simple POJO that represents RabbitMQ events we want to 
consume and contains some custom properties we are interested in: 
{code:title=Event.java|borderStyle=solid}
class Event implements java.io.Externalizable {

// custom properties

// custom implementation of writeExternal(), readExternal() methods
}
{code}
We have implemented custom spark receiver that just receives messages 
from RabbitMQ queue by means of custom consumer(See Receiving messages by 
subscription at https://www.rabbitmq.com/api-guide.html), converts them to our 
custom domain event objects(com.xxx.Event) and stores them on spark memory:
{code:title=RabbitMQReceiver.java|borderStyle=solid}
byte[] body = // data received from Rabbit using custom consumer
Event event = new Event(body);
store(event)  // store into Spark  
 {code}

The main program is simple, it just set up spark streaming context:
{code:title=SparkApplication.java|borderStyle=solid}
SparkConf sparkConf = new SparkConf().setAppName(APPLICATION_NAME);

sparkConf.setJars(SparkContext.jarOfClass(Application.class).toList());  
{code}
Initialize input streams:
{code:title=SparkApplication.java|borderStyle=solid}
ReceiverInputDStreamEvent stream = // create input stream from 
RabbitMQ
JavaReceiverInputDStreamEvent events = new 
JavaReceiverInputDStreamEvent(stream, classTag(Event.class));
{code}
Process events:
{code:title=SparkApplication.java|borderStyle=solid}
events.foreachRDD(
rdd - {

rdd.foreachPartition(

partition - {
 
// process partition
}
}
})

ssc.start();
ssc.awaitTermination();
{code}

Application submission:

Application is packaged into single fat jar using maven shade 
plugin(http://maven.apache.org/plugins/maven-shade-plugin/). It compiled with 
spark version 1.1.0   
We run our application on spark version 1.1.0 standalone cluster that 
consists of driver host, master host and two worker hosts. We submit 
application from driver host.
On one of the workers we see java.lang.ClassNotFoundException 
exceptions:  

We see that worker has downloaded application.jar and added it to class 
loader:

14/11/27 10:26:59 INFO Executor: Fetching 
http://xx.xx.xx.xx:38287/jars/application.jar with timestamp 1417084011213
14/11/27 10:26:59 INFO Utils: Fetching 
http://xx.xx.xx.xx:38287/jars/application.jar to 
/tmp/fetchFileTemp8223721356974787443.tmp
14/11/27 10:27:00 INFO BlockManager: Removing RDD 4
14/11/27 10:27:01 INFO Executor: Adding 
file:/path/to/spark/work/app-20141127102651-0001/1/./application.jar to class 
loader

...

14/11/27 10:27:10 ERROR BlockManagerWorker: Exception handling buffer message
java.lang.ClassNotFoundException: com.xxx.Event
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:344)
at 
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
at

[jira] [Commented] (SPARK-4526) Gradient should be added batch computing interface


[ 
https://issues.apache.org/jira/browse/SPARK-4526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242504#comment-14242504
 ] 

Apache Spark commented on SPARK-4526:
-

User 'witgo' has created a pull request for this issue:
https://github.com/apache/spark/pull/3677

 Gradient should be added batch computing interface
 --

 Key: SPARK-4526
 URL: https://issues.apache.org/jira/browse/SPARK-4526
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Guoqiang Li

 If Gradient support batch computing, we can use some efficient numerical 
 libraries(eg, BLAS).
 In some cases, it can improve the performance of more than ten times as much.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4830) Spark Java Application : java.lang.ClassNotFoundException


 [ 
https://issues.apache.org/jira/browse/SPARK-4830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mykhaylo Telizhyn updated SPARK-4830:
-
Description: 
h4. Application Overview:
  
   We have Spark Streaming application that consumes messages from RabbitMQ 
and processes them. When generating hundreds of events on RabbitMQ and running 
our application on spark standalone cluster we see some 
{{java.lang.ClassNotFoundException}} exceptions in the log. 

Our domain model is simple POJO that represents RabbitMQ events we want to 
consume and contains some custom properties we are interested in: 
{code:title=com.xxx.Event.java|borderStyle=solid}
public class Event implements java.io.Externalizable {

// custom properties

// custom implementation of writeExternal(), readExternal() methods
}
{code}

We have implemented custom Spark Streaming receiver that just receives 
messages from RabbitMQ queue by means of custom consumer (See _Receiving 
messages by subscription_ at https://www.rabbitmq.com/api-guide.html), 
converts them to our custom domain event objects ({{com.xxx.Event}}) and stores 
them on spark memory:
{code:title=RabbitMQReceiver.java|borderStyle=solid}
byte[] body = // data received from Rabbit using custom consumer
Event event = new Event(body);
store(event)  // store into Spark  
{code}

The main program is simple, it just set up spark streaming context:
{code:title=Application.java|borderStyle=solid}
SparkConf sparkConf = new SparkConf().setAppName(APPLICATION_NAME);

sparkConf.setJars(SparkContext.jarOfClass(Application.class).toList());  

JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, new 
Duration(BATCH_DURATION_MS));
{code}

Initialize input streams:
{code:title=Application.java|borderStyle=solid}
ReceiverInputDStreamEvent stream = // create input stream from 
RabbitMQ
JavaReceiverInputDStreamEvent events = new 
JavaReceiverInputDStreamEvent(stream, classTag(Event.class));
{code}

Process events:
{code:title=Application.java|borderStyle=solid}
events.foreachRDD(
rdd - {

rdd.foreachPartition(

partition - {
 
// process partition
}
}
})

ssc.start();
ssc.awaitTermination();
{code}

h4. Application submission:

Application is packaged as a single fat jar file using maven shade 
plugin (http://maven.apache.org/plugins/maven-shade-plugin/). It is compiled 
with spark version _1.1.0_   
We run our application on spark version _1.1.0_ standalone cluster that 
consists of driver host, master host and two worker hosts. We submit 
application from driver host.

On one of the workers we see {{java.lang.ClassNotFoundException}} 
exceptions:   
{panel:title=app.log|borderStyle=dashed|borderColor=#ccc|titleBGColor=#e3e4e1|bgColor=#ff}
14/11/27 10:27:10 ERROR BlockManagerWorker: Exception handling buffer message
java.lang.ClassNotFoundException: com.xxx.Event
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:344)
at 
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:235)
at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:126)
at

[jira] [Updated] (SPARK-4830) Spark Java Application : java.lang.ClassNotFoundException


 [ 
https://issues.apache.org/jira/browse/SPARK-4830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mykhaylo Telizhyn updated SPARK-4830:
-
Description: 
h4. Application Overview:
  
   We have Spark Streaming application that consumes messages from RabbitMQ 
and processes them. When generating hundreds of events on RabbitMQ and running 
our application on spark standalone cluster we see some 
{{java.lang.ClassNotFoundException}} exceptions in the log. 

Our domain model is simple POJO that represents RabbitMQ events we want to 
consume and contains some custom properties we are interested in: 
{code:title=com.xxx.Event.java|borderStyle=solid}
public class Event implements java.io.Externalizable {

// custom properties

// custom implementation of writeExternal(), readExternal() methods
}
{code}

We have implemented custom Spark Streaming receiver that just receives 
messages from RabbitMQ queue by means of custom consumer (See _Receiving 
messages by subscription_ at https://www.rabbitmq.com/api-guide.html), 
converts them to our custom domain event objects ({{com.xxx.Event}}) and stores 
them on spark memory:
{code:title=RabbitMQReceiver.java|borderStyle=solid}
byte[] body = // data received from Rabbit using custom consumer
Event event = new Event(body);
store(event)  // store into Spark  
{code}

The main program is simple, it just set up spark streaming context:
{code:title=Application.java|borderStyle=solid}
SparkConf sparkConf = new SparkConf().setAppName(APPLICATION_NAME);

sparkConf.setJars(SparkContext.jarOfClass(Application.class).toList());  

JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, new 
Duration(BATCH_DURATION_MS));
{code}

Initialize input streams:
{code:title=Application.java|borderStyle=solid}
ReceiverInputDStreamEvent stream = // create input stream from 
RabbitMQ
JavaReceiverInputDStreamEvent events = new 
JavaReceiverInputDStreamEvent(stream, classTag(Event.class));
{code}

Process events:
{code:title=Application.java|borderStyle=solid}
events.foreachRDD(
rdd - {

rdd.foreachPartition(

partition - {
 
// process partition
}
}
})

ssc.start();
ssc.awaitTermination();
{code}

h4. Application submission:

Application is packaged as a single fat jar file using maven shade 
plugin (http://maven.apache.org/plugins/maven-shade-plugin/). It is compiled 
with spark version _1.1.0_   
We run our application on spark version _1.1.0_ standalone cluster that 
consists of driver host, master host and two worker hosts. We submit 
application from driver host.

On one of the workers we see {{java.lang.ClassNotFoundException}} 
exceptions:   
{panel:title=app.log|borderStyle=dashed|borderColor=#ccc|titleBGColor=#e3e4e1|bgColor=#f0f8ff}
14/11/27 10:27:10 ERROR BlockManagerWorker: Exception handling buffer message
java.lang.ClassNotFoundException: com.xxx.Event
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:344)
at 
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:235)
at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:126)
at

[jira] [Updated] (SPARK-4830) Spark Streaming Java Application : java.lang.ClassNotFoundException


 [ 
https://issues.apache.org/jira/browse/SPARK-4830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mykhaylo Telizhyn updated SPARK-4830:
-
Summary: Spark Streaming Java Application : 
java.lang.ClassNotFoundException  (was: Spark Java Application : 
java.lang.ClassNotFoundException)

 Spark Streaming Java Application : java.lang.ClassNotFoundException
 ---

 Key: SPARK-4830
 URL: https://issues.apache.org/jira/browse/SPARK-4830
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Mykhaylo Telizhyn

 h4. Application Overview:
   
We have Spark Streaming application that consumes messages from 
 RabbitMQ and processes them. When generating hundreds of events on RabbitMQ 
 and running our application on spark standalone cluster we see some 
 {{java.lang.ClassNotFoundException}} exceptions in the log. 
 Our domain model is simple POJO that represents RabbitMQ events we want to 
 consume and contains some custom properties we are interested in: 
 {code:title=com.xxx.Event.java|borderStyle=solid}
 public class Event implements java.io.Externalizable {
 
 // custom properties
 // custom implementation of writeExternal(), readExternal() 
 methods
 }
 {code}
 We have implemented custom Spark Streaming receiver that just 
 receives messages from RabbitMQ queue by means of custom consumer (See 
 _Receiving messages by subscription_ at 
 https://www.rabbitmq.com/api-guide.html), converts them to our custom domain 
 event objects ({{com.xxx.Event}}) and stores them on spark memory:
 {code:title=RabbitMQReceiver.java|borderStyle=solid}
 byte[] body = // data received from Rabbit using custom consumer
 Event event = new Event(body);
 store(event)  // store into Spark  
 {code}
 The main program is simple, it just set up spark streaming context:
 {code:title=Application.java|borderStyle=solid}
 SparkConf sparkConf = new 
 SparkConf().setAppName(APPLICATION_NAME);
 
 sparkConf.setJars(SparkContext.jarOfClass(Application.class).toList());  
 JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, 
 new Duration(BATCH_DURATION_MS));
 {code}
 Initialize input streams:
 {code:title=Application.java|borderStyle=solid}
 ReceiverInputDStreamEvent stream = // create input stream from 
 RabbitMQ
 JavaReceiverInputDStreamEvent events = new 
 JavaReceiverInputDStreamEvent(stream, classTag(Event.class));
 {code}
 Process events:
 {code:title=Application.java|borderStyle=solid}
 events.foreachRDD(
 rdd - {
 rdd.foreachPartition(
 partition - {
  
 // process partition
 }
 }
 })
 
 ssc.start();
 ssc.awaitTermination();
 {code}
 h4. Application submission:
 
 Application is packaged as a single fat jar file using maven shade 
 plugin (http://maven.apache.org/plugins/maven-shade-plugin/). It is compiled 
 with spark version _1.1.0_   
 We run our application on spark version _1.1.0_ standalone cluster 
 that consists of driver host, master host and two worker hosts. We submit 
 application from driver host.
 
 On one of the workers we see {{java.lang.ClassNotFoundException}} 
 exceptions:   
 {panel:title=app.log|borderStyle=dashed|borderColor=#ccc|titleBGColor=#e3e4e1|bgColor=#f0f8ff}
 14/11/27 10:27:10 ERROR BlockManagerWorker: Exception handling buffer message
 java.lang.ClassNotFoundException: com.xxx.Event
 at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:344)
 at 
 org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59)
 at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
 at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
 at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
 at

[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)

2014-12-11 Thread koert kuipers (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242547#comment-14242547
 ] 

koert kuipers commented on SPARK-3655:
--

i updated the pullreq to use Iterables instead of TraversableOnce

i also wanted to take this opportunity to one more time make a pitch for 
foldLeft. i think we should implement foldLeft because
1) it is a well known operation that perfectly fits many problems such as time 
series analysis
2) it does not need to make the in-memory assumption for the sorted values, 
which is crucial for a lot of problems
3) it is (i think?) the most basic api that does not need values in memory, 
since it uses a repeated operation that uses the values like a Traversable and 
builds the return value. no Iterator or TraversableOnce is exposed, so it does 
not have potential strange interactions with things like caching and downstream 
shuffles.
4) groupByKeysAndSortValues (which does keep values in memory) can be expressed 
in foldLeft trivially:
groupByKeysAndSortValues(valueOrdering) = foldLeftByKey(valueOrdering, new 
ArrayBuffer[V])(_ += _)

 Support sorting of values in addition to keys (i.e. secondary sort)
 ---

 Key: SPARK-3655
 URL: https://issues.apache.org/jira/browse/SPARK-3655
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0, 1.2.0
Reporter: koert kuipers
Assignee: Koert Kuipers
Priority: Minor

 Now that spark has a sort based shuffle, can we expect a secondary sort soon? 
 There are some use cases where getting a sorted iterator of values per key is 
 helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4806) Update Streaming Programming Guide for Spark 1.2

2014-12-11 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-4806.
--
  Resolution: Done
Target Version/s: 1.2.0, 1.3.0  (was: 1.2.0)

 Update Streaming Programming Guide for Spark 1.2
 

 Key: SPARK-4806
 URL: https://issues.apache.org/jira/browse/SPARK-4806
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das

 Important updates to the streaming programming guide
 - Make the fault-tolerance properties easier to understand, with information 
 about write ahead logs
 - Update the information about deploying the spark streaming app with 
 information about Driver HA 
 - Update Receiver guide to discuss reliable vs unreliable receivers. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4814) Enable assertions in SBT, Maven tests / AssertionError from Hive's LazyBinaryInteger

2014-12-11 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242611#comment-14242611
 ] 

Cheng Lian commented on SPARK-4814:
---

This assertion failure seems to be related to details of the RCFile format. I 
haven't found the root cause. Tested the original {{alter_merge_2.q}} test case 
under Hive 0.13.1 by running
{code}
$ cd itest/qtest
$ mvn -Dtest=TestCliDriver -Phadoop-2 -Dqfile=alter_merge_2.q test
{code}
 didn't observe similar {{AssertionError}}. Keep digging. Although in this case 
the assertion failure doesn't affect correctness, I'm not sure whether it's 
generally safe to ignore it...

 Enable assertions in SBT, Maven tests / AssertionError from Hive's 
 LazyBinaryInteger
 

 Key: SPARK-4814
 URL: https://issues.apache.org/jira/browse/SPARK-4814
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.1.0
Reporter: Sean Owen

 Follow up to SPARK-4159, wherein we noticed that Java tests weren't running 
 in Maven, in part because a Java test actually fails with {{AssertionError}}. 
 That code/test was fixed in SPARK-4850.
 The reason it wasn't caught by SBT tests was that they don't run with 
 assertions on, and Maven's surefire does.
 Turning on assertions in the SBT build is trivial, adding one line:
 {code}
 javaOptions in Test += -ea,
 {code}
 This reveals a test failure in Scala test suites though:
 {code}
 [info] - alter_merge_2 *** FAILED *** (1 second, 305 milliseconds)
 [info]   Failed to execute query using catalyst:
 [info]   Error: Job aborted due to stage failure: Task 1 in stage 551.0 
 failed 1 times, most recent failure: Lost task 1.0 in stage 551.0 (TID 1532, 
 localhost): java.lang.AssertionError
 [info]at 
 org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryInteger.init(LazyBinaryInteger.java:51)
 [info]at 
 org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase$FieldInfo.uncheckedGetField(ColumnarStructBase.java:110)
 [info]at 
 org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase.getField(ColumnarStructBase.java:171)
 [info]at 
 org.apache.hadoop.hive.serde2.objectinspector.ColumnarStructObjectInspector.getStructFieldData(ColumnarStructObjectInspector.java:166)
 [info]at 
 org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:318)
 [info]at 
 org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:314)
 [info]at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 [info]at 
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:132)
 [info]at 
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:128)
 [info]at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615)
 [info]at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615)
 [info]at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 [info]at 
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
 [info]at org.apache.spark.rdd.RDD.iterator(RDD.scala:231)
 [info]at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 [info]at 
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
 [info]at org.apache.spark.rdd.RDD.iterator(RDD.scala:231)
 [info]at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 [info]at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 [info]at org.apache.spark.scheduler.Task.run(Task.scala:56)
 [info]at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)
 [info]at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 [info]at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 [info]at java.lang.Thread.run(Thread.java:745)
 {code}
 The items for this JIRA are therefore:
 - Enable assertions in SBT
 - Fix this failure
 - Figure out why Maven scalatest didn't trigger it - may need assertions 
 explicitly turned on too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4740) Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey

2014-12-11 Thread Zhang, Liye (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242656#comment-14242656
 ] 

Zhang, Liye commented on SPARK-4740:


Hi [~adav], I missed there is another patch from [~rxin] set connectionsPerPeer 
to 1, the result in my last comment is with the default value. I have made 
another test with connectionsPerPeer set to 2 on 8HDDs, the result is a little 
better than connectionsPerPeer=1, but still can not compete with NIO. Seems the 
unbalance of Netty is not introduced by rxin's patch. It exists in the original 
master branch with HDD.

Hi [~rxin], I tested your patch https://github.com/apache/spark/pull/3667 with 
8HDDs and with spark.executor.memory=48GB, the result shows this patch doesn't 
get better performance, the reduce time with patch is longer than without the 
patch (37mins VS 31 mins).


 Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey
 

 Key: SPARK-4740
 URL: https://issues.apache.org/jira/browse/SPARK-4740
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Zhang, Liye
Assignee: Reynold Xin
 Attachments: (rxin patch better executor)TestRunner  sort-by-key - 
 Thread dump for executor 3_files.zip, (rxin patch normal executor)TestRunner  
 sort-by-key - Thread dump for executor 0 _files.zip, Spark-perf Test Report 
 16 Cores per Executor.pdf, Spark-perf Test Report.pdf, TestRunner  
 sort-by-key - Thread dump for executor 1_files (Netty-48 Cores per node).zip, 
 TestRunner  sort-by-key - Thread dump for executor 1_files (Nio-48 cores per 
 node).zip, rxin_patch-on_4_node_cluster_48CoresPerNode(Unbalance).7z


 When testing current spark master (1.3.0-snapshot) with spark-perf 
 (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService 
 takes much longer time than NIO based shuffle transferService. The network 
 throughput of Netty is only about half of that of NIO. 
 We tested with standalone mode, and the data set we used for test is 20 
 billion records, and the total size is about 400GB. Spark-perf test is 
 Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each 
 executor memory is 64GB. The reduce tasks number is set to 1000. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4831) Current directory always on classpath with spark-submit

2014-12-11 Thread Daniel Darabos (JIRA)

Daniel Darabos created SPARK-4831:
-

Summary: Current directory always on classpath with spark-submit
Key: SPARK-4831
URL: https://issues.apache.org/jira/browse/SPARK-4831
Project: Spark
Issue Type: Bug
Components: Deploy
Affects Versions: 1.1.1, 1.2.0
Reporter: Daniel Darabos
Priority: Minor

I think this is caused by a line in
[compute-classpath.sh|https://github.com/apache/spark/blob/v1.2.0-rc2/bin/compute-classpath.sh#L28]:

{code}
CLASSPATH=$SPARK_CLASSPATH:$SPARK_SUBMIT_CLASSPATH
{code}

Now if SPARK_CLASSPATH is empty, the empty string is added to the classpath,
which means the current working directory.

We tried setting SPARK_CLASSPATH to a bogus value, but that is [not
allowed|https://github.com/apache/spark/blob/v1.2.0-rc2/core/src/main/scala/org/apache/spark/SparkConf.scala#L312].

What is the right solution? Only add SPARK_CLASSPATH if it's non-empty? I can
send a pull request for that I think. Thanks!

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2980) Python support for chi-squared test


[ 
https://issues.apache.org/jira/browse/SPARK-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242718#comment-14242718
 ] 

Apache Spark commented on SPARK-2980:
-

User 'jbencook' has created a pull request for this issue:
https://github.com/apache/spark/pull/3679

 Python support for chi-squared test
 ---

 Key: SPARK-2980
 URL: https://issues.apache.org/jira/browse/SPARK-2980
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Doris Xin





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4831) Current directory always on classpath with spark-submit


[ 
https://issues.apache.org/jira/browse/SPARK-4831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242778#comment-14242778
 ] 

Sean Owen commented on SPARK-4831:
--

Hm, so I made a quick test, where I put a class {{Foo.class}} inside 
{{Foo.jar}} and then ran {{java -cp :otherstuff.jar Foo}}. It does not find the 
class, which suggests to me that it does not interpret that empty entry as 
meaning local directory too. 

It doesn't work even if I put . on the classpath. That makes sense. The 
working directory contains JARs, in your case, not classes.

However it finds it if I leave {{Foo.class}} in the working directory, *if* I 
have an empty entry in the classpath. Is it perhaps finding and exploded 
directory of classes? Otherwise, I can't repro this directly I suppose, in Java.

 Current directory always on classpath with spark-submit
 ---

 Key: SPARK-4831
 URL: https://issues.apache.org/jira/browse/SPARK-4831
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.1.1, 1.2.0
Reporter: Daniel Darabos
Priority: Minor

 We had a situation where we were launching an application with spark-submit, 
 and a file (play.plugins) was on the classpath twice, causing problems 
 (trying to register plugins twice). Upon investigating how it got on the 
 classpath twice, we found that it was present in one of our jars, and also in 
 the current working directory. But the one in the current working directory 
 should not be on the classpath. We never asked spark-submit to put the 
 current directory on the classpath.
 I think this is caused by a line in 
 [compute-classpath.sh|https://github.com/apache/spark/blob/v1.2.0-rc2/bin/compute-classpath.sh#L28]:
 {code}
 CLASSPATH=$SPARK_CLASSPATH:$SPARK_SUBMIT_CLASSPATH
 {code}
 Now if SPARK_CLASSPATH is empty, the empty string is added to the classpath, 
 which means the current working directory.
 We tried setting SPARK_CLASSPATH to a bogus value, but that is [not 
 allowed|https://github.com/apache/spark/blob/v1.2.0-rc2/core/src/main/scala/org/apache/spark/SparkConf.scala#L312].
 What is the right solution? Only add SPARK_CLASSPATH if it's non-empty? I can 
 send a pull request for that I think. Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-2892) Socket Receiver does not stop when streaming context is stopped


 [ 
https://issues.apache.org/jira/browse/SPARK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-2892:
--

OK. The issues may have a common cause but that can be deferred until the other 
JIRA is resolved. If it turns out to resolve this, great.

 Socket Receiver does not stop when streaming context is stopped
 ---

 Key: SPARK-2892
 URL: https://issues.apache.org/jira/browse/SPARK-2892
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.2
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical

 Running NetworkWordCount with
 {quote}  
 ssc.start(); Thread.sleep(1); ssc.stop(stopSparkContext = false); 
 Thread.sleep(6)
 {quote}
 gives the following error
 {quote}
 14/08/06 18:37:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) 
 in 10047 ms on localhost (1/1)
 14/08/06 18:37:13 INFO DAGScheduler: Stage 0 (runJob at 
 ReceiverTracker.scala:275) finished in 10.056 s
 14/08/06 18:37:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks 
 have all completed, from pool
 14/08/06 18:37:13 INFO SparkContext: Job finished: runJob at 
 ReceiverTracker.scala:275, took 10.179263 s
 14/08/06 18:37:13 INFO ReceiverTracker: All of the receivers have been 
 terminated
 14/08/06 18:37:13 WARN ReceiverTracker: All of the receivers have not 
 deregistered, Map(0 - 
 ReceiverInfo(0,SocketReceiver-0,null,false,localhost,Stopped by driver,))
 14/08/06 18:37:13 INFO ReceiverTracker: ReceiverTracker stopped
 14/08/06 18:37:13 INFO JobGenerator: Stopping JobGenerator immediately
 14/08/06 18:37:13 INFO RecurringTimer: Stopped timer for JobGenerator after 
 time 1407375433000
 14/08/06 18:37:13 INFO JobGenerator: Stopped JobGenerator
 14/08/06 18:37:13 INFO JobScheduler: Stopped JobScheduler
 14/08/06 18:37:13 INFO StreamingContext: StreamingContext stopped successfully
 14/08/06 18:37:43 INFO SocketReceiver: Stopped receiving
 14/08/06 18:37:43 INFO SocketReceiver: Closed socket to localhost:
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3677) pom.xml and SparkBuild.scala are wrong : Scalastyle is never applyed to the sources under yarn/common


 [ 
https://issues.apache.org/jira/browse/SPARK-3677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3677.
--
  Resolution: Not a Problem
Target Version/s:   (was: 1.1.2, 1.2.1)

Obsoleted by the restructuring of the YARN code. See discussion in the PR.

 pom.xml and SparkBuild.scala are wrong : Scalastyle is never applyed to the 
 sources under yarn/common
 -

 Key: SPARK-3677
 URL: https://issues.apache.org/jira/browse/SPARK-3677
 Project: Spark
  Issue Type: Bug
  Components: Build, YARN
Affects Versions: 1.2.0
Reporter: Kousuke Saruta

 When we run sbt -Pyarn scalastyle or mvn package, scalastyle is not 
 applied to the sources under yarn/common.
 I think, this is caused by the directory structure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3918) Forget Unpersist in RandomForest.scala(train Method)


 [ 
https://issues.apache.org/jira/browse/SPARK-3918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3918.
--
Resolution: Fixed

Great, looks like this was in fact fixed for 1.2 then.

 Forget Unpersist in RandomForest.scala(train Method)
 

 Key: SPARK-3918
 URL: https://issues.apache.org/jira/browse/SPARK-3918
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.0
 Environment: All
Reporter: junlong
Assignee: Joseph K. Bradley
  Labels: decisiontree, train, unpersist
 Fix For: 1.2.0

   Original Estimate: 10m
  Remaining Estimate: 10m

In version 1.1.0 DecisionTree.scala, train Method, treeInput has been 
 persisted in Memory, but without unpersist. It caused heavy DISK usage.
In github version(1.2.0 maybe), RandomForest.scala, train Method, 
 baggedInput has been persisted but without unpersisted too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3918) Forget Unpersist in RandomForest.scala(train Method)


 [ 
https://issues.apache.org/jira/browse/SPARK-3918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3918:
-
 Target Version/s: 1.2.0  (was: 1.1.0)
Affects Version/s: (was: 1.2.0)
   1.1.0
Fix Version/s: (was: 1.1.0)
   1.2.0

 Forget Unpersist in RandomForest.scala(train Method)
 

 Key: SPARK-3918
 URL: https://issues.apache.org/jira/browse/SPARK-3918
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.0
 Environment: All
Reporter: junlong
Assignee: Joseph K. Bradley
  Labels: decisiontree, train, unpersist
 Fix For: 1.2.0

   Original Estimate: 10m
  Remaining Estimate: 10m

In version 1.1.0 DecisionTree.scala, train Method, treeInput has been 
 persisted in Memory, but without unpersist. It caused heavy DISK usage.
In github version(1.2.0 maybe), RandomForest.scala, train Method, 
 baggedInput has been persisted but without unpersisted too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4458) Skip compilation of tests classes when using make-distribution


 [ 
https://issues.apache.org/jira/browse/SPARK-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4458.
--
Resolution: Won't Fix

Given PR discussion, looks like a WontFix.

 Skip compilation of tests classes when using make-distribution
 --

 Key: SPARK-4458
 URL: https://issues.apache.org/jira/browse/SPARK-4458
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.0.0, 1.1.0
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4675) Find similar products and similar users in MatrixFactorizationModel


[ 
https://issues.apache.org/jira/browse/SPARK-4675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242808#comment-14242808
 ] 

Sean Owen commented on SPARK-4675:
--

The lower dimensional space is of course smaller. This makes it faster and more 
efficient to work with, which is an advantage to be sure at scale. But the real 
reason is that the original high-dimensional space is extremely sparse. 
Standard similarity measures are undefined for most pairs, or are 0. It's sort 
of a symptom of the curse of dimensionality. 

 Find similar products and similar users in MatrixFactorizationModel
 ---

 Key: SPARK-4675
 URL: https://issues.apache.org/jira/browse/SPARK-4675
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Steven Bourke
Priority: Trivial
  Labels: mllib, recommender

 Using the latent feature space that is learnt in MatrixFactorizationModel, I 
 have added 2 new functions to find similar products and similar users. A user 
 of the API can for example pass a product ID, and get the closest products. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4779) PySpark Shuffle Fails Looking for Files that Don't Exist when low on Memory

2014-12-11 Thread Ilya Ganelin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242831#comment-14242831
 ] 

Ilya Ganelin commented on SPARK-4779:
-

I've seen this issue on Scala as well. This happens during large shuffles when 
an intermediate stage of the shuffle map/reduce fails due to memory 
constraints. I have not received any suggestions on how to resolve it short of 
increasing available memory and shuffling smaller sizes.

 PySpark Shuffle Fails Looking for Files that Don't Exist when low on Memory
 ---

 Key: SPARK-4779
 URL: https://issues.apache.org/jira/browse/SPARK-4779
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Shuffle
Affects Versions: 1.1.0
 Environment: ec2 launched cluster with scripts
 6 Nodes
 c3.2xlarge
Reporter: Brad Willard

 When Spark is tight on memory it starts saying files don't exist during 
 shuffle causing tasks to fail and be rebuilt destroying performance.
 The same code works flawlessly with smaller datasets with less memory 
 pressure I assume.
 14/12/06 18:39:37 WARN scheduler.TaskSetManager: Lost task 292.0 in stage 3.0 
 (TID 1099, ip-10-13-192-209.ec2.internal): 
 org.apache.spark.api.python.PythonException: Traceback (most recent call 
 last):
   File /root/spark/python/pyspark/worker.py, line 79, in main
 serializer.dump_stream(func(split_index, iterator), outfile)
   File /root/spark/python/pyspark/serializers.py, line 196, in dump_stream
 self.serializer.dump_stream(self._batched(iterator), stream)
   File /root/spark/python/pyspark/serializers.py, line 127, in dump_stream
 for obj in iterator:
   File /root/spark/python/pyspark/serializers.py, line 185, in _batched
 for item in iterator:
   File /root/spark/python/pyspark/shuffle.py, line 370, in _external_items
 self.mergeCombiners(self.serializer.load_stream(open(p)),
 IOError: [Errno 2] No such file or directory: 
 '/mnt/spark/spark-local-20141206182702-8748/python/16070/66618000/1/18'
 
 org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)
 org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:91)
 org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:87)
 
 org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
 scala.collection.Iterator$$anon$12.next(Iterator.scala:357)
 
 org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
 scala.collection.Iterator$$anon$12.next(Iterator.scala:357)
 scala.collection.Iterator$class.foreach(Iterator.scala:727)
 scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 
 org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:335)
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
 
 org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3533) Add saveAsTextFileByKey() method to RDDs

2014-12-11 Thread Ilya Ganelin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242913#comment-14242913
 ] 

Ilya Ganelin commented on SPARK-3533:
-

I am looking into a solution for this.

 Add saveAsTextFileByKey() method to RDDs
 

 Key: SPARK-3533
 URL: https://issues.apache.org/jira/browse/SPARK-3533
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
Reporter: Nicholas Chammas

 Users often have a single RDD of key-value pairs that they want to save to 
 multiple locations based on the keys.
 For example, say I have an RDD like this:
 {code}
  a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 
  'Frankie']).keyBy(lambda x: x[0])
  a.collect()
 [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')]
  a.keys().distinct().collect()
 ['B', 'F', 'N']
 {code}
 Now I want to write the RDD out to different paths depending on the keys, so 
 that I have one output directory per distinct key. Each output directory 
 could potentially have multiple {{part-}} files, one per RDD partition.
 So the output would look something like:
 {code}
 /path/prefix/B [/part-1, /part-2, etc]
 /path/prefix/F [/part-1, /part-2, etc]
 /path/prefix/N [/part-1, /part-2, etc]
 {code}
 Though it may be possible to do this with some combination of 
 {{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the 
 {{MultipleTextOutputFormat}} output format class, it isn't straightforward. 
 It's not clear if it's even possible at all in PySpark.
 Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs 
 that makes it easy to save RDDs out to multiple locations at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4728) Add exponential, log normal, and gamma distributions to data generator to MLlib


[ 
https://issues.apache.org/jira/browse/SPARK-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242915#comment-14242915
 ] 

Apache Spark commented on SPARK-4728:
-

User 'rnowling' has created a pull request for this issue:
https://github.com/apache/spark/pull/3680

 Add exponential, log normal, and gamma distributions to data generator to 
 MLlib
 ---

 Key: SPARK-4728
 URL: https://issues.apache.org/jira/browse/SPARK-4728
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.1.0
Reporter: RJ Nowling
Priority: Minor

 MLlib supports sampling from normal, uniform, and Poisson distributions.  
 I'd like to add support for sampling from exponential, gamma, and log normal 
 distributions, using the features of math3 like the other generators.
 Please assign this to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4728) Add exponential, log normal, and gamma distributions to data generator to MLlib

2014-12-11 Thread RJ Nowling (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242924#comment-14242924
 ] 

RJ Nowling commented on SPARK-4728:
---

I posted a PR for this issue:
https://github.com/apache/spark/pull/3680

 Add exponential, log normal, and gamma distributions to data generator to 
 MLlib
 ---

 Key: SPARK-4728
 URL: https://issues.apache.org/jira/browse/SPARK-4728
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.1.0
Reporter: RJ Nowling
Priority: Minor

 MLlib supports sampling from normal, uniform, and Poisson distributions.  
 I'd like to add support for sampling from exponential, gamma, and log normal 
 distributions, using the features of math3 like the other generators.
 Please assign this to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-4728) Add exponential, log normal, and gamma distributions to data generator to MLlib

2014-12-11 Thread RJ Nowling (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

RJ Nowling updated SPARK-4728:
--
Comment: was deleted

(was: I posted a PR for this issue:
https://github.com/apache/spark/pull/3680)

 Add exponential, log normal, and gamma distributions to data generator to 
 MLlib
 ---

 Key: SPARK-4728
 URL: https://issues.apache.org/jira/browse/SPARK-4728
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.1.0
Reporter: RJ Nowling
Priority: Minor

 MLlib supports sampling from normal, uniform, and Poisson distributions.  
 I'd like to add support for sampling from exponential, gamma, and log normal 
 distributions, using the features of math3 like the other generators.
 Please assign this to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4675) Find similar products and similar users in MatrixFactorizationModel


[ 
https://issues.apache.org/jira/browse/SPARK-4675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243026#comment-14243026
 ] 

Debasish Das commented on SPARK-4675:
-

Is there a metric like MAP / AUC kind of measure that can help us validate 
similarUsers and similarProducts ? 

Right now if I run column similarities with sparse vector on matrix 
factorization datasets for product similarities, it will assume all unvisited 
entries (which should be ?) as 0 and compute column similarities for...If the 
sparse vector has ? in place of 0 then basically all similarity calculation is 
incorrect...so in that sense it makes more sense to compute the similarities on 
the matrix factors...

But then we are back to map-reduce calculation of rowSimilarities.

 Find similar products and similar users in MatrixFactorizationModel
 ---

 Key: SPARK-4675
 URL: https://issues.apache.org/jira/browse/SPARK-4675
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Steven Bourke
Priority: Trivial
  Labels: mllib, recommender

 Using the latent feature space that is learnt in MatrixFactorizationModel, I 
 have added 2 new functions to find similar products and similar users. A user 
 of the API can for example pass a product ID, and get the closest products. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4455) Exclude dependency on hbase-annotations module


[ 
https://issues.apache.org/jira/browse/SPARK-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243044#comment-14243044
 ] 

Sean Owen commented on SPARK-4455:
--

FWIW this change also unfortunately causes compiler warnings, so hopefully it 
can be undone some day. But it's the right thing for now.

{code}
[warn] Class org.apache.hadoop.hbase.classification.InterfaceAudience not found 
- continuing with a stub.
[warn] Class org.apache.hadoop.hbase.classification.InterfaceStability not 
found - continuing with a stub.
[warn] Class org.apache.hadoop.hbase.classification.InterfaceAudience not found 
- continuing with a stub.
[warn] Class org.apache.hadoop.hbase.classification.InterfaceStability not 
found - continuing with a stub.
[warn] Class org.apache.hadoop.hbase.classification.InterfaceAudience not found 
- continuing with a stub.
...
{code}

 Exclude dependency on hbase-annotations module
 --

 Key: SPARK-4455
 URL: https://issues.apache.org/jira/browse/SPARK-4455
 Project: Spark
  Issue Type: Bug
Reporter: Ted Yu
Assignee: Ted Yu
 Fix For: 1.2.0


 As Patrick mentioned in the thread 'Has anyone else observed this build 
 break?' :
 The error I've seen is this when building the examples project:
 {code}
 spark-examples_2.10: Could not resolve dependencies for project
 org.apache.spark:spark-examples_2.10:jar:1.2.0-SNAPSHOT: Could not
 find artifact jdk.tools:jdk.tools:jar:1.7 at specified path
 /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/../lib/tools.jar
 {code}
 The reason for this error is that hbase-annotations is using a
 system scoped dependency in their hbase-annotations pom, and this
 doesn't work with certain JDK layouts such as that provided on Mac OS:
 http://central.maven.org/maven2/org/apache/hbase/hbase-annotations/0.98.7-hadoop2/hbase-annotations-0.98.7-hadoop2.pom
 hbase-annotations module is transitively brought in through other HBase 
 modules, we should exclude it from related modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4823) rowSimilarities


[ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243048#comment-14243048
 ] 

Debasish Das commented on SPARK-4823:
-

[~srowen] did you implement map-reduce row similarities for user factors ? 
What's the algorithm that you used ? Any pointers will be really helpful...

 rowSimilarities
 ---

 Key: SPARK-4823
 URL: https://issues.apache.org/jira/browse/SPARK-4823
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh

 RowMatrix has a columnSimilarities method to find cosine similarities between 
 columns.
 A rowSimilarities method would be useful to find similarities between rows.
 This is JIRA is to investigate which algorithms are suitable for such a 
 method, better than brute-forcing it. Note that when there are many rows ( 
 10^6), it is unlikely that brute-force will be feasible, since the output 
 will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4823) rowSimilarities


[ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243055#comment-14243055
 ] 

Sean Owen commented on SPARK-4823:
--

I don't think MapReduce matters here. You can compute pairs of similarities 
with any framework, or try to do it on the fly. It's not different than column 
similarities, right? I don't think there's anything more to it than applying a 
similarity metric to all pairs of vectors. I think the JIRA is about exposing a 
method just for API convenience, not because it's conceptually different.

 rowSimilarities
 ---

 Key: SPARK-4823
 URL: https://issues.apache.org/jira/browse/SPARK-4823
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh

 RowMatrix has a columnSimilarities method to find cosine similarities between 
 columns.
 A rowSimilarities method would be useful to find similarities between rows.
 This is JIRA is to investigate which algorithms are suitable for such a 
 method, better than brute-forcing it. Note that when there are many rows ( 
 10^6), it is unlikely that brute-force will be feasible, since the output 
 will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1412) Disable partial aggregation automatically when reduction factor is low


[ 
https://issues.apache.org/jira/browse/SPARK-1412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243072#comment-14243072
 ] 

Sean Owen commented on SPARK-1412:
--

The PRs for SPARK-2253 and SPARK-1412 were abandoned. Are both a WontFix?

 Disable partial aggregation automatically when reduction factor is low
 --

 Key: SPARK-1412
 URL: https://issues.apache.org/jira/browse/SPARK-1412
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0
Reporter: Reynold Xin
Assignee: Michael Armbrust
Priority: Minor
 Fix For: 1.2.0


 Once we see enough number of rows in partial aggregation and don't observe 
 any reduction, the aggregate operator should just turn off partial 
 aggregation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4831) Current directory always on classpath with spark-submit

2014-12-11 Thread Daniel Darabos (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-4831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243074#comment-14243074
]

Daniel Darabos commented on SPARK-4831:
---

bq. Is it perhaps finding and exploded directory of classes?

Yes, that is exactly the situation. One instance of the file is in a jar,
another is just there (free-floating) in the directory. It is a configuration
file. (Actually it's in a conf directory, but Play looks for both
play.plugins and conf/play.plugins with getResources in the classpath. So
it finds the copy inside the generated jar, also in the conf directory of the
project. We can of course work around this in numerous ways.)

I think there is no reason for spark-submit to add an empty entry to the
classpath. It will just lead to accidents like ours. If the user wants to add
an empty entry, they can easily do so.

I've sent https://github.com/apache/spark/pull/3678 as a possible fix. Thanks
for investigating!

Current directory always on classpath with spark-submit
---

Key: SPARK-4831
URL: https://issues.apache.org/jira/browse/SPARK-4831
Project: Spark
Issue Type: Bug
Components: Deploy
Affects Versions: 1.1.1, 1.2.0
Reporter: Daniel Darabos
Priority: Minor

We had a situation where we were launching an application with spark-submit,
and a file (play.plugins) was on the classpath twice, causing problems
(trying to register plugins twice). Upon investigating how it got on the
classpath twice, we found that it was present in one of our jars, and also in
the current working directory. But the one in the current working directory
should not be on the classpath. We never asked spark-submit to put the
current directory on the classpath.
I think this is caused by a line in
[compute-classpath.sh|https://github.com/apache/spark/blob/v1.2.0-rc2/bin/compute-classpath.sh#L28]:
{code}
CLASSPATH=$SPARK_CLASSPATH:$SPARK_SUBMIT_CLASSPATH
{code}
Now if SPARK_CLASSPATH is empty, the empty string is added to the classpath,
which means the current working directory.
We tried setting SPARK_CLASSPATH to a bogus value, but that is [not
allowed|https://github.com/apache/spark/blob/v1.2.0-rc2/core/src/main/scala/org/apache/spark/SparkConf.scala#L312].
What is the right solution? Only add SPARK_CLASSPATH if it's non-empty? I can
send a pull request for that I think. Thanks!

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1412) Disable partial aggregation automatically when reduction factor is low


[ 
https://issues.apache.org/jira/browse/SPARK-1412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243077#comment-14243077
 ] 

Reynold Xin commented on SPARK-1412:


I think we should still do it - it's just that the current AppendOnlyMap isn't 
really built for it. We will probably revisit this in the future.


 Disable partial aggregation automatically when reduction factor is low
 --

 Key: SPARK-1412
 URL: https://issues.apache.org/jira/browse/SPARK-1412
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0
Reporter: Reynold Xin
Assignee: Michael Armbrust
Priority: Minor

 Once we see enough number of rows in partial aggregation and don't observe 
 any reduction, the aggregate operator should just turn off partial 
 aggregation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1412) Disable partial aggregation automatically when reduction factor is low


 [ 
https://issues.apache.org/jira/browse/SPARK-1412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-1412:
---
Fix Version/s: (was: 1.2.0)

 Disable partial aggregation automatically when reduction factor is low
 --

 Key: SPARK-1412
 URL: https://issues.apache.org/jira/browse/SPARK-1412
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0
Reporter: Reynold Xin
Assignee: Michael Armbrust
Priority: Minor

 Once we see enough number of rows in partial aggregation and don't observe 
 any reduction, the aggregate operator should just turn off partial 
 aggregation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1412) [SQL] Disable partial aggregation automatically when reduction factor is low


 [ 
https://issues.apache.org/jira/browse/SPARK-1412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-1412:
---
Summary: [SQL] Disable partial aggregation automatically when reduction 
factor is low  (was: Disable partial aggregation automatically when reduction 
factor is low)

 [SQL] Disable partial aggregation automatically when reduction factor is low
 

 Key: SPARK-1412
 URL: https://issues.apache.org/jira/browse/SPARK-1412
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0
Reporter: Reynold Xin
Assignee: Michael Armbrust
Priority: Minor

 Once we see enough number of rows in partial aggregation and don't observe 
 any reduction, the aggregate operator should just turn off partial 
 aggregation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1627) Support external aggregation in Spark SQL


 [ 
https://issues.apache.org/jira/browse/SPARK-1627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1627.
--
   Resolution: Won't Fix
Fix Version/s: (was: 1.2.0)

The discussion in https://github.com/apache/spark/pull/867 suggests this was 
subsumed by SPARK-2873.

 Support external aggregation in Spark SQL
 -

 Key: SPARK-1627
 URL: https://issues.apache.org/jira/browse/SPARK-1627
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0
Reporter: Chen Chao

 Spark SQL Aggregator does not support external sorting now which is extremely 
 important when data is much larger than memory. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2253) [Core] Disable partial aggregation automatically when reduction factor is low


 [ 
https://issues.apache.org/jira/browse/SPARK-2253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2253:
---
Summary: [Core] Disable partial aggregation automatically when reduction 
factor is low  (was: Disable partial aggregation automatically when reduction 
factor is low)

 [Core] Disable partial aggregation automatically when reduction factor is low
 -

 Key: SPARK-2253
 URL: https://issues.apache.org/jira/browse/SPARK-2253
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin
 Fix For: 1.3.0


 Once we see enough number of rows in partial aggregation and don't observe 
 any reduction, Aggregator should just turn off partial aggregation. This 
 reduces memory usage for high cardinality aggregations.
 This one is for Spark core. There is another ticket tracking this for SQL. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1581) Allow One Flume Avro RPC Server for Each Worker rather than Just One Worker


 [ 
https://issues.apache.org/jira/browse/SPARK-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1581.
--
Resolution: Won't Fix

No follow-up from OP explaining the change, and so the PR was closed already.

 Allow One Flume Avro RPC Server for Each Worker rather than Just One Worker
 ---

 Key: SPARK-1581
 URL: https://issues.apache.org/jira/browse/SPARK-1581
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Christophe Clapp
Priority: Minor
  Labels: Flume





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1559) Add conf dir to CLASSPATH in compute-classpath.sh dependent on whether SPARK_CONF_DIR is set


 [ 
https://issues.apache.org/jira/browse/SPARK-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1559.
--
Resolution: Duplicate

The PR discussion suggests it was duplicated by the PR for SPARK-2058.

 Add conf dir to CLASSPATH in compute-classpath.sh dependent on whether 
 SPARK_CONF_DIR is set
 

 Key: SPARK-1559
 URL: https://issues.apache.org/jira/browse/SPARK-1559
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.0
Reporter: Albert Chu
Priority: Minor
 Attachments: SPARK-1559.patch


 bin/load-spark-env.sh loads spark-env.sh from SPARK_CONF_DIR if it is set, or 
 from $parent_dir/conf if it is not set.
 However, in compute-classpath.sh, the CLASSPATH adds $FWDIR/conf to the 
 CLASSPATH regardless if SPARK_CONF_DIR is set.
 Attached patch fixes this.  Pull request on github will also be sent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1532) provide option for more restrictive firewall rule in ec2/spark_ec2.py


[ 
https://issues.apache.org/jira/browse/SPARK-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243086#comment-14243086
 ] 

Sean Owen commented on SPARK-1532:
--

[~foundart] Is this abandoned? your second PR was ready to merge but needed 
rebasing, then got closed. Looks like it was a good change that can be revived.

 provide option for more restrictive firewall rule in ec2/spark_ec2.py
 -

 Key: SPARK-1532
 URL: https://issues.apache.org/jira/browse/SPARK-1532
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 0.9.0
Reporter: Art Peel
Priority: Minor

 When ec2/spark_ec2.py sets up firewall rules for various ports, it uses an 
 extremely lenient hard-coded value for allowed IP addresses: '0.0.0.0/0'
 It would be very useful for deployments to allow specifying the allowed IP 
 addresses as a command-line option to ec2/spark_ec2.py.  This new 
 configuration parameter should have as its default the current hard-coded 
 value, '0.0.0.0/0', so the functionality of ec2/spark_ec2.py will change only 
 for those users who specify the new option.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1771) CoarseGrainedSchedulerBackend is not resilient to Akka restarts


 [ 
https://issues.apache.org/jira/browse/SPARK-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1771.
--
Resolution: Won't Fix

The PR says this was abandoned in favor of SPARK-4004

 CoarseGrainedSchedulerBackend is not resilient to Akka restarts
 ---

 Key: SPARK-1771
 URL: https://issues.apache.org/jira/browse/SPARK-1771
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Aaron Davidson

 The exception reported in SPARK-1769 was propagated through the 
 CoarseGrainedSchedulerBackend, and caused an Actor restart of the 
 DriverActor. Unfortunately, this actor does not seem to have been written 
 with Akka restartability in mind. For instance, the new DriverActor has lost 
 all state about the prior Executors without cleanly disconnecting them. This 
 means that the driver actually has executors attached to it, but doesn't 
 think it does, which leads to mayhem of various sorts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1700) PythonRDD leaks socket descriptors during cancellation


 [ 
https://issues.apache.org/jira/browse/SPARK-1700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1700.
--
   Resolution: Fixed
Fix Version/s: 1.0.0

The PR was https://github.com/apache/spark/pull/623 and says it was merged in 
1.0.

 PythonRDD leaks socket descriptors during cancellation
 --

 Key: SPARK-1700
 URL: https://issues.apache.org/jira/browse/SPARK-1700
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0, 1.0.0
Reporter: Aaron Davidson
Assignee: Aaron Davidson
 Fix For: 1.0.0


 Sockets from Spark to Python workers are not cleaned up over the duration of 
 a job, causing the total number of opened file descriptors to grow to around 
 the number of partitions in the job. Usually these go away if the job is 
 successful, but in the case of cancellation (and possibly exceptions, though 
 I haven't investigated), the socket file descriptors remain indefinitely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1888) enhance MEMORY_AND_DISK mode by dropping blocks in parallel


 [ 
https://issues.apache.org/jira/browse/SPARK-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1888.
--
Resolution: Duplicate

From the end of the PR discussion, it looks like this was continued in 
SPARK-3000? The issue looks the same and the change looks like it touches the 
same files.

 enhance MEMORY_AND_DISK mode by dropping blocks in parallel
 ---

 Key: SPARK-1888
 URL: https://issues.apache.org/jira/browse/SPARK-1888
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Wenchen Fan
Assignee: Wenchen Fan

 Sometimes MEMORY_AND_DISK mode is slower than DISK_ONLY mode because of the 
 lock on IO operations(dropping blocks in memory store). As the TODO says, the 
 solution is: only synchronize the selecting of to-be-dropped blocks and do 
 the dropping in parallel. I have a quick fix in my PR: 
 https://github.com/apache/spark/pull/791#issuecomment-43567924
 It's fragile currently  but I'm working on it to make it more robust.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4746) integration tests should be separated from faster unit tests

2014-12-11 Thread Ryan Williams (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243101#comment-14243101
 ] 

Ryan Williams commented on SPARK-4746:
--

I don't have any experience with test tags, but this approach sounds good to me 
[~imranr]!

 integration tests should be separated from faster unit tests
 

 Key: SPARK-4746
 URL: https://issues.apache.org/jira/browse/SPARK-4746
 Project: Spark
  Issue Type: Bug
Reporter: Imran Rashid
Priority: Trivial

 Currently there isn't a good way for a developer to skip the longer 
 integration tests.  This can slow down local development.  See 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Spurious-test-failures-testing-best-practices-td9560.html
 One option is to use scalatest's notion of test tags to tag all integration 
 tests, so they could easily be skipped



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1880) Eliminate unnecessary job executions.


[ 
https://issues.apache.org/jira/browse/SPARK-1880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243103#comment-14243103
 ] 

Sean Owen commented on SPARK-1880:
--

Is this now a WontFix? The PR refers to this being subsumed by hash join 
changes related to SPARK-1800.

 Eliminate unnecessary job executions.
 -

 Key: SPARK-1880
 URL: https://issues.apache.org/jira/browse/SPARK-1880
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Takuya Ueshin

 There are unnecessary job executions in {{BroadcastNestedLoopJoin}}.
 When {{Innner}} or {{LeftOuter}} join, preparation of {{rightOuterMatches}} 
 for {{RightOuter}} or {{FullOuter}} join is not neccessary.
 And when {{RightOuter}} or {{FullOuter}}, it should use not {{count}} and 
 then {{reduce}} but {{fold}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-1880) Eliminate unnecessary job executions.


 [ 
https://issues.apache.org/jira/browse/SPARK-1880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-1880.
--
Resolution: Won't Fix

 Eliminate unnecessary job executions.
 -

 Key: SPARK-1880
 URL: https://issues.apache.org/jira/browse/SPARK-1880
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Takuya Ueshin

 There are unnecessary job executions in {{BroadcastNestedLoopJoin}}.
 When {{Innner}} or {{LeftOuter}} join, preparation of {{rightOuterMatches}} 
 for {{RightOuter}} or {{FullOuter}} join is not neccessary.
 And when {{RightOuter}} or {{FullOuter}}, it should use not {{count}} and 
 then {{reduce}} but {{fold}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2016) rdd in-memory storage UI becomes unresponsive when the number of RDD partitions is large


[ 
https://issues.apache.org/jira/browse/SPARK-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243109#comment-14243109
 ] 

Sean Owen commented on SPARK-2016:
--

Is this and SPARK-2017 now subsumed by SPARK-3644? the PR for this and 
SPARK-2017 are closed and discussion suggested it was to be continued in 
SPARK-3644.

 rdd in-memory storage UI becomes unresponsive when the number of RDD 
 partitions is large
 

 Key: SPARK-2016
 URL: https://issues.apache.org/jira/browse/SPARK-2016
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin
  Labels: starter

 Try run
 {code}
 sc.parallelize(1 to 100, 100).cache().count()
 {code}
 And open the storage UI for this RDD. It takes forever to load the page.
 When the number of partitions is very large, I think there are a few 
 alternatives:
 0. Only show the top 1000.
 1. Pagination
 2. Instead of grouping by RDD blocks, group by executors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2227) Support dfs command


 [ 
https://issues.apache.org/jira/browse/SPARK-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2227.
--
   Resolution: Fixed
Fix Version/s: 1.1.0

Looks like this was in fact merged in 
https://github.com/apache/spark/commit/51c8168377a89d20d0b2d7b9a28af58593a0fe0c

 Support dfs command
 -

 Key: SPARK-2227
 URL: https://issues.apache.org/jira/browse/SPARK-2227
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Minor
 Fix For: 1.1.0


 Potentially just delegate to Hive. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2201) Improve FlumeInputDStream's stability and make it scalable

[
https://issues.apache.org/jira/browse/SPARK-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sean Owen resolved SPARK-2201.
--
Resolution: Won't Fix

I hope I understood this right, but the PR discussion seemed to end with
suggesting that this would not go into Spark, but maybe a contrib repo, and
that it was partly already implemented by other changes.

Improve FlumeInputDStream's stability and make it scalable
--

Key: SPARK-2201
URL: https://issues.apache.org/jira/browse/SPARK-2201
Project: Spark
Issue Type: Improvement
Components: Streaming
Reporter: sunsc

Currently:
FlumeUtils.createStream(ssc, localhost, port);
This means that only one flume receiver can work with FlumeInputDStream .so
the solution is not scalable.
I use a zookeeper to solve this problem.
Spark flume receivers register themselves to a zk path when started, and a
flume agent get physical hosts and push events to them.
Some works need to be done here:
1.receiver create tmp node in zk, listeners just watch those tmp nodes.
2. when spark FlumeReceivers started, they acquire a physical host
(localhost's ip and an idle port) and register itself to zookeeper.
3. A new flume sink. In the method of appendEvents, they get physical hosts
and push data to them in a round-robin manner.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2193) Improve tasks‘ preferred locality by sorting tasks partial ordering


 [ 
https://issues.apache.org/jira/browse/SPARK-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2193.
--
Resolution: Won't Fix

Last word appears to be that this was obviated by SPARK-2294 and 
https://github.com/apache/spark/pull/1313

 Improve tasks‘ preferred locality by sorting tasks partial ordering
 ---

 Key: SPARK-2193
 URL: https://issues.apache.org/jira/browse/SPARK-2193
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Zhihui
 Attachments: Improve Tasks Preferred Locality.pptx


 Now, the last executor(s) maybe not get it’s preferred task(s), although 
 these tasks have build in pendingTasksForHosts map. Because executers pick up 
 tasks sequential, their preferred task(s) maybe picked up by other executors.
 This appearance can be eliminated by sorting tasks partial ordering. Executor 
 pick up task by host’s order of task’s preferredLocation, that mean, executor 
 firstly pick up all tasks which task.preferredLocations.1 = 
 executor.hostName, then secondly…  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4823) rowSimilarities

2014-12-11 Thread Reza Zadeh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243133#comment-14243133
 ] 

Reza Zadeh commented on SPARK-4823:
---

Given that we're talking about RowMatrices, computing rowSimilarities the same 
way as columnSimilarities would require transposing the matrix, which is 
dangerous when the original matrix has many rows. RowMatrix assumes a single 
row should fit in memory on a single machine, but this might not happen after 
transposing a RowMatrix.

 rowSimilarities
 ---

 Key: SPARK-4823
 URL: https://issues.apache.org/jira/browse/SPARK-4823
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh

 RowMatrix has a columnSimilarities method to find cosine similarities between 
 columns.
 A rowSimilarities method would be useful to find similarities between rows.
 This is JIRA is to investigate which algorithms are suitable for such a 
 method, better than brute-forcing it. Note that when there are many rows ( 
 10^6), it is unlikely that brute-force will be feasible, since the output 
 will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2127) Use application specific folders to dump metrics via CsvSink

[
https://issues.apache.org/jira/browse/SPARK-2127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sean Owen resolved SPARK-2127.
--
Resolution: Duplicate

The PR that closes SPARK-3377 also closed the PR for this JIRA, and it looks
like this is a subset of SPARK-3377.

Use application specific folders to dump metrics via CsvSink

Key: SPARK-2127
URL: https://issues.apache.org/jira/browse/SPARK-2127
Project: Spark
Issue Type: Improvement
Components: Spark Core
Affects Versions: 1.0.0
Reporter: Rahul Singhal
Priority: Minor

Currently when using the CsvSink, all application's csv metrics are dumped in
the root folder (configured via *.sink.csv.director in metrics.properties).
Also, some files that have common names (e.g. jvm.PS-MarkSweep.count.csv)
are reused. And if one is running the same application multiple times, the
metrics get appended to previously existing files.
This makes it harder to parse these files and extract the information that
one might be looking for. I suggest that a unique folder is created every
time an application is run and use it to dump the metrics from that
particular run only. This unique folder could be created similar the one that
is currently craeted for logging application events (e.g.
spark-pi-1402484928439).

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2016) rdd in-memory storage UI becomes unresponsive when the number of RDD partitions is large


[ 
https://issues.apache.org/jira/browse/SPARK-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243147#comment-14243147
 ] 

Reynold Xin commented on SPARK-2016:


cc [~andrewor14] can you comment on this?

 rdd in-memory storage UI becomes unresponsive when the number of RDD 
 partitions is large
 

 Key: SPARK-2016
 URL: https://issues.apache.org/jira/browse/SPARK-2016
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin
  Labels: starter

 Try run
 {code}
 sc.parallelize(1 to 100, 100).cache().count()
 {code}
 And open the storage UI for this RDD. It takes forever to load the page.
 When the number of partitions is very large, I think there are a few 
 alternatives:
 0. Only show the top 1000.
 1. Pagination
 2. Instead of grouping by RDD blocks, group by executors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS


[ 
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243149#comment-14243149
 ] 

Debasish Das commented on SPARK-2426:
-

[~mengxr] as per our discussion, QuadraticMinimizer and NNLS are both added to 
breeze and updated with breeze DenseMatrix and DenseVector...Inside breeze I 
did some interesting comparisons and that motivated me to port NNLS to breeze 
as well...I added all the testcases for QuadraticMinimizer and NNLS as well 
based on my experiments with MovieLens dataset...

Here is the PR: https://github.com/scalanlp/breeze/pull/321

To run the Quadratic programming variants in breeze:
runMain breeze.optimize.quadratic.QuadraticMinimizer 100 1 0.1 0.99
regParam = 0.1, beta = 0.99 is Elastic Net parameter
 
It will randomly generate quadratic problems with 100 variables, 1 equality 
constraint and lower/upper bounds. This format is similar to PDCO QP generator 
(please look into my Matlab examples)

0.5x'Hx + c'x
s.t Ax = B, 
lb = x = ub

1. Unconstrained minimization: breeze luSolve, cg and qp(dposv added to breeze 
through this PR)

Minimize 0.5x'Hx + c'x

||qp - lu|| norm 4.312577233496585E-10 max-norm 1.3842793578078272E-10
||cg - lu|| norm 4.167925029822007E-7 max-norm 1.0053204402282745E-7
dim 100 lu 86.007 qp 41.56 cg 102.627

||qp - lu|| norm 4.267891623199082E-8 max-norm 6.681460718027665E-9
||cg - lu|| norm 1.94497623480055E-7 max-norm 2.6288773824489908E-8
dim 500 lu 169.993 qp 78.001 cg 443.044

qp is faster than cg for smaller dimensions as expected. I also tried 
unconstrained BFGS but the results were not good. We are looking into it.

2. Elastic Net formulation: 0.5 x'Hx + c'x + (1-beta)*L2(x) + 
beta*regParam*L1(x)

beta = 0.99 Strong L1 regParam=0.1
||owlqn - sparseqp|| norm 0.1653200701235298 inf-norm 0.051855911945906996
sparseQp 61.948 ms iters 227 owlqn 928.11 ms

beta = 0.5 average L1 regParam=0.1
||owlqn - sparseqp|| norm 0.15823773098501168 inf-norm 0.035153837685728107
sparseQp 69.934 ms iters 353 owlqn 882.104 ms

beta = 0.01 mostly BFGS regParam=0.1

||owlqn - sparseqp|| norm 0.17950035092790165 inf-norm 0.04718697692014828
sparseQp 80.411 ms iters 580 owlqn 988.313 ms

ADMM based proximal formulation is faster for smaller dimension. Even as I 
scale dimension, I notice similar behavior that owlqn is taking longer to 
converge and results are not same. Look for example in dim = 500 case:

||owlqn - sparseqp|| norm 10.946326189397649 inf-norm 1.412726586317294
sparseQp 830.593 ms iters 2417 owlqn 19848.932 ms

I validated ADMM through Matlab scripts so there is something funky going on in 
OWLQN.

3. NNLS formulation:  0.5 x'Hx + c'x s.t x = 0

Here are compared ADMM based proximal formulation with CG based projected 
gradient in NNLS. NNLS converges much nicer but the convergence criteria does 
not look same as breeze CG but they should be same. 

For now I ported it to breeze and we can call NNLS for x = 0 and 
QuadraticMinimizer for other formulations

dim = 100 posQp 16.367 ms iters 284 nnls 8.854 ms iters 107
dim = 500 posQp 303.184 ms iters 950 nnls 183.543 ms iters 517

NNLS on average looks 2X faster !

4. Bounds formulation: 0.5x'Hx + c'x s.t lb = x = ub
Validated through Matlab scripts above. Here are the runtime numbers:

dim = 100 boundsQp 15.654 ms iters 284 converged true
dim= 500 boundsQp 311.613 ms iters 950 converged true

5. Equality and positivity: 0.5 x'Hx + c'x s.t \sum_i x_i = 1, x_i =0
Validated through Matlab scripts above. Here are the runtime numbers:

dim = 100 Qp Equality 13.64 ms iters 184 converged true
dim = 500 Qp Equality 278.525 ms iters 890 converged true

With this change all copyrights are moved to breeze. Once it merges, I will 
update the Spark PR. With this change we can move ALS code to Breeze 
DenseMatrix and DenseVector as well 

My focus next will be to get a Truncated Newton running for convex cost since 
convex cost is required for PLSA, SVM and Neural Net formulations...

I am still puzzled that why BFGS/OWLQN is not working well for the 
unconstrained case/L1 optimization. If TRON works well for unconstrained case, 
that's what I will use for NonlinearMinimizer. I am looking more into it.

 Quadratic Minimization for MLlib ALS
 

 Key: SPARK-2426
 URL: https://issues.apache.org/jira/browse/SPARK-2426
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Debasish Das
Assignee: Debasish Das
   Original Estimate: 504h
  Remaining Estimate: 504h

 Current ALS supports least squares and nonnegative least squares.
 I presented ADMM and IPM based Quadratic Minimization solvers to be used for 
 the following ALS problems:
 1. ALS with bounds
 2. ALS with L1 regularization
 3. ALS with Equality constraint and bounds
 Initial runtime

[jira] [Resolved] (SPARK-2381) streaming receiver crashed,but seems nothing happened


 [ 
https://issues.apache.org/jira/browse/SPARK-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2381.
--
Resolution: Won't Fix

PR comments didn't receive follow-up changes either, so per comments here, 
looks like WontFix.

 streaming receiver crashed,but seems nothing happened
 -

 Key: SPARK-2381
 URL: https://issues.apache.org/jira/browse/SPARK-2381
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: sunsc

 when we submit a streaming job and if receivers doesn't start normally, the 
 application should stop itself. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2016) rdd in-memory storage UI becomes unresponsive when the number of RDD partitions is large

2014-12-11 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243153#comment-14243153
 ] 

Andrew Or commented on SPARK-2016:
--

This was filed before SPARK-2316 (https://github.com/apache/spark/pull/1679) 
was fixed. At least on the backend side, this should be much quicker than 
before. I don't know if we need to do some CSS magic to make the frontend side 
blazing fast too. Is this still reproducible?

 rdd in-memory storage UI becomes unresponsive when the number of RDD 
 partitions is large
 

 Key: SPARK-2016
 URL: https://issues.apache.org/jira/browse/SPARK-2016
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin
  Labels: starter

 Try run
 {code}
 sc.parallelize(1 to 100, 100).cache().count()
 {code}
 And open the storage UI for this RDD. It takes forever to load the page.
 When the number of partitions is very large, I think there are a few 
 alternatives:
 0. Only show the top 1000.
 1. Pagination
 2. Instead of grouping by RDD blocks, group by executors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2368) Improve io.netty related handlers and clients in network.netty


 [ 
https://issues.apache.org/jira/browse/SPARK-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2368.
--
Resolution: Won't Fix

PR discussion says that the OP abandoned this change as it was covered by other 
changes to Netty code.

 Improve io.netty related handlers and clients in network.netty
 --

 Key: SPARK-2368
 URL: https://issues.apache.org/jira/browse/SPARK-2368
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 1.0.0
Reporter: Binh Nguyen
Priority: Minor

 One issue with current implementation is that FileServerHandler will just 
 write to the channel without checking if the channel buffer is full. This 
 could cause OOM on the receiving end. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2402) DiskBlockObjectWriter should update the initial position when reusing this object


 [ 
https://issues.apache.org/jira/browse/SPARK-2402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2402.
--
Resolution: Won't Fix

There was disagreement about whether to merge this change, but looks like it 
was closed and not merged in the end because the object is not supposed to be 
reusable.

 DiskBlockObjectWriter should update the initial position when reusing this 
 object
 -

 Key: SPARK-2402
 URL: https://issues.apache.org/jira/browse/SPARK-2402
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Affects Versions: 1.0.0
Reporter: Saisai Shao
Priority: Minor

 Initial position of DiskBlockObjectWriter is not updated when closing and 
 reopening, so reuse of this object to write file will lead to error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4823) rowSimilarities


[ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243207#comment-14243207
 ] 

Debasish Das commented on SPARK-4823:
-

Even for matrix factorization userFactors are user x rank...with modest ranks 
of 50..and users at 10M, I don't think it is possible to transpose the matrix 
and run column similarities...doing it on the fly complexity wise is still 
O(n*n) right...

 rowSimilarities
 ---

 Key: SPARK-4823
 URL: https://issues.apache.org/jira/browse/SPARK-4823
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh

 RowMatrix has a columnSimilarities method to find cosine similarities between 
 columns.
 A rowSimilarities method would be useful to find similarities between rows.
 This is JIRA is to investigate which algorithms are suitable for such a 
 method, better than brute-forcing it. Note that when there are many rows ( 
 10^6), it is unlikely that brute-force will be feasible, since the output 
 will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2542) Exit Code Class should be renamed and placed package properly


 [ 
https://issues.apache.org/jira/browse/SPARK-2542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2542.
--
Resolution: Won't Fix

PR discussion says this is WontFix.

 Exit Code Class should be renamed and placed package properly
 -

 Key: SPARK-2542
 URL: https://issues.apache.org/jira/browse/SPARK-2542
 Project: Spark
  Issue Type: Bug
Reporter: Kousuke Saruta

 org.apache.spark.executor.ExecutorExitCode represents some of Exit Codes.
 The name of the class associates the set of exit code of Executor.
 But, the exit codes defined in the class can be used not only Executor (e.g 
 Driver).
 Actually, DiskBlockManager uses 
 ExecutorExitCode.DISK_STORE_FAILED_TO_CREATE_DIR and DiskBlockManager can be 
 used Driver.
 We should rename and move the class to new package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2671) BlockObjectWriter should create parent directory when the directory doesn't exist


 [ 
https://issues.apache.org/jira/browse/SPARK-2671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2671.
--
Resolution: Won't Fix

This is another where the PR discussion indicates this is WontFix.

 BlockObjectWriter should create parent directory when the directory doesn't 
 exist
 -

 Key: SPARK-2671
 URL: https://issues.apache.org/jira/browse/SPARK-2671
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Kousuke Saruta
Priority: Minor

 BlockObjectWriter#open expects parent directory is present. 
 {code}
   override def open(): BlockObjectWriter = {
 fos = new FileOutputStream(file, true)
 ts = new TimeTrackingOutputStream(fos)
 channel = fos.getChannel()
 lastValidPosition = initialPosition
 bs = compressStream(new BufferedOutputStream(ts, bufferSize))
 objOut = serializer.newInstance().serializeStream(bs)
 initialized = true
 this
   }
 {code}
 Normally, the parent directory is created by DiskBlockManager#createLocalDirs 
 but, just in case, BlockObjectWriter#open should check the existence of the 
 directory and create the directory if the directory does not exist.
 I think, recoverable error should be recovered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2604) Spark Application hangs on yarn in edge case scenario of executor memory requirement


[ 
https://issues.apache.org/jira/browse/SPARK-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243221#comment-14243221
 ] 

Sean Owen commented on SPARK-2604:
--

PR comments suggest this was fixed by SPARK-2140? 
https://github.com/apache/spark/pull/1571

 Spark Application hangs on yarn in edge case scenario of executor memory 
 requirement
 

 Key: SPARK-2604
 URL: https://issues.apache.org/jira/browse/SPARK-2604
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Twinkle Sachdeva

 In yarn environment, let's say :
 MaxAM = Maximum allocatable memory
 ExecMem - Executor's memory
 if (MaxAM  ExecMem  ( MaxAM - ExecMem)  384m ))
   then Maximum resource validation fails w.r.t executor memory , and 
 application master gets launched, but when resource is allocated and again 
 validated, they are returned and application appears to be hanged.
 Typical use case is to ask for executor memory = maximum allowed memory as 
 per yarn config



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2770) Rename spark-ganglia-lgpl to ganglia-lgpl


[ 
https://issues.apache.org/jira/browse/SPARK-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243244#comment-14243244
 ] 

Sean Owen commented on SPARK-2770:
--

Is this still active? the PR was attempted but got messed up and then I don't 
see another.

 Rename spark-ganglia-lgpl to ganglia-lgpl
 -

 Key: SPARK-2770
 URL: https://issues.apache.org/jira/browse/SPARK-2770
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Chris Fregly
Assignee: Chris Fregly
Priority: Minor
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2750) Add Https support for Web UI


[ 
https://issues.apache.org/jira/browse/SPARK-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243250#comment-14243250
 ] 

Sean Owen commented on SPARK-2750:
--

Shall this be rolled into SPARK-3883? both have an open PR. I'm not sure which 
is the better one to pursue, but they overlap.

 Add Https support for Web UI
 

 Key: SPARK-2750
 URL: https://issues.apache.org/jira/browse/SPARK-2750
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Reporter: WangTaoTheTonic
  Labels: https, ssl, webui
 Fix For: 1.0.3

   Original Estimate: 96h
  Remaining Estimate: 96h

 Now I try to add https support for web ui using Jetty ssl integration.Below 
 is the plan:
 1.Web UI include Master UI, Worker UI, HistoryServer UI and Spark Ui. User 
 can switch between https and http by configure spark.http.policy in JVM 
 property for each process, while choose http by default.
 2.Web port of Master and worker would be decided in order of launch 
 arguments, JVM property, System Env and default port.
 3.Below is some other configuration items:
 spark.ssl.server.keystore.location The file or URL of the SSL Key store
 spark.ssl.server.keystore.password  The password for the key store
 spark.ssl.server.keystore.keypassword The password (if any) for the specific 
 key within the key store
 spark.ssl.server.keystore.type The type of the key store (default JKS)
 spark.client.https.need-auth True if SSL needs client authentication
 spark.ssl.server.truststore.location The file name or URL of the trust store 
 location
 spark.ssl.server.truststore.password The password for the trust store
 spark.ssl.server.truststore.type The type of the trust store (default JKS)
 Any feedback is welcome!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3247) Improved support for external data sources

2014-12-11 Thread Matei Zaharia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243253#comment-14243253
 ] 

Matei Zaharia commented on SPARK-3247:
--

For those looking to learn about the interface in more detail, there is a 
meetup video on it at https://www.youtube.com/watch?v=GQSNJAzxOr8.

 Improved support for external data sources
 --

 Key: SPARK-3247
 URL: https://issues.apache.org/jira/browse/SPARK-3247
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2715) ExternalAppendOnlyMap adds max limit of times and max limit of disk bytes written for spilling


 [ 
https://issues.apache.org/jira/browse/SPARK-2715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2715.
--
Resolution: Won't Fix

PR discussion says it is a WontFix.

 ExternalAppendOnlyMap adds max limit of times and max limit of disk bytes 
 written for spilling
 --

 Key: SPARK-2715
 URL: https://issues.apache.org/jira/browse/SPARK-2715
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: YanTang Zhai
Priority: Minor

 ExternalAppendOnlyMap adds max limit of times and max limit of disk bytes 
 written for spilling. Therefore, some task could be let fail fast instead of 
 running for a long time if it has data skew.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2710) Build SchemaRDD from a JdbcRDD with MetaData (no hard-coded case class)


 [ 
https://issues.apache.org/jira/browse/SPARK-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2710.
--
Resolution: Won't Fix

PR discussion says that this should become an external library, given the new 
external data source API in 1.2.

 Build SchemaRDD from a JdbcRDD with MetaData (no hard-coded case class)
 ---

 Key: SPARK-2710
 URL: https://issues.apache.org/jira/browse/SPARK-2710
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Reporter: Teng Qiu

 Spark SQL can take Parquet files or JSON files as a table directly (without 
 given a case class to define the schema)
 as a component named SQL, it should also be able to take a ResultSet from 
 RDBMS easily.
 i find that there is a JdbcRDD in core: 
 core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala
 so i want to make some small change in this file to allow SQLContext to read 
 the MetaData from the PreparedStatement (read metadata do not need to execute 
 the query really).
 Then, in Spark SQL, SQLContext can create SchemaRDD with JdbcRDD and his 
 MetaData.
 In the further, maybe we can add a feature in sql-shell, so that user can 
 using spark-thrift-server join tables from different sources
 such as:
 {code}
 CREATE TABLE jdbc_tbl1 AS JDBC connectionString username password 
 initQuery bound ...
 CREATE TABLE parquet_files AS PARQUET hdfs://tmp/parquet_table/
 SELECT parquet_files.colX, jdbc_tbl1.colY
   FROM parquet_files
   JOIN jdbc_tbl1
 ON (parquet_files.id = jdbc_tbl1.id)
 {code}
 I think such a feature will be useful, like facebook Presto engine does.
 oh, and there is a small bug in JdbcRDD
 in compute(), method close()
 {code}
 if (null != conn  ! stmt.isClosed()) conn.close()
 {code}
 should be
 {code}
 if (null != conn  ! conn.isClosed()) conn.close()
 {code}
 just a small write error :)
 but such a close method will never be able to close conn...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2872) Fix conflict between code and doc in YarnClientSchedulerBackend


 [ 
https://issues.apache.org/jira/browse/SPARK-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2872.
--
Resolution: Won't Fix

Looks like this was obsoleted by subsequent changes to how YARN parses 
configuration, given the PR.

 Fix conflict between code and doc in YarnClientSchedulerBackend
 ---

 Key: SPARK-2872
 URL: https://issues.apache.org/jira/browse/SPARK-2872
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Zhihui

 Doc say: system properties override environment variables.
 https://github.com/apache/spark/blob/master/yarn/common/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala#L71
 But code is conflict with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2947) DAGScheduler resubmit the stage into an infinite loop


 [ 
https://issues.apache.org/jira/browse/SPARK-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2947.
--
   Resolution: Duplicate
Fix Version/s: (was: 1.0.3)
   (was: 1.2.0)

Discussion indicates this was the same as SPARK-3224, which makes a more 
comprehensive change and has been resolved.

 DAGScheduler resubmit the stage into an infinite loop
 -

 Key: SPARK-2947
 URL: https://issues.apache.org/jira/browse/SPARK-2947
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.0.2
Reporter: Guoqiang Li
Priority: Blocker

 Stage to resubmit more than 5 times.
 This seems to be caused by {{FetchFailed.bmAddress}} is null .
 I don't know how to reproduce it.
 master log:
 {noformat}
 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:276 as 
 TID 52334 on executor 82: sanshan (PROCESS_LOCAL)
 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:276 as 
 3060 bytes in 0 ms
 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:277 as 
 TID 52335 on executor 78: tuan231 (PROCESS_LOCAL)
 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:277 as 
 3060 bytes in 0 ms
 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Lost TID 52199 (task 
 1.189:141)
 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch 
 failure from null
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
 DealCF.scala:215) for resubmision due to a fetch failure
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
 Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch 
 failure from null
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
 DealCF.scala:215) for resubmision due to a fetch failure
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
 Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
  -- 5 times ---
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
 DealCF.scala:215) for resubmision due to a fetch failure
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
 Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
 14/08/09 21:50:17 INFO cluster.YarnClientClusterScheduler: Removed TaskSet 
 1.189, whose tasks have all completed, from pool 
 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Finished TID 1869 in 87398 
 ms on jilin (progress: 280/280)
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Completed ShuffleMapTask(2, 
 269)
 14/08/09 21:50:17 INFO cluster.YarnClientClusterScheduler: Removed TaskSet 
 2.1, whose tasks have all completed, from pool 
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Stage 2 (flatMap at 
 DealCF.scala:207) finished in 129.544 s
 {noformat}
 worker: log
 {noformat}
 /1408/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_57 not found, 
 computing it
 14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_191 not found, 
 computing it
 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
 task 18017
 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18017
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally
 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
 task 18151
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally
 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18151
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally
 14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_86 not found, 
 computing it
 14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_220 not found, 
 computing it
 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
 task 18285
 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18285
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally
 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
 task 18419
 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18419
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally
 14/08/09 21:49:41 INFO storage.BlockManager:

[jira] [Resolved] (SPARK-3099) Staging Directory is never deleted when we run job with YARN Client Mode


 [ 
https://issues.apache.org/jira/browse/SPARK-3099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3099.
--
Resolution: Won't Fix

PR discussion says this was obsoleted by SPARK-2933.

 Staging Directory is never deleted when we run job with YARN Client Mode
 

 Key: SPARK-3099
 URL: https://issues.apache.org/jira/browse/SPARK-3099
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Kousuke Saruta

 When we run application with YARN Cluster mode, the class 'ApplicationMaster' 
 is used as ApplicationMaster, which has shutdown hook to cleanup stagind 
 directory (~/.sparkStaging).
 But, when we run application with YARN Client mode, the class 
 'ExecutorLauncher' as an ApplicationMaster doesn't cleanup staging directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3038) delete history server logs when there are too many logs


 [ 
https://issues.apache.org/jira/browse/SPARK-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3038.
--
   Resolution: Won't Fix
Fix Version/s: (was: 1.2.0)

PR says this is WontFix.

 delete history server logs when there are too many logs 
 

 Key: SPARK-3038
 URL: https://issues.apache.org/jira/browse/SPARK-3038
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: wangfei

 enhance history server to delete logs automatically
 1 use spark.history.deletelogs.enable to enable this function
 2 when app logs num is greater than spark.history.maxsavedapplication, delete 
 the older logs 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

2014-12-11 Thread Valeriy Avanesov (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243398#comment-14243398
]

Valeriy Avanesov commented on SPARK-2426:
-

what's the normalization constraint ? Each row of W should sum upto 1 and
each column of H should sum upto 1 with positivity ?
Yes.

That is similar to PLSA right except that PLSA will have a bi-concave loss...
There's a completely different loss... BTW, we've used a factorisation with the
loss you've described as an initial approximation for PLSA. It gave a
significant speed-up.

Quadratic Minimization for MLlib ALS

Key: SPARK-2426
URL: https://issues.apache.org/jira/browse/SPARK-2426
Project: Spark
Issue Type: New Feature
Components: MLlib
Affects Versions: 1.3.0
Reporter: Debasish Das
Assignee: Debasish Das
Original Estimate: 504h
Remaining Estimate: 504h

Current ALS supports least squares and nonnegative least squares.
I presented ADMM and IPM based Quadratic Minimization solvers to be used for
the following ALS problems:
1. ALS with bounds
2. ALS with L1 regularization
3. ALS with Equality constraint and bounds
Initial runtime comparisons are presented at Spark Summit.
http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark
Based on Xiangrui's feedback I am currently comparing the ADMM based
Quadratic Minimization solvers with IPM based QpSolvers and the default
ALS/NNLS. I will keep updating the runtime comparison results.
For integration the detailed plan is as follows:
1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization
2. Integrate QuadraticMinimizer in mllib ALS

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3124) Jar version conflict in the assembly package


 [ 
https://issues.apache.org/jira/browse/SPARK-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3124.
--
Resolution: Fixed

I'm a little unclear on the outcome, but in master, running {{mvn  -Phive 
-Phive-thriftserver -Dhadoop.version=2.0.0-mr1-cdh4.3.0 dependency:tree}} says 
there is no Netty 3.2.2 anymore. So it looks fixed.

 Jar version conflict in the assembly package
 

 Key: SPARK-3124
 URL: https://issues.apache.org/jira/browse/SPARK-3124
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Blocker

 Both netty-3.2.2.Final.jar and netty-3.6.6.Final.jar are flatten into the 
 assembly package, however, the class(NioWorker) signature difference leads to 
 the failure in launching sparksql CLI/ThriftServer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3358) PySpark worker fork()ing performance regression in m3.* / PVM instances


[ 
https://issues.apache.org/jira/browse/SPARK-3358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243454#comment-14243454
 ] 

Sean Owen commented on SPARK-3358:
--

Is this then resolved by one of https://github.com/apache/spark/pull/2244 or 
https://github.com/apache/spark/pull/2259 ?

 PySpark worker fork()ing performance regression in m3.* / PVM instances
 ---

 Key: SPARK-3358
 URL: https://issues.apache.org/jira/browse/SPARK-3358
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
 Environment: m3.* instances on EC2
Reporter: Josh Rosen

 SPARK-2764 (and some followup commits) simplified PySpark's worker process 
 structure by removing an intermediate pool of processes forked by daemon.py.  
 Previously, daemon.py forked a fixed-size pool of processes that shared a 
 socket and handled worker launch requests from Java.  After my patch, this 
 intermediate pool was removed and launch requests are handled directly in 
 daemon.py.
 Unfortunately, this seems to have increased PySpark task launch latency when 
 running on m3* class instances in EC2.  Most of this difference can be 
 attributed to m3 instances' more expensive fork() system calls.  I tried the 
 following microbenchmark on m3.xlarge and r3.xlarge instances:
 {code}
 import os
 for x in range(1000):
   if os.fork() == 0:
 exit()
 {code}
 On the r3.xlarge instance:
 {code}
 real  0m0.761s
 user  0m0.008s
 sys   0m0.144s
 {code}
 And on m3.xlarge:
 {code}
 real0m1.699s
 user0m0.012s
 sys 0m1.008s
 {code}
 I think this is due to HVM vs PVM EC2 instances using different 
 virtualization technologies with different fork costs.
 It may be the case that this performance difference only appears in certain 
 microbenchmarks and is masked by other performance improvements in PySpark, 
 such as improvements to large group-bys.  I'm in the process of re-running 
 spark-perf benchmarks on m3 instances in order to confirm whether this 
 impacts more realistic jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3352) Rename Flume Polling stream to Pull Based stream


 [ 
https://issues.apache.org/jira/browse/SPARK-3352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3352.
--
Resolution: Won't Fix

According to the PR, this is WontFix.

 Rename Flume Polling stream to Pull Based stream
 

 Key: SPARK-3352
 URL: https://issues.apache.org/jira/browse/SPARK-3352
 Project: Spark
  Issue Type: Bug
Reporter: Hari Shreedharan





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

[
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243456#comment-14243456
]

Debasish Das commented on SPARK-2426:
-

[~akopich] I got good MAP results on recommendation datasets with the
approximated PLSA formulation. I did not get time to compare that formulation
with Gibbs sampling based LDA PR:
https://issues.apache.org/jira/browse/SPARK-1405 yet. Did you compare them ?

Quadratic Minimization for MLlib ALS

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3229) spark.shuffle.safetyFraction and spark.storage.safetyFraction is not documented


 [ 
https://issues.apache.org/jira/browse/SPARK-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3229.
--
Resolution: Won't Fix

Another one that says it's WontFix in the PR.

 spark.shuffle.safetyFraction and spark.storage.safetyFraction is not 
 documented
 ---

 Key: SPARK-3229
 URL: https://issues.apache.org/jira/browse/SPARK-3229
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.1.0
Reporter: Kousuke Saruta

 There are no descriptions about spark.shuffle.safetyFraction and  
 spark.storage.safetyFraction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3201) Yarn Client do not support the -X java opts


 [ 
https://issues.apache.org/jira/browse/SPARK-3201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3201.
--
Resolution: Won't Fix

Looks like it was abandoned by the OP, possibly in favor of SPARK-1953 or 
SPARK-1507

 Yarn Client do not support the -X java opts
 -

 Key: SPARK-3201
 URL: https://issues.apache.org/jira/browse/SPARK-3201
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.2
Reporter: SaintBacchus

 In yarn-client mode, it's not allowed to set the 
 spark.driver.extraJavaOptions .
 I think it's very inconvenient if we want to set the -X java opts in the 
 process of ExecutorLauncher.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3548) Display cache hit ratio on WebUI


 [ 
https://issues.apache.org/jira/browse/SPARK-3548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3548.
--
Resolution: Won't Fix

I think this particular suggestion is WontFix since the idea of a hit ratio was 
declined in the PR discussion and the PR was closed.

 Display cache hit ratio on WebUI
 

 Key: SPARK-3548
 URL: https://issues.apache.org/jira/browse/SPARK-3548
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.2.0
Reporter: Kousuke Saruta

 In Stage page,  if cache hit ratio is displayed, it's useful for application 
 / cache strategy tuning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3433) Mima false-positives with @DeveloperAPI and @Experimental annotations


 [ 
https://issues.apache.org/jira/browse/SPARK-3433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3433.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

The PR was https://github.com/apache/spark/pull/2285 and it was merged, looks 
like for 1.2: 
https://github.com/apache/spark/commit/ecf0c02935815f0d4018c0e30ec4c784e60a5db0

 Mima false-positives with @DeveloperAPI and @Experimental annotations
 -

 Key: SPARK-3433
 URL: https://issues.apache.org/jira/browse/SPARK-3433
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.2.0
Reporter: Josh Rosen
Assignee: Prashant Sharma
Priority: Minor
 Fix For: 1.2.0


 In https://github.com/apache/spark/pull/2315, I found two cases where 
 {{@DeveloperAPI}} and {{@Experimental}} annotations didn't prevent 
 false-positive warnings from Mima.  To reproduce this problem, run dev/mima 
 as of 
 https://github.com/JoshRosen/spark/commit/ec90e21947b615d4ef94a3a54cfd646924ccaf7c.
   The spurious warnings are listed at the top of 
 https://gist.github.com/JoshRosen/5d8df835516dc367389d.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3719) Spark UI: complete/failed stages is better to show the total number of stages


 [ 
https://issues.apache.org/jira/browse/SPARK-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3719.
--
  Resolution: Duplicate
Target Version/s:   (was: 1.1.1, 1.2.0)

Apparently resolved by the very similar SPARK-4168, according to the PR.

 Spark UI: complete/failed stages is better to show the total number of 
 stages 
 

 Key: SPARK-3719
 URL: https://issues.apache.org/jira/browse/SPARK-3719
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: uncleGen
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3712) add a new UpdateDStream to update a rdd dynamically


 [ 
https://issues.apache.org/jira/browse/SPARK-3712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3712.
--
  Resolution: Won't Fix
Target Version/s:   (was: 1.1.1, 1.2.0)

Withdrawn in the PR by the proposer.

 add a new UpdateDStream to update a rdd dynamically
 ---

 Key: SPARK-3712
 URL: https://issues.apache.org/jira/browse/SPARK-3712
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: uncleGen
Priority: Minor

 Maybe, we can achieve the aim by using forEachRdd  function. But I feel 
 weird in this way, because I need to pass a closure, like this:
 val baseRdd = ...
 var updatedRDD = ...
 val inputStream = ...
 val func = (rdd: RDD[T], t: Time) = {
  updatedRDD = baseRDD.op(rdd)
 }
 inputStream.foreachRDD(func _)
 In my PR, we can update a rdd like:
 val updateStream = inputStream.updateRDD(baseRDD, func).asInstanceOf[T, 
 V, U]
 and obtain the updatedRDD like this:
 val updatedRDD = updateStream.getUpdatedRDD



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3689) FileLogger should create new instance of FileSystem regardless of it's scheme


 [ 
https://issues.apache.org/jira/browse/SPARK-3689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3689.
--
  Resolution: Not a Problem
Target Version/s:   (was: 1.1.1, 1.2.0)

The PR says it's not a problem actually.

 FileLogger should create new instance of FileSystem regardless of it's scheme
 -

 Key: SPARK-3689
 URL: https://issues.apache.org/jira/browse/SPARK-3689
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Kousuke Saruta

 FileLogger creates new instance of FileSystem to avoid the effect of 
 FileSystem#close from another module but it's expected only HDFS.
 We can used another filesystem for the directory which event log is stored to.
 {code}
 if (scheme == hdfs) {
   conf.setBoolean(fs.hdfs.impl.disable.cache, true)
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3663) Document SPARK_LOG_DIR and SPARK_PID_DIR


 [ 
https://issues.apache.org/jira/browse/SPARK-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3663.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

The PR was merged, though looks like not for 1.2:
https://github.com/apache/spark/commit/5c265ccde0c5594899ec61f9c1ea100ddff52da7

 Document SPARK_LOG_DIR and SPARK_PID_DIR
 

 Key: SPARK-3663
 URL: https://issues.apache.org/jira/browse/SPARK-3663
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Reporter: Andrew Ash
Assignee: Andrew Ash
 Fix For: 1.3.0


 I'm using these two parameters in some puppet scripts for standalone 
 deployment and realized that they're not documented anywhere.  We should 
 document them



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3636) It is not friendly to interrupt a Job when user passes different storageLevels to a RDD


 [ 
https://issues.apache.org/jira/browse/SPARK-3636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3636.
--
  Resolution: Won't Fix
Target Version/s:   (was: 1.1.1)

PR discussion says this is WontFix

 It is not friendly to interrupt a Job when user passes different 
 storageLevels to a RDD
 ---

 Key: SPARK-3636
 URL: https://issues.apache.org/jira/browse/SPARK-3636
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: uncleGen
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4817) [streaming]Print the specified number of data and handle all of the elements in RDD

2014-12-11 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243683#comment-14243683
 ] 

宿荣全 commented on SPARK-4817:


[~srowen]
I‘m sorry that didn't describe the problem clearly.
If there is such a scene that there are multiple outputs:
*data from HDFS files -  map- filter-map(each row to updata mysql DB)-  
filter- map-print(print 20 datas to console)*
# {color:red} output to mysql DB{color}
# {color:red} output to console DB{color}

this patch: ( the function {{processAllAndPrintFirst}} is new be defined.)
{code}
ssc.textFileStream(path).map(func1).filter(func2).map(f={updataMysql(f)}).filter(func3).
 map(func4).processAllAndPrintFirst(20)
{code}

How to do this scene if use {{foreachRDD}} and {{take}} or {{print(num)}}?
Both [ {{rdd.foreach}} and {{rdd.take}} ]or  [ {{rdd.foreach}} and 
{{stream.print(100)}} ] will have two Jobs in each streaming batch.
With have a job to compare it will whether or not the efficiency?

 [streaming]Print the specified number of data and handle all of the elements 
 in RDD
 ---

 Key: SPARK-4817
 URL: https://issues.apache.org/jira/browse/SPARK-4817
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: 宿荣全
Priority: Minor

 Dstream.print function:Print 10 elements and handle 11 elements.
 A new function based on Dstream.print function is presented:
 the new function:
 Print the specified number of data and handle all of the elements in RDD.
 there is a work scene:
 val dstream = stream.map-filter-mapPartitions-print
 the data after filter need update database in mapPartitions,but don't need 
 print each data,only need to print the top 20 for view the data processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4713) SchemaRDD.unpersist() should not raise exception if it is not cached.


 [ 
https://issues.apache.org/jira/browse/SPARK-4713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4713.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3572
[https://github.com/apache/spark/pull/3572]

 SchemaRDD.unpersist() should not raise exception if it is not cached.
 -

 Key: SPARK-4713
 URL: https://issues.apache.org/jira/browse/SPARK-4713
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Minor
 Fix For: 1.3.0


 Unpersist a uncached RDD, will not raise exception, for example:
 {panel}
 val data = Array(1, 2, 3, 4, 5)
 val distData = sc.parallelize(data)
 distData.unpersist(true)
 {panel}
 But the SchemaRDD will throws exception if the SchemaRDD is not cached. Since 
 SchemaRDD is the subclasses of the RDD, we should follow the same behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4662) Whitelist more Hive unittest


 [ 
https://issues.apache.org/jira/browse/SPARK-4662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4662.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3522
[https://github.com/apache/spark/pull/3522]

 Whitelist more Hive unittest
 

 Key: SPARK-4662
 URL: https://issues.apache.org/jira/browse/SPARK-4662
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
 Fix For: 1.3.0


 Whitelist more hive unit test:
 create_like_tbl_props
 udf5
 udf_java_method
 decimal_1
 udf_pmod
 udf_to_double
 udf_to_float
 udf7 (this will fail in Hive 0.12)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4639) Pass maxIterations in as a parameter in Analyzer


 [ 
https://issues.apache.org/jira/browse/SPARK-4639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4639.
-
Resolution: Fixed

Issue resolved by pull request 3499
[https://github.com/apache/spark/pull/3499]

 Pass maxIterations in as a parameter in Analyzer
 

 Key: SPARK-4639
 URL: https://issues.apache.org/jira/browse/SPARK-4639
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Jacky Li
Priority: Minor
 Fix For: 1.3.0


 fix a TODO in Analyzer: 
 // TODO: pass this in as a parameter
  val fixedPoint = FixedPoint(100)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4293) Make Cast be able to handle complex types.


 [ 
https://issues.apache.org/jira/browse/SPARK-4293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4293.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3150
[https://github.com/apache/spark/pull/3150]

 Make Cast be able to handle complex types.
 --

 Key: SPARK-4293
 URL: https://issues.apache.org/jira/browse/SPARK-4293
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Takuya Ueshin
Priority: Critical
 Fix For: 1.3.0


 Inserting data of type including {{ArrayType.containsNull == false}} or 
 {{MapType.valueContainsNull == false}} or 
 {{StructType.fields.exists(_.nullable == false)}} into Hive table will fail 
 because {{Cast}} inserted by {{HiveMetastoreCatalog.PreInsertionCasts}} rule 
 of {{Analyzer}} can't handle these types correctly.
 Complex type cast rule proposal:
 * Cast for non-complex types should be able to cast the same as before.
 * Cast for {{ArrayType}} can evaluate if
 ** Element type can cast
 ** Nullability rule doesn't break
 * Cast for {{MapType}} can evaluate if
 ** Key type can cast
 ** Nullability for casted key type is {{false}}
 ** Value type can cast
 ** Nullability rule for value type doesn't break
 * Cast for {{StructType}} can evaluate if
 ** The field size is the same
 ** Each field can cast
 ** Nullability rule for each field doesn't break
 * The nested structure should be the same.
 Nullability rule:
 * If the casted type is {{nullable == true}}, the target nullability should 
 be {{true}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4828) sum and avg over empty table should return null


 [ 
https://issues.apache.org/jira/browse/SPARK-4828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4828.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3675
[https://github.com/apache/spark/pull/3675]

 sum and avg over empty table should return null
 ---

 Key: SPARK-4828
 URL: https://issues.apache.org/jira/browse/SPARK-4828
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang
Priority: Minor
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4825) CTAS fails to resolve when created using saveAsTable


 [ 
https://issues.apache.org/jira/browse/SPARK-4825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4825.
-
  Resolution: Fixed
   Fix Version/s: 1.2.1
Target Version/s: 1.2.1  (was: 1.2.0)

 CTAS fails to resolve when created using saveAsTable
 

 Key: SPARK-4825
 URL: https://issues.apache.org/jira/browse/SPARK-4825
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Michael Armbrust
Assignee: Cheng Hao
Priority: Critical
 Fix For: 1.2.1


 While writing a test for a different issue, I found that saveAsTable seems to 
 be broken:
 {code}
   test(save join to table) {
 val testData = sparkContext.parallelize(1 to 10).map(i = TestData(i, 
 i.toString))
 sql(CREATE TABLE test1 (key INT, value STRING))
 testData.insertInto(test1)
 sql(CREATE TABLE test2 (key INT, value STRING))
 testData.insertInto(test2)
 testData.insertInto(test2)
 sql(SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = 
 b.key).saveAsTable(test)
 checkAnswer(
   table(test),
   sql(SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = 
 b.key).collect().toSeq)
   }
 
 sql(SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = 
 b.key).saveAsTable(test)
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
 plan found, tree:
 'CreateTableAsSelect None, test, false, None
  Aggregate [], [COUNT(value#336) AS _c0#334L]
   Join Inner, Some((key#335 = key#339))
MetastoreRelation default, test1, Some(a)
MetastoreRelation default, test2, Some(b)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4742) The name of Parquet File generated by AppendingParquetOutputFormat should be zero padded


 [ 
https://issues.apache.org/jira/browse/SPARK-4742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4742.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3602
[https://github.com/apache/spark/pull/3602]

 The name of Parquet File generated by AppendingParquetOutputFormat should be 
 zero padded
 

 Key: SPARK-4742
 URL: https://issues.apache.org/jira/browse/SPARK-4742
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: Sasaki Toru
Priority: Minor
 Fix For: 1.3.0


 When I use Parquet File as a output file using 
 ParquetOutputFormat#getDefaultWorkFile, the file name is not zero padded 
 while RDD#saveAsText does zero padding.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4829) eliminate expressions calculation in count expression