date:20141204


[ 
https://issues.apache.org/jira/browse/SPARK-4683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233995#comment-14233995
 ] 

Apache Spark commented on SPARK-4683:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/3599

 Add a beeline.cmd to run on Windows
 ---

 Key: SPARK-4683
 URL: https://issues.apache.org/jira/browse/SPARK-4683
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Matei Zaharia
Assignee: Cheng Lian
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4740) Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time

2014-12-04 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-4740:
---
Affects Version/s: 1.2.0

 Netty's network bandwidth is much lower than NIO in spark-perf and Netty 
 takes longer running time
 --

 Key: SPARK-4740
 URL: https://issues.apache.org/jira/browse/SPARK-4740
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Zhang, Liye

 When testing current spark master (1.3.0-snapshot) with spark-perf 
 (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService 
 takes much longer time than NIO based shuffle transferService. The network 
 throughput of Netty is only about half of that of NIO. 
 We tested with standalone mode, and the data set we used for test is 20 
 billion records, and the total size is about 400GB. Spark-perf test is 
 Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each 
 executor memory is 64GB. The reduce tasks number is set to 1000. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4741) Do not destroy and re-create FileInputStream

2014-12-04 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-4741:
--

 Summary: Do not destroy and re-create FileInputStream
 Key: SPARK-4741
 URL: https://issues.apache.org/jira/browse/SPARK-4741
 Project: Spark
  Issue Type: Improvement
Reporter: Liang-Chi Hsieh
Priority: Minor


The FileInputStream in DiskMapIterator is destroyed and recreate after each 
batch reading. However, since we can change the reading position on that 
stream, it is no need and inefficient to destroy and recreate it every time. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4741) Do not destroy and re-create FileInputStream


[ 
https://issues.apache.org/jira/browse/SPARK-4741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234016#comment-14234016
 ] 

Apache Spark commented on SPARK-4741:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/3600

 Do not destroy and re-create FileInputStream
 

 Key: SPARK-4741
 URL: https://issues.apache.org/jira/browse/SPARK-4741
 Project: Spark
  Issue Type: Improvement
Reporter: Liang-Chi Hsieh
Priority: Minor

 The FileInputStream in DiskMapIterator is destroyed and recreate after each 
 batch reading. However, since we can change the reading position on that 
 stream, it is no need and inefficient to destroy and recreate it every time. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4719) Consolidate various narrow dep RDD classes with MapPartitionsRDD


 [ 
https://issues.apache.org/jira/browse/SPARK-4719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-4719.

   Resolution: Fixed
Fix Version/s: 1.3.0

 Consolidate various narrow dep RDD classes with MapPartitionsRDD
 

 Key: SPARK-4719
 URL: https://issues.apache.org/jira/browse/SPARK-4719
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.3.0


 Seems like we don't really need MappedRDD, MappedValuesRDD, 
 FlatMappedValuesRDD, FilteredRDD, GlommedRDD. They can all be implemented 
 directly using MapPartitionsRDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4685) Update JavaDoc settings to include spark.ml and all spark.mllib subpackages in the right sections

2014-12-04 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-4685.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3598
[https://github.com/apache/spark/pull/3598]

 Update JavaDoc settings to include spark.ml and all spark.mllib subpackages 
 in the right sections
 -

 Key: SPARK-4685
 URL: https://issues.apache.org/jira/browse/SPARK-4685
 Project: Spark
  Issue Type: New Feature
  Components: Documentation
Reporter: Matei Zaharia
Priority: Trivial
 Fix For: 1.2.0


 Right now they're listed under other packages on the homepage of the 
 JavaDoc docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4685) Update JavaDoc settings to include spark.ml and all spark.mllib subpackages in the right sections

2014-12-04 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4685:
-
Assignee: Kai Sasaki

 Update JavaDoc settings to include spark.ml and all spark.mllib subpackages 
 in the right sections
 -

 Key: SPARK-4685
 URL: https://issues.apache.org/jira/browse/SPARK-4685
 Project: Spark
  Issue Type: New Feature
  Components: Documentation
Reporter: Matei Zaharia
Assignee: Kai Sasaki
Priority: Trivial
 Fix For: 1.2.0


 Right now they're listed under other packages on the homepage of the 
 JavaDoc docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4575) Documentation for the pipeline features

2014-12-04 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-4575.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3588
[https://github.com/apache/spark/pull/3588]

 Documentation for the pipeline features
 ---

 Key: SPARK-4575
 URL: https://issues.apache.org/jira/browse/SPARK-4575
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, ML, MLlib
Affects Versions: 1.2.0
Reporter: Xiangrui Meng
Assignee: Joseph K. Bradley
 Fix For: 1.2.0


 Add user guide for the newly added ML pipeline feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4742) The name of Parquet File generated by AppendingParquetOutputFormat should be zero padded

2014-12-04 Thread Sasaki Toru (JIRA)

Sasaki Toru created SPARK-4742:
--

 Summary: The name of Parquet File generated by 
AppendingParquetOutputFormat should be zero padded
 Key: SPARK-4742
 URL: https://issues.apache.org/jira/browse/SPARK-4742
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: Sasaki Toru
Priority: Minor


When I use Parquet File as a output file using 
ParquetOutputFormat#getDefaultWorkFile, the file name is not zero padded while 
RDD#saveAsText does zero padding.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4742) The name of Parquet File generated by AppendingParquetOutputFormat should be zero padded


[ 
https://issues.apache.org/jira/browse/SPARK-4742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234081#comment-14234081
 ] 

Apache Spark commented on SPARK-4742:
-

User 'sasakitoa' has created a pull request for this issue:
https://github.com/apache/spark/pull/3602

 The name of Parquet File generated by AppendingParquetOutputFormat should be 
 zero padded
 

 Key: SPARK-4742
 URL: https://issues.apache.org/jira/browse/SPARK-4742
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: Sasaki Toru
Priority: Minor

 When I use Parquet File as a output file using 
 ParquetOutputFormat#getDefaultWorkFile, the file name is not zero padded 
 while RDD#saveAsText does zero padding.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4494) IDFModel.transform() add support for single vector


[ 
https://issues.apache.org/jira/browse/SPARK-4494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234091#comment-14234091
 ] 

Apache Spark commented on SPARK-4494:
-

User 'yu-iskw' has created a pull request for this issue:
https://github.com/apache/spark/pull/3603

 IDFModel.transform() add support for single vector
 --

 Key: SPARK-4494
 URL: https://issues.apache.org/jira/browse/SPARK-4494
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.1.1, 1.2.0
Reporter: Jean-Philippe Quemener
Priority: Minor

 For now when using the tfidf implementation of mllib you have no other 
 possibility to map your data back onto i.e. labels or ids than use a hackish 
 way with ziping: {quote} 1. Persist input RDD. 2. Transform it to just 
 vectors and apply IDFModel 3. zip with original RDD 4. transform label and 
 new vector to LabeledPoint{quote}
 Source:[http://stackoverflow.com/questions/26897908/spark-mllib-tfidf-implementation-for-logisticregression]
 I think as in production alot of users want to map their data back to some 
 identifier, it would be a good imporvement to allow using a single vector on 
 IDFModel.transform()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4726) NotSerializableException thrown on SystemDefaultHttpClient with stack not related to my functions

2014-12-04 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234142#comment-14234142
 ] 

Sean Owen commented on SPARK-4726:
--

You can use it, you just can't serialize these objects from the drivers to 
workers. You'll want to diagnose your code to see if you're accidentally 
creating a connection or client on the driver but then using it inside 
functions that are sent to the workers.

 NotSerializableException thrown on SystemDefaultHttpClient with stack not 
 related to my functions
 -

 Key: SPARK-4726
 URL: https://issues.apache.org/jira/browse/SPARK-4726
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.1
Reporter: Dmitriy Makarenko

 I get this stacktrace  that doesn't contain any of my function - 
 Exception in thread main org.apache.spark.SparkException: Job aborted due 
 to stage failure: Task not serializable: java.io.NotSerializableException: 
 org.apache.http.impl.client.SystemDefaultHttpClient
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:771)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:714)
 at 
 org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:698)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1198)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 As I know SystemDefaultHttpClient is used inside the SolrJ library that I 
 use, but it is in the separate Jar from my project.
 All of mine classes are Serializable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4734) [Streaming]limit the file Dstream size for each batch

2014-12-04 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-4734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234146#comment-14234146
]

Sean Owen commented on SPARK-4734:
--

I don't quite understand this suggestion. In general, if processing time
exceeds the batch duration, you simply need a longer batch duration or need to
speed up your processing. Lots of small files are a problem in general, for
shuffle -- although less so for a sort-based shuffle.

The basic solution there is: don't design your system to put lots of tiny files
on HDFS.

Are you suggesting capping the amount of data in each batch? This does not
solve either problem. Either you are just running more, smaller batches, or you
are dropping data. In any event this amounts to a significant change in
semantics. This doesn't sound likely.

[Streaming]limit the file Dstream size for each batch
-

Key: SPARK-4734
URL: https://issues.apache.org/jira/browse/SPARK-4734
Project: Spark
Issue Type: New Feature
Components: Streaming
Reporter: 宿荣全
Priority: Minor

Streaming scan new files form the HDFS and process those files in each batch
process.Current streaming exist some problems：
1.When the number of files is very large(the count size of those files is
very large) in some batch segement.The processing time required will become
very long.The processing time maybe over slideDuration time.Eventually lead
to dispatch the next batch process is delay.
2.when the size of total file Dstream is very large in one batch,those
dstream data do shuffle after memory will be n times increasing
occupation,app will be slow or even terminated by operating system.
So if we set a upper limit value of input data for each batch to control the
batch process time,the job dispatch delay and the process delay wil be
alleviated.
modification:
Add a new parameter spark.streaming.segmentSizeThreshold in InputDStream
(input data base class).the size of each batch process segments will be set
in this parameter from [spark-defaults.conf] or setting in source.
all implements class of InputDStream will do corresponding action be aimed at
the segmentSizeThreshold.
This patch is a modification about FileInputDStream ,so when find new files
,put those files's name and size in a queue and take elements package to a
batch data with totail size segmentSizeThreshold in
FileInputDStream.Please look source about detailed logic.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4735) Spark SQL UDF doesn't support 0 arguments.


[ 
https://issues.apache.org/jira/browse/SPARK-4735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234153#comment-14234153
 ] 

Apache Spark commented on SPARK-4735:
-

User 'potix2' has created a pull request for this issue:
https://github.com/apache/spark/pull/3604

 Spark SQL UDF doesn't support 0 arguments.
 --

 Key: SPARK-4735
 URL: https://issues.apache.org/jira/browse/SPARK-4735
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Minor

 To reproduce that with:
 val udf = () = {Seq(1,2,3)}
 sqlCtx.registerFunction(myudf, udf)
 sqlCtx.sql(select myudf() from tbl limit 1).collect.foreach(println)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4743) Use SparkEnv.serializer instead of closureSerializer in aggregateByKey and foldByKey

2014-12-04 Thread Ivan Vergiliev (JIRA)

Ivan Vergiliev created SPARK-4743:
-

 Summary: Use SparkEnv.serializer instead of closureSerializer in 
aggregateByKey and foldByKey
 Key: SPARK-4743
 URL: https://issues.apache.org/jira/browse/SPARK-4743
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Ivan Vergiliev


AggregateByKey and foldByKey in PairRDDFunctions both use the closure 
serializer to serialize and deserialize the initial value. This means that the 
Java serializer is always used, which can be very expensive if there's a large 
number of groups. Calling combineByKey manually and using the normal serializer 
instead of the closure one improved the performance on the dataset I'm testing 
with by about 30-35%.

I'm not familiar enough with the codebase to be certain that replacing the 
serializer here is OK, but it works correctly in my tests, and it's only 
serializing a single value of type U, which should be serializable by the 
default one since it can be the output of a job. Let me know if I'm missing 
anything.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4743) Use SparkEnv.serializer instead of closureSerializer in aggregateByKey and foldByKey


[ 
https://issues.apache.org/jira/browse/SPARK-4743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234165#comment-14234165
 ] 

Apache Spark commented on SPARK-4743:
-

User 'IvanVergiliev' has created a pull request for this issue:
https://github.com/apache/spark/pull/3605

 Use SparkEnv.serializer instead of closureSerializer in aggregateByKey and 
 foldByKey
 

 Key: SPARK-4743
 URL: https://issues.apache.org/jira/browse/SPARK-4743
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Ivan Vergiliev
  Labels: performance

 AggregateByKey and foldByKey in PairRDDFunctions both use the closure 
 serializer to serialize and deserialize the initial value. This means that 
 the Java serializer is always used, which can be very expensive if there's a 
 large number of groups. Calling combineByKey manually and using the normal 
 serializer instead of the closure one improved the performance on the dataset 
 I'm testing with by about 30-35%.
 I'm not familiar enough with the codebase to be certain that replacing the 
 serializer here is OK, but it works correctly in my tests, and it's only 
 serializing a single value of type U, which should be serializable by the 
 default one since it can be the output of a job. Let me know if I'm missing 
 anything.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4744) Short Circuit evaluation for AND OR in code gen

2014-12-04 Thread Cheng Hao (JIRA)

Cheng Hao created SPARK-4744:


 Summary: Short Circuit evaluation for AND  OR in code gen
 Key: SPARK-4744
 URL: https://issues.apache.org/jira/browse/SPARK-4744
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4744) Short Circuit evaluation for AND OR in code gen


[ 
https://issues.apache.org/jira/browse/SPARK-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234182#comment-14234182
 ] 

Apache Spark commented on SPARK-4744:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/3606

 Short Circuit evaluation for AND  OR in code gen
 -

 Key: SPARK-4744
 URL: https://issues.apache.org/jira/browse/SPARK-4744
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2188) Support sbt/sbt for Windows

2014-12-04 Thread Masayoshi TSUZUKI (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234209#comment-14234209
 ] 

Masayoshi TSUZUKI commented on SPARK-2188:
--

We have some bugs reported on JIRA about Windows. When we struggle with them or 
try to reproduce them or fix them, we need building tools for Windows. Indeed 
we already have maven, but sbt is much better for trial and error development 
as you know.

 Support sbt/sbt for Windows
 ---

 Key: SPARK-2188
 URL: https://issues.apache.org/jira/browse/SPARK-2188
 Project: Spark
  Issue Type: New Feature
  Components: Build
Affects Versions: 1.0.0
Reporter: Pat McDonough

 Add the equivalent of sbt/sbt for Windows users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1953) yarn client mode Application Master memory size is same as driver memory size


[ 
https://issues.apache.org/jira/browse/SPARK-1953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234212#comment-14234212
 ] 

Apache Spark commented on SPARK-1953:
-

User 'WangTaoTheTonic' has created a pull request for this issue:
https://github.com/apache/spark/pull/3607

 yarn client mode Application Master memory size is same as driver memory size
 -

 Key: SPARK-1953
 URL: https://issues.apache.org/jira/browse/SPARK-1953
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Thomas Graves

 With Spark on yarn in client mode, the application master that gets created 
 to allocated containers gets the same amount of memory as the driver running 
 on the client. (--driver-memory option through spark-submit)  This could 
 definitely be more then what is really needed, thus wasting resources.  The 
 Application Master should be very small and require very little memory since 
 all its doing is allocating and starting containers.  
 We should allow the memory for the application master to be configurable 
 separate from the driver in client mode. 
  We probably need to be careful about how we do this as to not cause 
 confusion about what the options do in the various modes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4740) Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time


 [ 
https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhang, Liye updated SPARK-4740:
---
Attachment: Spark-perf Test Report.pdf

 Netty's network bandwidth is much lower than NIO in spark-perf and Netty 
 takes longer running time
 --

 Key: SPARK-4740
 URL: https://issues.apache.org/jira/browse/SPARK-4740
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Zhang, Liye
 Attachments: Spark-perf Test Report.pdf


 When testing current spark master (1.3.0-snapshot) with spark-perf 
 (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService 
 takes much longer time than NIO based shuffle transferService. The network 
 throughput of Netty is only about half of that of NIO. 
 We tested with standalone mode, and the data set we used for test is 20 
 billion records, and the total size is about 400GB. Spark-perf test is 
 Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each 
 executor memory is 64GB. The reduce tasks number is set to 1000. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4745) get_existing_cluster() doesn't work with additional security groups

2014-12-04 Thread Alex DeBrie (JIRA)

Alex DeBrie created SPARK-4745:
--

 Summary: get_existing_cluster() doesn't work with additional 
security groups
 Key: SPARK-4745
 URL: https://issues.apache.org/jira/browse/SPARK-4745
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.1.0
Reporter: Alex DeBrie


The spark-ec2 script has a flag that allows you to add additional security 
groups to clusters when you launch. However, the get_existing_cluster() 
function cycles through active instances and only returns instances whose 
group_names == cluster_name + -master (or + -slaves), which is the group 
created by default.  The get_existing_cluster() function is used to login to, 
stop, and destroy existing clusters, among other actions.

This is a pretty simple fix for which I've already submitted a [pull 
request|https://github.com/apache/spark/pull/3596]. It checks if cluster_name + 
-master is in the list of groups for each active instance. This means the 
cluster group can be one among many groups, rather than the sole group for an 
instance.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4745) get_existing_cluster() doesn't work with additional security groups


[ 
https://issues.apache.org/jira/browse/SPARK-4745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234240#comment-14234240
 ] 

Apache Spark commented on SPARK-4745:
-

User 'alexdebrie' has created a pull request for this issue:
https://github.com/apache/spark/pull/3596

 get_existing_cluster() doesn't work with additional security groups
 ---

 Key: SPARK-4745
 URL: https://issues.apache.org/jira/browse/SPARK-4745
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.1.0
Reporter: Alex DeBrie

 The spark-ec2 script has a flag that allows you to add additional security 
 groups to clusters when you launch. However, the get_existing_cluster() 
 function cycles through active instances and only returns instances whose 
 group_names == cluster_name + -master (or + -slaves), which is the group 
 created by default.  The get_existing_cluster() function is used to login to, 
 stop, and destroy existing clusters, among other actions.
 This is a pretty simple fix for which I've already submitted a [pull 
 request|https://github.com/apache/spark/pull/3596]. It checks if cluster_name 
 + -master is in the list of groups for each active instance. This means the 
 cluster group can be one among many groups, rather than the sole group for an 
 instance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4727) Add dimensional RDDs (time series, spatial)

2014-12-04 Thread Jeremy Freeman (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-4727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234269#comment-14234269
]

Jeremy Freeman commented on SPARK-4727:
---

Great to brainstorm about this RJ!

To some extent, we've been doing this over on the
[Thunder|http://thefreemanlab.com/thunder/docs/] project. In particular, check
out the {{TimeSeries}} and {{Images}} classes
[here|https://github.com/freeman-lab/thunder/tree/master/python/thunder/rdds],
which are essentially wrappers for specialized RDDs. Our basic abstraction is
RDDs of ndarrays (1D for time series, 2D or 3D for images/volumes), with
metadeta (lazily propagated) for things like dimensionality and time base,
coordinates embedded in keys, and useful methods on these objects like the ones
you menion (e.g. filtering, fourier, cross-correlation). We've also worked on
transformations between representations, for the common case of sequences of
images corresponding to different time points.

We haven't worked on custom partition strategies yet, I think that will be most
important for image tiles drawn from a much larger image. There's cool work
ongoing for that in GeoTrellis, see the
[repo|https://github.com/geotrellis/geotrellis/tree/master/spark/src/main] and
a
[talk|http://spark-summit.org/2014/talk/geotrellis-adding-geospatial-capabilities-to-spark]
from Rob.

FWIW, when we started it seemed more appropriate to build this into a
specialized library, rather than Spark core. It's also something that benefits
from using Python, due to a bevy of existing libraries for temporal and image
data (though there are certainly analogs in Java/Scala). But it would be great
to probe the community for general interest in these kinds of abstractions and
methods.

Add dimensional RDDs (time series, spatial)
-

Key: SPARK-4727
URL: https://issues.apache.org/jira/browse/SPARK-4727
Project: Spark
Issue Type: Brainstorming
Components: Spark Core
Affects Versions: 1.1.0
Reporter: RJ Nowling

Certain types of data (times series, spatial) can benefit from specialized
RDDs. I'd like to open a discussion about this.
For example, time series data should be ordered by time and would benefit
from operations like:
* Subsampling (taking every n data points)
* Signal processing (correlations, FFTs, filtering)
* Windowing functions
Spatial data benefits from ordering and partitioning along a 2D or 3D grid.
For example, path finding algorithms can optimized by only comparing points
within a set distance, which can be computed more efficiently by partitioning
data into a grid.
Although the operations on time series and spatial data may be different,
there is some commonality in the sense of the data having ordered dimensions
and the implementations may overlap.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1010) Update all unit tests to use SparkConf instead of system properties

2014-12-04 Thread liu chang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234272#comment-14234272
 ] 

liu chang commented on SPARK-1010:
--

please assign to me, I will fix it.

 Update all unit tests to use SparkConf instead of system properties
 ---

 Key: SPARK-1010
 URL: https://issues.apache.org/jira/browse/SPARK-1010
 Project: Spark
  Issue Type: New Feature
Affects Versions: 0.9.0
Reporter: Patrick Wendell
Assignee: Nirmal
Priority: Minor
  Labels: starter





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4181) Create separate options to control the client-mode AM resource allocation request

2014-12-04 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234289#comment-14234289
 ] 

Thomas Graves commented on SPARK-4181:
--

What exactly is the change you are proposing here?  You reference other jiras 
that all have specific things to fix.  Is this above and beyond those?

 Create separate options to control the client-mode AM resource allocation 
 request
 -

 Key: SPARK-4181
 URL: https://issues.apache.org/jira/browse/SPARK-4181
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: WangTaoTheTonic
Priority: Minor

 I found related discussion in https://github.com/apache/spark/pull/2115, 
 SPARK-1953 and SPARK-1507. And recently I found some inconvenience in 
 configuring properties like logging while we use yarn-client mode. So if no 
 one else do the work, I will try it. Maybe start in few days, and complete in 
 next 1 or 2 weeks.
 As not very familiar with spark on yarn, any discussion and feedback is 
 welcome!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4181) Create separate options to control the client-mode AM resource allocation request

2014-12-04 Thread WangTaoTheTonic (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234305#comment-14234305
 ] 

WangTaoTheTonic commented on SPARK-4181:


Maybe I didn't describe exactly here. What I wanna do is to pass 
extraJavaOption and extraLibraryPath to AM in yarn-client mode.

 Create separate options to control the client-mode AM resource allocation 
 request
 -

 Key: SPARK-4181
 URL: https://issues.apache.org/jira/browse/SPARK-4181
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: WangTaoTheTonic
Priority: Minor

 I found related discussion in https://github.com/apache/spark/pull/2115, 
 SPARK-1953 and SPARK-1507. And recently I found some inconvenience in 
 configuring properties like logging while we use yarn-client mode. So if no 
 one else do the work, I will try it. Maybe start in few days, and complete in 
 next 1 or 2 weeks.
 As not very familiar with spark on yarn, any discussion and feedback is 
 welcome!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1

2014-12-04 Thread Yana Kadiyska (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234307#comment-14234307
 ] 

Yana Kadiyska commented on SPARK-4702:
--

Just confirming that https://github.com/apache/spark/pull/3586 does fix the 
issue. Thanks!

 Querying  non-existent partition produces exception in v1.2.0-rc1
 -

 Key: SPARK-4702
 URL: https://issues.apache.org/jira/browse/SPARK-4702
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yana Kadiyska

 Using HiveThriftServer2, when querying a non-existent partition I get an 
 exception rather than an empty result set. This seems to be a regression -- I 
 had an older build of master branch where this works. Build off of RC1.2 tag 
 produces the following:
 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement:
 org.apache.hive.service.cli.HiveSQLException: 
 java.lang.IllegalArgumentException: Can not create a Path from an empty string
 at 
 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
 at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
 at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source)
 at 
 org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)
 at 
 org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at 
 org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
 at 
 org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4181) Create separate options to control the client-mode AM resource allocation request

2014-12-04 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234333#comment-14234333
 ] 

Thomas Graves commented on SPARK-4181:
--

ok. as you discovered extraJavaOptions and possibly the others is being 
discussed in https://github.com/apache/spark/pull/3409.

What is your use case for the extraLibraryPath as I couldn't think of one?

 Create separate options to control the client-mode AM resource allocation 
 request
 -

 Key: SPARK-4181
 URL: https://issues.apache.org/jira/browse/SPARK-4181
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: WangTaoTheTonic
Priority: Minor

 I found related discussion in https://github.com/apache/spark/pull/2115, 
 SPARK-1953 and SPARK-1507. And recently I found some inconvenience in 
 configuring properties like logging while we use yarn-client mode. So if no 
 one else do the work, I will try it. Maybe start in few days, and complete in 
 next 1 or 2 weeks.
 As not very familiar with spark on yarn, any discussion and feedback is 
 welcome!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4298) The spark-submit cannot read Main-Class from Manifest.

2014-12-04 Thread Brennon York (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234346#comment-14234346
 ] 

Brennon York commented on SPARK-4298:
-

[~pwendell] could you take a look at this? This is an annoying issue our 
developers continue to run into and would like to see this pushed into the next 
release. Thanks!

 The spark-submit cannot read Main-Class from Manifest.
 --

 Key: SPARK-4298
 URL: https://issues.apache.org/jira/browse/SPARK-4298
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: Linux
 spark-1.1.0-bin-hadoop2.4.tgz
 java version 1.7.0_72
 Java(TM) SE Runtime Environment (build 1.7.0_72-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 24.72-b04, mixed mode)
Reporter: Milan Straka

 Consider trivial {{test.scala}}:
 {code:title=test.scala|borderStyle=solid}
 import org.apache.spark.SparkContext
 import org.apache.spark.SparkContext._
 object Main {
   def main(args: Array[String]) {
 val sc = new SparkContext()
 sc.stop()
   }
 }
 {code}
 When built with {{sbt}} and executed using {{spark-submit 
 target/scala-2.10/test_2.10-1.0.jar}}, I get the following error:
 {code}
 Spark assembly has been built with Hive, including Datanucleus jars on 
 classpath
 Error: Cannot load main class from JAR: 
 file:/ha/home/straka/s/target/scala-2.10/test_2.10-1.0.jar
 Run with --help for usage help or --verbose for debug output
 {code}
 When executed using {{spark-submit --class Main 
 target/scala-2.10/test_2.10-1.0.jar}}, it works.
 The jar file has correct MANIFEST.MF:
 {code:title=MANIFEST.MF|borderStyle=solid}
 Manifest-Version: 1.0
 Implementation-Vendor: test
 Implementation-Title: test
 Implementation-Version: 1.0
 Implementation-Vendor-Id: test
 Specification-Vendor: test
 Specification-Title: test
 Specification-Version: 1.0
 Main-Class: Main
 {code}
 The problem is that in {{org.apache.spark.deploy.SparkSubmitArguments}}, line 
 127:
 {code}
   val jar = new JarFile(primaryResource)
 {code}
 the primaryResource has String value 
 {{file:/ha/home/straka/s/target/scala-2.10/test_2.10-1.0.jar}}, which is 
 URI, but JarFile can use only Path. One way to fix this would be using
 {code}
   val uri = new URI(primaryResource)
   val jar = new JarFile(uri.getPath)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4616) SPARK_CONF_DIR is not effective in spark-submit

2014-12-04 Thread Brennon York (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234347#comment-14234347
 ] 

Brennon York commented on SPARK-4616:
-

[~pwendell] could you review this? Since this answers a larger problem I was 
hoping to get some feedback on this commit. Thanks!

 SPARK_CONF_DIR is not effective in spark-submit
 ---

 Key: SPARK-4616
 URL: https://issues.apache.org/jira/browse/SPARK-4616
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: leo.luan

 SPARK_CONF_DIR is not effective in spark-submit ,because this line in 
 spark-submit:
 DEFAULT_PROPERTIES_FILE=$SPARK_HOME/conf/spark-defaults.conf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-546) Support full outer join and multiple join in a single shuffle

2014-12-04 Thread Thiago Souza (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234383#comment-14234383
 ] 

Thiago Souza commented on SPARK-546:


What about #2? Did you file a new ticket?

I'm quite interested on this!

 Support full outer join and multiple join in a single shuffle
 -

 Key: SPARK-546
 URL: https://issues.apache.org/jira/browse/SPARK-546
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Streaming
Reporter: Reynold Xin
Assignee: Aaron Staple
 Fix For: 1.2.0


 RDD[(K,V)] now supports left/right outer join but not full outer join.
 Also it'd be nice to provide a way for users to join multiple RDDs on the 
 same key in a single shuffle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4746) integration tests should be seseparated from faster unit tests

2014-12-04 Thread Imran Rashid (JIRA)

Imran Rashid created SPARK-4746:
---

 Summary: integration tests should be seseparated from faster unit 
tests
 Key: SPARK-4746
 URL: https://issues.apache.org/jira/browse/SPARK-4746
 Project: Spark
  Issue Type: Bug
Reporter: Imran Rashid
Priority: Trivial


Currently there isn't a good way for a developer to skip the longer integration 
tests.  This can slow down local development.  See 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spurious-test-failures-testing-best-practices-td9560.html

One option is to use scalatest's notion of test tags to tag all integration 
tests, so they could easily be skipped



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4727) Add dimensional RDDs (time series, spatial)

2014-12-04 Thread RJ Nowling (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234399#comment-14234399
 ] 

RJ Nowling commented on SPARK-4727:
---

Thanks, Jeremy!

Your work may cover my needs, and if not, it seems like a great place to 
contribute to!

Was there some talk about encouraging people to build Spark libraries and 
putting together a community list?  I'd love to see this sort of work 
advertised more.

 Add dimensional RDDs (time series, spatial)
 -

 Key: SPARK-4727
 URL: https://issues.apache.org/jira/browse/SPARK-4727
 Project: Spark
  Issue Type: Brainstorming
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: RJ Nowling

 Certain types of data (times series, spatial) can benefit from specialized 
 RDDs.  I'd like to open a discussion about this.
 For example, time series data should be ordered by time and would benefit 
 from operations like:
 * Subsampling (taking every n data points)
 * Signal processing (correlations, FFTs, filtering)
 * Windowing functions
 Spatial data benefits from ordering and partitioning along a 2D or 3D grid.  
 For example, path finding algorithms can optimized by only comparing points 
 within a set distance, which can be computed more efficiently by partitioning 
 data into a grid.
 Although the operations on time series and spatial data may be different, 
 there is some commonality in the sense of the data having ordered dimensions 
 and the implementations may overlap.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-546) Support full outer join and multiple join in a single shuffle


[ 
https://issues.apache.org/jira/browse/SPARK-546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234404#comment-14234404
 ] 

Reynold Xin commented on SPARK-546:
---

Actually my experience implementing full join in a single shuffle is that it is 
fairly complicated and very hard to maintain. Since it is doable entirely in 
user code and given SparkSQL's SchemaRDD already supports it, I suggest not 
pulling this in Spark core. 

 Support full outer join and multiple join in a single shuffle
 -

 Key: SPARK-546
 URL: https://issues.apache.org/jira/browse/SPARK-546
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Streaming
Reporter: Reynold Xin
Assignee: Aaron Staple
 Fix For: 1.2.0


 RDD[(K,V)] now supports left/right outer join but not full outer join.
 Also it'd be nice to provide a way for users to join multiple RDDs on the 
 same key in a single shuffle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4747) Move JobProgressListener out of org.apache.spark.ui.jobs

2014-12-04 Thread Ryan Williams (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234422#comment-14234422
 ] 

Ryan Williams commented on SPARK-4747:
--

[~vanzin] let me know what package you think it should go to and I'll make the 
change, if you like.

 Move JobProgressListener out of org.apache.spark.ui.jobs
 

 Key: SPARK-4747
 URL: https://issues.apache.org/jira/browse/SPARK-4747
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Ryan Williams
Priority: Minor

 [~vanzin] noted on 
 [#2696|https://github.com/apache/spark/pull/2696/files#r19235280] that 
 {{JobProgressListener}} should be moved out of {{ui.jobs}} since it doesn't 
 really deal with UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4747) Move JobProgressListener out of org.apache.spark.ui.jobs

2014-12-04 Thread Ryan Williams (JIRA)

Ryan Williams created SPARK-4747:


 Summary: Move JobProgressListener out of org.apache.spark.ui.jobs
 Key: SPARK-4747
 URL: https://issues.apache.org/jira/browse/SPARK-4747
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Ryan Williams
Priority: Minor


[~vanzin] noted on 
[#2696|https://github.com/apache/spark/pull/2696/files#r19235280] that 
{{JobProgressListener}} should be moved out of {{ui.jobs}} since it doesn't 
really deal with UI.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4739) spark.files.userClassPathFirst does not work in local[*] mode

2014-12-04 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234433#comment-14234433
 ] 

Marcelo Vanzin commented on SPARK-4739:
---

BTW my fix for SPARK-2996 (https://github.com/apache/spark/pull/3233) should 
also fix this.

 spark.files.userClassPathFirst does not work in local[*] mode
 -

 Key: SPARK-4739
 URL: https://issues.apache.org/jira/browse/SPARK-4739
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.1.0
Reporter: Tobias Pfeiffer

 The parameter spark.files.userClassPathFirst=true does not work when using 
 spark-submit with \-\-master local\[3\]. In particular, even though my 
 application jar file contains netty-3.9.4.Final, the older version from the 
 spark-assembly jar file is loaded (cf. SPARK-4738). When using the same jars 
 with \-\-master yarn-cluster and spark.yarn.user.classpath.first=true 
 (cf. SPARK-2996), it works correctly and my bundled classes are used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4747) Move JobProgressListener out of org.apache.spark.ui.jobs

2014-12-04 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234434#comment-14234434
 ] 

Marcelo Vanzin commented on SPARK-4747:
---

I don't really have a recommendation aside from not the UI package. Maybe the 
package where all the other job tracking types are declared.

 Move JobProgressListener out of org.apache.spark.ui.jobs
 

 Key: SPARK-4747
 URL: https://issues.apache.org/jira/browse/SPARK-4747
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Ryan Williams
Priority: Minor

 [~vanzin] noted on 
 [#2696|https://github.com/apache/spark/pull/2696/files#r19235280] that 
 {{JobProgressListener}} should be moved out of {{ui.jobs}} since it doesn't 
 really deal with UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4683) Add a beeline.cmd to run on Windows


 [ 
https://issues.apache.org/jira/browse/SPARK-4683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4683.

   Resolution: Fixed
Fix Version/s: 1.2.0

 Add a beeline.cmd to run on Windows
 ---

 Key: SPARK-4683
 URL: https://issues.apache.org/jira/browse/SPARK-4683
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Matei Zaharia
Assignee: Cheng Lian
Priority: Critical
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4747) Move JobProgressListener out of org.apache.spark.ui.jobs


[ 
https://issues.apache.org/jira/browse/SPARK-4747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234445#comment-14234445
 ] 

Patrick Wendell commented on SPARK-4747:


Because this is an exposed API I'd prefer not to move it - I know many 
applications that build on this and it would break their code. It is slightly 
nicer to not nest it under the ui package but IMO it's not worth breaking user 
applications for this minor clean-up.

 Move JobProgressListener out of org.apache.spark.ui.jobs
 

 Key: SPARK-4747
 URL: https://issues.apache.org/jira/browse/SPARK-4747
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Ryan Williams
Priority: Minor

 [~vanzin] noted on 
 [#2696|https://github.com/apache/spark/pull/2696/files#r19235280] that 
 {{JobProgressListener}} should be moved out of {{ui.jobs}} since it doesn't 
 really deal with UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4737) Prevent serialization errors from ever crashing the DAG scheduler


 [ 
https://issues.apache.org/jira/browse/SPARK-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4737:
-
Affects Version/s: 1.2.0

 Prevent serialization errors from ever crashing the DAG scheduler
 -

 Key: SPARK-4737
 URL: https://issues.apache.org/jira/browse/SPARK-4737
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Patrick Wendell
Assignee: Matthew Cheah
Priority: Blocker

 Currently in Spark we assume that when tasks are serialized in the 
 TaskSetManager that the serialization cannot fail. We assume this because 
 upstream in the DAGScheduler we attempt to catch any serialization errors by 
 serializing a single partition. However, in some cases this upstream test is 
 not accurate - i.e. an RDD can have one partition that can serialize cleanly 
 but not others.
 Do do this in the proper way we need to catch and propagate the exception at 
 the time of serialization. The tricky bit is making sure it gets propagated 
 in the right way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4747) Move JobProgressListener out of org.apache.spark.ui.jobs

2014-12-04 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234448#comment-14234448
 ] 

Marcelo Vanzin commented on SPARK-4747:
---

Ah. It's a @DeveloperApi... that makes it trickier to move around. :-/

 Move JobProgressListener out of org.apache.spark.ui.jobs
 

 Key: SPARK-4747
 URL: https://issues.apache.org/jira/browse/SPARK-4747
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Ryan Williams
Priority: Minor

 [~vanzin] noted on 
 [#2696|https://github.com/apache/spark/pull/2696/files#r19235280] that 
 {{JobProgressListener}} should be moved out of {{ui.jobs}} since it doesn't 
 really deal with UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4747) Move JobProgressListener out of org.apache.spark.ui.jobs

2014-12-04 Thread Ryan Williams (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234456#comment-14234456
 ] 

Ryan Williams commented on SPARK-4747:
--

OK, feel free to wontfix this then

 Move JobProgressListener out of org.apache.spark.ui.jobs
 

 Key: SPARK-4747
 URL: https://issues.apache.org/jira/browse/SPARK-4747
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Ryan Williams
Priority: Minor

 [~vanzin] noted on 
 [#2696|https://github.com/apache/spark/pull/2696/files#r19235280] that 
 {{JobProgressListener}} should be moved out of {{ui.jobs}} since it doesn't 
 really deal with UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1


[ 
https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234469#comment-14234469
 ] 

Michael Armbrust commented on SPARK-4702:
-

It did in my testing.  Please let us know if you are stil having problems.

To answer your question above, heterogenous schema is not supported in either 
mode officially.  Depending on which file gets picked up when 
convertMetastoreParquet=true it may or may not work (assuming you are only 
adding columns).  See [SPARK-3851] for more info.



 Querying  non-existent partition produces exception in v1.2.0-rc1
 -

 Key: SPARK-4702
 URL: https://issues.apache.org/jira/browse/SPARK-4702
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yana Kadiyska

 Using HiveThriftServer2, when querying a non-existent partition I get an 
 exception rather than an empty result set. This seems to be a regression -- I 
 had an older build of master branch where this works. Build off of RC1.2 tag 
 produces the following:
 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement:
 org.apache.hive.service.cli.HiveSQLException: 
 java.lang.IllegalArgumentException: Can not create a Path from an empty string
 at 
 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
 at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
 at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source)
 at 
 org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)
 at 
 org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at 
 org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
 at 
 org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4747) Move JobProgressListener out of org.apache.spark.ui.jobs


[ 
https://issues.apache.org/jira/browse/SPARK-4747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234474#comment-14234474
 ] 

Patrick Wendell commented on SPARK-4747:


Yeah - okay if you guys don't mind I'll probably close this as wont fix.

 Move JobProgressListener out of org.apache.spark.ui.jobs
 

 Key: SPARK-4747
 URL: https://issues.apache.org/jira/browse/SPARK-4747
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Ryan Williams
Priority: Minor

 [~vanzin] noted on 
 [#2696|https://github.com/apache/spark/pull/2696/files#r19235280] that 
 {{JobProgressListener}} should be moved out of {{ui.jobs}} since it doesn't 
 really deal with UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4747) Move JobProgressListener out of org.apache.spark.ui.jobs


 [ 
https://issues.apache.org/jira/browse/SPARK-4747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4747.

Resolution: Won't Fix

 Move JobProgressListener out of org.apache.spark.ui.jobs
 

 Key: SPARK-4747
 URL: https://issues.apache.org/jira/browse/SPARK-4747
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Ryan Williams
Priority: Minor

 [~vanzin] noted on 
 [#2696|https://github.com/apache/spark/pull/2696/files#r19235280] that 
 {{JobProgressListener}} should be moved out of {{ui.jobs}} since it doesn't 
 really deal with UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4740) Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time


[ 
https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234479#comment-14234479
 ] 

Patrick Wendell commented on SPARK-4740:


Thanks for reporting this. We've run a bunch of tests and never found netty to 
be slower than NIO, so this is a helpful piece of feedback. One unique thing 
about your environment is that you have 48 cores per node. Do you observe the 
same effect if you limit the parallelism on each node to fewer cores?

 Netty's network bandwidth is much lower than NIO in spark-perf and Netty 
 takes longer running time
 --

 Key: SPARK-4740
 URL: https://issues.apache.org/jira/browse/SPARK-4740
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Zhang, Liye
 Attachments: Spark-perf Test Report.pdf


 When testing current spark master (1.3.0-snapshot) with spark-perf 
 (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService 
 takes much longer time than NIO based shuffle transferService. The network 
 throughput of Netty is only about half of that of NIO. 
 We tested with standalone mode, and the data set we used for test is 20 
 billion records, and the total size is about 400GB. Spark-perf test is 
 Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each 
 executor memory is 64GB. The reduce tasks number is set to 1000. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4740) Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time

[
https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234479#comment-14234479
]

Patrick Wendell edited comment on SPARK-4740 at 12/4/14 7:00 PM:
-

Thanks for reporting this. We've run a bunch of tests and never found netty to
be slower than NIO, so this is a helpful piece of feedback. One unique thing
about your environment is that you have 48 cores per node. Do you observe the
same effect if you limit the parallelism on each node to fewer cores?

/cc [~adav] [~rxin]

was (Author: pwendell):
Thanks for reporting this. We've run a bunch of tests and never found netty to
be slower than NIO, so this is a helpful piece of feedback. One unique thing
about your environment is that you have 48 cores per node. Do you observe the
same effect if you limit the parallelism on each node to fewer cores?

Netty's network bandwidth is much lower than NIO in spark-perf and Netty
takes longer running time
--

Key: SPARK-4740
URL: https://issues.apache.org/jira/browse/SPARK-4740
Project: Spark
Issue Type: Improvement
Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Zhang, Liye
Attachments: Spark-perf Test Report.pdf

When testing current spark master (1.3.0-snapshot) with spark-perf
(sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService
takes much longer time than NIO based shuffle transferService. The network
throughput of Netty is only about half of that of NIO.
We tested with standalone mode, and the data set we used for test is 20
billion records, and the total size is about 400GB. Spark-perf test is
Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each
executor memory is 64GB. The reduce tasks number is set to 1000.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4740) Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time


[ 
https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234489#comment-14234489
 ] 

Reynold Xin commented on SPARK-4740:


[~adav] Could it be the thread pool size being too small? 

 Netty's network bandwidth is much lower than NIO in spark-perf and Netty 
 takes longer running time
 --

 Key: SPARK-4740
 URL: https://issues.apache.org/jira/browse/SPARK-4740
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Zhang, Liye
 Attachments: Spark-perf Test Report.pdf


 When testing current spark master (1.3.0-snapshot) with spark-perf 
 (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService 
 takes much longer time than NIO based shuffle transferService. The network 
 throughput of Netty is only about half of that of NIO. 
 We tested with standalone mode, and the data set we used for test is 20 
 billion records, and the total size is about 400GB. Spark-perf test is 
 Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each 
 executor memory is 64GB. The reduce tasks number is set to 1000. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4737) Prevent serialization errors from ever crashing the DAG scheduler

[
https://issues.apache.org/jira/browse/SPARK-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234538#comment-14234538
]

Michael Armbrust commented on SPARK-4737:
-

I think another big problem here is that the DAGScheduler restarts (somewhat
silently) and comes back in a bad state. Perhaps if the DAGScheduler crashes
we should kill the whole process if we aren't actually resilient to restarts.

Prevent serialization errors from ever crashing the DAG scheduler
-

Key: SPARK-4737
URL: https://issues.apache.org/jira/browse/SPARK-4737
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.2.0
Reporter: Patrick Wendell
Assignee: Matthew Cheah
Priority: Blocker

Currently in Spark we assume that when tasks are serialized in the
TaskSetManager that the serialization cannot fail. We assume this because
upstream in the DAGScheduler we attempt to catch any serialization errors by
serializing a single partition. However, in some cases this upstream test is
not accurate - i.e. an RDD can have one partition that can serialize cleanly
but not others.
Do do this in the proper way we need to catch and propagate the exception at
the time of serialization. The tricky bit is making sure it gets propagated
in the right way.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4331) SBT Scalastyle doesn't work for the sources under hive's v0.12.0 and v0.13.1


[ 
https://issues.apache.org/jira/browse/SPARK-4331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234556#comment-14234556
 ] 

Michael Armbrust commented on SPARK-4331:
-

I'll add that scalastyle does not run on test code either.

 SBT Scalastyle doesn't work for the sources under hive's v0.12.0 and v0.13.1
 

 Key: SPARK-4331
 URL: https://issues.apache.org/jira/browse/SPARK-4331
 Project: Spark
  Issue Type: Bug
  Components: Build, SQL
Affects Versions: 1.3.0
Reporter: Kousuke Saruta

 v0.13.1 and v0.12.0 is not standard directory structure for sbt's sclastyle 
 plugin so scalastyle doesn't work for sources under those directories.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4253) Ignore spark.driver.host in yarn-cluster and standalone-cluster mode


 [ 
https://issues.apache.org/jira/browse/SPARK-4253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4253.
---
   Resolution: Fixed
Fix Version/s: 1.2.0
   1.1.2

Issue resolved by pull request 3112
[https://github.com/apache/spark/pull/3112]

 Ignore spark.driver.host in yarn-cluster and standalone-cluster mode
 

 Key: SPARK-4253
 URL: https://issues.apache.org/jira/browse/SPARK-4253
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: WangTaoTheTonic
Priority: Minor
 Fix For: 1.1.2, 1.2.0

 Attachments: Cannot assign requested address.txt


 We actually don't know where driver will be before it is launched in 
 yarn-cluster mode. If we set spark.driver.host property, Spark will create 
 Actor on the hostname or ip as setted, which will leads an error.
 So we should ignore this config item in yarn-cluster mode.
 As [~joshrosen]] pointed, we also ignore it in standalone cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4253) Ignore spark.driver.host in yarn-cluster and standalone-cluster mode


 [ 
https://issues.apache.org/jira/browse/SPARK-4253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4253:
--
Assignee: WangTaoTheTonic

 Ignore spark.driver.host in yarn-cluster and standalone-cluster mode
 

 Key: SPARK-4253
 URL: https://issues.apache.org/jira/browse/SPARK-4253
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: WangTaoTheTonic
Assignee: WangTaoTheTonic
Priority: Minor
 Fix For: 1.2.0, 1.1.2

 Attachments: Cannot assign requested address.txt


 We actually don't know where driver will be before it is launched in 
 yarn-cluster mode. If we set spark.driver.host property, Spark will create 
 Actor on the hostname or ip as setted, which will leads an error.
 So we should ignore this config item in yarn-cluster mode.
 As [~joshrosen]] pointed, we also ignore it in standalone cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-4731) Spark 1.1.1 launches broken EC2 clusters


 [ 
https://issues.apache.org/jira/browse/SPARK-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reassigned SPARK-4731:


Assignee: Andrew Or

 Spark 1.1.1 launches broken EC2 clusters
 

 Key: SPARK-4731
 URL: https://issues.apache.org/jira/browse/SPARK-4731
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.1.1
 Environment: Spark 1.1.1 on MacOS X
Reporter: Jey Kottalam
Assignee: Andrew Or

 EC2 clusters launched using Spark 1.1.1's `spark-ec2` script with the `-v 
 1.1.1` flag fail to initialize the master and workers correctly. The 
 `/root/spark` directory contains only the `conf` directory and doesn't have 
 the `bin` and other directories. 
 [~joshrosen] suggested that [spark-ec2 
 #81](https://github.com/mesos/spark-ec2/pull/81) might have fixed it, but I 
 still see this problem after that was merged.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4731) Spark 1.1.1 launches broken EC2 clusters


[ 
https://issues.apache.org/jira/browse/SPARK-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234666#comment-14234666
 ] 

Andrew Or commented on SPARK-4731:
--

This should work once https://github.com/mesos/spark-ec2/pull/82 is merged.

 Spark 1.1.1 launches broken EC2 clusters
 

 Key: SPARK-4731
 URL: https://issues.apache.org/jira/browse/SPARK-4731
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.1.1
 Environment: Spark 1.1.1 on MacOS X
Reporter: Jey Kottalam
Assignee: Andrew Or

 EC2 clusters launched using Spark 1.1.1's `spark-ec2` script with the `-v 
 1.1.1` flag fail to initialize the master and workers correctly. The 
 `/root/spark` directory contains only the `conf` directory and doesn't have 
 the `bin` and other directories. 
 [~joshrosen] suggested that [spark-ec2 
 #81](https://github.com/mesos/spark-ec2/pull/81) might have fixed it, but I 
 still see this problem after that was merged.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4748) PySpark can't read data in HDFS in YARN mode

2014-12-04 Thread JIRA

Sebastián Ramírez created SPARK-4748:


 Summary: PySpark can't read data in HDFS in YARN mode
 Key: SPARK-4748
 URL: https://issues.apache.org/jira/browse/SPARK-4748
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.1.1
 Environment: Spark 1.1.1 precompiled for Hadoop 2.4
Hortonworks HDP 2.1
CentOS 6.6
(Anaconda 2.1.0 64-bit) Python 2.7.8
Numpy 1.9.0
Reporter: Sebastián Ramírez


Using *PySpark*, I'm being unable to read and process data in *HDFS* in *YARN* 
cluster mode.
But I can read data from HDFS in local mode.

I have a 6 nodes cluster with Hortonworks HDP 2.1.
The operating system is CentOS 6.6.

I have installed Anaconda Python (which includes numpy) on every node for the 
user yarn.




h5. This works (*PySpark* local reading from HDFS):

When I start the console with:
{code}
IPYTHON=1 /home/hdfs/spark-1.1.1-bin-hadoop2.4/bin/pyspark --master local
{code}

Then I do (that file is in HDFS):
{code}
testdata = sc.textFile('/user/hdfs/testdata.csv')
{code}

And then:
{code}
testdata.first()
{code}

I get my data back:
{code}
u'asdf,qwer,1,M'
{code}

And if I do:
{code}
testdata.count()
{code}

It also works, I get:
{code}
500
{code}




h5. This also works (*Scala* in YARN cluster reading from HDFS):

When I start the console with:
{code}
/home/hdfs/spark-1.1.1-bin-hadoop2.4/bin/spark-shell --master yarn-client 
--num-executors 6 --executor-cores 2 --executor-memory 2G --driver-memory 2G 
{code}

Then I do (that file is in HDFS):
{code}
val testdata = sc.textFile(/user/hdfs/testdata.csv)
{code}

And then:
{code}
testdata.first()
{code}

I get my data back:
{code}
res1: String = asdf,qwer,1,M
{code}

And if I do:
{code}
testdata.count()
{code}

It also works, I get:
{code}
res2: Long = 500
{code}



h5. This doesn't work (*PySpark* in YARN cluster reading from HDFS):

When I start the console with:
{code}
IPYTHON=1 /home/hdfs/spark-1.1.1-bin-hadoop2.4/bin/pyspark --master yarn-client 
--num-executors 6 --executor-cores 2 --executor-memory 2G --driver-memory 2G
{code}

Then I do (that file is in HDFS):
{code}
testdata = sc.textFile('/user/hdfs/testdata.csv')
{code}

And then:
{code}
testdata.first()
{code}

And I get some *INFO* logs, and then a *WARN*:
{code}
14/12/04 15:26:40 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 
(TID 0, node05): org.apache.spark.api.python.PythonException: Traceback (most 
recent call last):
  File 
/hadoop/yarn/local/usercache/hdfs/filecache/44/spark-assembly-1.1.1-hadoop2.4.0.jar/pyspark/worker.py,
 line 79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
/hadoop/yarn/local/usercache/hdfs/filecache/44/spark-assembly-1.1.1-hadoop2.4.0.jar/pyspark/serializers.py,
 line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
/hadoop/yarn/local/usercache/hdfs/filecache/44/spark-assembly-1.1.1-hadoop2.4.0.jar/pyspark/serializers.py,
 line 127, in dump_stream
for obj in iterator:
  File 
/hadoop/yarn/local/usercache/hdfs/filecache/44/spark-assembly-1.1.1-hadoop2.4.0.jar/pyspark/serializers.py,
 line 185, in _batched
for item in iterator:
  File /home/hdfs/spark-1.1.1-bin-hadoop2.4/python/pyspark/rdd.py, line 1146, 
in takeUpToNumLeft
ImportError: No module named next

org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)

org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:154)
org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:744)
14/12/04 15:26:40 INFO scheduler.TaskSetManager: Starting task 0.1 in stage 0.0 
(TID 1, node05, NODE_LOCAL, 1254 bytes)
14/12/04 15:26:40 INFO scheduler.TaskSetManager: Lost task 0.1 in stage 0.0 
(TID 1) on executor node05: org.apache.spark.api.python.PythonException 
(Traceback (most recent call last):
  File 
/hadoop/yarn/local/usercache/hdfs/filecache/44/spark-assembly-1.1.1-hadoop2.4.0.jar/pyspark/worker.py,
 line 79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
/hadoop/yarn/local/usercache/hdfs/filecache/44/spark-assembly-1.1.1-hadoop2.4.0.jar/pyspark/serializers.py,
 line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File

[jira] [Created] (SPARK-4749) Allow initializing KMeans clusters using a seed

2014-12-04 Thread Nate Crosswhite (JIRA)

Nate Crosswhite created SPARK-4749:
--

 Summary: Allow initializing KMeans clusters using a seed
 Key: SPARK-4749
 URL: https://issues.apache.org/jira/browse/SPARK-4749
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Affects Versions: 1.1.0
Reporter: Nate Crosswhite


Add an optional seed to MLLib KMeans clustering to allow initial cluster 
choices to be deterministic.  Update the pyspark mllib interface to also allow 
an optional seed parameter to be supplie. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4749) Allow initializing KMeans clusters using a seed


[ 
https://issues.apache.org/jira/browse/SPARK-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234698#comment-14234698
 ] 

Apache Spark commented on SPARK-4749:
-

User 'nxwhite-str' has created a pull request for this issue:
https://github.com/apache/spark/pull/3610

 Allow initializing KMeans clusters using a seed
 ---

 Key: SPARK-4749
 URL: https://issues.apache.org/jira/browse/SPARK-4749
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Affects Versions: 1.1.0
Reporter: Nate Crosswhite
   Original Estimate: 24h
  Remaining Estimate: 24h

 Add an optional seed to MLLib KMeans clustering to allow initial cluster 
 choices to be deterministic.  Update the pyspark mllib interface to also 
 allow an optional seed parameter to be supplie. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3431) Parallelize execution of tests

2014-12-04 Thread Nicholas Chammas (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234702#comment-14234702
]

Nicholas Chammas commented on SPARK-3431:
-

I think I'm on to something, but I need some help. I think I understand how to
tell SBT to fork JVMs for tests, and I also think I got how to specify how the
tests should be grouped in the various forked JVMs.

It's not working because I think the forked JVMs are not getting passed all the
options they need. Basically, I don't think that the reference to
{{javaOptions}} [here in this
line|https://github.com/nchammas/spark/blob/ab127b798dbfa9399833d546e627f9651b060918/project/SparkBuild.scala#L429]
actually has all the options [defined
earlier|https://github.com/nchammas/spark/blob/ab127b798dbfa9399833d546e627f9651b060918/project/SparkBuild.scala#L403-L418].

I don't know much Scala. If anyone could review what I have so far give me some
pointers, that would be great!

You can see all the variations I've tried along with the associated output in
[the open pull request|https://github.com/apache/spark/pull/3564]. I know we
want to get this working with Maven, but I figured getting it to work first
with SBT wouldn't be a bad thing.

Parallelize execution of tests
--

Key: SPARK-3431
URL: https://issues.apache.org/jira/browse/SPARK-3431
Project: Spark
Issue Type: Improvement
Components: Build
Reporter: Nicholas Chammas

Running all the tests in {{dev/run-tests}} takes up to 2 hours. A common
strategy to cut test time down is to parallelize the execution of the tests.
Doing that may in turn require some prerequisite changes to be made to how
certain tests run.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4745) get_existing_cluster() doesn't work with additional security groups


 [ 
https://issues.apache.org/jira/browse/SPARK-4745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4745.
---
   Resolution: Fixed
Fix Version/s: 1.2.1
   1.1.2

Issue resolved by pull request 3596
[https://github.com/apache/spark/pull/3596]

 get_existing_cluster() doesn't work with additional security groups
 ---

 Key: SPARK-4745
 URL: https://issues.apache.org/jira/browse/SPARK-4745
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.1.0
Reporter: Alex DeBrie
 Fix For: 1.1.2, 1.2.1


 The spark-ec2 script has a flag that allows you to add additional security 
 groups to clusters when you launch. However, the get_existing_cluster() 
 function cycles through active instances and only returns instances whose 
 group_names == cluster_name + -master (or + -slaves), which is the group 
 created by default.  The get_existing_cluster() function is used to login to, 
 stop, and destroy existing clusters, among other actions.
 This is a pretty simple fix for which I've already submitted a [pull 
 request|https://github.com/apache/spark/pull/3596]. It checks if cluster_name 
 + -master is in the list of groups for each active instance. This means the 
 cluster group can be one among many groups, rather than the sole group for an 
 instance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4745) get_existing_cluster() doesn't work with additional security groups


 [ 
https://issues.apache.org/jira/browse/SPARK-4745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4745:
--
Assignee: Alex DeBrie

 get_existing_cluster() doesn't work with additional security groups
 ---

 Key: SPARK-4745
 URL: https://issues.apache.org/jira/browse/SPARK-4745
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.1.0
Reporter: Alex DeBrie
Assignee: Alex DeBrie
 Fix For: 1.1.2, 1.2.1


 The spark-ec2 script has a flag that allows you to add additional security 
 groups to clusters when you launch. However, the get_existing_cluster() 
 function cycles through active instances and only returns instances whose 
 group_names == cluster_name + -master (or + -slaves), which is the group 
 created by default.  The get_existing_cluster() function is used to login to, 
 stop, and destroy existing clusters, among other actions.
 This is a pretty simple fix for which I've already submitted a [pull 
 request|https://github.com/apache/spark/pull/3596]. It checks if cluster_name 
 + -master is in the list of groups for each active instance. This means the 
 cluster group can be one among many groups, rather than the sole group for an 
 instance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-4731) Spark 1.1.1 launches broken EC2 clusters


 [ 
https://issues.apache.org/jira/browse/SPARK-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4731.

  Resolution: Fixed
   Fix Version/s: 1.1.1
Target Version/s: 1.1.1

 Spark 1.1.1 launches broken EC2 clusters
 

 Key: SPARK-4731
 URL: https://issues.apache.org/jira/browse/SPARK-4731
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.1.1
 Environment: Spark 1.1.1 on MacOS X
Reporter: Jey Kottalam
Assignee: Andrew Or
 Fix For: 1.1.1


 EC2 clusters launched using Spark 1.1.1's `spark-ec2` script with the `-v 
 1.1.1` flag fail to initialize the master and workers correctly. The 
 `/root/spark` directory contains only the `conf` directory and doesn't have 
 the `bin` and other directories. 
 [~joshrosen] suggested that [spark-ec2 
 #81](https://github.com/mesos/spark-ec2/pull/81) might have fixed it, but I 
 still see this problem after that was merged.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4740) Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time

2014-12-04 Thread Aaron Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234758#comment-14234758
 ] 

Aaron Davidson commented on SPARK-4740:
---

Could you try to set spark.shuffle.io.serverThreads and 
spark.shuffle.io.clientThreads to 48? We have an artificial max default of 8 
to limit off-heap memory usage, but it's possible this is not sufficient to 
saturate 10GB/s.

 Netty's network bandwidth is much lower than NIO in spark-perf and Netty 
 takes longer running time
 --

 Key: SPARK-4740
 URL: https://issues.apache.org/jira/browse/SPARK-4740
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Zhang, Liye
 Attachments: Spark-perf Test Report.pdf


 When testing current spark master (1.3.0-snapshot) with spark-perf 
 (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService 
 takes much longer time than NIO based shuffle transferService. The network 
 throughput of Netty is only about half of that of NIO. 
 We tested with standalone mode, and the data set we used for test is 20 
 billion records, and the total size is about 400GB. Spark-perf test is 
 Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each 
 executor memory is 64GB. The reduce tasks number is set to 1000. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4459) JavaRDDLike.groupBy[K](f: JFunction[T, K]) may fail with typechecking errors


 [ 
https://issues.apache.org/jira/browse/SPARK-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4459:
--
Affects Version/s: (was: 1.0.2)
   (was: 1.1.0)
   1.1.2
   1.2.0

 JavaRDDLike.groupBy[K](f: JFunction[T, K]) may fail with typechecking errors
 

 Key: SPARK-4459
 URL: https://issues.apache.org/jira/browse/SPARK-4459
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.2.0, 1.1.2
Reporter: Alok Saldanha
 Fix For: 1.1.1, 1.1.2


 I believe this issue is essentially the same as SPARK-668.
 Original error: 
 {code}
 [ERROR] 
 /Users/saldaal1/workspace/JavaSparkSimpleApp/src/main/java/SimpleApp.java:[29,105]
  no suitable method found for 
 groupBy(org.apache.spark.api.java.function.Functionscala.Tuple2java.lang.String,java.lang.Long,java.lang.Long)
 [ERROR] method 
 org.apache.spark.api.java.JavaPairRDD.KgroupBy(org.apache.spark.api.java.function.Functionscala.Tuple2K,java.lang.Long,K)
  is not applicable
 [ERROR] (inferred type does not conform to equality constraint(s)
 {code}
 from core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala 
 {code}
 211  /**
 212* Return an RDD of grouped elements. Each group consists of a key and 
 a sequence of elements
 213* mapping to that key.
 214*/
 215   def groupBy[K](f: JFunction[T, K]): JavaPairRDD[K, JIterable[T]] = {
 216 implicit val ctagK: ClassTag[K] = fakeClassTag
 217 implicit val ctagV: ClassTag[JList[T]] = fakeClassTag
 218 JavaPairRDD.fromRDD(groupByResultToJava(rdd.groupBy(f)(fakeClassTag)))
 219   }
 {code}
 Then in core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala:
 {code}
   45 class JavaPairRDD[K, V](val rdd: RDD[(K, V)])
   46(implicit val kClassTag: ClassTag[K], implicit 
 val vClassTag: ClassTag[V])
   47   extends JavaRDDLike[(K, V), JavaPairRDD[K, V]] {
 {code}
 The problem is that the type parameter T in JavaRDDLike is Tuple2[K,V], which 
 means the combined signature for groupBy in the JavaPairRDD is 
 {code}
 groupBy[K](f: JFunction[Tuple2[K,V], K])
 {code}
 which imposes an unfortunate correlation between the Tuple2 and the return 
 type of the grouping function, namely that the return type of the grouping 
 function must be the same as the first type of the JavaPairRDD.
 If we compare the method signature to flatMap:
 {code}
 105   /**
 106*  Return a new RDD by first applying a function to all elements of 
 this
 107*  RDD, and then flattening the results.
 108*/
 109   def flatMap[U](f: FlatMapFunction[T, U]): JavaRDD[U] = {
 110 import scala.collection.JavaConverters._
 111 def fn = (x: T) = f.call(x).asScala
 112 JavaRDD.fromRDD(rdd.flatMap(fn)(fakeClassTag[U]))(fakeClassTag[U])
 113   }
 {code}
 we see there should be an easy fix by changing the type parameter of the 
 groupBy function from K to U.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4459) JavaRDDLike.groupBy[K](f: JFunction[T, K]) may fail with typechecking errors


 [ 
https://issues.apache.org/jira/browse/SPARK-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4459.
---
   Resolution: Fixed
Fix Version/s: 1.1.1
   1.1.2

Issue resolved by pull request 3327
[https://github.com/apache/spark/pull/3327]

 JavaRDDLike.groupBy[K](f: JFunction[T, K]) may fail with typechecking errors
 

 Key: SPARK-4459
 URL: https://issues.apache.org/jira/browse/SPARK-4459
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.2.0, 1.1.2
Reporter: Alok Saldanha
 Fix For: 1.1.2, 1.1.1


 I believe this issue is essentially the same as SPARK-668.
 Original error: 
 {code}
 [ERROR] 
 /Users/saldaal1/workspace/JavaSparkSimpleApp/src/main/java/SimpleApp.java:[29,105]
  no suitable method found for 
 groupBy(org.apache.spark.api.java.function.Functionscala.Tuple2java.lang.String,java.lang.Long,java.lang.Long)
 [ERROR] method 
 org.apache.spark.api.java.JavaPairRDD.KgroupBy(org.apache.spark.api.java.function.Functionscala.Tuple2K,java.lang.Long,K)
  is not applicable
 [ERROR] (inferred type does not conform to equality constraint(s)
 {code}
 from core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala 
 {code}
 211  /**
 212* Return an RDD of grouped elements. Each group consists of a key and 
 a sequence of elements
 213* mapping to that key.
 214*/
 215   def groupBy[K](f: JFunction[T, K]): JavaPairRDD[K, JIterable[T]] = {
 216 implicit val ctagK: ClassTag[K] = fakeClassTag
 217 implicit val ctagV: ClassTag[JList[T]] = fakeClassTag
 218 JavaPairRDD.fromRDD(groupByResultToJava(rdd.groupBy(f)(fakeClassTag)))
 219   }
 {code}
 Then in core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala:
 {code}
   45 class JavaPairRDD[K, V](val rdd: RDD[(K, V)])
   46(implicit val kClassTag: ClassTag[K], implicit 
 val vClassTag: ClassTag[V])
   47   extends JavaRDDLike[(K, V), JavaPairRDD[K, V]] {
 {code}
 The problem is that the type parameter T in JavaRDDLike is Tuple2[K,V], which 
 means the combined signature for groupBy in the JavaPairRDD is 
 {code}
 groupBy[K](f: JFunction[Tuple2[K,V], K])
 {code}
 which imposes an unfortunate correlation between the Tuple2 and the return 
 type of the grouping function, namely that the return type of the grouping 
 function must be the same as the first type of the JavaPairRDD.
 If we compare the method signature to flatMap:
 {code}
 105   /**
 106*  Return a new RDD by first applying a function to all elements of 
 this
 107*  RDD, and then flattening the results.
 108*/
 109   def flatMap[U](f: FlatMapFunction[T, U]): JavaRDD[U] = {
 110 import scala.collection.JavaConverters._
 111 def fn = (x: T) = f.call(x).asScala
 112 JavaRDD.fromRDD(rdd.flatMap(fn)(fakeClassTag[U]))(fakeClassTag[U])
 113   }
 {code}
 we see there should be an easy fix by changing the type parameter of the 
 groupBy function from K to U.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4459) JavaRDDLike.groupBy[K](f: JFunction[T, K]) may fail with typechecking errors


 [ 
https://issues.apache.org/jira/browse/SPARK-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4459:
--
Assignee: Alok Saldanha

 JavaRDDLike.groupBy[K](f: JFunction[T, K]) may fail with typechecking errors
 

 Key: SPARK-4459
 URL: https://issues.apache.org/jira/browse/SPARK-4459
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.0.0, 1.2.0, 1.1.2
Reporter: Alok Saldanha
Assignee: Alok Saldanha
 Fix For: 1.1.1, 1.1.2


 I believe this issue is essentially the same as SPARK-668.
 Original error: 
 {code}
 [ERROR] 
 /Users/saldaal1/workspace/JavaSparkSimpleApp/src/main/java/SimpleApp.java:[29,105]
  no suitable method found for 
 groupBy(org.apache.spark.api.java.function.Functionscala.Tuple2java.lang.String,java.lang.Long,java.lang.Long)
 [ERROR] method 
 org.apache.spark.api.java.JavaPairRDD.KgroupBy(org.apache.spark.api.java.function.Functionscala.Tuple2K,java.lang.Long,K)
  is not applicable
 [ERROR] (inferred type does not conform to equality constraint(s)
 {code}
 from core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala 
 {code}
 211  /**
 212* Return an RDD of grouped elements. Each group consists of a key and 
 a sequence of elements
 213* mapping to that key.
 214*/
 215   def groupBy[K](f: JFunction[T, K]): JavaPairRDD[K, JIterable[T]] = {
 216 implicit val ctagK: ClassTag[K] = fakeClassTag
 217 implicit val ctagV: ClassTag[JList[T]] = fakeClassTag
 218 JavaPairRDD.fromRDD(groupByResultToJava(rdd.groupBy(f)(fakeClassTag)))
 219   }
 {code}
 Then in core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala:
 {code}
   45 class JavaPairRDD[K, V](val rdd: RDD[(K, V)])
   46(implicit val kClassTag: ClassTag[K], implicit 
 val vClassTag: ClassTag[V])
   47   extends JavaRDDLike[(K, V), JavaPairRDD[K, V]] {
 {code}
 The problem is that the type parameter T in JavaRDDLike is Tuple2[K,V], which 
 means the combined signature for groupBy in the JavaPairRDD is 
 {code}
 groupBy[K](f: JFunction[Tuple2[K,V], K])
 {code}
 which imposes an unfortunate correlation between the Tuple2 and the return 
 type of the grouping function, namely that the return type of the grouping 
 function must be the same as the first type of the JavaPairRDD.
 If we compare the method signature to flatMap:
 {code}
 105   /**
 106*  Return a new RDD by first applying a function to all elements of 
 this
 107*  RDD, and then flattening the results.
 108*/
 109   def flatMap[U](f: FlatMapFunction[T, U]): JavaRDD[U] = {
 110 import scala.collection.JavaConverters._
 111 def fn = (x: T) = f.call(x).asScala
 112 JavaRDD.fromRDD(rdd.flatMap(fn)(fakeClassTag[U]))(fakeClassTag[U])
 113   }
 {code}
 we see there should be an easy fix by changing the type parameter of the 
 groupBy function from K to U.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4459) JavaRDDLike.groupBy[K](f: JFunction[T, K]) may fail with typechecking errors


 [ 
https://issues.apache.org/jira/browse/SPARK-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4459:
--
Affects Version/s: 1.0.0

 JavaRDDLike.groupBy[K](f: JFunction[T, K]) may fail with typechecking errors
 

 Key: SPARK-4459
 URL: https://issues.apache.org/jira/browse/SPARK-4459
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.0.0, 1.2.0, 1.1.2
Reporter: Alok Saldanha
Assignee: Alok Saldanha
 Fix For: 1.1.1, 1.1.2


 I believe this issue is essentially the same as SPARK-668.
 Original error: 
 {code}
 [ERROR] 
 /Users/saldaal1/workspace/JavaSparkSimpleApp/src/main/java/SimpleApp.java:[29,105]
  no suitable method found for 
 groupBy(org.apache.spark.api.java.function.Functionscala.Tuple2java.lang.String,java.lang.Long,java.lang.Long)
 [ERROR] method 
 org.apache.spark.api.java.JavaPairRDD.KgroupBy(org.apache.spark.api.java.function.Functionscala.Tuple2K,java.lang.Long,K)
  is not applicable
 [ERROR] (inferred type does not conform to equality constraint(s)
 {code}
 from core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala 
 {code}
 211  /**
 212* Return an RDD of grouped elements. Each group consists of a key and 
 a sequence of elements
 213* mapping to that key.
 214*/
 215   def groupBy[K](f: JFunction[T, K]): JavaPairRDD[K, JIterable[T]] = {
 216 implicit val ctagK: ClassTag[K] = fakeClassTag
 217 implicit val ctagV: ClassTag[JList[T]] = fakeClassTag
 218 JavaPairRDD.fromRDD(groupByResultToJava(rdd.groupBy(f)(fakeClassTag)))
 219   }
 {code}
 Then in core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala:
 {code}
   45 class JavaPairRDD[K, V](val rdd: RDD[(K, V)])
   46(implicit val kClassTag: ClassTag[K], implicit 
 val vClassTag: ClassTag[V])
   47   extends JavaRDDLike[(K, V), JavaPairRDD[K, V]] {
 {code}
 The problem is that the type parameter T in JavaRDDLike is Tuple2[K,V], which 
 means the combined signature for groupBy in the JavaPairRDD is 
 {code}
 groupBy[K](f: JFunction[Tuple2[K,V], K])
 {code}
 which imposes an unfortunate correlation between the Tuple2 and the return 
 type of the grouping function, namely that the return type of the grouping 
 function must be the same as the first type of the JavaPairRDD.
 If we compare the method signature to flatMap:
 {code}
 105   /**
 106*  Return a new RDD by first applying a function to all elements of 
 this
 107*  RDD, and then flattening the results.
 108*/
 109   def flatMap[U](f: FlatMapFunction[T, U]): JavaRDD[U] = {
 110 import scala.collection.JavaConverters._
 111 def fn = (x: T) = f.call(x).asScala
 112 JavaRDD.fromRDD(rdd.flatMap(fn)(fakeClassTag[U]))(fakeClassTag[U])
 113   }
 {code}
 we see there should be an easy fix by changing the type parameter of the 
 groupBy function from K to U.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4459) JavaRDDLike.groupBy[K](f: JFunction[T, K]) may fail with typechecking errors


 [ 
https://issues.apache.org/jira/browse/SPARK-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4459:
--
Affects Version/s: (was: 1.1.2)
   (was: 1.2.0)
   (was: 1.0.0)
   1.0.2
   1.1.0

 JavaRDDLike.groupBy[K](f: JFunction[T, K]) may fail with typechecking errors
 

 Key: SPARK-4459
 URL: https://issues.apache.org/jira/browse/SPARK-4459
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.0.2, 1.1.0
Reporter: Alok Saldanha
Assignee: Alok Saldanha
 Fix For: 1.1.2, 1.2.1


 I believe this issue is essentially the same as SPARK-668.
 Original error: 
 {code}
 [ERROR] 
 /Users/saldaal1/workspace/JavaSparkSimpleApp/src/main/java/SimpleApp.java:[29,105]
  no suitable method found for 
 groupBy(org.apache.spark.api.java.function.Functionscala.Tuple2java.lang.String,java.lang.Long,java.lang.Long)
 [ERROR] method 
 org.apache.spark.api.java.JavaPairRDD.KgroupBy(org.apache.spark.api.java.function.Functionscala.Tuple2K,java.lang.Long,K)
  is not applicable
 [ERROR] (inferred type does not conform to equality constraint(s)
 {code}
 from core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala 
 {code}
 211  /**
 212* Return an RDD of grouped elements. Each group consists of a key and 
 a sequence of elements
 213* mapping to that key.
 214*/
 215   def groupBy[K](f: JFunction[T, K]): JavaPairRDD[K, JIterable[T]] = {
 216 implicit val ctagK: ClassTag[K] = fakeClassTag
 217 implicit val ctagV: ClassTag[JList[T]] = fakeClassTag
 218 JavaPairRDD.fromRDD(groupByResultToJava(rdd.groupBy(f)(fakeClassTag)))
 219   }
 {code}
 Then in core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala:
 {code}
   45 class JavaPairRDD[K, V](val rdd: RDD[(K, V)])
   46(implicit val kClassTag: ClassTag[K], implicit 
 val vClassTag: ClassTag[V])
   47   extends JavaRDDLike[(K, V), JavaPairRDD[K, V]] {
 {code}
 The problem is that the type parameter T in JavaRDDLike is Tuple2[K,V], which 
 means the combined signature for groupBy in the JavaPairRDD is 
 {code}
 groupBy[K](f: JFunction[Tuple2[K,V], K])
 {code}
 which imposes an unfortunate correlation between the Tuple2 and the return 
 type of the grouping function, namely that the return type of the grouping 
 function must be the same as the first type of the JavaPairRDD.
 If we compare the method signature to flatMap:
 {code}
 105   /**
 106*  Return a new RDD by first applying a function to all elements of 
 this
 107*  RDD, and then flattening the results.
 108*/
 109   def flatMap[U](f: FlatMapFunction[T, U]): JavaRDD[U] = {
 110 import scala.collection.JavaConverters._
 111 def fn = (x: T) = f.call(x).asScala
 112 JavaRDD.fromRDD(rdd.flatMap(fn)(fakeClassTag[U]))(fakeClassTag[U])
 113   }
 {code}
 we see there should be an easy fix by changing the type parameter of the 
 groupBy function from K to U.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4652) Add docs about spark-git-repo option


 [ 
https://issues.apache.org/jira/browse/SPARK-4652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4652:
--
Assignee: Kai Sasaki

 Add docs about spark-git-repo option
 

 Key: SPARK-4652
 URL: https://issues.apache.org/jira/browse/SPARK-4652
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Kai Sasaki
Assignee: Kai Sasaki
Priority: Minor

 It was a little hard to understand how to use --spark-git-repo option on 
 spark-ec2 script. Some additional documentation might be needed to use it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4652) Add docs about spark-git-repo option


 [ 
https://issues.apache.org/jira/browse/SPARK-4652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4652.
---
   Resolution: Fixed
Fix Version/s: 1.2.1
   1.1.2

Issue resolved by pull request 3513
[https://github.com/apache/spark/pull/3513]

 Add docs about spark-git-repo option
 

 Key: SPARK-4652
 URL: https://issues.apache.org/jira/browse/SPARK-4652
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Kai Sasaki
Assignee: Kai Sasaki
Priority: Minor
 Fix For: 1.1.2, 1.2.1


 It was a little hard to understand how to use --spark-git-repo option on 
 spark-ec2 script. Some additional documentation might be needed to use it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3431) Parallelize execution of tests

2014-12-04 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234783#comment-14234783
 ] 

Nicholas Chammas commented on SPARK-3431:
-

As an aside, I expect there to be some work required to let certain tests play 
nicely with one another. But if we figure out how to specify test groupings and 
make sure the forked JVMs are configured correctly, refactoring tests where 
necessary should be very doable.

 Parallelize execution of tests
 --

 Key: SPARK-3431
 URL: https://issues.apache.org/jira/browse/SPARK-3431
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Nicholas Chammas

 Running all the tests in {{dev/run-tests}} takes up to 2 hours. A common 
 strategy to cut test time down is to parallelize the execution of the tests. 
 Doing that may in turn require some prerequisite changes to be made to how 
 certain tests run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4136) Under dynamic allocation, cancel outstanding executor requests when pending task queue is empty


 [ 
https://issues.apache.org/jira/browse/SPARK-4136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4136:
-
Target Version/s: 1.3.0  (was: 1.2.0)

 Under dynamic allocation, cancel outstanding executor requests when pending 
 task queue is empty
 ---

 Key: SPARK-4136
 URL: https://issues.apache.org/jira/browse/SPARK-4136
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4750) Dynamic allocation - we need to synchronize kills

Andrew Or created SPARK-4750:


 Summary: Dynamic allocation - we need to synchronize kills
 Key: SPARK-4750
 URL: https://issues.apache.org/jira/browse/SPARK-4750
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or


https://github.com/apache/spark/blob/ab8177da2defab1ecd8bc0cd5a21f07be5b8d2c5/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L337

Simple omission on my part.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4750) Dynamic allocation - we need to synchronize kills


[ 
https://issues.apache.org/jira/browse/SPARK-4750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234838#comment-14234838
 ] 

Apache Spark commented on SPARK-4750:
-

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/3612

 Dynamic allocation - we need to synchronize kills
 -

 Key: SPARK-4750
 URL: https://issues.apache.org/jira/browse/SPARK-4750
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or

 https://github.com/apache/spark/blob/ab8177da2defab1ecd8bc0cd5a21f07be5b8d2c5/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L337
 Simple omission on my part.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4751) Support dynamic allocation for standalone mode

Andrew Or created SPARK-4751:


 Summary: Support dynamic allocation for standalone mode
 Key: SPARK-4751
 URL: https://issues.apache.org/jira/browse/SPARK-4751
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Blocker


This is equivalent to SPARK-3822 but for standalone mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4752) Classifier based on artificial neural network

2014-12-04 Thread Alexander Ulanov (JIRA)

Alexander Ulanov created SPARK-4752:
---

 Summary: Classifier based on artificial neural network
 Key: SPARK-4752
 URL: https://issues.apache.org/jira/browse/SPARK-4752
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Alexander Ulanov
 Fix For: 1.3.0


Implement classifier based on artificial neural network (ANN). Requirements:
1) Use the existing artificial neural network implementation 
https://issues.apache.org/jira/browse/SPARK-2352, 
https://github.com/apache/spark/pull/1290
2) Extend MLlib ClassificationModel trait, 
3) Like other classifiers in MLlib, accept RDD[LabeledPoint] for training,
4) Be able to return the ANN model



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4752) Classifier based on artificial neural network

2014-12-04 Thread Alexander Ulanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234855#comment-14234855
 ] 

Alexander Ulanov edited comment on SPARK-4752 at 12/5/14 12:51 AM:
---

The initial implementation can be found here: 
https://github.com/avulanov/spark/tree/annclassifier. It encodes the class 
label as a binary vector in the ANN output and selects the class based on 
biggest output value. The implementation contains unit tests as well. 

The mentioned code uses the following PR: 
https://github.com/apache/spark/pull/1290. It is not yet merged into the main 
branch. I think that I should not make a pull request until then.


was (Author: avulanov):
The initial implementation can be found here: 
https://github.com/avulanov/spark/tree/annclassifier. It codes the class label 
as a binary vector in the ANN output and selects the class based on biggest 
output value. The implementation contains unit tests as well. 

The mentioned code uses the following PR: 
https://github.com/apache/spark/pull/1290. It is not yet merged into the main 
branch. I think that I should not make a pull request until then.

 Classifier based on artificial neural network
 -

 Key: SPARK-4752
 URL: https://issues.apache.org/jira/browse/SPARK-4752
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Alexander Ulanov
 Fix For: 1.3.0

   Original Estimate: 168h
  Remaining Estimate: 168h

 Implement classifier based on artificial neural network (ANN). Requirements:
 1) Use the existing artificial neural network implementation 
 https://issues.apache.org/jira/browse/SPARK-2352, 
 https://github.com/apache/spark/pull/1290
 2) Extend MLlib ClassificationModel trait, 
 3) Like other classifiers in MLlib, accept RDD[LabeledPoint] for training,
 4) Be able to return the ANN model



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4752) Classifier based on artificial neural network

2014-12-04 Thread Alexander Ulanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234855#comment-14234855
 ] 

Alexander Ulanov commented on SPARK-4752:
-

The initial implementation can be found here: 
https://github.com/avulanov/spark/tree/annclassifier. It codes the class label 
as a binary vector in the ANN output and selects the class based on biggest 
output value. The implementation contains unit tests as well. 

The mentioned code uses the following PR: 
https://github.com/apache/spark/pull/1290. It is not yet merged into the main 
branch. I think that I should not make a pull request until then.

 Classifier based on artificial neural network
 -

 Key: SPARK-4752
 URL: https://issues.apache.org/jira/browse/SPARK-4752
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Alexander Ulanov
 Fix For: 1.3.0

   Original Estimate: 168h
  Remaining Estimate: 168h

 Implement classifier based on artificial neural network (ANN). Requirements:
 1) Use the existing artificial neural network implementation 
 https://issues.apache.org/jira/browse/SPARK-2352, 
 https://github.com/apache/spark/pull/1290
 2) Extend MLlib ClassificationModel trait, 
 3) Like other classifiers in MLlib, accept RDD[LabeledPoint] for training,
 4) Be able to return the ANN model



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4740) Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time

2014-12-04 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234876#comment-14234876
 ] 

Saisai Shao commented on SPARK-4740:


We also tested with small dataset like 40GB, the netty performance is similar 
to NIO, I'm guessing if Netty is not efficient when fetching large number of 
shuffle blocks, in our 400GB case, each reduce task need to fetch about 7000 
shuffle blocks, and each shuffle block is about tens of KB size. 

We will try increase shuffle thread number to test again. Seeing from the call 
stack, all the shuffle client are busy waiting on epoll_wait, I'm not sure is 
this the right thing?

 Netty's network bandwidth is much lower than NIO in spark-perf and Netty 
 takes longer running time
 --

 Key: SPARK-4740
 URL: https://issues.apache.org/jira/browse/SPARK-4740
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Zhang, Liye
 Attachments: Spark-perf Test Report.pdf


 When testing current spark master (1.3.0-snapshot) with spark-perf 
 (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService 
 takes much longer time than NIO based shuffle transferService. The network 
 throughput of Netty is only about half of that of NIO. 
 We tested with standalone mode, and the data set we used for test is 20 
 billion records, and the total size is about 400GB. Spark-perf test is 
 Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each 
 executor memory is 64GB. The reduce tasks number is set to 1000. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4740) Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time

2014-12-04 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234902#comment-14234902
 ] 

Saisai Shao commented on SPARK-4740:


Besides we also tested with 24 cores WSM cpu, the performance of Netty is still 
slower than NIO.

 Netty's network bandwidth is much lower than NIO in spark-perf and Netty 
 takes longer running time
 --

 Key: SPARK-4740
 URL: https://issues.apache.org/jira/browse/SPARK-4740
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Zhang, Liye
 Attachments: Spark-perf Test Report.pdf


 When testing current spark master (1.3.0-snapshot) with spark-perf 
 (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService 
 takes much longer time than NIO based shuffle transferService. The network 
 throughput of Netty is only about half of that of NIO. 
 We tested with standalone mode, and the data set we used for test is 20 
 billion records, and the total size is about 400GB. Spark-perf test is 
 Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each 
 executor memory is 64GB. The reduce tasks number is set to 1000. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4751) Support dynamic allocation for standalone mode


 [ 
https://issues.apache.org/jira/browse/SPARK-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4751:
-
Description: 
This is equivalent to SPARK-3822 but for standalone mode.

This is actually a very tricky issue

  was:This is equivalent to SPARK-3822 but for standalone mode.


 Support dynamic allocation for standalone mode
 --

 Key: SPARK-4751
 URL: https://issues.apache.org/jira/browse/SPARK-4751
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Blocker

 This is equivalent to SPARK-3822 but for standalone mode.
 This is actually a very tricky issue



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4751) Support dynamic allocation for standalone mode


 [ 
https://issues.apache.org/jira/browse/SPARK-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4751:
-
Affects Version/s: 1.2.0

 Support dynamic allocation for standalone mode
 --

 Key: SPARK-4751
 URL: https://issues.apache.org/jira/browse/SPARK-4751
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Blocker

 This is equivalent to SPARK-3822 but for standalone mode.
 This is actually a very tricky issue



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4751) Support dynamic allocation for standalone mode

[
https://issues.apache.org/jira/browse/SPARK-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrew Or updated SPARK-4751:
-
Description:
This is equivalent to SPARK-3822 but for standalone mode.

This is actually a very tricky issue because of the scheduling mechanisms in
the standalone Master. In standalone mode we allocate resources based on cores.
By default, an application will grab all the cores in the cluster unless
spark.cores.max is specified. This means an application could get executors
of different sizes (in terms of cores) if we kill and then request executors.
Further, standalone mode is subject to the constraint that only one executor
can be allocated on each worker per application. As a result, it is rather
meaningless to request new executors if the existing ones are already spread
out across all nodes.

was:
This is equivalent to SPARK-3822 but for standalone mode.

This is actually a very tricky issue

Support dynamic allocation for standalone mode
--

Key: SPARK-4751
URL: https://issues.apache.org/jira/browse/SPARK-4751
Project: Spark
Issue Type: New Feature
Components: Spark Core
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Blocker

This is equivalent to SPARK-3822 but for standalone mode.
This is actually a very tricky issue because of the scheduling mechanisms in
the standalone Master. In standalone mode we allocate resources based on
cores. By default, an application will grab all the cores in the cluster
unless spark.cores.max is specified. This means an application could get
executors of different sizes (in terms of cores) if we kill and then request
executors. Further, standalone mode is subject to the constraint that only
one executor can be allocated on each worker per application. As a result, it
is rather meaningless to request new executors if the existing ones are
already spread out across all nodes.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4751) Support dynamic allocation for standalone mode

[
https://issues.apache.org/jira/browse/SPARK-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrew Or updated SPARK-4751:
-
Description:
This is equivalent to SPARK-3822 but for standalone mode.

This is actually a very tricky issue the scheduling mechanism in the standalone
Master uses different semantics. In standalone mode we allocate resources based
on cores. By default, an application will grab all the cores in the cluster
unless spark.cores.max is specified. Unfortunately, this means an application
could get executors of different sizes (in terms of cores) if:

1) App 1 kills an executor
2) App 2, with spark.cores.max set, grabs a subset of cores on a worker
3) App 1 requests an executor

In this case, App 1 will get back an executor of half the number of cores.
Further, standalone mode is subject to the constraint that only one executor
can be allocated on each worker per application. As a result, it is rather
meaningless to request new executors if the existing ones are already spread
out across all nodes.

was:
This is equivalent to SPARK-3822 but for standalone mode.

Support dynamic allocation for standalone mode
--

This is equivalent to SPARK-3822 but for standalone mode.
This is actually a very tricky issue the scheduling mechanism in the
standalone Master uses different semantics. In standalone mode we allocate
resources based on cores. By default, an application will grab all the cores
in the cluster unless spark.cores.max is specified. Unfortunately, this
means an application could get executors of different sizes (in terms of
cores) if:
1) App 1 kills an executor
2) App 2, with spark.cores.max set, grabs a subset of cores on a worker
3) App 1 requests an executor
In this case, App 1 will get back an executor of half the number of cores.
Further, standalone mode is subject to the constraint that only one executor
can be allocated on each worker per application. As a result, it is rather
meaningless to request new executors if the existing ones are already spread
out across all nodes.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4751) Support dynamic allocation for standalone mode

[
https://issues.apache.org/jira/browse/SPARK-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrew Or updated SPARK-4751:
-
Description:
This is equivalent to SPARK-3822 but for standalone mode.

This is actually a very tricky issue because the scheduling mechanism in the
standalone Master uses different semantics. In standalone mode we allocate
resources based on cores. By default, an application will grab all the cores in
the cluster unless spark.cores.max is specified. Unfortunately, this means an
application could get executors of different sizes (in terms of cores) if:

1) App 1 kills an executor
2) App 2, with spark.cores.max set, grabs a subset of cores on a worker
3) App 1 requests an executor

was:
This is equivalent to SPARK-3822 but for standalone mode.

1) App 1 kills an executor
2) App 2, with spark.cores.max set, grabs a subset of cores on a worker
3) App 1 requests an executor

Support dynamic allocation for standalone mode
--

This is equivalent to SPARK-3822 but for standalone mode.
This is actually a very tricky issue because the scheduling mechanism in the
standalone Master uses different semantics. In standalone mode we allocate
resources based on cores. By default, an application will grab all the cores
in the cluster unless spark.cores.max is specified. Unfortunately, this
means an application could get executors of different sizes (in terms of
cores) if:
1) App 1 kills an executor
2) App 2, with spark.cores.max set, grabs a subset of cores on a worker
3) App 1 requests an executor
In this case, App 1 will get back an executor of half the number of cores.
Further, standalone mode is subject to the constraint that only one executor
can be allocated on each worker per application. As a result, it is rather
meaningless to request new executors if the existing ones are already spread
out across all nodes.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4751) Support dynamic allocation for standalone mode

[
https://issues.apache.org/jira/browse/SPARK-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrew Or updated SPARK-4751:
-
Description:
This is equivalent to SPARK-3822 but for standalone mode.

1) App 1 kills an executor
2) App 2, with spark.cores.max set, grabs a subset of cores on a worker
3) App 1 requests an executor

In this case, the new executor that App 1 gets back will be smaller than the
rest and can execute fewer tasks in parallel. Further, standalone mode is
subject to the constraint that only one executor can be allocated on each
worker per application. As a result, it is rather meaningless to request new
executors if the existing ones are already spread out across all nodes.

was:
This is equivalent to SPARK-3822 but for standalone mode.

1) App 1 kills an executor
2) App 2, with spark.cores.max set, grabs a subset of cores on a worker
3) App 1 requests an executor

Support dynamic allocation for standalone mode
--

This is equivalent to SPARK-3822 but for standalone mode.
This is actually a very tricky issue because the scheduling mechanism in the
standalone Master uses different semantics. In standalone mode we allocate
resources based on cores. By default, an application will grab all the cores
in the cluster unless spark.cores.max is specified. Unfortunately, this
means an application could get executors of different sizes (in terms of
cores) if:
1) App 1 kills an executor
2) App 2, with spark.cores.max set, grabs a subset of cores on a worker
3) App 1 requests an executor
In this case, the new executor that App 1 gets back will be smaller than the
rest and can execute fewer tasks in parallel. Further, standalone mode is
subject to the constraint that only one executor can be allocated on each
worker per application. As a result, it is rather meaningless to request new
executors if the existing ones are already spread out across all nodes.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-4421) Wrong link in spark-standalone.html


 [ 
https://issues.apache.org/jira/browse/SPARK-4421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-4421:
-

Assignee: Josh Rosen

 Wrong link in spark-standalone.html
 ---

 Key: SPARK-4421
 URL: https://issues.apache.org/jira/browse/SPARK-4421
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Reporter: Masayoshi TSUZUKI
Assignee: Josh Rosen
Priority: Trivial
 Fix For: 1.1.2, 1.2.1


 The link about building spark in the document page Spark Standalone Mode 
 (spark-standalone.html) is wrong.
 That link is pointed at {{index.html#building}}, but it is only available 
 until 0.9. The building guide was moved to another page 
 ({{building-with-maven.html}} in 1.0 and 1.1, or {{building-spark.html}} in 
 1.2).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4421) Wrong link in spark-standalone.html


 [ 
https://issues.apache.org/jira/browse/SPARK-4421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4421.
---
   Resolution: Fixed
Fix Version/s: 1.2.1
   1.1.2

Issue resolved by pull request 3279
[https://github.com/apache/spark/pull/3279]

 Wrong link in spark-standalone.html
 ---

 Key: SPARK-4421
 URL: https://issues.apache.org/jira/browse/SPARK-4421
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Reporter: Masayoshi TSUZUKI
Priority: Trivial
 Fix For: 1.1.2, 1.2.1


 The link about building spark in the document page Spark Standalone Mode 
 (spark-standalone.html) is wrong.
 That link is pointed at {{index.html#building}}, but it is only available 
 until 0.9. The building guide was moved to another page 
 ({{building-with-maven.html}} in 1.0 and 1.1, or {{building-spark.html}} in 
 1.2).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4421) Wrong link in spark-standalone.html


 [ 
https://issues.apache.org/jira/browse/SPARK-4421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4421:
--
Assignee: Masayoshi TSUZUKI  (was: Josh Rosen)

 Wrong link in spark-standalone.html
 ---

 Key: SPARK-4421
 URL: https://issues.apache.org/jira/browse/SPARK-4421
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Reporter: Masayoshi TSUZUKI
Assignee: Masayoshi TSUZUKI
Priority: Trivial
 Fix For: 1.1.2, 1.2.1


 The link about building spark in the document page Spark Standalone Mode 
 (spark-standalone.html) is wrong.
 That link is pointed at {{index.html#building}}, but it is only available 
 until 0.9. The building guide was moved to another page 
 ({{building-with-maven.html}} in 1.0 and 1.1, or {{building-spark.html}} in 
 1.2).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4740) Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time


[ 
https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234937#comment-14234937
 ] 

Zhang, Liye commented on SPARK-4740:


We found this issue when we make the performance test for 
[SPARK-2926|https://issues.apache.org/jira/browse/SPARK-2926], since 
[SPARK-2926|https://issues.apache.org/jira/browse/SPARK-2926] takes less time 
in reduce phase, so the difference between Netty and Nio is not too much, about 
20%. So we tested the master branch, and the difference is more significant, 
more than 30%.

 Netty's network bandwidth is much lower than NIO in spark-perf and Netty 
 takes longer running time
 --

 Key: SPARK-4740
 URL: https://issues.apache.org/jira/browse/SPARK-4740
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Zhang, Liye
 Attachments: Spark-perf Test Report.pdf


 When testing current spark master (1.3.0-snapshot) with spark-perf 
 (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService 
 takes much longer time than NIO based shuffle transferService. The network 
 throughput of Netty is only about half of that of NIO. 
 We tested with standalone mode, and the data set we used for test is 20 
 billion records, and the total size is about 400GB. Spark-perf test is 
 Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each 
 executor memory is 64GB. The reduce tasks number is set to 1000. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4740) Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time


[ 
https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234941#comment-14234941
 ] 

Zhang, Liye commented on SPARK-4740:


[~adav], I have tested by setting spark.shuffle.io.serverThreads and 
spark.shuffle.io.clientThreads to 48, the result does not change, Netty takes 
the same time with 39mins for reduce phase.

 Netty's network bandwidth is much lower than NIO in spark-perf and Netty 
 takes longer running time
 --

 Key: SPARK-4740
 URL: https://issues.apache.org/jira/browse/SPARK-4740
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Zhang, Liye
 Attachments: Spark-perf Test Report.pdf


 When testing current spark master (1.3.0-snapshot) with spark-perf 
 (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService 
 takes much longer time than NIO based shuffle transferService. The network 
 throughput of Netty is only about half of that of NIO. 
 We tested with standalone mode, and the data set we used for test is 20 
 billion records, and the total size is about 400GB. Spark-perf test is 
 Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each 
 executor memory is 64GB. The reduce tasks number is set to 1000. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4740) Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time


[ 
https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234960#comment-14234960
 ] 

Reynold Xin commented on SPARK-4740:


Can you limit the number of cores to a lower volume and see what happens? i.e. 
try it with 16 threads and see if the problem still exists. Thanks.


 Netty's network bandwidth is much lower than NIO in spark-perf and Netty 
 takes longer running time
 --

 Key: SPARK-4740
 URL: https://issues.apache.org/jira/browse/SPARK-4740
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Zhang, Liye
 Attachments: Spark-perf Test Report.pdf


 When testing current spark master (1.3.0-snapshot) with spark-perf 
 (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService 
 takes much longer time than NIO based shuffle transferService. The network 
 throughput of Netty is only about half of that of NIO. 
 We tested with standalone mode, and the data set we used for test is 20 
 billion records, and the total size is about 400GB. Spark-perf test is 
 Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each 
 executor memory is 64GB. The reduce tasks number is set to 1000. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4740) Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time


[ 
https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234963#comment-14234963
 ] 

Reynold Xin commented on SPARK-4740:


Also can you take a few more jstacks and paste those here? Thanks.


 Netty's network bandwidth is much lower than NIO in spark-perf and Netty 
 takes longer running time
 --

 Key: SPARK-4740
 URL: https://issues.apache.org/jira/browse/SPARK-4740
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Zhang, Liye
 Attachments: Spark-perf Test Report.pdf


 When testing current spark master (1.3.0-snapshot) with spark-perf 
 (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService 
 takes much longer time than NIO based shuffle transferService. The network 
 throughput of Netty is only about half of that of NIO. 
 We tested with standalone mode, and the data set we used for test is 20 
 billion records, and the total size is about 400GB. Spark-perf test is 
 Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each 
 executor memory is 64GB. The reduce tasks number is set to 1000. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4753) Parquet2 does not prune based on OR filters on partition columns

Michael Armbrust created SPARK-4753:
---

 Summary: Parquet2 does not prune based on OR filters on partition 
columns
 Key: SPARK-4753
 URL: https://issues.apache.org/jira/browse/SPARK-4753
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Michael Armbrust
Assignee: Michael Armbrust






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4753) Parquet2 does not prune based on OR filters on partition columns


 [ 
https://issues.apache.org/jira/browse/SPARK-4753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4753:

Priority: Blocker  (was: Major)

 Parquet2 does not prune based on OR filters on partition columns
 

 Key: SPARK-4753
 URL: https://issues.apache.org/jira/browse/SPARK-4753
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4753) Parquet2 does not prune based on OR filters on partition columns


[ 
https://issues.apache.org/jira/browse/SPARK-4753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234973#comment-14234973
 ] 

Apache Spark commented on SPARK-4753:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/3613

 Parquet2 does not prune based on OR filters on partition columns
 

 Key: SPARK-4753
 URL: https://issues.apache.org/jira/browse/SPARK-4753
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4740) Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time


 [ 
https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhang, Liye updated SPARK-4740:
---
Attachment: TestRunner  sort-by-key - Thread dump for executor 1_files (48 
Cores per node).zip

 Netty's network bandwidth is much lower than NIO in spark-perf and Netty 
 takes longer running time
 --

 Key: SPARK-4740
 URL: https://issues.apache.org/jira/browse/SPARK-4740
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Zhang, Liye
 Attachments: Spark-perf Test Report.pdf, TestRunner  sort-by-key - 
 Thread dump for executor 1_files (48 Cores per node).zip


 When testing current spark master (1.3.0-snapshot) with spark-perf 
 (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService 
 takes much longer time than NIO based shuffle transferService. The network 
 throughput of Netty is only about half of that of NIO. 
 We tested with standalone mode, and the data set we used for test is 20 
 billion records, and the total size is about 400GB. Spark-perf test is 
 Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each 
 executor memory is 64GB. The reduce tasks number is set to 1000. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4740) Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time


[ 
https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234983#comment-14234983
 ] 

Zhang, Liye commented on SPARK-4740:


[~rxin] I attached the thread dump of one executor (48 cores) in reduce phase, 
please take a look. I'll try 16 cores later on.

 Netty's network bandwidth is much lower than NIO in spark-perf and Netty 
 takes longer running time
 --

 Key: SPARK-4740
 URL: https://issues.apache.org/jira/browse/SPARK-4740
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Zhang, Liye
 Attachments: Spark-perf Test Report.pdf, TestRunner  sort-by-key - 
 Thread dump for executor 1_files (48 Cores per node).zip


 When testing current spark master (1.3.0-snapshot) with spark-perf 
 (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService 
 takes much longer time than NIO based shuffle transferService. The network 
 throughput of Netty is only about half of that of NIO. 
 We tested with standalone mode, and the data set we used for test is 20 
 billion records, and the total size is about 400GB. Spark-perf test is 
 Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each 
 executor memory is 64GB. The reduce tasks number is set to 1000. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4754) ExecutorAllocationManager should not take in SparkContext