[jira] [Created] (SPARK-2384) Add tooltips for shuffle write and scheduler delay in UI

2014-07-07 Thread Kay Ousterhout (JIRA)
Kay Ousterhout created SPARK-2384:
-

 Summary: Add tooltips for shuffle write and scheduler delay in UI
 Key: SPARK-2384
 URL: https://issues.apache.org/jira/browse/SPARK-2384
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
Priority: Minor


There are a few common points of confusion in the UI that could be clarified 
with tooltips.  We should add tooltips to explain the scheduler delay and the 
shuffle data (to explain why shuffle read is typically  shuffle write, as many 
many people have expressed confusion about).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2382) build error:

2014-07-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053433#comment-14053433
 ] 

Sean Owen commented on SPARK-2382:
--

I can't reproduce this. It is almost certainly a problem with your environment 
or network, possibly temporary. It's not a Spark issue so I think this should 
be closed.

 build error: 
 -

 Key: SPARK-2382
 URL: https://issues.apache.org/jira/browse/SPARK-2382
 Project: Spark
  Issue Type: Question
  Components: Build
Affects Versions: 1.0.0
 Environment: Ubuntu 12.0.4 precise. 
 spark@ubuntu-cdh5-spark:~/spark-1.0.0$ mvn -version
 Apache Maven 3.0.4
 Maven home: /usr/share/maven
 Java version: 1.6.0_31, vendor: Sun Microsystems Inc.
 Java home: /usr/lib/jvm/j2sdk1.6-oracle/jre
 Default locale: en_US, platform encoding: UTF-8
 OS name: linux, version: 3.11.0-15-generic, arch: amd64, family: unix
Reporter: Mukul Jain
  Labels: newbie

 Unable to build. maven can't download dependency .. checked my http_proxy and 
 https_proxy setting they are working fine. Other http and https dependencies 
 were downloaded fine.. build process gets stuck always at this repo. manually 
 down loading also fails and receive an exception. 
 [INFO] 
 
 [INFO] Building Spark Project External MQTT 1.0.0
 [INFO] 
 
 Downloading: 
 https://repository.apache.org/content/repositories/releases/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom
 Jul 6, 2014 4:53:26 PM org.apache.commons.httpclient.HttpMethodDirector 
 executeWithRetry
 INFO: I/O exception (java.net.ConnectException) caught when processing 
 request: Connection timed out
 Jul 6, 2014 4:53:26 PM org.apache.commons.httpclient.HttpMethodDirector 
 executeWithRetry
 INFO: Retrying request



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2386) RowWriteSupport should use the exact types to cast.

2014-07-07 Thread Takuya Ueshin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053537#comment-14053537
 ] 

Takuya Ueshin commented on SPARK-2386:
--

PRed: https://github.com/apache/spark/pull/1315

 RowWriteSupport should use the exact types to cast.
 ---

 Key: SPARK-2386
 URL: https://issues.apache.org/jira/browse/SPARK-2386
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin

 When execute {{saveAsParquetFile}} with non-primitive type, 
 {{RowWriteSupport}} uses wrong type {{Int}} for {{ByteType}} and 
 {{ShortType}}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2387) Remove the stage barrier for better resource utilization

2014-07-07 Thread Rui Li (JIRA)
Rui Li created SPARK-2387:
-

 Summary: Remove the stage barrier for better resource utilization
 Key: SPARK-2387
 URL: https://issues.apache.org/jira/browse/SPARK-2387
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Rui Li


DAGScheduler divides a Spark job into multiple stages according to RDD 
dependencies. Whenever there’s a shuffle dependency, DAGScheduler creates a 
shuffle map stage on the map side, and another stage depending on that stage.
Currently, the downstream stage cannot start until all its depended stages have 
finished. This barrier between stages leads to idle slots when waiting for the 
last few upstream tasks to finish and thus wasting cluster resources.
Therefore we propose to remove the barrier and pre-start the reduce stage once 
there're free slots. This can achieve better resource utilization and improve 
the overall job performance, especially when there're lots of executors granted 
to the application.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2388) Streaming from multiple different Kafka topics is problematic

2014-07-07 Thread Sergey (JIRA)
Sergey created SPARK-2388:
-

 Summary: Streaming from multiple different Kafka topics is 
problematic
 Key: SPARK-2388
 URL: https://issues.apache.org/jira/browse/SPARK-2388
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.0
Reporter: Sergey
 Fix For: 1.0.1


Default way of creating stream out of Kafka source would be as

val stream = KafkaUtils.createStream(ssc,localhost:2181,logs, 
Map(retarget - 2,datapair - 2))

However, if two topics - in this case retarget and datapair - are very 
different, there is no way to set up different filter, mapping functions, etc), 
as they are effectively merged.

However, instance of KafkaInputDStream, created with this call internally calls 
ConsumerConnector.createMessageStream() which returns *map* of KafkaStreams, 
keyed by topic. It would be great if this map would be exposed somehow, so 
aforementioned call 

val streamS = KafkaUtils.createStreamS(...)

returned map of streams.

Regards,
Sergey Malov
Collective Media



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2389) globally shared SparkContext / shared Spark application

2014-07-07 Thread Robert Stupp (JIRA)
Robert Stupp created SPARK-2389:
---

 Summary: globally shared SparkContext / shared Spark application
 Key: SPARK-2389
 URL: https://issues.apache.org/jira/browse/SPARK-2389
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Robert Stupp


The documentation (in Cluster Mode Overview) cites:

bq. Each application gets its own executor processes, which *stay up for the 
duration of the whole application* and run tasks in multiple threads. This has 
the benefit of isolating applications from each other, on both the scheduling 
side (each driver schedules its own tasks) and executor side (tasks from 
different applications run in different JVMs). However, it also means that 
*data cannot be shared* across different Spark applications (instances of 
SparkContext) without writing it to an external storage system.

IMO this is a limitation that should be lifted to support any number of 
--driver-- client processes to share executors and to share (persistent / 
cached) data.

This is especially useful if you have a bunch of frontend servers (dump web app 
servers) that want to use Spark as a _big computing machine_. Most important is 
the fact that Spark is quite good in caching/persisting data in memory / on 
disk thus removing load from backend data stores.

Means: it would be really great to let different --driver-- client JVMs operate 
on the same RDDs and benefit from Spark's caching/persistence.

It would however introduce some administration mechanisms to
* start a shared context
* update the executor configuration (# of worker nodes, # of cpus, etc) on the 
fly
* stop a shared context

Even conventional batch MR applications would benefit if ran fequently 
against the same data set.

As an implicit requirement, RDD persistence could get a TTL for its 
materialized state.

With such a feature the overall performance of today's web applications could 
then be increased by adding more web app servers, more spark nodes, more nosql 
nodes etc




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2384) Add tooltips for shuffle write and scheduler delay in UI

2014-07-07 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053834#comment-14053834
 ] 

Sandy Ryza commented on SPARK-2384:
---

This is a great idea

 Add tooltips for shuffle write and scheduler delay in UI
 

 Key: SPARK-2384
 URL: https://issues.apache.org/jira/browse/SPARK-2384
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
Priority: Minor

 There are a few common points of confusion in the UI that could be clarified 
 with tooltips.  We should add tooltips to explain the scheduler delay and the 
 shuffle data (to explain why shuffle read is typically  shuffle write, as 
 many many people have expressed confusion about).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2160) error of Decision tree algorithm in Spark MLlib

2014-07-07 Thread Jon Sondag (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053938#comment-14053938
 ] 

Jon Sondag commented on SPARK-2160:
---

https://github.com/apache/spark/pull/1316 (also resolves SPARK-2152)

 error of  Decision tree algorithm  in Spark MLlib 
 --

 Key: SPARK-2160
 URL: https://issues.apache.org/jira/browse/SPARK-2160
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: caoli
  Labels: patch
 Fix For: 1.1.0

   Original Estimate: 4h
  Remaining Estimate: 4h

 the error of comput rightNodeAgg about  Decision tree algorithm  in Spark 
 MLlib  , in the function extractLeftRightNodeAggregates() ,when compute 
 rightNodeAgg  used bindata index is error. in the DecisionTree.scala file 
 about  Line980:
  rightNodeAgg(featureIndex)(2 * (numBins - 2 - splitIndex)) =
 binData(shift + (2 * (numBins - 2 - splitIndex))) +
   rightNodeAgg(featureIndex)(2 * (numBins - 1 - splitIndex))  
   
  the   binData(shift + (2 * (numBins - 2 - splitIndex)))  index compute is 
 error, so the result of rightNodeAgg  include  repeated data about bins  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2390) Files in staging directory cannot be deleted and wastes the space of HDFS

2014-07-07 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-2390:
-

 Summary: Files in staging directory cannot be deleted and wastes 
the space of HDFS
 Key: SPARK-2390
 URL: https://issues.apache.org/jira/browse/SPARK-2390
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Kousuke Saruta


When running jobs with YARN Cluster mode and using HistoryServer, the files in 
the Staging Directory cannot be deleted.

HistoryServer uses directory where event log is written, and the directory is 
represented as a instance of o.a.h.f.FileSystem created by using FileSystem.get.

{code:title=FileLogger.scala}
private val fileSystem = Utils.getHadoopFileSystem(new URI(logDir))
{code}
{code:title=utils.getHadoopFileSystem}
def getHadoopFileSystem(path: URI): FileSystem = {
  FileSystem.get(path, SparkHadoopUtil.get.newConfiguration())
}
{code}

On the other hand, ApplicationMaster has a instance named fs, which also 
created by using FileSystem.get.

{code:title=ApplicationMaster}
private val fs = FileSystem.get(yarnConf)
{code}

FileSystem.get returns cached same instance when URI passed to the method 
represents same file system and the method is called by same user.

Because of the behavior, when the directory for event log is on HDFS, fs of 
ApplicationMaster and fileSystem of FileLogger is same instance.

When shutting down ApplicationMaster, fileSystem.close is called in 
FileLogger#stop, which is invoked by SparkContext#stop indirectly.
{code:title=FileLogger.stop}
def stop() {
  hadoopDataStream.foreach(_.close())
  writer.foreach(_.close())
  fileSystem.close()
}
{code}

And  ApplicationMaster#cleanupStagingDir also called by JVM shutdown hook. In 
this method, fs.delete(stagingDirPath) is invoked. 

Because fs.delete in ApplicationMaster is called after fileSystem.close in 
FileLogger, fs.delete fails and results not deleting files in the staging 
directory.




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2391) LIMIT queries ship a whole partition of data

2014-07-07 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2391:


Fix Version/s: 1.1.0

 LIMIT queries ship a whole partition of data
 

 Key: SPARK-2391
 URL: https://issues.apache.org/jira/browse/SPARK-2391
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Michael Armbrust
Assignee: Michael Armbrust

 Basically the problem here is that Spark's take() runs jobs using allowLocal 
 = true.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2391) LIMIT queries ship a whole partition of data

2014-07-07 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-2391:
---

 Summary: LIMIT queries ship a whole partition of data
 Key: SPARK-2391
 URL: https://issues.apache.org/jira/browse/SPARK-2391
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Michael Armbrust
Assignee: Michael Armbrust


Basically the problem here is that Spark's take() runs jobs using allowLocal = 
true.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2391) LIMIT queries ship a whole partition of data

2014-07-07 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2391:


Target Version/s: 1.1.0
   Fix Version/s: (was: 1.1.0)

 LIMIT queries ship a whole partition of data
 

 Key: SPARK-2391
 URL: https://issues.apache.org/jira/browse/SPARK-2391
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Michael Armbrust
Assignee: Michael Armbrust

 Basically the problem here is that Spark's take() runs jobs using allowLocal 
 = true.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1977) mutable.BitSet in ALS not serializable with KryoSerializer

2014-07-07 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053994#comment-14053994
 ] 

Xiangrui Meng commented on SPARK-1977:
--

I think now I understand when it happens. We use storage level MEMORY_AND_DISK 
for user/product in/out links, which contains BitSet objects. If the dataset is 
large, these RDDs will be pushed from in memory storage to on disk storage, 
where the latter requires serialization. So the easiest way to re-produce this 
error is changing the storage level of inLinks/outLinks to DISK_ONLY and run 
with kryo.

[~neville] Instead of mapping mutable.BitSet to immutable.BitSet, which 
introduces overhead, we can register mutable.BitSet in our MovieLensALS example 
code and wait for the next Chill release. Does it sound good to you?

 mutable.BitSet in ALS not serializable with KryoSerializer
 --

 Key: SPARK-1977
 URL: https://issues.apache.org/jira/browse/SPARK-1977
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Neville Li
Priority: Minor

 OutLinkBlock in ALS.scala has an Array[mutable.BitSet] member.
 KryoSerializer uses AllScalaRegistrar from Twitter chill but it doesn't 
 register mutable.BitSet.
 Right now we have to register mutable.BitSet manually. A proper fix would be 
 using immutable.BitSet in ALS or register mutable.BitSet in upstream chill.
 {code}
 Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
 Task 1724.0:9 failed 4 times, most recent failure: Exception failure in TID 
 68548 on host lon4-hadoopslave-b232.lon4.spotify.net: 
 com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: 
 scala.collection.mutable.HashSet
 Serialization trace:
 shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)
 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 
 org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115)
 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155)
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154)
 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
 org.apache.spark.scheduler.Task.run(Task.scala:51)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 java.lang.Thread.run(Thread.java:662)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
   at 

[jira] [Updated] (SPARK-2386) RowWriteSupport should use the exact types to cast.

2014-07-07 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2386:


Target Version/s: 1.1.0

 RowWriteSupport should use the exact types to cast.
 ---

 Key: SPARK-2386
 URL: https://issues.apache.org/jira/browse/SPARK-2386
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin

 When execute {{saveAsParquetFile}} with non-primitive type, 
 {{RowWriteSupport}} uses wrong type {{Int}} for {{ByteType}} and 
 {{ShortType}}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1977) mutable.BitSet in ALS not serializable with KryoSerializer

2014-07-07 Thread Neville Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053999#comment-14053999
 ] 

Neville Li commented on SPARK-1977:
---

[~mengxr] sounds good to me.

 mutable.BitSet in ALS not serializable with KryoSerializer
 --

 Key: SPARK-1977
 URL: https://issues.apache.org/jira/browse/SPARK-1977
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Neville Li
Priority: Minor

 OutLinkBlock in ALS.scala has an Array[mutable.BitSet] member.
 KryoSerializer uses AllScalaRegistrar from Twitter chill but it doesn't 
 register mutable.BitSet.
 Right now we have to register mutable.BitSet manually. A proper fix would be 
 using immutable.BitSet in ALS or register mutable.BitSet in upstream chill.
 {code}
 Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
 Task 1724.0:9 failed 4 times, most recent failure: Exception failure in TID 
 68548 on host lon4-hadoopslave-b232.lon4.spotify.net: 
 com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: 
 scala.collection.mutable.HashSet
 Serialization trace:
 shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)
 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 
 org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115)
 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155)
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154)
 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
 org.apache.spark.scheduler.Task.run(Task.scala:51)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 java.lang.Thread.run(Thread.java:662)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
   at scala.Option.foreach(Option.scala:236)
   at 
 

[jira] [Commented] (SPARK-1977) mutable.BitSet in ALS not serializable with KryoSerializer

2014-07-07 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054003#comment-14054003
 ] 

Xiangrui Meng commented on SPARK-1977:
--

Do you mind creating a PR registering mutable.BitSet in MovieLensALS.scala and 
close PR #925? Thanks!

 mutable.BitSet in ALS not serializable with KryoSerializer
 --

 Key: SPARK-1977
 URL: https://issues.apache.org/jira/browse/SPARK-1977
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Neville Li
Priority: Minor

 OutLinkBlock in ALS.scala has an Array[mutable.BitSet] member.
 KryoSerializer uses AllScalaRegistrar from Twitter chill but it doesn't 
 register mutable.BitSet.
 Right now we have to register mutable.BitSet manually. A proper fix would be 
 using immutable.BitSet in ALS or register mutable.BitSet in upstream chill.
 {code}
 Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
 Task 1724.0:9 failed 4 times, most recent failure: Exception failure in TID 
 68548 on host lon4-hadoopslave-b232.lon4.spotify.net: 
 com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: 
 scala.collection.mutable.HashSet
 Serialization trace:
 shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)
 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 
 org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115)
 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155)
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154)
 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
 org.apache.spark.scheduler.Task.run(Task.scala:51)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 java.lang.Thread.run(Thread.java:662)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
   at scala.Option.foreach(Option.scala:236)
   

[jira] [Updated] (SPARK-2311) Added additional GLMs (Poisson and Gamma) into MLlib

2014-07-07 Thread Xiaokai Wei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaokai Wei updated SPARK-2311:
---

Priority: Major  (was: Minor)

 Added additional GLMs (Poisson and Gamma) into MLlib
 

 Key: SPARK-2311
 URL: https://issues.apache.org/jira/browse/SPARK-2311
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiaokai Wei

 Though GeneralizedLinearModel in MLlib 1.0.0 has some important GLMs such as 
 Logistic Regression, Linear Regression, some other important GLMs like 
 Poisson Regression are still missing. 
 Poisson Regression and Gamma Regression are two widely used models in 
 industry with many applications. This patch added Poisson and Gamma 
 Regression as additional GeneralizedLinearModels.
 https://github.com/apache/spark/pull/1237



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2390) Files in staging directory cannot be deleted and wastes the space of HDFS

2014-07-07 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-2390:
--

Affects Version/s: 1.0.1

 Files in staging directory cannot be deleted and wastes the space of HDFS
 -

 Key: SPARK-2390
 URL: https://issues.apache.org/jira/browse/SPARK-2390
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.0.1
Reporter: Kousuke Saruta

 When running jobs with YARN Cluster mode and using HistoryServer, the files 
 in the Staging Directory cannot be deleted.
 HistoryServer uses directory where event log is written, and the directory is 
 represented as a instance of o.a.h.f.FileSystem created by using 
 FileSystem.get.
 {code:title=FileLogger.scala}
 private val fileSystem = Utils.getHadoopFileSystem(new URI(logDir))
 {code}
 {code:title=utils.getHadoopFileSystem}
 def getHadoopFileSystem(path: URI): FileSystem = {
   FileSystem.get(path, SparkHadoopUtil.get.newConfiguration())
 }
 {code}
 On the other hand, ApplicationMaster has a instance named fs, which also 
 created by using FileSystem.get.
 {code:title=ApplicationMaster}
 private val fs = FileSystem.get(yarnConf)
 {code}
 FileSystem.get returns cached same instance when URI passed to the method 
 represents same file system and the method is called by same user.
 Because of the behavior, when the directory for event log is on HDFS, fs of 
 ApplicationMaster and fileSystem of FileLogger is same instance.
 When shutting down ApplicationMaster, fileSystem.close is called in 
 FileLogger#stop, which is invoked by SparkContext#stop indirectly.
 {code:title=FileLogger.stop}
 def stop() {
   hadoopDataStream.foreach(_.close())
   writer.foreach(_.close())
   fileSystem.close()
 }
 {code}
 And  ApplicationMaster#cleanupStagingDir also called by JVM shutdown hook. In 
 this method, fs.delete(stagingDirPath) is invoked. 
 Because fs.delete in ApplicationMaster is called after fileSystem.close in 
 FileLogger, fs.delete fails and results not deleting files in the staging 
 directory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2390) Files in staging directory cannot be deleted and wastes the space of HDFS

2014-07-07 Thread Kousuke Saruta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054041#comment-14054041
 ] 

Kousuke Saruta commented on SPARK-2390:
---

I modified the way to create a instance of FileSystem in FileLogger as follows, 
the symptom didn't appear.

{code}
private val fileSystem = FileSystem.newInstance(new URI(logDir), 
SparkHadoopUtil.get.newConfiguration())  
{code}



 Files in staging directory cannot be deleted and wastes the space of HDFS
 -

 Key: SPARK-2390
 URL: https://issues.apache.org/jira/browse/SPARK-2390
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.0.1
Reporter: Kousuke Saruta

 When running jobs with YARN Cluster mode and using HistoryServer, the files 
 in the Staging Directory cannot be deleted.
 HistoryServer uses directory where event log is written, and the directory is 
 represented as a instance of o.a.h.f.FileSystem created by using 
 FileSystem.get.
 {code:title=FileLogger.scala}
 private val fileSystem = Utils.getHadoopFileSystem(new URI(logDir))
 {code}
 {code:title=utils.getHadoopFileSystem}
 def getHadoopFileSystem(path: URI): FileSystem = {
   FileSystem.get(path, SparkHadoopUtil.get.newConfiguration())
 }
 {code}
 On the other hand, ApplicationMaster has a instance named fs, which also 
 created by using FileSystem.get.
 {code:title=ApplicationMaster}
 private val fs = FileSystem.get(yarnConf)
 {code}
 FileSystem.get returns cached same instance when URI passed to the method 
 represents same file system and the method is called by same user.
 Because of the behavior, when the directory for event log is on HDFS, fs of 
 ApplicationMaster and fileSystem of FileLogger is same instance.
 When shutting down ApplicationMaster, fileSystem.close is called in 
 FileLogger#stop, which is invoked by SparkContext#stop indirectly.
 {code:title=FileLogger.stop}
 def stop() {
   hadoopDataStream.foreach(_.close())
   writer.foreach(_.close())
   fileSystem.close()
 }
 {code}
 And  ApplicationMaster#cleanupStagingDir also called by JVM shutdown hook. In 
 this method, fs.delete(stagingDirPath) is invoked. 
 Because fs.delete in ApplicationMaster is called after fileSystem.close in 
 FileLogger, fs.delete fails and results not deleting files in the staging 
 directory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1977) mutable.BitSet in ALS not serializable with KryoSerializer

2014-07-07 Thread Neville Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054043#comment-14054043
 ] 

Neville Li commented on SPARK-1977:
---

There you go:
https://github.com/apache/spark/pull/1319

 mutable.BitSet in ALS not serializable with KryoSerializer
 --

 Key: SPARK-1977
 URL: https://issues.apache.org/jira/browse/SPARK-1977
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Neville Li
Priority: Minor

 OutLinkBlock in ALS.scala has an Array[mutable.BitSet] member.
 KryoSerializer uses AllScalaRegistrar from Twitter chill but it doesn't 
 register mutable.BitSet.
 Right now we have to register mutable.BitSet manually. A proper fix would be 
 using immutable.BitSet in ALS or register mutable.BitSet in upstream chill.
 {code}
 Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
 Task 1724.0:9 failed 4 times, most recent failure: Exception failure in TID 
 68548 on host lon4-hadoopslave-b232.lon4.spotify.net: 
 com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: 
 scala.collection.mutable.HashSet
 Serialization trace:
 shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)
 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 
 org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115)
 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155)
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154)
 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
 org.apache.spark.scheduler.Task.run(Task.scala:51)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 java.lang.Thread.run(Thread.java:662)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
   at scala.Option.foreach(Option.scala:236)
   at 
 

[jira] [Created] (SPARK-2392) Executors should not start an HTTP server

2014-07-07 Thread Andrew Or (JIRA)
Andrew Or created SPARK-2392:


 Summary: Executors should not start an HTTP server
 Key: SPARK-2392
 URL: https://issues.apache.org/jira/browse/SPARK-2392
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or


In the long run we should separate out classes used by the driver vs executors  
in SparkEnv. For now, we should at least not start an unused HTTP on every 
executor.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2392) Executors should not start their own HTTP servers

2014-07-07 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2392:
-

Summary: Executors should not start their own HTTP servers  (was: Executors 
should not start an HTTP server)

 Executors should not start their own HTTP servers
 -

 Key: SPARK-2392
 URL: https://issues.apache.org/jira/browse/SPARK-2392
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or

 In the long run we should separate out classes used by the driver vs 
 executors  in SparkEnv. For now, we should at least not start an unused HTTP 
 on every executor.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2390) Files in staging directory cannot be deleted and wastes the space of HDFS

2014-07-07 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054162#comment-14054162
 ] 

Mridul Muralidharan commented on SPARK-2390:


Here, and a bunch of other places, spark currently closes the Filesystem 
instance : this is incorrect, and should not be done.
The fix would be to remove the fs.close; not force creation of new instances.

 Files in staging directory cannot be deleted and wastes the space of HDFS
 -

 Key: SPARK-2390
 URL: https://issues.apache.org/jira/browse/SPARK-2390
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.0.1
Reporter: Kousuke Saruta

 When running jobs with YARN Cluster mode and using HistoryServer, the files 
 in the Staging Directory cannot be deleted.
 HistoryServer uses directory where event log is written, and the directory is 
 represented as a instance of o.a.h.f.FileSystem created by using 
 FileSystem.get.
 {code:title=FileLogger.scala}
 private val fileSystem = Utils.getHadoopFileSystem(new URI(logDir))
 {code}
 {code:title=utils.getHadoopFileSystem}
 def getHadoopFileSystem(path: URI): FileSystem = {
   FileSystem.get(path, SparkHadoopUtil.get.newConfiguration())
 }
 {code}
 On the other hand, ApplicationMaster has a instance named fs, which also 
 created by using FileSystem.get.
 {code:title=ApplicationMaster}
 private val fs = FileSystem.get(yarnConf)
 {code}
 FileSystem.get returns cached same instance when URI passed to the method 
 represents same file system and the method is called by same user.
 Because of the behavior, when the directory for event log is on HDFS, fs of 
 ApplicationMaster and fileSystem of FileLogger is same instance.
 When shutting down ApplicationMaster, fileSystem.close is called in 
 FileLogger#stop, which is invoked by SparkContext#stop indirectly.
 {code:title=FileLogger.stop}
 def stop() {
   hadoopDataStream.foreach(_.close())
   writer.foreach(_.close())
   fileSystem.close()
 }
 {code}
 And  ApplicationMaster#cleanupStagingDir also called by JVM shutdown hook. In 
 this method, fs.delete(stagingDirPath) is invoked. 
 Because fs.delete in ApplicationMaster is called after fileSystem.close in 
 FileLogger, fs.delete fails and results not deleting files in the staging 
 directory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2212) Hash Outer Joins

2014-07-07 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2212:


Priority: Minor  (was: Major)

 Hash Outer Joins
 

 Key: SPARK-2212
 URL: https://issues.apache.org/jira/browse/SPARK-2212
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao
Assignee: Cheng Hao
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2179) Public API for DataTypes and Schema

2014-07-07 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2179:


Priority: Critical  (was: Major)

 Public API for DataTypes and Schema
 ---

 Key: SPARK-2179
 URL: https://issues.apache.org/jira/browse/SPARK-2179
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Michael Armbrust
Assignee: Yin Huai
Priority: Critical

 We want something like the following:
  * Expose DataType in the SQL package and lock down all the internal details 
 (TypeTags, etc)
  * Programatic API for viewing the schema of an RDD as a StructType
  * Method that creates a schema RDD given (RDD[A], StructType, A = Row)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2391) LIMIT queries ship a whole partition of data

2014-07-07 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2391:


Priority: Blocker  (was: Major)

 LIMIT queries ship a whole partition of data
 

 Key: SPARK-2391
 URL: https://issues.apache.org/jira/browse/SPARK-2391
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker

 Basically the problem here is that Spark's take() runs jobs using allowLocal 
 = true.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2054) Code Generation for Expression Evaluation

2014-07-07 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2054:


Priority: Critical  (was: Major)

 Code Generation for Expression Evaluation
 -

 Key: SPARK-2054
 URL: https://issues.apache.org/jira/browse/SPARK-2054
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2097) UDF Support

2014-07-07 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2097:


Priority: Critical  (was: Major)

 UDF Support
 ---

 Key: SPARK-2097
 URL: https://issues.apache.org/jira/browse/SPARK-2097
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Critical

 Right now we only support UDFs that are written against the Hive API or are 
 called directly as expressions in the DSL.  It would be nice to have native 
 support for registering scala/python functions as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2087) Support SQLConf per session

2014-07-07 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2087:


Priority: Minor  (was: Major)

 Support SQLConf per session
 ---

 Key: SPARK-2087
 URL: https://issues.apache.org/jira/browse/SPARK-2087
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Zongheng Yang
Priority: Minor

 For things like the SharkServer we should support configuration per thread 
 instead of globally for a context.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2190) Specialized ColumnType for Timestamp

2014-07-07 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2190:


Priority: Critical  (was: Major)

 Specialized ColumnType for Timestamp
 

 Key: SPARK-2190
 URL: https://issues.apache.org/jira/browse/SPARK-2190
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Critical

 I'm going to call this a bug since currently its like 300X slower than it 
 needs to  be.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2205) Unnecessary exchange operators in a join on multiple tables with the same join key.

2014-07-07 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2205:


Priority: Minor  (was: Major)

 Unnecessary exchange operators in a join on multiple tables with the same 
 join key.
 ---

 Key: SPARK-2205
 URL: https://issues.apache.org/jira/browse/SPARK-2205
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Priority: Minor

 {code}
 hql(select * from src x join src y on (x.key=y.key) join src z on 
 (y.key=z.key))
 SchemaRDD[1] at RDD at SchemaRDD.scala:100
 == Query Plan ==
 Project [key#4:0,value#5:1,key#6:2,value#7:3,key#8:4,value#9:5]
  HashJoin [key#6], [key#8], BuildRight
   Exchange (HashPartitioning [key#6], 200)
HashJoin [key#4], [key#6], BuildRight
 Exchange (HashPartitioning [key#4], 200)
  HiveTableScan [key#4,value#5], (MetastoreRelation default, src, 
 Some(x)), None
 Exchange (HashPartitioning [key#6], 200)
  HiveTableScan [key#6,value#7], (MetastoreRelation default, src, 
 Some(y)), None
   Exchange (HashPartitioning [key#8], 200)
HiveTableScan [key#8,value#9], (MetastoreRelation default, src, Some(z)), 
 None
 {code}
 However, this is fine...
 {code}
 hql(select * from src x join src y on (x.key=y.key) join src z on 
 (x.key=z.key))
 res5: org.apache.spark.sql.SchemaRDD = 
 SchemaRDD[5] at RDD at SchemaRDD.scala:100
 == Query Plan ==
 Project [key#26:0,value#27:1,key#28:2,value#29:3,key#30:4,value#31:5]
  HashJoin [key#26], [key#30], BuildRight
   HashJoin [key#26], [key#28], BuildRight
Exchange (HashPartitioning [key#26], 200)
 HiveTableScan [key#26,value#27], (MetastoreRelation default, src, 
 Some(x)), None
Exchange (HashPartitioning [key#28], 200)
 HiveTableScan [key#28,value#29], (MetastoreRelation default, src, 
 Some(y)), None
   Exchange (HashPartitioning [key#30], 200)
HiveTableScan [key#30,value#31], (MetastoreRelation default, src, 
 Some(z)), None
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2375) JSON schema inference may not resolve type conflicts correctly for a field inside an array of structs

2014-07-07 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2375:


Target Version/s: 1.1.0
   Fix Version/s: (was: 1.1.0)

 JSON schema inference may not resolve type conflicts correctly for a field 
 inside an array of structs
 -

 Key: SPARK-2375
 URL: https://issues.apache.org/jira/browse/SPARK-2375
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.1
Reporter: Yin Huai

 For example, for
 {code}
 {array: [{field:214748364700}, {field:1}]}
 {code}
 the type of field is resolved as IntType. While, for
 {code}
 {array: [{field:1}, {field:214748364700}]}
 {code}
 the type of field is resolved as LongType.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2017) web ui stage page becomes unresponsive when the number of tasks is large

2014-07-07 Thread Masayoshi TSUZUKI (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054178#comment-14054178
 ] 

Masayoshi TSUZUKI commented on SPARK-2017:
--

Pagination seems to be better because with aggregated metrics,
1. we can't identify the skew of tasks between the executors.
2. the same problem will appear again when many tasks fail in a certain stage.

In addition, when some errors or problems occur under the production 
environment, we would like to see the status of tasks near the time even if 
those tasks mostly succeeded. Although every status of tasks is written in the 
log file, web ui is very useful in operation phase.

 web ui stage page becomes unresponsive when the number of tasks is large
 

 Key: SPARK-2017
 URL: https://issues.apache.org/jira/browse/SPARK-2017
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin
  Labels: starter

 {code}
 sc.parallelize(1 to 100, 100).count()
 {code}
 The above code creates one million tasks to be executed. The stage detail web 
 ui page takes forever to load (if it ever completes).
 There are again a few different alternatives:
 0. Limit the number of tasks we show.
 1. Pagination
 2. By default only show the aggregate metrics and failed tasks, and hide the 
 successful ones.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2376) Selecting list values inside nested JSON objects raises java.lang.IllegalArgumentException

2014-07-07 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2376:


Target Version/s: 1.1.0

 Selecting list values inside nested JSON objects raises 
 java.lang.IllegalArgumentException
 --

 Key: SPARK-2376
 URL: https://issues.apache.org/jira/browse/SPARK-2376
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.1
Reporter: Nicholas Chammas

 Repro script for PySpark, deployed via {{spark-ec2}} at git revision 
 {{9d5ecf8205b924dc8a3c13fed68beb78cc5c7553}}:
 {code}
 from pyspark.sql import SQLContext
 sqlContext = SQLContext(sc)
 raw = sc.parallelize([
 
 {
 name: Nick,
 history: {
 countries: []
 }
 }
 
 ])
 profiles = sqlContext.jsonRDD(raw)
 profiles.registerAsTable(profiles)
 profiles.printSchema()
 sqlContext.sql(SELECT name FROM profiles;).collect() # works fine
 sqlContext.sql(SELECT history FROM profiles;).collect()  # raises exception
 {code}
 Attempting to select the top-level struct that has a nested list value yields 
 the following error:
 {code}
 14/07/06 00:10:26 INFO scheduler.TaskSetManager: Loss was due to 
 net.razorvine.pickle.PickleException: couldn't introspect javabean: 
 java.lang.IllegalArgumentException: wrong number of arguments [duplicate 3]
 14/07/06 00:10:26 ERROR scheduler.TaskSetManager: Task 26.0:15 failed 4 
 times; aborting job
 14/07/06 00:10:26 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 26.0, 
 whose tasks have all completed, from pool 
 14/07/06 00:10:26 INFO scheduler.TaskSchedulerImpl: Cancelling stage 26
 14/07/06 00:10:26 INFO scheduler.DAGScheduler: Failed to run collect at 
 stdin:1
 Traceback (most recent call last):
   File stdin, line 1, in module
   File /root/spark/python/pyspark/rdd.py, line 649, in collect
 bytesInJava = self._jrdd.collect().iterator()
   File /root/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 
 537, in __call__
   File /root/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py, line 
 300, in get_return_value
 py4j.protocol.Py4JJavaError: An error occurred while calling o286.collect.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 
 26.0:15 failed 4 times, most recent failure: Exception failure in TID 394 on 
 host ip-10-183-59-125.ec2.internal: net.razorvine.pickle.PickleException: 
 couldn't introspect javabean: java.lang.IllegalArgumentException: wrong 
 number of arguments
 net.razorvine.pickle.Pickler.put_javabean(Pickler.java:603)
 net.razorvine.pickle.Pickler.dispatch(Pickler.java:299)
 net.razorvine.pickle.Pickler.save(Pickler.java:125)
 net.razorvine.pickle.Pickler.put_map(Pickler.java:322)
 net.razorvine.pickle.Pickler.dispatch(Pickler.java:286)
 net.razorvine.pickle.Pickler.save(Pickler.java:125)
 net.razorvine.pickle.Pickler.put_map(Pickler.java:322)
 net.razorvine.pickle.Pickler.dispatch(Pickler.java:286)
 net.razorvine.pickle.Pickler.save(Pickler.java:125)
 net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:392)
 net.razorvine.pickle.Pickler.dispatch(Pickler.java:195)
 net.razorvine.pickle.Pickler.save(Pickler.java:125)
 net.razorvine.pickle.Pickler.dump(Pickler.java:95)
 net.razorvine.pickle.Pickler.dumps(Pickler.java:80)
 
 org.apache.spark.sql.SchemaRDD$$anonfun$javaToPython$1$$anonfun$apply$3.apply(SchemaRDD.scala:385)
 
 org.apache.spark.sql.SchemaRDD$$anonfun$javaToPython$1$$anonfun$apply$3.apply(SchemaRDD.scala:385)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 
 org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:317)
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:203)
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:178)
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:178)
 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1213)
 
 org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:177)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1041)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1025)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1023)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at 

[jira] [Created] (SPARK-2393) Simple cost estimation and auto selection of broadcast join

2014-07-07 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-2393:
---

 Summary: Simple cost estimation and auto selection of broadcast 
join
 Key: SPARK-2393
 URL: https://issues.apache.org/jira/browse/SPARK-2393
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2339) SQL parser in sql-core is case sensitive, but a table alias is converted to lower case when we create Subquery

2014-07-07 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-2339:


Target Version/s: 1.1.0

 SQL parser in sql-core is case sensitive, but a table alias is converted to 
 lower case when we create Subquery
 --

 Key: SPARK-2339
 URL: https://issues.apache.org/jira/browse/SPARK-2339
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Yin Huai

 Reported by 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-throws-exception-td8599.html
 After we get the table from the catalog, because the table has an alias, we 
 will temporarily insert a Subquery. Then, we convert the table alias to lower 
 case no matter if the parser is case sensitive or not.
 To see the issue ...
 {code}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext.createSchemaRDD
 case class Person(name: String, age: Int)
 val people = 
 sc.textFile(examples/src/main/resources/people.txt).map(_.split(,)).map(p 
 = Person(p(0), p(1).trim.toInt))
 people.registerAsTable(people)
 sqlContext.sql(select PEOPLE.name from people PEOPLE)
 {code}
 The plan is ...
 {code}
 == Query Plan ==
 Project ['PEOPLE.name]
  ExistingRdd [name#0,age#1], MapPartitionsRDD[4] at mapPartitions at 
 basicOperators.scala:176
 {code}
 You can find that PEOPLE.name is not resolved.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2339) SQL parser in sql-core is case sensitive, but a table alias is converted to lower case when we create Subquery

2014-07-07 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-2339:


Fix Version/s: (was: 1.1.0)

 SQL parser in sql-core is case sensitive, but a table alias is converted to 
 lower case when we create Subquery
 --

 Key: SPARK-2339
 URL: https://issues.apache.org/jira/browse/SPARK-2339
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Yin Huai

 Reported by 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-throws-exception-td8599.html
 After we get the table from the catalog, because the table has an alias, we 
 will temporarily insert a Subquery. Then, we convert the table alias to lower 
 case no matter if the parser is case sensitive or not.
 To see the issue ...
 {code}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext.createSchemaRDD
 case class Person(name: String, age: Int)
 val people = 
 sc.textFile(examples/src/main/resources/people.txt).map(_.split(,)).map(p 
 = Person(p(0), p(1).trim.toInt))
 people.registerAsTable(people)
 sqlContext.sql(select PEOPLE.name from people PEOPLE)
 {code}
 The plan is ...
 {code}
 == Query Plan ==
 Project ['PEOPLE.name]
  ExistingRdd [name#0,age#1], MapPartitionsRDD[4] at mapPartitions at 
 basicOperators.scala:176
 {code}
 You can find that PEOPLE.name is not resolved.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2205) Unnecessary exchange operators in a join on multiple tables with the same join key.

2014-07-07 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-2205:


Target Version/s: 1.1.0  (was: 1.0.1, 1.1.0)

 Unnecessary exchange operators in a join on multiple tables with the same 
 join key.
 ---

 Key: SPARK-2205
 URL: https://issues.apache.org/jira/browse/SPARK-2205
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Priority: Minor

 {code}
 hql(select * from src x join src y on (x.key=y.key) join src z on 
 (y.key=z.key))
 SchemaRDD[1] at RDD at SchemaRDD.scala:100
 == Query Plan ==
 Project [key#4:0,value#5:1,key#6:2,value#7:3,key#8:4,value#9:5]
  HashJoin [key#6], [key#8], BuildRight
   Exchange (HashPartitioning [key#6], 200)
HashJoin [key#4], [key#6], BuildRight
 Exchange (HashPartitioning [key#4], 200)
  HiveTableScan [key#4,value#5], (MetastoreRelation default, src, 
 Some(x)), None
 Exchange (HashPartitioning [key#6], 200)
  HiveTableScan [key#6,value#7], (MetastoreRelation default, src, 
 Some(y)), None
   Exchange (HashPartitioning [key#8], 200)
HiveTableScan [key#8,value#9], (MetastoreRelation default, src, Some(z)), 
 None
 {code}
 However, this is fine...
 {code}
 hql(select * from src x join src y on (x.key=y.key) join src z on 
 (x.key=z.key))
 res5: org.apache.spark.sql.SchemaRDD = 
 SchemaRDD[5] at RDD at SchemaRDD.scala:100
 == Query Plan ==
 Project [key#26:0,value#27:1,key#28:2,value#29:3,key#30:4,value#31:5]
  HashJoin [key#26], [key#30], BuildRight
   HashJoin [key#26], [key#28], BuildRight
Exchange (HashPartitioning [key#26], 200)
 HiveTableScan [key#26,value#27], (MetastoreRelation default, src, 
 Some(x)), None
Exchange (HashPartitioning [key#28], 200)
 HiveTableScan [key#28,value#29], (MetastoreRelation default, src, 
 Some(y)), None
   Exchange (HashPartitioning [key#30], 200)
HiveTableScan [key#30,value#31], (MetastoreRelation default, src, 
 Some(z)), None
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2339) SQL parser in sql-core is case sensitive, but a table alias is converted to lower case when we create Subquery

2014-07-07 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-2339:


Component/s: SQL

 SQL parser in sql-core is case sensitive, but a table alias is converted to 
 lower case when we create Subquery
 --

 Key: SPARK-2339
 URL: https://issues.apache.org/jira/browse/SPARK-2339
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Yin Huai
Assignee: Yin Huai

 Reported by 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-throws-exception-td8599.html
 After we get the table from the catalog, because the table has an alias, we 
 will temporarily insert a Subquery. Then, we convert the table alias to lower 
 case no matter if the parser is case sensitive or not.
 To see the issue ...
 {code}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext.createSchemaRDD
 case class Person(name: String, age: Int)
 val people = 
 sc.textFile(examples/src/main/resources/people.txt).map(_.split(,)).map(p 
 = Person(p(0), p(1).trim.toInt))
 people.registerAsTable(people)
 sqlContext.sql(select PEOPLE.name from people PEOPLE)
 {code}
 The plan is ...
 {code}
 == Query Plan ==
 Project ['PEOPLE.name]
  ExistingRdd [name#0,age#1], MapPartitionsRDD[4] at mapPartitions at 
 basicOperators.scala:176
 {code}
 You can find that PEOPLE.name is not resolved.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2384) Add tooltips for shuffle write and scheduler delay in UI

2014-07-07 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054207#comment-14054207
 ] 

Kay Ousterhout commented on SPARK-2384:
---

https://github.com/apache/spark/pull/1314

 Add tooltips for shuffle write and scheduler delay in UI
 

 Key: SPARK-2384
 URL: https://issues.apache.org/jira/browse/SPARK-2384
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
Priority: Minor

 There are a few common points of confusion in the UI that could be clarified 
 with tooltips.  We should add tooltips to explain the scheduler delay and the 
 shuffle data (to explain why shuffle read is typically  shuffle write, as 
 many many people have expressed confusion about).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2394) Make it easier to read LZO-compressed files from EC2 clusters

2014-07-07 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-2394:
---

 Summary: Make it easier to read LZO-compressed files from EC2 
clusters
 Key: SPARK-2394
 URL: https://issues.apache.org/jira/browse/SPARK-2394
 Project: Spark
  Issue Type: Improvement
  Components: EC2, Input/Output
Affects Versions: 1.0.0
Reporter: Nicholas Chammas
Priority: Minor


Amazon hosts [a large Google n-grams data set on 
S3|https://aws.amazon.com/datasets/8172056142375670]. This data set is perfect, 
among other things, for putting together interesting and easily reproducible 
public demos of Spark's capabilities.

The problem is that the data set is compressed using LZO, and it is currently 
more painful than it should be to get your average {{spark-ec2}} cluster to 
read input compressed in this way.

This is what one has to go through to get a Spark cluster created with 
{{spark-ec2}} to read LZO-compressed files:
# Install the latest LZO release, perhaps via {{yum}}.
# Download [{{hadoop-lzo}}|https://github.com/twitter/hadoop-lzo] and build it. 
To build {{hadoop-lzo}} you need Maven. 
# Install Maven. For some reason, [you cannot install Maven with 
{{yum}}|http://stackoverflow.com/questions/7532928/how-do-i-install-maven-with-yum],
 so install it manually.
# Update your {{core-site.xml}} and {{spark-env.sh}} with [the appropriate 
configs|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3cca+-p3aga6f86qcsowp7k_r+8r-dgbmj3gz+4xljzjpr90db...@mail.gmail.com%3E].
# Make [the appropriate 
calls|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCA+-p3AGSPeNE5miQRFHC7-ZwNbicaXfh1-ZXdKJ=saw_mgr...@mail.gmail.com%3E]
 to {{sc.newAPIHadoopFile}}.

This seems like a bit too much work for what we're trying to accomplish.

If we expect this to be a common pattern -- reading LZO-compressed files from a 
{{spark-ec2}} cluster -- it would be great if we could somehow make this less 
painful.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1977) mutable.BitSet in ALS not serializable with KryoSerializer

2014-07-07 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-1977.
--

   Resolution: Fixed
Fix Version/s: 1.1.0
   1.0.1

Issue resolved by pull request 1319
[https://github.com/apache/spark/pull/1319]

 mutable.BitSet in ALS not serializable with KryoSerializer
 --

 Key: SPARK-1977
 URL: https://issues.apache.org/jira/browse/SPARK-1977
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Neville Li
Priority: Minor
 Fix For: 1.0.1, 1.1.0


 OutLinkBlock in ALS.scala has an Array[mutable.BitSet] member.
 KryoSerializer uses AllScalaRegistrar from Twitter chill but it doesn't 
 register mutable.BitSet.
 Right now we have to register mutable.BitSet manually. A proper fix would be 
 using immutable.BitSet in ALS or register mutable.BitSet in upstream chill.
 {code}
 Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
 Task 1724.0:9 failed 4 times, most recent failure: Exception failure in TID 
 68548 on host lon4-hadoopslave-b232.lon4.spotify.net: 
 com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: 
 scala.collection.mutable.HashSet
 Serialization trace:
 shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)
 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 
 org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115)
 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155)
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154)
 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
 org.apache.spark.scheduler.Task.run(Task.scala:51)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 java.lang.Thread.run(Thread.java:662)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
   at 

[jira] [Closed] (SPARK-2378) Implement functionality to read csv files

2014-07-07 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust closed SPARK-2378.
---

Resolution: Duplicate

 Implement functionality to read csv files
 -

 Key: SPARK-2378
 URL: https://issues.apache.org/jira/browse/SPARK-2378
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 1.0.0
Reporter: Hossein Falaki
Priority: Minor
 Fix For: 1.0.0


 Similar to jsonFile(), csvFile() could be used to read data into a SchemaRDD 
 (or a normal RDD).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2360) CSV import to SchemaRDDs

2014-07-07 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2360:


Description: 
I think the first step it to design the interface that we want to present to 
users.  Mostly this is defining options when importing.  Off the top of my head:
- What is the separator?
- Provide column names or infer them from the first row.
- how to handle multiple files with possibly different schemas
- do we have a method to let users specify the datatypes of the columns or are 
they just strings?
- what types of quoting / escaping do we want to support?

 CSV import to SchemaRDDs
 

 Key: SPARK-2360
 URL: https://issues.apache.org/jira/browse/SPARK-2360
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Priority: Minor

 I think the first step it to design the interface that we want to present to 
 users.  Mostly this is defining options when importing.  Off the top of my 
 head:
 - What is the separator?
 - Provide column names or infer them from the first row.
 - how to handle multiple files with possibly different schemas
 - do we have a method to let users specify the datatypes of the columns or 
 are they just strings?
 - what types of quoting / escaping do we want to support?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2360) CSV import to SchemaRDDs

2014-07-07 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054262#comment-14054262
 ] 

Hossein Falaki commented on SPARK-2360:
---

As a point for comparison the interface in some other popular packages are:
__R__:
```
read.csv(filePath, header = TRUE, sep = ,, quote = \, dec = ., fill = 
TRUE, comment.char = , ...)
```
Where:
header: a logical value indicating whether the file contains the names of the 
variables as its first line.
sep: the field separator character. 
quote: the set of quoting characters. To disable quoting altogether, use ‘quote 
= ’
dec: the character used in the file for decimal points.
fill: If ‘TRUE’ then in case the rows have unequal length, blank fields are 
implicitly added.

__pandas__:
```
pandas.io.parsers.read_csv(filepath_or_buffer, sep=', ', dialect=None, 
compression=None, doublequote=True, escapechar=None, quotechar='', quoting=0, 
skipinitialspace=False, lineterminator=None, header='infer', index_col=None, 
names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, 
na_values=None, na_fvalues=None, true_values=None, false_values=None, 
delimiter=None, converters=None, dtype=None, usecols=None, engine=None, 
delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, 
use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, 
error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, 
decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, 
date_parser=None, memory_map=False, nrows=None, iterator=False, chunksize=None, 
verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True, 
tupleize_cols=False, infer_datetime_format=False)
```
The description of fields can be found here: 
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html

 CSV import to SchemaRDDs
 

 Key: SPARK-2360
 URL: https://issues.apache.org/jira/browse/SPARK-2360
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Priority: Minor

 I think the first step it to design the interface that we want to present to 
 users.  Mostly this is defining options when importing.  Off the top of my 
 head:
 - What is the separator?
 - Provide column names or infer them from the first row.
 - how to handle multiple files with possibly different schemas
 - do we have a method to let users specify the datatypes of the columns or 
 are they just strings?
 - what types of quoting / escaping do we want to support?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2360) CSV import to SchemaRDDs

2014-07-07 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054262#comment-14054262
 ] 

Hossein Falaki edited comment on SPARK-2360 at 7/7/14 11:03 PM:


As a point for comparison the interface in some other popular packages are:
_R_:
{code}
read.csv(filePath, header = TRUE, sep = ,, quote = \, dec = ., fill = 
TRUE, comment.char = , ...)
{code}

Where:
header: a logical value indicating whether the file contains the names of the 
variables as its first line.
sep: the field separator character. 
quote: the set of quoting characters. To disable quoting altogether, use ‘quote 
= ’
dec: the character used in the file for decimal points.
fill: If ‘TRUE’ then in case the rows have unequal length, blank fields are 
implicitly added.

__pandas__:
```
pandas.io.parsers.read_csv(filepath_or_buffer, sep=', ', dialect=None, 
compression=None, doublequote=True, escapechar=None, quotechar='', quoting=0, 
skipinitialspace=False, lineterminator=None, header='infer', index_col=None, 
names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, 
na_values=None, na_fvalues=None, true_values=None, false_values=None, 
delimiter=None, converters=None, dtype=None, usecols=None, engine=None, 
delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, 
use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, 
error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, 
decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, 
date_parser=None, memory_map=False, nrows=None, iterator=False, chunksize=None, 
verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True, 
tupleize_cols=False, infer_datetime_format=False)
```
The description of fields can be found here: 
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html


was (Author: falaki):
As a point for comparison the interface in some other popular packages are:
__R__:
```
read.csv(filePath, header = TRUE, sep = ,, quote = \, dec = ., fill = 
TRUE, comment.char = , ...)
```
Where:
header: a logical value indicating whether the file contains the names of the 
variables as its first line.
sep: the field separator character. 
quote: the set of quoting characters. To disable quoting altogether, use ‘quote 
= ’
dec: the character used in the file for decimal points.
fill: If ‘TRUE’ then in case the rows have unequal length, blank fields are 
implicitly added.

__pandas__:
```
pandas.io.parsers.read_csv(filepath_or_buffer, sep=', ', dialect=None, 
compression=None, doublequote=True, escapechar=None, quotechar='', quoting=0, 
skipinitialspace=False, lineterminator=None, header='infer', index_col=None, 
names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, 
na_values=None, na_fvalues=None, true_values=None, false_values=None, 
delimiter=None, converters=None, dtype=None, usecols=None, engine=None, 
delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, 
use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, 
error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, 
decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, 
date_parser=None, memory_map=False, nrows=None, iterator=False, chunksize=None, 
verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True, 
tupleize_cols=False, infer_datetime_format=False)
```
The description of fields can be found here: 
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html

 CSV import to SchemaRDDs
 

 Key: SPARK-2360
 URL: https://issues.apache.org/jira/browse/SPARK-2360
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Priority: Minor

 I think the first step it to design the interface that we want to present to 
 users.  Mostly this is defining options when importing.  Off the top of my 
 head:
 - What is the separator?
 - Provide column names or infer them from the first row.
 - how to handle multiple files with possibly different schemas
 - do we have a method to let users specify the datatypes of the columns or 
 are they just strings?
 - what types of quoting / escaping do we want to support?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2360) CSV import to SchemaRDDs

2014-07-07 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054262#comment-14054262
 ] 

Hossein Falaki edited comment on SPARK-2360 at 7/7/14 11:04 PM:


As a point for comparison the interface in some other popular packages are:
_R_:
{code}
read.csv(filePath, header = TRUE, sep = ,, quote = \, dec = ., fill = 
TRUE, comment.char = , ...)
{code}

Where:
header: a logical value indicating whether the file contains the names of the 
variables as its first line.
sep: the field separator character. 
quote: the set of quoting characters. To disable quoting altogether, use ‘quote 
= ’
dec: the character used in the file for decimal points.
fill: If ‘TRUE’ then in case the rows have unequal length, blank fields are 
implicitly added.

_pandas_:
{code}
pandas.io.parsers.read_csv(filepath_or_buffer, sep=', ', dialect=None, 
compression=None, doublequote=True, escapechar=None, quotechar='', quoting=0, 
skipinitialspace=False, lineterminator=None, header='infer', index_col=None, 
names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, 
na_values=None, na_fvalues=None, true_values=None, false_values=None, 
delimiter=None, converters=None, dtype=None, usecols=None, engine=None, 
delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, 
use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, 
error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, 
decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, 
date_parser=None, memory_map=False, nrows=None, iterator=False, chunksize=None, 
verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True, 
tupleize_cols=False, infer_datetime_format=False)
{code}
The description of fields can be found here: 
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html


was (Author: falaki):
As a point for comparison the interface in some other popular packages are:
_R_:
{code}
read.csv(filePath, header = TRUE, sep = ,, quote = \, dec = ., fill = 
TRUE, comment.char = , ...)
{code}

Where:
header: a logical value indicating whether the file contains the names of the 
variables as its first line.
sep: the field separator character. 
quote: the set of quoting characters. To disable quoting altogether, use ‘quote 
= ’
dec: the character used in the file for decimal points.
fill: If ‘TRUE’ then in case the rows have unequal length, blank fields are 
implicitly added.

__pandas__:
```
pandas.io.parsers.read_csv(filepath_or_buffer, sep=', ', dialect=None, 
compression=None, doublequote=True, escapechar=None, quotechar='', quoting=0, 
skipinitialspace=False, lineterminator=None, header='infer', index_col=None, 
names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, 
na_values=None, na_fvalues=None, true_values=None, false_values=None, 
delimiter=None, converters=None, dtype=None, usecols=None, engine=None, 
delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, 
use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, 
error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, 
decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, 
date_parser=None, memory_map=False, nrows=None, iterator=False, chunksize=None, 
verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True, 
tupleize_cols=False, infer_datetime_format=False)
```
The description of fields can be found here: 
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html

 CSV import to SchemaRDDs
 

 Key: SPARK-2360
 URL: https://issues.apache.org/jira/browse/SPARK-2360
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Priority: Minor

 I think the first step it to design the interface that we want to present to 
 users.  Mostly this is defining options when importing.  Off the top of my 
 head:
 - What is the separator?
 - Provide column names or infer them from the first row.
 - how to handle multiple files with possibly different schemas
 - do we have a method to let users specify the datatypes of the columns or 
 are they just strings?
 - what types of quoting / escaping do we want to support?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2395) Optimize common like expressions

2014-07-07 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-2395:
---

 Summary: Optimize common like expressions
 Key: SPARK-2395
 URL: https://issues.apache.org/jira/browse/SPARK-2395
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust


LIKE queries that are really just a contains, startsWith or endsWith would be 
much faster if we avoided regular expressions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2390) Files in staging directory cannot be deleted and wastes the space of HDFS

2014-07-07 Thread Kousuke Saruta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054298#comment-14054298
 ] 

Kousuke Saruta commented on SPARK-2390:
---

Thank you for your comment [~mridulm80].

I noticed FileSystem is to be closed by shutdown hook, and the shutdown hook 
deletes files in the staging directory and then close FileSystem.
So, in this case, I also think there is no problem to remove fs.close.


 Files in staging directory cannot be deleted and wastes the space of HDFS
 -

 Key: SPARK-2390
 URL: https://issues.apache.org/jira/browse/SPARK-2390
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.0.1
Reporter: Kousuke Saruta

 When running jobs with YARN Cluster mode and using HistoryServer, the files 
 in the Staging Directory cannot be deleted.
 HistoryServer uses directory where event log is written, and the directory is 
 represented as a instance of o.a.h.f.FileSystem created by using 
 FileSystem.get.
 {code:title=FileLogger.scala}
 private val fileSystem = Utils.getHadoopFileSystem(new URI(logDir))
 {code}
 {code:title=utils.getHadoopFileSystem}
 def getHadoopFileSystem(path: URI): FileSystem = {
   FileSystem.get(path, SparkHadoopUtil.get.newConfiguration())
 }
 {code}
 On the other hand, ApplicationMaster has a instance named fs, which also 
 created by using FileSystem.get.
 {code:title=ApplicationMaster}
 private val fs = FileSystem.get(yarnConf)
 {code}
 FileSystem.get returns cached same instance when URI passed to the method 
 represents same file system and the method is called by same user.
 Because of the behavior, when the directory for event log is on HDFS, fs of 
 ApplicationMaster and fileSystem of FileLogger is same instance.
 When shutting down ApplicationMaster, fileSystem.close is called in 
 FileLogger#stop, which is invoked by SparkContext#stop indirectly.
 {code:title=FileLogger.stop}
 def stop() {
   hadoopDataStream.foreach(_.close())
   writer.foreach(_.close())
   fileSystem.close()
 }
 {code}
 And  ApplicationMaster#cleanupStagingDir also called by JVM shutdown hook. In 
 this method, fs.delete(stagingDirPath) is invoked. 
 Because fs.delete in ApplicationMaster is called after fileSystem.close in 
 FileLogger, fs.delete fails and results not deleting files in the staging 
 directory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2339) SQL parser in sql-core is case sensitive, but a table alias is converted to lower case when we create Subquery

2014-07-07 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2339.
-

   Resolution: Fixed
Fix Version/s: 1.0.2
   1.1.0

 SQL parser in sql-core is case sensitive, but a table alias is converted to 
 lower case when we create Subquery
 --

 Key: SPARK-2339
 URL: https://issues.apache.org/jira/browse/SPARK-2339
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Yin Huai
Assignee: Yin Huai
 Fix For: 1.1.0, 1.0.2


 Reported by 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-throws-exception-td8599.html
 After we get the table from the catalog, because the table has an alias, we 
 will temporarily insert a Subquery. Then, we convert the table alias to lower 
 case no matter if the parser is case sensitive or not.
 To see the issue ...
 {code}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext.createSchemaRDD
 case class Person(name: String, age: Int)
 val people = 
 sc.textFile(examples/src/main/resources/people.txt).map(_.split(,)).map(p 
 = Person(p(0), p(1).trim.toInt))
 people.registerAsTable(people)
 sqlContext.sql(select PEOPLE.name from people PEOPLE)
 {code}
 The plan is ...
 {code}
 == Query Plan ==
 Project ['PEOPLE.name]
  ExistingRdd [name#0,age#1], MapPartitionsRDD[4] at mapPartitions at 
 basicOperators.scala:176
 {code}
 You can find that PEOPLE.name is not resolved.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2386) RowWriteSupport should use the exact types to cast.

2014-07-07 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2386.
-

   Resolution: Fixed
Fix Version/s: 1.0.2
   1.1.0
 Assignee: Takuya Ueshin

 RowWriteSupport should use the exact types to cast.
 ---

 Key: SPARK-2386
 URL: https://issues.apache.org/jira/browse/SPARK-2386
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin
Assignee: Takuya Ueshin
 Fix For: 1.1.0, 1.0.2


 When execute {{saveAsParquetFile}} with non-primitive type, 
 {{RowWriteSupport}} uses wrong type {{Int}} for {{ByteType}} and 
 {{ShortType}}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2375) JSON schema inference may not resolve type conflicts correctly for a field inside an array of structs

2014-07-07 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2375.
-

   Resolution: Fixed
Fix Version/s: 1.0.2
   1.1.0

 JSON schema inference may not resolve type conflicts correctly for a field 
 inside an array of structs
 -

 Key: SPARK-2375
 URL: https://issues.apache.org/jira/browse/SPARK-2375
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.1
Reporter: Yin Huai
Assignee: Yin Huai
 Fix For: 1.1.0, 1.0.2


 For example, for
 {code}
 {array: [{field:214748364700}, {field:1}]}
 {code}
 the type of field is resolved as IntType. While, for
 {code}
 {array: [{field:1}, {field:214748364700}]}
 {code}
 the type of field is resolved as LongType.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2390) Files in staging directory cannot be deleted and wastes the space of HDFS

2014-07-07 Thread Kousuke Saruta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054333#comment-14054333
 ] 

Kousuke Saruta commented on SPARK-2390:
---

Pull Requested at https://github.com/apache/spark/pull/1317

 Files in staging directory cannot be deleted and wastes the space of HDFS
 -

 Key: SPARK-2390
 URL: https://issues.apache.org/jira/browse/SPARK-2390
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.0.1
Reporter: Kousuke Saruta

 When running jobs with YARN Cluster mode and using HistoryServer, the files 
 in the Staging Directory cannot be deleted.
 HistoryServer uses directory where event log is written, and the directory is 
 represented as a instance of o.a.h.f.FileSystem created by using 
 FileSystem.get.
 {code:title=FileLogger.scala}
 private val fileSystem = Utils.getHadoopFileSystem(new URI(logDir))
 {code}
 {code:title=utils.getHadoopFileSystem}
 def getHadoopFileSystem(path: URI): FileSystem = {
   FileSystem.get(path, SparkHadoopUtil.get.newConfiguration())
 }
 {code}
 On the other hand, ApplicationMaster has a instance named fs, which also 
 created by using FileSystem.get.
 {code:title=ApplicationMaster}
 private val fs = FileSystem.get(yarnConf)
 {code}
 FileSystem.get returns cached same instance when URI passed to the method 
 represents same file system and the method is called by same user.
 Because of the behavior, when the directory for event log is on HDFS, fs of 
 ApplicationMaster and fileSystem of FileLogger is same instance.
 When shutting down ApplicationMaster, fileSystem.close is called in 
 FileLogger#stop, which is invoked by SparkContext#stop indirectly.
 {code:title=FileLogger.stop}
 def stop() {
   hadoopDataStream.foreach(_.close())
   writer.foreach(_.close())
   fileSystem.close()
 }
 {code}
 And  ApplicationMaster#cleanupStagingDir also called by JVM shutdown hook. In 
 this method, fs.delete(stagingDirPath) is invoked. 
 Because fs.delete in ApplicationMaster is called after fileSystem.close in 
 FileLogger, fs.delete fails and results not deleting files in the staging 
 directory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2390) Files in staging directory cannot be deleted and wastes the space of HDFS

2014-07-07 Thread Kousuke Saruta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054333#comment-14054333
 ] 

Kousuke Saruta edited comment on SPARK-2390 at 7/8/14 12:41 AM:


Pull Requested at https://github.com/apache/spark/pull/1326


was (Author: sarutak):
Pull Requested at https://github.com/apache/spark/pull/1317

 Files in staging directory cannot be deleted and wastes the space of HDFS
 -

 Key: SPARK-2390
 URL: https://issues.apache.org/jira/browse/SPARK-2390
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.0.1
Reporter: Kousuke Saruta

 When running jobs with YARN Cluster mode and using HistoryServer, the files 
 in the Staging Directory cannot be deleted.
 HistoryServer uses directory where event log is written, and the directory is 
 represented as a instance of o.a.h.f.FileSystem created by using 
 FileSystem.get.
 {code:title=FileLogger.scala}
 private val fileSystem = Utils.getHadoopFileSystem(new URI(logDir))
 {code}
 {code:title=utils.getHadoopFileSystem}
 def getHadoopFileSystem(path: URI): FileSystem = {
   FileSystem.get(path, SparkHadoopUtil.get.newConfiguration())
 }
 {code}
 On the other hand, ApplicationMaster has a instance named fs, which also 
 created by using FileSystem.get.
 {code:title=ApplicationMaster}
 private val fs = FileSystem.get(yarnConf)
 {code}
 FileSystem.get returns cached same instance when URI passed to the method 
 represents same file system and the method is called by same user.
 Because of the behavior, when the directory for event log is on HDFS, fs of 
 ApplicationMaster and fileSystem of FileLogger is same instance.
 When shutting down ApplicationMaster, fileSystem.close is called in 
 FileLogger#stop, which is invoked by SparkContext#stop indirectly.
 {code:title=FileLogger.stop}
 def stop() {
   hadoopDataStream.foreach(_.close())
   writer.foreach(_.close())
   fileSystem.close()
 }
 {code}
 And  ApplicationMaster#cleanupStagingDir also called by JVM shutdown hook. In 
 this method, fs.delete(stagingDirPath) is invoked. 
 Because fs.delete in ApplicationMaster is called after fileSystem.close in 
 FileLogger, fs.delete fails and results not deleting files in the staging 
 directory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2360) CSV import to SchemaRDDs

2014-07-07 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054262#comment-14054262
 ] 

Hossein Falaki edited comment on SPARK-2360 at 7/8/14 12:45 AM:


As a point for comparison the interface in some other popular packages are:
_R_:
{code}
read.csv(filePath, header = TRUE, sep = ,, quote = \, dec = ., fill = 
TRUE, comment.char = , ...)
{code}

Where:
* header: a logical value indicating whether the file contains the names of the 
variables as its first line.
* sep: the field separator character. 
* quote: the set of quoting characters. To disable quoting altogether, use 
‘quote = ’
* dec: the character used in the file for decimal points.
* fill: If ‘TRUE’ then in case the rows have unequal length, blank fields are 
implicitly added.

_pandas_:
{code}
pandas.io.parsers.read_csv(filepath_or_buffer, sep=', ', dialect=None, 
compression=None, doublequote=True, escapechar=None, quotechar='', quoting=0, 
skipinitialspace=False, lineterminator=None, header='infer', index_col=None, 
names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, 
na_values=None, na_fvalues=None, true_values=None, false_values=None, 
delimiter=None, converters=None, dtype=None, usecols=None, engine=None, 
delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, 
use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, 
error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, 
decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, 
date_parser=None, memory_map=False, nrows=None, iterator=False, chunksize=None, 
verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True, 
tupleize_cols=False, infer_datetime_format=False)
{code}
The description of fields can be found here: 
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html


was (Author: falaki):
As a point for comparison the interface in some other popular packages are:
_R_:
{code}
read.csv(filePath, header = TRUE, sep = ,, quote = \, dec = ., fill = 
TRUE, comment.char = , ...)
{code}

Where:
header: a logical value indicating whether the file contains the names of the 
variables as its first line.
sep: the field separator character. 
quote: the set of quoting characters. To disable quoting altogether, use ‘quote 
= ’
dec: the character used in the file for decimal points.
fill: If ‘TRUE’ then in case the rows have unequal length, blank fields are 
implicitly added.

_pandas_:
{code}
pandas.io.parsers.read_csv(filepath_or_buffer, sep=', ', dialect=None, 
compression=None, doublequote=True, escapechar=None, quotechar='', quoting=0, 
skipinitialspace=False, lineterminator=None, header='infer', index_col=None, 
names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, 
na_values=None, na_fvalues=None, true_values=None, false_values=None, 
delimiter=None, converters=None, dtype=None, usecols=None, engine=None, 
delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, 
use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, 
error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, 
decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, 
date_parser=None, memory_map=False, nrows=None, iterator=False, chunksize=None, 
verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True, 
tupleize_cols=False, infer_datetime_format=False)
{code}
The description of fields can be found here: 
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html

 CSV import to SchemaRDDs
 

 Key: SPARK-2360
 URL: https://issues.apache.org/jira/browse/SPARK-2360
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Priority: Minor

 I think the first step it to design the interface that we want to present to 
 users.  Mostly this is defining options when importing.  Off the top of my 
 head:
 - What is the separator?
 - Provide column names or infer them from the first row.
 - how to handle multiple files with possibly different schemas
 - do we have a method to let users specify the datatypes of the columns or 
 are they just strings?
 - what types of quoting / escaping do we want to support?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2397) Get rid of LocalHiveContext

2014-07-07 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-2397:
---

 Summary: Get rid of LocalHiveContext
 Key: SPARK-2397
 URL: https://issues.apache.org/jira/browse/SPARK-2397
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Michael Armbrust
Priority: Blocker
 Fix For: 1.1.0


HiveLocalContext is nearly completely redundant with HiveContext.  We should 
consider deprecating it and removing all uses.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2376) Selecting list values inside nested JSON objects raises java.lang.IllegalArgumentException

2014-07-07 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-2376.
-

   Resolution: Fixed
Fix Version/s: 1.0.2
   1.1.0

 Selecting list values inside nested JSON objects raises 
 java.lang.IllegalArgumentException
 --

 Key: SPARK-2376
 URL: https://issues.apache.org/jira/browse/SPARK-2376
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.1
Reporter: Nicholas Chammas
Assignee: Yin Huai
 Fix For: 1.1.0, 1.0.2


 Repro script for PySpark, deployed via {{spark-ec2}} at git revision 
 {{9d5ecf8205b924dc8a3c13fed68beb78cc5c7553}}:
 {code}
 from pyspark.sql import SQLContext
 sqlContext = SQLContext(sc)
 raw = sc.parallelize([
 
 {
 name: Nick,
 history: {
 countries: []
 }
 }
 
 ])
 profiles = sqlContext.jsonRDD(raw)
 profiles.registerAsTable(profiles)
 profiles.printSchema()
 sqlContext.sql(SELECT name FROM profiles;).collect() # works fine
 sqlContext.sql(SELECT history FROM profiles;).collect()  # raises exception
 {code}
 Attempting to select the top-level struct that has a nested list value yields 
 the following error:
 {code}
 14/07/06 00:10:26 INFO scheduler.TaskSetManager: Loss was due to 
 net.razorvine.pickle.PickleException: couldn't introspect javabean: 
 java.lang.IllegalArgumentException: wrong number of arguments [duplicate 3]
 14/07/06 00:10:26 ERROR scheduler.TaskSetManager: Task 26.0:15 failed 4 
 times; aborting job
 14/07/06 00:10:26 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 26.0, 
 whose tasks have all completed, from pool 
 14/07/06 00:10:26 INFO scheduler.TaskSchedulerImpl: Cancelling stage 26
 14/07/06 00:10:26 INFO scheduler.DAGScheduler: Failed to run collect at 
 stdin:1
 Traceback (most recent call last):
   File stdin, line 1, in module
   File /root/spark/python/pyspark/rdd.py, line 649, in collect
 bytesInJava = self._jrdd.collect().iterator()
   File /root/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 
 537, in __call__
   File /root/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py, line 
 300, in get_return_value
 py4j.protocol.Py4JJavaError: An error occurred while calling o286.collect.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 
 26.0:15 failed 4 times, most recent failure: Exception failure in TID 394 on 
 host ip-10-183-59-125.ec2.internal: net.razorvine.pickle.PickleException: 
 couldn't introspect javabean: java.lang.IllegalArgumentException: wrong 
 number of arguments
 net.razorvine.pickle.Pickler.put_javabean(Pickler.java:603)
 net.razorvine.pickle.Pickler.dispatch(Pickler.java:299)
 net.razorvine.pickle.Pickler.save(Pickler.java:125)
 net.razorvine.pickle.Pickler.put_map(Pickler.java:322)
 net.razorvine.pickle.Pickler.dispatch(Pickler.java:286)
 net.razorvine.pickle.Pickler.save(Pickler.java:125)
 net.razorvine.pickle.Pickler.put_map(Pickler.java:322)
 net.razorvine.pickle.Pickler.dispatch(Pickler.java:286)
 net.razorvine.pickle.Pickler.save(Pickler.java:125)
 net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:392)
 net.razorvine.pickle.Pickler.dispatch(Pickler.java:195)
 net.razorvine.pickle.Pickler.save(Pickler.java:125)
 net.razorvine.pickle.Pickler.dump(Pickler.java:95)
 net.razorvine.pickle.Pickler.dumps(Pickler.java:80)
 
 org.apache.spark.sql.SchemaRDD$$anonfun$javaToPython$1$$anonfun$apply$3.apply(SchemaRDD.scala:385)
 
 org.apache.spark.sql.SchemaRDD$$anonfun$javaToPython$1$$anonfun$apply$3.apply(SchemaRDD.scala:385)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 
 org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:317)
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:203)
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:178)
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:178)
 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1213)
 
 org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:177)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1041)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1025)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1023)
   at 
 

[jira] [Commented] (SPARK-2201) Improve FlumeInputDStream's stability and make it scalable

2014-07-07 Thread llai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054431#comment-14054431
 ] 

llai commented on SPARK-2201:
-

is it a good idea to add dependency of zookeeper?

 Improve FlumeInputDStream's stability and make it scalable
 --

 Key: SPARK-2201
 URL: https://issues.apache.org/jira/browse/SPARK-2201
 Project: Spark
  Issue Type: Improvement
Reporter: sunshangchun

 Currently:
 FlumeUtils.createStream(ssc, localhost, port); 
 This means that only one flume receiver can work with FlumeInputDStream .so 
 the solution is not scalable. 
 I use a zookeeper to solve this problem.
 Spark flume receivers register themselves to a zk path when started, and a 
 flume agent get physical hosts and push events to them.
 Some works need to be done here: 
 1.receiver create tmp node in zk,  listeners just watch those tmp nodes.
 2. when spark FlumeReceivers started, they acquire a physical host 
 (localhost's ip and an idle port) and register itself to zookeeper.
 3. A new flume sink. In the method of appendEvents, they get physical hosts 
 and push data to them in a round-robin manner.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2398) Trouble running Spark 1.0 on Yarn

2014-07-07 Thread Nishkam Ravi (JIRA)
Nishkam Ravi created SPARK-2398:
---

 Summary: Trouble running Spark 1.0 on Yarn 
 Key: SPARK-2398
 URL: https://issues.apache.org/jira/browse/SPARK-2398
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Nishkam Ravi


Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. 

For example: SparkPageRank when run in standalone mode goes through without any 
errors (tested for up to 30GB input dataset on a 6-node cluster).  Also runs 
fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn cluster 
mode) as the input data size is increased. Confirmed for 16GB input dataset.

The same workload runs fine with Spark 0.9 in both standalone and yarn cluster 
mode (for up to 30 GB input dataset on a 6-node cluster).

Commandline used:

(/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn 
--deploy-mode cluster --properties-file pagerank.conf  --driver-memory 30g 
--driver-cores 16 --num-executors 5 --class 
org.apache.spark.examples.SparkPageRank 
/opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar
 pagerank_in $NUM_ITER)

pagerank.conf:

spark.masterspark://c1704.halxg.cloudera.com:7077
spark.home  /opt/cloudera/parcels/CDH/lib/spark
spark.executor.memory   32g
spark.default.parallelism   118
spark.cores.max 96
spark.storage.memoryFraction0.6
spark.shuffle.memoryFraction0.3
spark.shuffle.compress  true
spark.shuffle.spill.compresstrue
spark.broadcast.compresstrue
spark.rdd.compress  false
spark.io.compression.codec  org.apache.spark.io.LZFCompressionCodec
spark.io.compression.snappy.block.size  32768
spark.reducer.maxMbInFlight 48
spark.local.dir  /var/lib/jenkins/workspace/tmp
spark.driver.memory 30g
spark.executor.cores16
spark.locality.wait 6000
spark.executor.instances5

UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions:

14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection 
to ConnectionManagerId(a1016.halxg.cloudera.com,54105)
java.nio.channels.AsynchronousCloseException
at 
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496)
at 
org.apache.spark.network.SendingConnection.write(Connection.scala:361)
at 
org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)



java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
at org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619)
at java.io.FilterInputStream.close(FilterInputStream.java:181)
at org.apache.hadoop.util.LineReader.close(LineReader.java:150)
at 
org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244)
at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226)
at 
org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
at 
org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197)
at 
org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
at 
org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
---

14/07/07 17:59:52 WARN network.SendingConnection: Error finishing connection to 
a1016.halxg.cloudera.com/10.20.184.116:54105
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at 

[jira] [Commented] (SPARK-2398) Trouble running Spark 1.0 on Yarn

2014-07-07 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054434#comment-14054434
 ] 

Guoqiang Li commented on SPARK-2398:


Seems to be related to 
[SPARK-1930|https://issues.apache.org/jira/browse/SPARK-1930].
Can you post the yarn node manager log?

 Trouble running Spark 1.0 on Yarn 
 --

 Key: SPARK-2398
 URL: https://issues.apache.org/jira/browse/SPARK-2398
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Nishkam Ravi

 Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. 
 For example: SparkPageRank when run in standalone mode goes through without 
 any errors (tested for up to 30GB input dataset on a 6-node cluster).  Also 
 runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn 
 cluster mode) as the input data size is increased. Confirmed for 16GB input 
 dataset.
 The same workload runs fine with Spark 0.9 in both standalone and yarn 
 cluster mode (for up to 30 GB input dataset on a 6-node cluster).
 Commandline used:
 (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn 
 --deploy-mode cluster --properties-file pagerank.conf  --driver-memory 30g 
 --driver-cores 16 --num-executors 5 --class 
 org.apache.spark.examples.SparkPageRank 
 /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar
  pagerank_in $NUM_ITER)
 pagerank.conf:
 spark.masterspark://c1704.halxg.cloudera.com:7077
 spark.home  /opt/cloudera/parcels/CDH/lib/spark
 spark.executor.memory   32g
 spark.default.parallelism   118
 spark.cores.max 96
 spark.storage.memoryFraction0.6
 spark.shuffle.memoryFraction0.3
 spark.shuffle.compress  true
 spark.shuffle.spill.compresstrue
 spark.broadcast.compresstrue
 spark.rdd.compress  false
 spark.io.compression.codec  org.apache.spark.io.LZFCompressionCodec
 spark.io.compression.snappy.block.size  32768
 spark.reducer.maxMbInFlight 48
 spark.local.dir  /var/lib/jenkins/workspace/tmp
 spark.driver.memory 30g
 spark.executor.cores16
 spark.locality.wait 6000
 spark.executor.instances5
 UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions:
 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection 
 to ConnectionManagerId(a1016.halxg.cloudera.com,54105)
 java.nio.channels.AsynchronousCloseException
 at 
 java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205)
 at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496)
 at 
 org.apache.spark.network.SendingConnection.write(Connection.scala:361)
 at 
 org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
 at 
 org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619)
 at java.io.FilterInputStream.close(FilterInputStream.java:181)
 at org.apache.hadoop.util.LineReader.close(LineReader.java:150)
 at 
 org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244)
 at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226)
 at 
 org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
 at 
 org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197)
 at 
 org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
 at 
 org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
 at org.apache.spark.scheduler.Task.run(Task.scala:51)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 ---
 14/07/07 

[jira] [Commented] (SPARK-2201) Improve FlumeInputDStream's stability and make it scalable

2014-07-07 Thread sunshangchun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054442#comment-14054442
 ] 

sunshangchun commented on SPARK-2201:
-

I don't like it's a problem. 
It's a external module and take no effect on spark core module.
Again, spark core module has already used zookeeper to select leader master.

Thanks 

 Improve FlumeInputDStream's stability and make it scalable
 --

 Key: SPARK-2201
 URL: https://issues.apache.org/jira/browse/SPARK-2201
 Project: Spark
  Issue Type: Improvement
Reporter: sunshangchun

 Currently:
 FlumeUtils.createStream(ssc, localhost, port); 
 This means that only one flume receiver can work with FlumeInputDStream .so 
 the solution is not scalable. 
 I use a zookeeper to solve this problem.
 Spark flume receivers register themselves to a zk path when started, and a 
 flume agent get physical hosts and push events to them.
 Some works need to be done here: 
 1.receiver create tmp node in zk,  listeners just watch those tmp nodes.
 2. when spark FlumeReceivers started, they acquire a physical host 
 (localhost's ip and an idle port) and register itself to zookeeper.
 3. A new flume sink. In the method of appendEvents, they get physical hosts 
 and push data to them in a round-robin manner.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2235) Spark SQL basicOperator add Intersect operator

2014-07-07 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2235.
-

   Resolution: Fixed
Fix Version/s: 1.1.0
 Assignee: (was: Michael Armbrust)

  Spark SQL basicOperator add Intersect operator
 ---

 Key: SPARK-2235
 URL: https://issues.apache.org/jira/browse/SPARK-2235
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Yanjie Gao
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2181) The keys for sorting the columns of Executor page in SparkUI are incorrect

2014-07-07 Thread Shuo Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054455#comment-14054455
 ] 

Shuo Xiang commented on SPARK-2181:
---

Hi, I've merged the latest version but the column of Memory Usage inside the 
RDD storage Info tab is not correct. It is still sorted by the string not the 
value.



 The keys for sorting the columns of Executor page in SparkUI are incorrect
 --

 Key: SPARK-2181
 URL: https://issues.apache.org/jira/browse/SPARK-2181
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Shuo Xiang
Assignee: Guoqiang Li
Priority: Minor
 Fix For: 1.1.0, 1.0.2


 Under the Executor page of SparkUI, each column is sorted alphabetically 
 (after clicking). However, it should be sorted by the value, not the string.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (SPARK-2303) Poisson regression model for count data

2014-07-07 Thread Gang Bai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Bai closed SPARK-2303.
---

Resolution: Fixed

 Poisson regression model for count data
 ---

 Key: SPARK-2303
 URL: https://issues.apache.org/jira/browse/SPARK-2303
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Gang Bai

 Modeling count data is of great importance in solving real-world statistic 
 problems. Currently mllib.regression provides models mostly for numeric data, 
 i.e fitting curves with various regularization on resulted weights, but still 
 lacks the support of count data modeling.
 A very basic model for this is the Poisson regression. Following the patterns 
 in mllib and reusing the components, we address the parameter estimation for 
 Poisson regression in a maximum likelihood manner. In detail, to add Poisson 
 regression to mllib.regression, we need to:
  # Add the gradient of the negative log-likelihood into 
 mllib/optimization/Gradients.scala.
  # Add the implementations of PoissonRegressionModel, which extends 
 GeneralizedLinearModel with RegressionModel. Here we need the implementation 
 of the predict method.
  # Add the implementations of the generalized linear algorithm class. Here we 
 can use either LBFGS or GradientDescent as the optimizer. So we implement 
 both as class PoissonRegressionWithSGD and class PoissonRegressionWithLBFGS 
 respectively.
  # Add the companion object PoissonRegressionWithSGD and 
 PoissonRegressionWithLBFGS as drivers.
  # Test suites
  ## Test the gradient computation.
  ## Test the regression method using generated data, which requires a 
 PoissonRegressionDataGenerator.
  ## Test the regression method using a real-world data set.
  # Add the documents.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2303) Poisson regression model for count data

2014-07-07 Thread Gang Bai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054470#comment-14054470
 ] 

Gang Bai commented on SPARK-2303:
-

This change has been merged into another JIRA SPARK-2311. Closing this one.

 Poisson regression model for count data
 ---

 Key: SPARK-2303
 URL: https://issues.apache.org/jira/browse/SPARK-2303
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Gang Bai

 Modeling count data is of great importance in solving real-world statistic 
 problems. Currently mllib.regression provides models mostly for numeric data, 
 i.e fitting curves with various regularization on resulted weights, but still 
 lacks the support of count data modeling.
 A very basic model for this is the Poisson regression. Following the patterns 
 in mllib and reusing the components, we address the parameter estimation for 
 Poisson regression in a maximum likelihood manner. In detail, to add Poisson 
 regression to mllib.regression, we need to:
  # Add the gradient of the negative log-likelihood into 
 mllib/optimization/Gradients.scala.
  # Add the implementations of PoissonRegressionModel, which extends 
 GeneralizedLinearModel with RegressionModel. Here we need the implementation 
 of the predict method.
  # Add the implementations of the generalized linear algorithm class. Here we 
 can use either LBFGS or GradientDescent as the optimizer. So we implement 
 both as class PoissonRegressionWithSGD and class PoissonRegressionWithLBFGS 
 respectively.
  # Add the companion object PoissonRegressionWithSGD and 
 PoissonRegressionWithLBFGS as drivers.
  # Test suites
  ## Test the gradient computation.
  ## Test the regression method using generated data, which requires a 
 PoissonRegressionDataGenerator.
  ## Test the regression method using a real-world data set.
  # Add the documents.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2215) Multi-way join

2014-07-07 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054491#comment-14054491
 ] 

Yin Huai commented on SPARK-2215:
-

Instead of implementing a multi-way join operator, how about we just rely on 
AddExchange to properly insert Exchange Operator to avoid unnecessary data 
movements?

 Multi-way join
 --

 Key: SPARK-2215
 URL: https://issues.apache.org/jira/browse/SPARK-2215
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao
Priority: Minor

 Support the multi-way join (multiple table joins) in a single reduce stage if 
 they have the same join keys.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2399) Add support for LZ4 compression

2014-07-07 Thread Greg Bowyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Bowyer updated SPARK-2399:
---

Attachment: SPARK-2399-Make-spark-compression-able-to-use-LZ4.patch

 Add support for LZ4 compression
 ---

 Key: SPARK-2399
 URL: https://issues.apache.org/jira/browse/SPARK-2399
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Greg Bowyer
  Labels: compression, lz4
 Attachments: SPARK-2399-Make-spark-compression-able-to-use-LZ4.patch


 LZ4 is a compression codec of the same ideas as googles snappy, but has some 
 advantages:
 * It is faster than snappy with a similar compression ration
 * The implementation is Apache licensed and not GPL
 It has shown promise in both the lucene and hadoop communities, and it looks 
 like its a really easy add to spark io compression.
 Attached is a patch that does this



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2399) Add support for LZ4 compression

2014-07-07 Thread Greg Bowyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Bowyer updated SPARK-2399:
---

Description: 
LZ4 is a compression codec of the same ideas as googles snappy, but has some 
advantages:
* It is faster than snappy with a similar compression ratio
* The implementation is Apache licensed and not GPL

It has shown promise in both the lucene and hadoop communities, and it looks 
like its a really easy add to spark io compression.

Attached is a patch that does this

  was:
LZ4 is a compression codec of the same ideas as googles snappy, but has some 
advantages:
* It is faster than snappy with a similar compression ration
* The implementation is Apache licensed and not GPL

It has shown promise in both the lucene and hadoop communities, and it looks 
like its a really easy add to spark io compression.

Attached is a patch that does this


 Add support for LZ4 compression
 ---

 Key: SPARK-2399
 URL: https://issues.apache.org/jira/browse/SPARK-2399
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Greg Bowyer
  Labels: compression, lz4
 Attachments: SPARK-2399-Make-spark-compression-able-to-use-LZ4.patch


 LZ4 is a compression codec of the same ideas as googles snappy, but has some 
 advantages:
 * It is faster than snappy with a similar compression ratio
 * The implementation is Apache licensed and not GPL
 It has shown promise in both the lucene and hadoop communities, and it looks 
 like its a really easy add to spark io compression.
 Attached is a patch that does this



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2400) config spark.yarn.max.executor.failures is not explained accurately

2014-07-07 Thread Chen Chao (JIRA)
Chen Chao created SPARK-2400:


 Summary: config spark.yarn.max.executor.failures is not explained 
accurately
 Key: SPARK-2400
 URL: https://issues.apache.org/jira/browse/SPARK-2400
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Chen Chao
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2400) config spark.yarn.max.executor.failures is not explained accurately

2014-07-07 Thread Chen Chao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054540#comment-14054540
 ] 

Chen Chao commented on SPARK-2400:
--

it should be numExecutors * 2, with minimum of 3 rather than 2*numExecutors.


 config spark.yarn.max.executor.failures is not explained accurately
 ---

 Key: SPARK-2400
 URL: https://issues.apache.org/jira/browse/SPARK-2400
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Chen Chao
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.2#6252)