[jira] [Created] (SPARK-2384) Add tooltips for shuffle write and scheduler delay in UI
Kay Ousterhout created SPARK-2384: - Summary: Add tooltips for shuffle write and scheduler delay in UI Key: SPARK-2384 URL: https://issues.apache.org/jira/browse/SPARK-2384 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Kay Ousterhout Assignee: Kay Ousterhout Priority: Minor There are a few common points of confusion in the UI that could be clarified with tooltips. We should add tooltips to explain the scheduler delay and the shuffle data (to explain why shuffle read is typically shuffle write, as many many people have expressed confusion about). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2382) build error:
[ https://issues.apache.org/jira/browse/SPARK-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053433#comment-14053433 ] Sean Owen commented on SPARK-2382: -- I can't reproduce this. It is almost certainly a problem with your environment or network, possibly temporary. It's not a Spark issue so I think this should be closed. build error: - Key: SPARK-2382 URL: https://issues.apache.org/jira/browse/SPARK-2382 Project: Spark Issue Type: Question Components: Build Affects Versions: 1.0.0 Environment: Ubuntu 12.0.4 precise. spark@ubuntu-cdh5-spark:~/spark-1.0.0$ mvn -version Apache Maven 3.0.4 Maven home: /usr/share/maven Java version: 1.6.0_31, vendor: Sun Microsystems Inc. Java home: /usr/lib/jvm/j2sdk1.6-oracle/jre Default locale: en_US, platform encoding: UTF-8 OS name: linux, version: 3.11.0-15-generic, arch: amd64, family: unix Reporter: Mukul Jain Labels: newbie Unable to build. maven can't download dependency .. checked my http_proxy and https_proxy setting they are working fine. Other http and https dependencies were downloaded fine.. build process gets stuck always at this repo. manually down loading also fails and receive an exception. [INFO] [INFO] Building Spark Project External MQTT 1.0.0 [INFO] Downloading: https://repository.apache.org/content/repositories/releases/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom Jul 6, 2014 4:53:26 PM org.apache.commons.httpclient.HttpMethodDirector executeWithRetry INFO: I/O exception (java.net.ConnectException) caught when processing request: Connection timed out Jul 6, 2014 4:53:26 PM org.apache.commons.httpclient.HttpMethodDirector executeWithRetry INFO: Retrying request -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2386) RowWriteSupport should use the exact types to cast.
[ https://issues.apache.org/jira/browse/SPARK-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053537#comment-14053537 ] Takuya Ueshin commented on SPARK-2386: -- PRed: https://github.com/apache/spark/pull/1315 RowWriteSupport should use the exact types to cast. --- Key: SPARK-2386 URL: https://issues.apache.org/jira/browse/SPARK-2386 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin When execute {{saveAsParquetFile}} with non-primitive type, {{RowWriteSupport}} uses wrong type {{Int}} for {{ByteType}} and {{ShortType}}. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2387) Remove the stage barrier for better resource utilization
Rui Li created SPARK-2387: - Summary: Remove the stage barrier for better resource utilization Key: SPARK-2387 URL: https://issues.apache.org/jira/browse/SPARK-2387 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Rui Li DAGScheduler divides a Spark job into multiple stages according to RDD dependencies. Whenever there’s a shuffle dependency, DAGScheduler creates a shuffle map stage on the map side, and another stage depending on that stage. Currently, the downstream stage cannot start until all its depended stages have finished. This barrier between stages leads to idle slots when waiting for the last few upstream tasks to finish and thus wasting cluster resources. Therefore we propose to remove the barrier and pre-start the reduce stage once there're free slots. This can achieve better resource utilization and improve the overall job performance, especially when there're lots of executors granted to the application. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2388) Streaming from multiple different Kafka topics is problematic
Sergey created SPARK-2388: - Summary: Streaming from multiple different Kafka topics is problematic Key: SPARK-2388 URL: https://issues.apache.org/jira/browse/SPARK-2388 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.0 Reporter: Sergey Fix For: 1.0.1 Default way of creating stream out of Kafka source would be as val stream = KafkaUtils.createStream(ssc,localhost:2181,logs, Map(retarget - 2,datapair - 2)) However, if two topics - in this case retarget and datapair - are very different, there is no way to set up different filter, mapping functions, etc), as they are effectively merged. However, instance of KafkaInputDStream, created with this call internally calls ConsumerConnector.createMessageStream() which returns *map* of KafkaStreams, keyed by topic. It would be great if this map would be exposed somehow, so aforementioned call val streamS = KafkaUtils.createStreamS(...) returned map of streams. Regards, Sergey Malov Collective Media -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2389) globally shared SparkContext / shared Spark application
Robert Stupp created SPARK-2389: --- Summary: globally shared SparkContext / shared Spark application Key: SPARK-2389 URL: https://issues.apache.org/jira/browse/SPARK-2389 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Robert Stupp The documentation (in Cluster Mode Overview) cites: bq. Each application gets its own executor processes, which *stay up for the duration of the whole application* and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). However, it also means that *data cannot be shared* across different Spark applications (instances of SparkContext) without writing it to an external storage system. IMO this is a limitation that should be lifted to support any number of --driver-- client processes to share executors and to share (persistent / cached) data. This is especially useful if you have a bunch of frontend servers (dump web app servers) that want to use Spark as a _big computing machine_. Most important is the fact that Spark is quite good in caching/persisting data in memory / on disk thus removing load from backend data stores. Means: it would be really great to let different --driver-- client JVMs operate on the same RDDs and benefit from Spark's caching/persistence. It would however introduce some administration mechanisms to * start a shared context * update the executor configuration (# of worker nodes, # of cpus, etc) on the fly * stop a shared context Even conventional batch MR applications would benefit if ran fequently against the same data set. As an implicit requirement, RDD persistence could get a TTL for its materialized state. With such a feature the overall performance of today's web applications could then be increased by adding more web app servers, more spark nodes, more nosql nodes etc -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2384) Add tooltips for shuffle write and scheduler delay in UI
[ https://issues.apache.org/jira/browse/SPARK-2384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053834#comment-14053834 ] Sandy Ryza commented on SPARK-2384: --- This is a great idea Add tooltips for shuffle write and scheduler delay in UI Key: SPARK-2384 URL: https://issues.apache.org/jira/browse/SPARK-2384 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Kay Ousterhout Assignee: Kay Ousterhout Priority: Minor There are a few common points of confusion in the UI that could be clarified with tooltips. We should add tooltips to explain the scheduler delay and the shuffle data (to explain why shuffle read is typically shuffle write, as many many people have expressed confusion about). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2160) error of Decision tree algorithm in Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053938#comment-14053938 ] Jon Sondag commented on SPARK-2160: --- https://github.com/apache/spark/pull/1316 (also resolves SPARK-2152) error of Decision tree algorithm in Spark MLlib -- Key: SPARK-2160 URL: https://issues.apache.org/jira/browse/SPARK-2160 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: caoli Labels: patch Fix For: 1.1.0 Original Estimate: 4h Remaining Estimate: 4h the error of comput rightNodeAgg about Decision tree algorithm in Spark MLlib , in the function extractLeftRightNodeAggregates() ,when compute rightNodeAgg used bindata index is error. in the DecisionTree.scala file about Line980: rightNodeAgg(featureIndex)(2 * (numBins - 2 - splitIndex)) = binData(shift + (2 * (numBins - 2 - splitIndex))) + rightNodeAgg(featureIndex)(2 * (numBins - 1 - splitIndex)) the binData(shift + (2 * (numBins - 2 - splitIndex))) index compute is error, so the result of rightNodeAgg include repeated data about bins -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2390) Files in staging directory cannot be deleted and wastes the space of HDFS
Kousuke Saruta created SPARK-2390: - Summary: Files in staging directory cannot be deleted and wastes the space of HDFS Key: SPARK-2390 URL: https://issues.apache.org/jira/browse/SPARK-2390 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Kousuke Saruta When running jobs with YARN Cluster mode and using HistoryServer, the files in the Staging Directory cannot be deleted. HistoryServer uses directory where event log is written, and the directory is represented as a instance of o.a.h.f.FileSystem created by using FileSystem.get. {code:title=FileLogger.scala} private val fileSystem = Utils.getHadoopFileSystem(new URI(logDir)) {code} {code:title=utils.getHadoopFileSystem} def getHadoopFileSystem(path: URI): FileSystem = { FileSystem.get(path, SparkHadoopUtil.get.newConfiguration()) } {code} On the other hand, ApplicationMaster has a instance named fs, which also created by using FileSystem.get. {code:title=ApplicationMaster} private val fs = FileSystem.get(yarnConf) {code} FileSystem.get returns cached same instance when URI passed to the method represents same file system and the method is called by same user. Because of the behavior, when the directory for event log is on HDFS, fs of ApplicationMaster and fileSystem of FileLogger is same instance. When shutting down ApplicationMaster, fileSystem.close is called in FileLogger#stop, which is invoked by SparkContext#stop indirectly. {code:title=FileLogger.stop} def stop() { hadoopDataStream.foreach(_.close()) writer.foreach(_.close()) fileSystem.close() } {code} And ApplicationMaster#cleanupStagingDir also called by JVM shutdown hook. In this method, fs.delete(stagingDirPath) is invoked. Because fs.delete in ApplicationMaster is called after fileSystem.close in FileLogger, fs.delete fails and results not deleting files in the staging directory. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2391) LIMIT queries ship a whole partition of data
[ https://issues.apache.org/jira/browse/SPARK-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2391: Fix Version/s: 1.1.0 LIMIT queries ship a whole partition of data Key: SPARK-2391 URL: https://issues.apache.org/jira/browse/SPARK-2391 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Michael Armbrust Assignee: Michael Armbrust Basically the problem here is that Spark's take() runs jobs using allowLocal = true. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2391) LIMIT queries ship a whole partition of data
Michael Armbrust created SPARK-2391: --- Summary: LIMIT queries ship a whole partition of data Key: SPARK-2391 URL: https://issues.apache.org/jira/browse/SPARK-2391 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Michael Armbrust Assignee: Michael Armbrust Basically the problem here is that Spark's take() runs jobs using allowLocal = true. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2391) LIMIT queries ship a whole partition of data
[ https://issues.apache.org/jira/browse/SPARK-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2391: Target Version/s: 1.1.0 Fix Version/s: (was: 1.1.0) LIMIT queries ship a whole partition of data Key: SPARK-2391 URL: https://issues.apache.org/jira/browse/SPARK-2391 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Michael Armbrust Assignee: Michael Armbrust Basically the problem here is that Spark's take() runs jobs using allowLocal = true. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1977) mutable.BitSet in ALS not serializable with KryoSerializer
[ https://issues.apache.org/jira/browse/SPARK-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053994#comment-14053994 ] Xiangrui Meng commented on SPARK-1977: -- I think now I understand when it happens. We use storage level MEMORY_AND_DISK for user/product in/out links, which contains BitSet objects. If the dataset is large, these RDDs will be pushed from in memory storage to on disk storage, where the latter requires serialization. So the easiest way to re-produce this error is changing the storage level of inLinks/outLinks to DISK_ONLY and run with kryo. [~neville] Instead of mapping mutable.BitSet to immutable.BitSet, which introduces overhead, we can register mutable.BitSet in our MovieLensALS example code and wait for the next Chill release. Does it sound good to you? mutable.BitSet in ALS not serializable with KryoSerializer -- Key: SPARK-1977 URL: https://issues.apache.org/jira/browse/SPARK-1977 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: Neville Li Priority: Minor OutLinkBlock in ALS.scala has an Array[mutable.BitSet] member. KryoSerializer uses AllScalaRegistrar from Twitter chill but it doesn't register mutable.BitSet. Right now we have to register mutable.BitSet manually. A proper fix would be using immutable.BitSet in ALS or register mutable.BitSet in upstream chill. {code} Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1724.0:9 failed 4 times, most recent failure: Exception failure in TID 68548 on host lon4-hadoopslave-b232.lon4.spotify.net: com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: scala.collection.mutable.HashSet Serialization trace: shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626) com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77) org.apache.spark.rdd.RDD.iterator(RDD.scala:227) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) java.lang.Thread.run(Thread.java:662) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) at
[jira] [Updated] (SPARK-2386) RowWriteSupport should use the exact types to cast.
[ https://issues.apache.org/jira/browse/SPARK-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2386: Target Version/s: 1.1.0 RowWriteSupport should use the exact types to cast. --- Key: SPARK-2386 URL: https://issues.apache.org/jira/browse/SPARK-2386 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin When execute {{saveAsParquetFile}} with non-primitive type, {{RowWriteSupport}} uses wrong type {{Int}} for {{ByteType}} and {{ShortType}}. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1977) mutable.BitSet in ALS not serializable with KryoSerializer
[ https://issues.apache.org/jira/browse/SPARK-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053999#comment-14053999 ] Neville Li commented on SPARK-1977: --- [~mengxr] sounds good to me. mutable.BitSet in ALS not serializable with KryoSerializer -- Key: SPARK-1977 URL: https://issues.apache.org/jira/browse/SPARK-1977 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: Neville Li Priority: Minor OutLinkBlock in ALS.scala has an Array[mutable.BitSet] member. KryoSerializer uses AllScalaRegistrar from Twitter chill but it doesn't register mutable.BitSet. Right now we have to register mutable.BitSet manually. A proper fix would be using immutable.BitSet in ALS or register mutable.BitSet in upstream chill. {code} Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1724.0:9 failed 4 times, most recent failure: Exception failure in TID 68548 on host lon4-hadoopslave-b232.lon4.spotify.net: com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: scala.collection.mutable.HashSet Serialization trace: shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626) com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77) org.apache.spark.rdd.RDD.iterator(RDD.scala:227) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) java.lang.Thread.run(Thread.java:662) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at scala.Option.foreach(Option.scala:236) at
[jira] [Commented] (SPARK-1977) mutable.BitSet in ALS not serializable with KryoSerializer
[ https://issues.apache.org/jira/browse/SPARK-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054003#comment-14054003 ] Xiangrui Meng commented on SPARK-1977: -- Do you mind creating a PR registering mutable.BitSet in MovieLensALS.scala and close PR #925? Thanks! mutable.BitSet in ALS not serializable with KryoSerializer -- Key: SPARK-1977 URL: https://issues.apache.org/jira/browse/SPARK-1977 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: Neville Li Priority: Minor OutLinkBlock in ALS.scala has an Array[mutable.BitSet] member. KryoSerializer uses AllScalaRegistrar from Twitter chill but it doesn't register mutable.BitSet. Right now we have to register mutable.BitSet manually. A proper fix would be using immutable.BitSet in ALS or register mutable.BitSet in upstream chill. {code} Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1724.0:9 failed 4 times, most recent failure: Exception failure in TID 68548 on host lon4-hadoopslave-b232.lon4.spotify.net: com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: scala.collection.mutable.HashSet Serialization trace: shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626) com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77) org.apache.spark.rdd.RDD.iterator(RDD.scala:227) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) java.lang.Thread.run(Thread.java:662) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at scala.Option.foreach(Option.scala:236)
[jira] [Updated] (SPARK-2311) Added additional GLMs (Poisson and Gamma) into MLlib
[ https://issues.apache.org/jira/browse/SPARK-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaokai Wei updated SPARK-2311: --- Priority: Major (was: Minor) Added additional GLMs (Poisson and Gamma) into MLlib Key: SPARK-2311 URL: https://issues.apache.org/jira/browse/SPARK-2311 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiaokai Wei Though GeneralizedLinearModel in MLlib 1.0.0 has some important GLMs such as Logistic Regression, Linear Regression, some other important GLMs like Poisson Regression are still missing. Poisson Regression and Gamma Regression are two widely used models in industry with many applications. This patch added Poisson and Gamma Regression as additional GeneralizedLinearModels. https://github.com/apache/spark/pull/1237 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2390) Files in staging directory cannot be deleted and wastes the space of HDFS
[ https://issues.apache.org/jira/browse/SPARK-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-2390: -- Affects Version/s: 1.0.1 Files in staging directory cannot be deleted and wastes the space of HDFS - Key: SPARK-2390 URL: https://issues.apache.org/jira/browse/SPARK-2390 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 1.0.1 Reporter: Kousuke Saruta When running jobs with YARN Cluster mode and using HistoryServer, the files in the Staging Directory cannot be deleted. HistoryServer uses directory where event log is written, and the directory is represented as a instance of o.a.h.f.FileSystem created by using FileSystem.get. {code:title=FileLogger.scala} private val fileSystem = Utils.getHadoopFileSystem(new URI(logDir)) {code} {code:title=utils.getHadoopFileSystem} def getHadoopFileSystem(path: URI): FileSystem = { FileSystem.get(path, SparkHadoopUtil.get.newConfiguration()) } {code} On the other hand, ApplicationMaster has a instance named fs, which also created by using FileSystem.get. {code:title=ApplicationMaster} private val fs = FileSystem.get(yarnConf) {code} FileSystem.get returns cached same instance when URI passed to the method represents same file system and the method is called by same user. Because of the behavior, when the directory for event log is on HDFS, fs of ApplicationMaster and fileSystem of FileLogger is same instance. When shutting down ApplicationMaster, fileSystem.close is called in FileLogger#stop, which is invoked by SparkContext#stop indirectly. {code:title=FileLogger.stop} def stop() { hadoopDataStream.foreach(_.close()) writer.foreach(_.close()) fileSystem.close() } {code} And ApplicationMaster#cleanupStagingDir also called by JVM shutdown hook. In this method, fs.delete(stagingDirPath) is invoked. Because fs.delete in ApplicationMaster is called after fileSystem.close in FileLogger, fs.delete fails and results not deleting files in the staging directory. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2390) Files in staging directory cannot be deleted and wastes the space of HDFS
[ https://issues.apache.org/jira/browse/SPARK-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054041#comment-14054041 ] Kousuke Saruta commented on SPARK-2390: --- I modified the way to create a instance of FileSystem in FileLogger as follows, the symptom didn't appear. {code} private val fileSystem = FileSystem.newInstance(new URI(logDir), SparkHadoopUtil.get.newConfiguration()) {code} Files in staging directory cannot be deleted and wastes the space of HDFS - Key: SPARK-2390 URL: https://issues.apache.org/jira/browse/SPARK-2390 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 1.0.1 Reporter: Kousuke Saruta When running jobs with YARN Cluster mode and using HistoryServer, the files in the Staging Directory cannot be deleted. HistoryServer uses directory where event log is written, and the directory is represented as a instance of o.a.h.f.FileSystem created by using FileSystem.get. {code:title=FileLogger.scala} private val fileSystem = Utils.getHadoopFileSystem(new URI(logDir)) {code} {code:title=utils.getHadoopFileSystem} def getHadoopFileSystem(path: URI): FileSystem = { FileSystem.get(path, SparkHadoopUtil.get.newConfiguration()) } {code} On the other hand, ApplicationMaster has a instance named fs, which also created by using FileSystem.get. {code:title=ApplicationMaster} private val fs = FileSystem.get(yarnConf) {code} FileSystem.get returns cached same instance when URI passed to the method represents same file system and the method is called by same user. Because of the behavior, when the directory for event log is on HDFS, fs of ApplicationMaster and fileSystem of FileLogger is same instance. When shutting down ApplicationMaster, fileSystem.close is called in FileLogger#stop, which is invoked by SparkContext#stop indirectly. {code:title=FileLogger.stop} def stop() { hadoopDataStream.foreach(_.close()) writer.foreach(_.close()) fileSystem.close() } {code} And ApplicationMaster#cleanupStagingDir also called by JVM shutdown hook. In this method, fs.delete(stagingDirPath) is invoked. Because fs.delete in ApplicationMaster is called after fileSystem.close in FileLogger, fs.delete fails and results not deleting files in the staging directory. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1977) mutable.BitSet in ALS not serializable with KryoSerializer
[ https://issues.apache.org/jira/browse/SPARK-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054043#comment-14054043 ] Neville Li commented on SPARK-1977: --- There you go: https://github.com/apache/spark/pull/1319 mutable.BitSet in ALS not serializable with KryoSerializer -- Key: SPARK-1977 URL: https://issues.apache.org/jira/browse/SPARK-1977 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: Neville Li Priority: Minor OutLinkBlock in ALS.scala has an Array[mutable.BitSet] member. KryoSerializer uses AllScalaRegistrar from Twitter chill but it doesn't register mutable.BitSet. Right now we have to register mutable.BitSet manually. A proper fix would be using immutable.BitSet in ALS or register mutable.BitSet in upstream chill. {code} Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1724.0:9 failed 4 times, most recent failure: Exception failure in TID 68548 on host lon4-hadoopslave-b232.lon4.spotify.net: com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: scala.collection.mutable.HashSet Serialization trace: shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626) com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77) org.apache.spark.rdd.RDD.iterator(RDD.scala:227) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) java.lang.Thread.run(Thread.java:662) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at scala.Option.foreach(Option.scala:236) at
[jira] [Created] (SPARK-2392) Executors should not start an HTTP server
Andrew Or created SPARK-2392: Summary: Executors should not start an HTTP server Key: SPARK-2392 URL: https://issues.apache.org/jira/browse/SPARK-2392 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or In the long run we should separate out classes used by the driver vs executors in SparkEnv. For now, we should at least not start an unused HTTP on every executor. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2392) Executors should not start their own HTTP servers
[ https://issues.apache.org/jira/browse/SPARK-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-2392: - Summary: Executors should not start their own HTTP servers (was: Executors should not start an HTTP server) Executors should not start their own HTTP servers - Key: SPARK-2392 URL: https://issues.apache.org/jira/browse/SPARK-2392 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or In the long run we should separate out classes used by the driver vs executors in SparkEnv. For now, we should at least not start an unused HTTP on every executor. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2390) Files in staging directory cannot be deleted and wastes the space of HDFS
[ https://issues.apache.org/jira/browse/SPARK-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054162#comment-14054162 ] Mridul Muralidharan commented on SPARK-2390: Here, and a bunch of other places, spark currently closes the Filesystem instance : this is incorrect, and should not be done. The fix would be to remove the fs.close; not force creation of new instances. Files in staging directory cannot be deleted and wastes the space of HDFS - Key: SPARK-2390 URL: https://issues.apache.org/jira/browse/SPARK-2390 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 1.0.1 Reporter: Kousuke Saruta When running jobs with YARN Cluster mode and using HistoryServer, the files in the Staging Directory cannot be deleted. HistoryServer uses directory where event log is written, and the directory is represented as a instance of o.a.h.f.FileSystem created by using FileSystem.get. {code:title=FileLogger.scala} private val fileSystem = Utils.getHadoopFileSystem(new URI(logDir)) {code} {code:title=utils.getHadoopFileSystem} def getHadoopFileSystem(path: URI): FileSystem = { FileSystem.get(path, SparkHadoopUtil.get.newConfiguration()) } {code} On the other hand, ApplicationMaster has a instance named fs, which also created by using FileSystem.get. {code:title=ApplicationMaster} private val fs = FileSystem.get(yarnConf) {code} FileSystem.get returns cached same instance when URI passed to the method represents same file system and the method is called by same user. Because of the behavior, when the directory for event log is on HDFS, fs of ApplicationMaster and fileSystem of FileLogger is same instance. When shutting down ApplicationMaster, fileSystem.close is called in FileLogger#stop, which is invoked by SparkContext#stop indirectly. {code:title=FileLogger.stop} def stop() { hadoopDataStream.foreach(_.close()) writer.foreach(_.close()) fileSystem.close() } {code} And ApplicationMaster#cleanupStagingDir also called by JVM shutdown hook. In this method, fs.delete(stagingDirPath) is invoked. Because fs.delete in ApplicationMaster is called after fileSystem.close in FileLogger, fs.delete fails and results not deleting files in the staging directory. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2212) Hash Outer Joins
[ https://issues.apache.org/jira/browse/SPARK-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2212: Priority: Minor (was: Major) Hash Outer Joins Key: SPARK-2212 URL: https://issues.apache.org/jira/browse/SPARK-2212 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cheng Hao Assignee: Cheng Hao Priority: Minor -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2179) Public API for DataTypes and Schema
[ https://issues.apache.org/jira/browse/SPARK-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2179: Priority: Critical (was: Major) Public API for DataTypes and Schema --- Key: SPARK-2179 URL: https://issues.apache.org/jira/browse/SPARK-2179 Project: Spark Issue Type: Improvement Components: SQL Reporter: Michael Armbrust Assignee: Yin Huai Priority: Critical We want something like the following: * Expose DataType in the SQL package and lock down all the internal details (TypeTags, etc) * Programatic API for viewing the schema of an RDD as a StructType * Method that creates a schema RDD given (RDD[A], StructType, A = Row) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2391) LIMIT queries ship a whole partition of data
[ https://issues.apache.org/jira/browse/SPARK-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2391: Priority: Blocker (was: Major) LIMIT queries ship a whole partition of data Key: SPARK-2391 URL: https://issues.apache.org/jira/browse/SPARK-2391 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker Basically the problem here is that Spark's take() runs jobs using allowLocal = true. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2054) Code Generation for Expression Evaluation
[ https://issues.apache.org/jira/browse/SPARK-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2054: Priority: Critical (was: Major) Code Generation for Expression Evaluation - Key: SPARK-2054 URL: https://issues.apache.org/jira/browse/SPARK-2054 Project: Spark Issue Type: Improvement Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Critical -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2097) UDF Support
[ https://issues.apache.org/jira/browse/SPARK-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2097: Priority: Critical (was: Major) UDF Support --- Key: SPARK-2097 URL: https://issues.apache.org/jira/browse/SPARK-2097 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Critical Right now we only support UDFs that are written against the Hive API or are called directly as expressions in the DSL. It would be nice to have native support for registering scala/python functions as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2087) Support SQLConf per session
[ https://issues.apache.org/jira/browse/SPARK-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2087: Priority: Minor (was: Major) Support SQLConf per session --- Key: SPARK-2087 URL: https://issues.apache.org/jira/browse/SPARK-2087 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Zongheng Yang Priority: Minor For things like the SharkServer we should support configuration per thread instead of globally for a context. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2190) Specialized ColumnType for Timestamp
[ https://issues.apache.org/jira/browse/SPARK-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2190: Priority: Critical (was: Major) Specialized ColumnType for Timestamp Key: SPARK-2190 URL: https://issues.apache.org/jira/browse/SPARK-2190 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Critical I'm going to call this a bug since currently its like 300X slower than it needs to be. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2205) Unnecessary exchange operators in a join on multiple tables with the same join key.
[ https://issues.apache.org/jira/browse/SPARK-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2205: Priority: Minor (was: Major) Unnecessary exchange operators in a join on multiple tables with the same join key. --- Key: SPARK-2205 URL: https://issues.apache.org/jira/browse/SPARK-2205 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Priority: Minor {code} hql(select * from src x join src y on (x.key=y.key) join src z on (y.key=z.key)) SchemaRDD[1] at RDD at SchemaRDD.scala:100 == Query Plan == Project [key#4:0,value#5:1,key#6:2,value#7:3,key#8:4,value#9:5] HashJoin [key#6], [key#8], BuildRight Exchange (HashPartitioning [key#6], 200) HashJoin [key#4], [key#6], BuildRight Exchange (HashPartitioning [key#4], 200) HiveTableScan [key#4,value#5], (MetastoreRelation default, src, Some(x)), None Exchange (HashPartitioning [key#6], 200) HiveTableScan [key#6,value#7], (MetastoreRelation default, src, Some(y)), None Exchange (HashPartitioning [key#8], 200) HiveTableScan [key#8,value#9], (MetastoreRelation default, src, Some(z)), None {code} However, this is fine... {code} hql(select * from src x join src y on (x.key=y.key) join src z on (x.key=z.key)) res5: org.apache.spark.sql.SchemaRDD = SchemaRDD[5] at RDD at SchemaRDD.scala:100 == Query Plan == Project [key#26:0,value#27:1,key#28:2,value#29:3,key#30:4,value#31:5] HashJoin [key#26], [key#30], BuildRight HashJoin [key#26], [key#28], BuildRight Exchange (HashPartitioning [key#26], 200) HiveTableScan [key#26,value#27], (MetastoreRelation default, src, Some(x)), None Exchange (HashPartitioning [key#28], 200) HiveTableScan [key#28,value#29], (MetastoreRelation default, src, Some(y)), None Exchange (HashPartitioning [key#30], 200) HiveTableScan [key#30,value#31], (MetastoreRelation default, src, Some(z)), None {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2375) JSON schema inference may not resolve type conflicts correctly for a field inside an array of structs
[ https://issues.apache.org/jira/browse/SPARK-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2375: Target Version/s: 1.1.0 Fix Version/s: (was: 1.1.0) JSON schema inference may not resolve type conflicts correctly for a field inside an array of structs - Key: SPARK-2375 URL: https://issues.apache.org/jira/browse/SPARK-2375 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.1 Reporter: Yin Huai For example, for {code} {array: [{field:214748364700}, {field:1}]} {code} the type of field is resolved as IntType. While, for {code} {array: [{field:1}, {field:214748364700}]} {code} the type of field is resolved as LongType. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2017) web ui stage page becomes unresponsive when the number of tasks is large
[ https://issues.apache.org/jira/browse/SPARK-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054178#comment-14054178 ] Masayoshi TSUZUKI commented on SPARK-2017: -- Pagination seems to be better because with aggregated metrics, 1. we can't identify the skew of tasks between the executors. 2. the same problem will appear again when many tasks fail in a certain stage. In addition, when some errors or problems occur under the production environment, we would like to see the status of tasks near the time even if those tasks mostly succeeded. Although every status of tasks is written in the log file, web ui is very useful in operation phase. web ui stage page becomes unresponsive when the number of tasks is large Key: SPARK-2017 URL: https://issues.apache.org/jira/browse/SPARK-2017 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin Labels: starter {code} sc.parallelize(1 to 100, 100).count() {code} The above code creates one million tasks to be executed. The stage detail web ui page takes forever to load (if it ever completes). There are again a few different alternatives: 0. Limit the number of tasks we show. 1. Pagination 2. By default only show the aggregate metrics and failed tasks, and hide the successful ones. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2376) Selecting list values inside nested JSON objects raises java.lang.IllegalArgumentException
[ https://issues.apache.org/jira/browse/SPARK-2376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2376: Target Version/s: 1.1.0 Selecting list values inside nested JSON objects raises java.lang.IllegalArgumentException -- Key: SPARK-2376 URL: https://issues.apache.org/jira/browse/SPARK-2376 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.1 Reporter: Nicholas Chammas Repro script for PySpark, deployed via {{spark-ec2}} at git revision {{9d5ecf8205b924dc8a3c13fed68beb78cc5c7553}}: {code} from pyspark.sql import SQLContext sqlContext = SQLContext(sc) raw = sc.parallelize([ { name: Nick, history: { countries: [] } } ]) profiles = sqlContext.jsonRDD(raw) profiles.registerAsTable(profiles) profiles.printSchema() sqlContext.sql(SELECT name FROM profiles;).collect() # works fine sqlContext.sql(SELECT history FROM profiles;).collect() # raises exception {code} Attempting to select the top-level struct that has a nested list value yields the following error: {code} 14/07/06 00:10:26 INFO scheduler.TaskSetManager: Loss was due to net.razorvine.pickle.PickleException: couldn't introspect javabean: java.lang.IllegalArgumentException: wrong number of arguments [duplicate 3] 14/07/06 00:10:26 ERROR scheduler.TaskSetManager: Task 26.0:15 failed 4 times; aborting job 14/07/06 00:10:26 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 26.0, whose tasks have all completed, from pool 14/07/06 00:10:26 INFO scheduler.TaskSchedulerImpl: Cancelling stage 26 14/07/06 00:10:26 INFO scheduler.DAGScheduler: Failed to run collect at stdin:1 Traceback (most recent call last): File stdin, line 1, in module File /root/spark/python/pyspark/rdd.py, line 649, in collect bytesInJava = self._jrdd.collect().iterator() File /root/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 537, in __call__ File /root/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o286.collect. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 26.0:15 failed 4 times, most recent failure: Exception failure in TID 394 on host ip-10-183-59-125.ec2.internal: net.razorvine.pickle.PickleException: couldn't introspect javabean: java.lang.IllegalArgumentException: wrong number of arguments net.razorvine.pickle.Pickler.put_javabean(Pickler.java:603) net.razorvine.pickle.Pickler.dispatch(Pickler.java:299) net.razorvine.pickle.Pickler.save(Pickler.java:125) net.razorvine.pickle.Pickler.put_map(Pickler.java:322) net.razorvine.pickle.Pickler.dispatch(Pickler.java:286) net.razorvine.pickle.Pickler.save(Pickler.java:125) net.razorvine.pickle.Pickler.put_map(Pickler.java:322) net.razorvine.pickle.Pickler.dispatch(Pickler.java:286) net.razorvine.pickle.Pickler.save(Pickler.java:125) net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:392) net.razorvine.pickle.Pickler.dispatch(Pickler.java:195) net.razorvine.pickle.Pickler.save(Pickler.java:125) net.razorvine.pickle.Pickler.dump(Pickler.java:95) net.razorvine.pickle.Pickler.dumps(Pickler.java:80) org.apache.spark.sql.SchemaRDD$$anonfun$javaToPython$1$$anonfun$apply$3.apply(SchemaRDD.scala:385) org.apache.spark.sql.SchemaRDD$$anonfun$javaToPython$1$$anonfun$apply$3.apply(SchemaRDD.scala:385) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:317) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:203) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:178) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:178) org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1213) org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:177) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1041) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1025) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1023) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at
[jira] [Created] (SPARK-2393) Simple cost estimation and auto selection of broadcast join
Michael Armbrust created SPARK-2393: --- Summary: Simple cost estimation and auto selection of broadcast join Key: SPARK-2393 URL: https://issues.apache.org/jira/browse/SPARK-2393 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Priority: Critical -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2339) SQL parser in sql-core is case sensitive, but a table alias is converted to lower case when we create Subquery
[ https://issues.apache.org/jira/browse/SPARK-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-2339: Target Version/s: 1.1.0 SQL parser in sql-core is case sensitive, but a table alias is converted to lower case when we create Subquery -- Key: SPARK-2339 URL: https://issues.apache.org/jira/browse/SPARK-2339 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Yin Huai Reported by http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-throws-exception-td8599.html After we get the table from the catalog, because the table has an alias, we will temporarily insert a Subquery. Then, we convert the table alias to lower case no matter if the parser is case sensitive or not. To see the issue ... {code} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD case class Person(name: String, age: Int) val people = sc.textFile(examples/src/main/resources/people.txt).map(_.split(,)).map(p = Person(p(0), p(1).trim.toInt)) people.registerAsTable(people) sqlContext.sql(select PEOPLE.name from people PEOPLE) {code} The plan is ... {code} == Query Plan == Project ['PEOPLE.name] ExistingRdd [name#0,age#1], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:176 {code} You can find that PEOPLE.name is not resolved. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2339) SQL parser in sql-core is case sensitive, but a table alias is converted to lower case when we create Subquery
[ https://issues.apache.org/jira/browse/SPARK-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-2339: Fix Version/s: (was: 1.1.0) SQL parser in sql-core is case sensitive, but a table alias is converted to lower case when we create Subquery -- Key: SPARK-2339 URL: https://issues.apache.org/jira/browse/SPARK-2339 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Yin Huai Reported by http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-throws-exception-td8599.html After we get the table from the catalog, because the table has an alias, we will temporarily insert a Subquery. Then, we convert the table alias to lower case no matter if the parser is case sensitive or not. To see the issue ... {code} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD case class Person(name: String, age: Int) val people = sc.textFile(examples/src/main/resources/people.txt).map(_.split(,)).map(p = Person(p(0), p(1).trim.toInt)) people.registerAsTable(people) sqlContext.sql(select PEOPLE.name from people PEOPLE) {code} The plan is ... {code} == Query Plan == Project ['PEOPLE.name] ExistingRdd [name#0,age#1], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:176 {code} You can find that PEOPLE.name is not resolved. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2205) Unnecessary exchange operators in a join on multiple tables with the same join key.
[ https://issues.apache.org/jira/browse/SPARK-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-2205: Target Version/s: 1.1.0 (was: 1.0.1, 1.1.0) Unnecessary exchange operators in a join on multiple tables with the same join key. --- Key: SPARK-2205 URL: https://issues.apache.org/jira/browse/SPARK-2205 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Priority: Minor {code} hql(select * from src x join src y on (x.key=y.key) join src z on (y.key=z.key)) SchemaRDD[1] at RDD at SchemaRDD.scala:100 == Query Plan == Project [key#4:0,value#5:1,key#6:2,value#7:3,key#8:4,value#9:5] HashJoin [key#6], [key#8], BuildRight Exchange (HashPartitioning [key#6], 200) HashJoin [key#4], [key#6], BuildRight Exchange (HashPartitioning [key#4], 200) HiveTableScan [key#4,value#5], (MetastoreRelation default, src, Some(x)), None Exchange (HashPartitioning [key#6], 200) HiveTableScan [key#6,value#7], (MetastoreRelation default, src, Some(y)), None Exchange (HashPartitioning [key#8], 200) HiveTableScan [key#8,value#9], (MetastoreRelation default, src, Some(z)), None {code} However, this is fine... {code} hql(select * from src x join src y on (x.key=y.key) join src z on (x.key=z.key)) res5: org.apache.spark.sql.SchemaRDD = SchemaRDD[5] at RDD at SchemaRDD.scala:100 == Query Plan == Project [key#26:0,value#27:1,key#28:2,value#29:3,key#30:4,value#31:5] HashJoin [key#26], [key#30], BuildRight HashJoin [key#26], [key#28], BuildRight Exchange (HashPartitioning [key#26], 200) HiveTableScan [key#26,value#27], (MetastoreRelation default, src, Some(x)), None Exchange (HashPartitioning [key#28], 200) HiveTableScan [key#28,value#29], (MetastoreRelation default, src, Some(y)), None Exchange (HashPartitioning [key#30], 200) HiveTableScan [key#30,value#31], (MetastoreRelation default, src, Some(z)), None {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2339) SQL parser in sql-core is case sensitive, but a table alias is converted to lower case when we create Subquery
[ https://issues.apache.org/jira/browse/SPARK-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-2339: Component/s: SQL SQL parser in sql-core is case sensitive, but a table alias is converted to lower case when we create Subquery -- Key: SPARK-2339 URL: https://issues.apache.org/jira/browse/SPARK-2339 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Yin Huai Assignee: Yin Huai Reported by http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-throws-exception-td8599.html After we get the table from the catalog, because the table has an alias, we will temporarily insert a Subquery. Then, we convert the table alias to lower case no matter if the parser is case sensitive or not. To see the issue ... {code} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD case class Person(name: String, age: Int) val people = sc.textFile(examples/src/main/resources/people.txt).map(_.split(,)).map(p = Person(p(0), p(1).trim.toInt)) people.registerAsTable(people) sqlContext.sql(select PEOPLE.name from people PEOPLE) {code} The plan is ... {code} == Query Plan == Project ['PEOPLE.name] ExistingRdd [name#0,age#1], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:176 {code} You can find that PEOPLE.name is not resolved. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2384) Add tooltips for shuffle write and scheduler delay in UI
[ https://issues.apache.org/jira/browse/SPARK-2384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054207#comment-14054207 ] Kay Ousterhout commented on SPARK-2384: --- https://github.com/apache/spark/pull/1314 Add tooltips for shuffle write and scheduler delay in UI Key: SPARK-2384 URL: https://issues.apache.org/jira/browse/SPARK-2384 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Kay Ousterhout Assignee: Kay Ousterhout Priority: Minor There are a few common points of confusion in the UI that could be clarified with tooltips. We should add tooltips to explain the scheduler delay and the shuffle data (to explain why shuffle read is typically shuffle write, as many many people have expressed confusion about). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2394) Make it easier to read LZO-compressed files from EC2 clusters
Nicholas Chammas created SPARK-2394: --- Summary: Make it easier to read LZO-compressed files from EC2 clusters Key: SPARK-2394 URL: https://issues.apache.org/jira/browse/SPARK-2394 Project: Spark Issue Type: Improvement Components: EC2, Input/Output Affects Versions: 1.0.0 Reporter: Nicholas Chammas Priority: Minor Amazon hosts [a large Google n-grams data set on S3|https://aws.amazon.com/datasets/8172056142375670]. This data set is perfect, among other things, for putting together interesting and easily reproducible public demos of Spark's capabilities. The problem is that the data set is compressed using LZO, and it is currently more painful than it should be to get your average {{spark-ec2}} cluster to read input compressed in this way. This is what one has to go through to get a Spark cluster created with {{spark-ec2}} to read LZO-compressed files: # Install the latest LZO release, perhaps via {{yum}}. # Download [{{hadoop-lzo}}|https://github.com/twitter/hadoop-lzo] and build it. To build {{hadoop-lzo}} you need Maven. # Install Maven. For some reason, [you cannot install Maven with {{yum}}|http://stackoverflow.com/questions/7532928/how-do-i-install-maven-with-yum], so install it manually. # Update your {{core-site.xml}} and {{spark-env.sh}} with [the appropriate configs|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3cca+-p3aga6f86qcsowp7k_r+8r-dgbmj3gz+4xljzjpr90db...@mail.gmail.com%3E]. # Make [the appropriate calls|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCA+-p3AGSPeNE5miQRFHC7-ZwNbicaXfh1-ZXdKJ=saw_mgr...@mail.gmail.com%3E] to {{sc.newAPIHadoopFile}}. This seems like a bit too much work for what we're trying to accomplish. If we expect this to be a common pattern -- reading LZO-compressed files from a {{spark-ec2}} cluster -- it would be great if we could somehow make this less painful. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1977) mutable.BitSet in ALS not serializable with KryoSerializer
[ https://issues.apache.org/jira/browse/SPARK-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-1977. -- Resolution: Fixed Fix Version/s: 1.1.0 1.0.1 Issue resolved by pull request 1319 [https://github.com/apache/spark/pull/1319] mutable.BitSet in ALS not serializable with KryoSerializer -- Key: SPARK-1977 URL: https://issues.apache.org/jira/browse/SPARK-1977 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: Neville Li Priority: Minor Fix For: 1.0.1, 1.1.0 OutLinkBlock in ALS.scala has an Array[mutable.BitSet] member. KryoSerializer uses AllScalaRegistrar from Twitter chill but it doesn't register mutable.BitSet. Right now we have to register mutable.BitSet manually. A proper fix would be using immutable.BitSet in ALS or register mutable.BitSet in upstream chill. {code} Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1724.0:9 failed 4 times, most recent failure: Exception failure in TID 68548 on host lon4-hadoopslave-b232.lon4.spotify.net: com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: scala.collection.mutable.HashSet Serialization trace: shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626) com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77) org.apache.spark.rdd.RDD.iterator(RDD.scala:227) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) java.lang.Thread.run(Thread.java:662) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at
[jira] [Closed] (SPARK-2378) Implement functionality to read csv files
[ https://issues.apache.org/jira/browse/SPARK-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust closed SPARK-2378. --- Resolution: Duplicate Implement functionality to read csv files - Key: SPARK-2378 URL: https://issues.apache.org/jira/browse/SPARK-2378 Project: Spark Issue Type: Improvement Components: Input/Output Affects Versions: 1.0.0 Reporter: Hossein Falaki Priority: Minor Fix For: 1.0.0 Similar to jsonFile(), csvFile() could be used to read data into a SchemaRDD (or a normal RDD). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2360) CSV import to SchemaRDDs
[ https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2360: Description: I think the first step it to design the interface that we want to present to users. Mostly this is defining options when importing. Off the top of my head: - What is the separator? - Provide column names or infer them from the first row. - how to handle multiple files with possibly different schemas - do we have a method to let users specify the datatypes of the columns or are they just strings? - what types of quoting / escaping do we want to support? CSV import to SchemaRDDs Key: SPARK-2360 URL: https://issues.apache.org/jira/browse/SPARK-2360 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Priority: Minor I think the first step it to design the interface that we want to present to users. Mostly this is defining options when importing. Off the top of my head: - What is the separator? - Provide column names or infer them from the first row. - how to handle multiple files with possibly different schemas - do we have a method to let users specify the datatypes of the columns or are they just strings? - what types of quoting / escaping do we want to support? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2360) CSV import to SchemaRDDs
[ https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054262#comment-14054262 ] Hossein Falaki commented on SPARK-2360: --- As a point for comparison the interface in some other popular packages are: __R__: ``` read.csv(filePath, header = TRUE, sep = ,, quote = \, dec = ., fill = TRUE, comment.char = , ...) ``` Where: header: a logical value indicating whether the file contains the names of the variables as its first line. sep: the field separator character. quote: the set of quoting characters. To disable quoting altogether, use ‘quote = ’ dec: the character used in the file for decimal points. fill: If ‘TRUE’ then in case the rows have unequal length, blank fields are implicitly added. __pandas__: ``` pandas.io.parsers.read_csv(filepath_or_buffer, sep=', ', dialect=None, compression=None, doublequote=True, escapechar=None, quotechar='', quoting=0, skipinitialspace=False, lineterminator=None, header='infer', index_col=None, names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, na_values=None, na_fvalues=None, true_values=None, false_values=None, delimiter=None, converters=None, dtype=None, usecols=None, engine=None, delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, date_parser=None, memory_map=False, nrows=None, iterator=False, chunksize=None, verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True, tupleize_cols=False, infer_datetime_format=False) ``` The description of fields can be found here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html CSV import to SchemaRDDs Key: SPARK-2360 URL: https://issues.apache.org/jira/browse/SPARK-2360 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Priority: Minor I think the first step it to design the interface that we want to present to users. Mostly this is defining options when importing. Off the top of my head: - What is the separator? - Provide column names or infer them from the first row. - how to handle multiple files with possibly different schemas - do we have a method to let users specify the datatypes of the columns or are they just strings? - what types of quoting / escaping do we want to support? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2360) CSV import to SchemaRDDs
[ https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054262#comment-14054262 ] Hossein Falaki edited comment on SPARK-2360 at 7/7/14 11:03 PM: As a point for comparison the interface in some other popular packages are: _R_: {code} read.csv(filePath, header = TRUE, sep = ,, quote = \, dec = ., fill = TRUE, comment.char = , ...) {code} Where: header: a logical value indicating whether the file contains the names of the variables as its first line. sep: the field separator character. quote: the set of quoting characters. To disable quoting altogether, use ‘quote = ’ dec: the character used in the file for decimal points. fill: If ‘TRUE’ then in case the rows have unequal length, blank fields are implicitly added. __pandas__: ``` pandas.io.parsers.read_csv(filepath_or_buffer, sep=', ', dialect=None, compression=None, doublequote=True, escapechar=None, quotechar='', quoting=0, skipinitialspace=False, lineterminator=None, header='infer', index_col=None, names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, na_values=None, na_fvalues=None, true_values=None, false_values=None, delimiter=None, converters=None, dtype=None, usecols=None, engine=None, delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, date_parser=None, memory_map=False, nrows=None, iterator=False, chunksize=None, verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True, tupleize_cols=False, infer_datetime_format=False) ``` The description of fields can be found here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html was (Author: falaki): As a point for comparison the interface in some other popular packages are: __R__: ``` read.csv(filePath, header = TRUE, sep = ,, quote = \, dec = ., fill = TRUE, comment.char = , ...) ``` Where: header: a logical value indicating whether the file contains the names of the variables as its first line. sep: the field separator character. quote: the set of quoting characters. To disable quoting altogether, use ‘quote = ’ dec: the character used in the file for decimal points. fill: If ‘TRUE’ then in case the rows have unequal length, blank fields are implicitly added. __pandas__: ``` pandas.io.parsers.read_csv(filepath_or_buffer, sep=', ', dialect=None, compression=None, doublequote=True, escapechar=None, quotechar='', quoting=0, skipinitialspace=False, lineterminator=None, header='infer', index_col=None, names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, na_values=None, na_fvalues=None, true_values=None, false_values=None, delimiter=None, converters=None, dtype=None, usecols=None, engine=None, delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, date_parser=None, memory_map=False, nrows=None, iterator=False, chunksize=None, verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True, tupleize_cols=False, infer_datetime_format=False) ``` The description of fields can be found here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html CSV import to SchemaRDDs Key: SPARK-2360 URL: https://issues.apache.org/jira/browse/SPARK-2360 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Priority: Minor I think the first step it to design the interface that we want to present to users. Mostly this is defining options when importing. Off the top of my head: - What is the separator? - Provide column names or infer them from the first row. - how to handle multiple files with possibly different schemas - do we have a method to let users specify the datatypes of the columns or are they just strings? - what types of quoting / escaping do we want to support? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2360) CSV import to SchemaRDDs
[ https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054262#comment-14054262 ] Hossein Falaki edited comment on SPARK-2360 at 7/7/14 11:04 PM: As a point for comparison the interface in some other popular packages are: _R_: {code} read.csv(filePath, header = TRUE, sep = ,, quote = \, dec = ., fill = TRUE, comment.char = , ...) {code} Where: header: a logical value indicating whether the file contains the names of the variables as its first line. sep: the field separator character. quote: the set of quoting characters. To disable quoting altogether, use ‘quote = ’ dec: the character used in the file for decimal points. fill: If ‘TRUE’ then in case the rows have unequal length, blank fields are implicitly added. _pandas_: {code} pandas.io.parsers.read_csv(filepath_or_buffer, sep=', ', dialect=None, compression=None, doublequote=True, escapechar=None, quotechar='', quoting=0, skipinitialspace=False, lineterminator=None, header='infer', index_col=None, names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, na_values=None, na_fvalues=None, true_values=None, false_values=None, delimiter=None, converters=None, dtype=None, usecols=None, engine=None, delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, date_parser=None, memory_map=False, nrows=None, iterator=False, chunksize=None, verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True, tupleize_cols=False, infer_datetime_format=False) {code} The description of fields can be found here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html was (Author: falaki): As a point for comparison the interface in some other popular packages are: _R_: {code} read.csv(filePath, header = TRUE, sep = ,, quote = \, dec = ., fill = TRUE, comment.char = , ...) {code} Where: header: a logical value indicating whether the file contains the names of the variables as its first line. sep: the field separator character. quote: the set of quoting characters. To disable quoting altogether, use ‘quote = ’ dec: the character used in the file for decimal points. fill: If ‘TRUE’ then in case the rows have unequal length, blank fields are implicitly added. __pandas__: ``` pandas.io.parsers.read_csv(filepath_or_buffer, sep=', ', dialect=None, compression=None, doublequote=True, escapechar=None, quotechar='', quoting=0, skipinitialspace=False, lineterminator=None, header='infer', index_col=None, names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, na_values=None, na_fvalues=None, true_values=None, false_values=None, delimiter=None, converters=None, dtype=None, usecols=None, engine=None, delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, date_parser=None, memory_map=False, nrows=None, iterator=False, chunksize=None, verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True, tupleize_cols=False, infer_datetime_format=False) ``` The description of fields can be found here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html CSV import to SchemaRDDs Key: SPARK-2360 URL: https://issues.apache.org/jira/browse/SPARK-2360 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Priority: Minor I think the first step it to design the interface that we want to present to users. Mostly this is defining options when importing. Off the top of my head: - What is the separator? - Provide column names or infer them from the first row. - how to handle multiple files with possibly different schemas - do we have a method to let users specify the datatypes of the columns or are they just strings? - what types of quoting / escaping do we want to support? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2395) Optimize common like expressions
Michael Armbrust created SPARK-2395: --- Summary: Optimize common like expressions Key: SPARK-2395 URL: https://issues.apache.org/jira/browse/SPARK-2395 Project: Spark Issue Type: Improvement Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust LIKE queries that are really just a contains, startsWith or endsWith would be much faster if we avoided regular expressions. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2390) Files in staging directory cannot be deleted and wastes the space of HDFS
[ https://issues.apache.org/jira/browse/SPARK-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054298#comment-14054298 ] Kousuke Saruta commented on SPARK-2390: --- Thank you for your comment [~mridulm80]. I noticed FileSystem is to be closed by shutdown hook, and the shutdown hook deletes files in the staging directory and then close FileSystem. So, in this case, I also think there is no problem to remove fs.close. Files in staging directory cannot be deleted and wastes the space of HDFS - Key: SPARK-2390 URL: https://issues.apache.org/jira/browse/SPARK-2390 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 1.0.1 Reporter: Kousuke Saruta When running jobs with YARN Cluster mode and using HistoryServer, the files in the Staging Directory cannot be deleted. HistoryServer uses directory where event log is written, and the directory is represented as a instance of o.a.h.f.FileSystem created by using FileSystem.get. {code:title=FileLogger.scala} private val fileSystem = Utils.getHadoopFileSystem(new URI(logDir)) {code} {code:title=utils.getHadoopFileSystem} def getHadoopFileSystem(path: URI): FileSystem = { FileSystem.get(path, SparkHadoopUtil.get.newConfiguration()) } {code} On the other hand, ApplicationMaster has a instance named fs, which also created by using FileSystem.get. {code:title=ApplicationMaster} private val fs = FileSystem.get(yarnConf) {code} FileSystem.get returns cached same instance when URI passed to the method represents same file system and the method is called by same user. Because of the behavior, when the directory for event log is on HDFS, fs of ApplicationMaster and fileSystem of FileLogger is same instance. When shutting down ApplicationMaster, fileSystem.close is called in FileLogger#stop, which is invoked by SparkContext#stop indirectly. {code:title=FileLogger.stop} def stop() { hadoopDataStream.foreach(_.close()) writer.foreach(_.close()) fileSystem.close() } {code} And ApplicationMaster#cleanupStagingDir also called by JVM shutdown hook. In this method, fs.delete(stagingDirPath) is invoked. Because fs.delete in ApplicationMaster is called after fileSystem.close in FileLogger, fs.delete fails and results not deleting files in the staging directory. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2339) SQL parser in sql-core is case sensitive, but a table alias is converted to lower case when we create Subquery
[ https://issues.apache.org/jira/browse/SPARK-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2339. - Resolution: Fixed Fix Version/s: 1.0.2 1.1.0 SQL parser in sql-core is case sensitive, but a table alias is converted to lower case when we create Subquery -- Key: SPARK-2339 URL: https://issues.apache.org/jira/browse/SPARK-2339 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Yin Huai Assignee: Yin Huai Fix For: 1.1.0, 1.0.2 Reported by http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-throws-exception-td8599.html After we get the table from the catalog, because the table has an alias, we will temporarily insert a Subquery. Then, we convert the table alias to lower case no matter if the parser is case sensitive or not. To see the issue ... {code} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD case class Person(name: String, age: Int) val people = sc.textFile(examples/src/main/resources/people.txt).map(_.split(,)).map(p = Person(p(0), p(1).trim.toInt)) people.registerAsTable(people) sqlContext.sql(select PEOPLE.name from people PEOPLE) {code} The plan is ... {code} == Query Plan == Project ['PEOPLE.name] ExistingRdd [name#0,age#1], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:176 {code} You can find that PEOPLE.name is not resolved. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2386) RowWriteSupport should use the exact types to cast.
[ https://issues.apache.org/jira/browse/SPARK-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2386. - Resolution: Fixed Fix Version/s: 1.0.2 1.1.0 Assignee: Takuya Ueshin RowWriteSupport should use the exact types to cast. --- Key: SPARK-2386 URL: https://issues.apache.org/jira/browse/SPARK-2386 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin Assignee: Takuya Ueshin Fix For: 1.1.0, 1.0.2 When execute {{saveAsParquetFile}} with non-primitive type, {{RowWriteSupport}} uses wrong type {{Int}} for {{ByteType}} and {{ShortType}}. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2375) JSON schema inference may not resolve type conflicts correctly for a field inside an array of structs
[ https://issues.apache.org/jira/browse/SPARK-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2375. - Resolution: Fixed Fix Version/s: 1.0.2 1.1.0 JSON schema inference may not resolve type conflicts correctly for a field inside an array of structs - Key: SPARK-2375 URL: https://issues.apache.org/jira/browse/SPARK-2375 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.1 Reporter: Yin Huai Assignee: Yin Huai Fix For: 1.1.0, 1.0.2 For example, for {code} {array: [{field:214748364700}, {field:1}]} {code} the type of field is resolved as IntType. While, for {code} {array: [{field:1}, {field:214748364700}]} {code} the type of field is resolved as LongType. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2390) Files in staging directory cannot be deleted and wastes the space of HDFS
[ https://issues.apache.org/jira/browse/SPARK-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054333#comment-14054333 ] Kousuke Saruta commented on SPARK-2390: --- Pull Requested at https://github.com/apache/spark/pull/1317 Files in staging directory cannot be deleted and wastes the space of HDFS - Key: SPARK-2390 URL: https://issues.apache.org/jira/browse/SPARK-2390 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 1.0.1 Reporter: Kousuke Saruta When running jobs with YARN Cluster mode and using HistoryServer, the files in the Staging Directory cannot be deleted. HistoryServer uses directory where event log is written, and the directory is represented as a instance of o.a.h.f.FileSystem created by using FileSystem.get. {code:title=FileLogger.scala} private val fileSystem = Utils.getHadoopFileSystem(new URI(logDir)) {code} {code:title=utils.getHadoopFileSystem} def getHadoopFileSystem(path: URI): FileSystem = { FileSystem.get(path, SparkHadoopUtil.get.newConfiguration()) } {code} On the other hand, ApplicationMaster has a instance named fs, which also created by using FileSystem.get. {code:title=ApplicationMaster} private val fs = FileSystem.get(yarnConf) {code} FileSystem.get returns cached same instance when URI passed to the method represents same file system and the method is called by same user. Because of the behavior, when the directory for event log is on HDFS, fs of ApplicationMaster and fileSystem of FileLogger is same instance. When shutting down ApplicationMaster, fileSystem.close is called in FileLogger#stop, which is invoked by SparkContext#stop indirectly. {code:title=FileLogger.stop} def stop() { hadoopDataStream.foreach(_.close()) writer.foreach(_.close()) fileSystem.close() } {code} And ApplicationMaster#cleanupStagingDir also called by JVM shutdown hook. In this method, fs.delete(stagingDirPath) is invoked. Because fs.delete in ApplicationMaster is called after fileSystem.close in FileLogger, fs.delete fails and results not deleting files in the staging directory. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2390) Files in staging directory cannot be deleted and wastes the space of HDFS
[ https://issues.apache.org/jira/browse/SPARK-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054333#comment-14054333 ] Kousuke Saruta edited comment on SPARK-2390 at 7/8/14 12:41 AM: Pull Requested at https://github.com/apache/spark/pull/1326 was (Author: sarutak): Pull Requested at https://github.com/apache/spark/pull/1317 Files in staging directory cannot be deleted and wastes the space of HDFS - Key: SPARK-2390 URL: https://issues.apache.org/jira/browse/SPARK-2390 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 1.0.1 Reporter: Kousuke Saruta When running jobs with YARN Cluster mode and using HistoryServer, the files in the Staging Directory cannot be deleted. HistoryServer uses directory where event log is written, and the directory is represented as a instance of o.a.h.f.FileSystem created by using FileSystem.get. {code:title=FileLogger.scala} private val fileSystem = Utils.getHadoopFileSystem(new URI(logDir)) {code} {code:title=utils.getHadoopFileSystem} def getHadoopFileSystem(path: URI): FileSystem = { FileSystem.get(path, SparkHadoopUtil.get.newConfiguration()) } {code} On the other hand, ApplicationMaster has a instance named fs, which also created by using FileSystem.get. {code:title=ApplicationMaster} private val fs = FileSystem.get(yarnConf) {code} FileSystem.get returns cached same instance when URI passed to the method represents same file system and the method is called by same user. Because of the behavior, when the directory for event log is on HDFS, fs of ApplicationMaster and fileSystem of FileLogger is same instance. When shutting down ApplicationMaster, fileSystem.close is called in FileLogger#stop, which is invoked by SparkContext#stop indirectly. {code:title=FileLogger.stop} def stop() { hadoopDataStream.foreach(_.close()) writer.foreach(_.close()) fileSystem.close() } {code} And ApplicationMaster#cleanupStagingDir also called by JVM shutdown hook. In this method, fs.delete(stagingDirPath) is invoked. Because fs.delete in ApplicationMaster is called after fileSystem.close in FileLogger, fs.delete fails and results not deleting files in the staging directory. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2360) CSV import to SchemaRDDs
[ https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054262#comment-14054262 ] Hossein Falaki edited comment on SPARK-2360 at 7/8/14 12:45 AM: As a point for comparison the interface in some other popular packages are: _R_: {code} read.csv(filePath, header = TRUE, sep = ,, quote = \, dec = ., fill = TRUE, comment.char = , ...) {code} Where: * header: a logical value indicating whether the file contains the names of the variables as its first line. * sep: the field separator character. * quote: the set of quoting characters. To disable quoting altogether, use ‘quote = ’ * dec: the character used in the file for decimal points. * fill: If ‘TRUE’ then in case the rows have unequal length, blank fields are implicitly added. _pandas_: {code} pandas.io.parsers.read_csv(filepath_or_buffer, sep=', ', dialect=None, compression=None, doublequote=True, escapechar=None, quotechar='', quoting=0, skipinitialspace=False, lineterminator=None, header='infer', index_col=None, names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, na_values=None, na_fvalues=None, true_values=None, false_values=None, delimiter=None, converters=None, dtype=None, usecols=None, engine=None, delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, date_parser=None, memory_map=False, nrows=None, iterator=False, chunksize=None, verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True, tupleize_cols=False, infer_datetime_format=False) {code} The description of fields can be found here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html was (Author: falaki): As a point for comparison the interface in some other popular packages are: _R_: {code} read.csv(filePath, header = TRUE, sep = ,, quote = \, dec = ., fill = TRUE, comment.char = , ...) {code} Where: header: a logical value indicating whether the file contains the names of the variables as its first line. sep: the field separator character. quote: the set of quoting characters. To disable quoting altogether, use ‘quote = ’ dec: the character used in the file for decimal points. fill: If ‘TRUE’ then in case the rows have unequal length, blank fields are implicitly added. _pandas_: {code} pandas.io.parsers.read_csv(filepath_or_buffer, sep=', ', dialect=None, compression=None, doublequote=True, escapechar=None, quotechar='', quoting=0, skipinitialspace=False, lineterminator=None, header='infer', index_col=None, names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, na_values=None, na_fvalues=None, true_values=None, false_values=None, delimiter=None, converters=None, dtype=None, usecols=None, engine=None, delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, date_parser=None, memory_map=False, nrows=None, iterator=False, chunksize=None, verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True, tupleize_cols=False, infer_datetime_format=False) {code} The description of fields can be found here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html CSV import to SchemaRDDs Key: SPARK-2360 URL: https://issues.apache.org/jira/browse/SPARK-2360 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Priority: Minor I think the first step it to design the interface that we want to present to users. Mostly this is defining options when importing. Off the top of my head: - What is the separator? - Provide column names or infer them from the first row. - how to handle multiple files with possibly different schemas - do we have a method to let users specify the datatypes of the columns or are they just strings? - what types of quoting / escaping do we want to support? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2397) Get rid of LocalHiveContext
Michael Armbrust created SPARK-2397: --- Summary: Get rid of LocalHiveContext Key: SPARK-2397 URL: https://issues.apache.org/jira/browse/SPARK-2397 Project: Spark Issue Type: Improvement Components: SQL Reporter: Michael Armbrust Priority: Blocker Fix For: 1.1.0 HiveLocalContext is nearly completely redundant with HiveContext. We should consider deprecating it and removing all uses. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2376) Selecting list values inside nested JSON objects raises java.lang.IllegalArgumentException
[ https://issues.apache.org/jira/browse/SPARK-2376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-2376. - Resolution: Fixed Fix Version/s: 1.0.2 1.1.0 Selecting list values inside nested JSON objects raises java.lang.IllegalArgumentException -- Key: SPARK-2376 URL: https://issues.apache.org/jira/browse/SPARK-2376 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.1 Reporter: Nicholas Chammas Assignee: Yin Huai Fix For: 1.1.0, 1.0.2 Repro script for PySpark, deployed via {{spark-ec2}} at git revision {{9d5ecf8205b924dc8a3c13fed68beb78cc5c7553}}: {code} from pyspark.sql import SQLContext sqlContext = SQLContext(sc) raw = sc.parallelize([ { name: Nick, history: { countries: [] } } ]) profiles = sqlContext.jsonRDD(raw) profiles.registerAsTable(profiles) profiles.printSchema() sqlContext.sql(SELECT name FROM profiles;).collect() # works fine sqlContext.sql(SELECT history FROM profiles;).collect() # raises exception {code} Attempting to select the top-level struct that has a nested list value yields the following error: {code} 14/07/06 00:10:26 INFO scheduler.TaskSetManager: Loss was due to net.razorvine.pickle.PickleException: couldn't introspect javabean: java.lang.IllegalArgumentException: wrong number of arguments [duplicate 3] 14/07/06 00:10:26 ERROR scheduler.TaskSetManager: Task 26.0:15 failed 4 times; aborting job 14/07/06 00:10:26 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 26.0, whose tasks have all completed, from pool 14/07/06 00:10:26 INFO scheduler.TaskSchedulerImpl: Cancelling stage 26 14/07/06 00:10:26 INFO scheduler.DAGScheduler: Failed to run collect at stdin:1 Traceback (most recent call last): File stdin, line 1, in module File /root/spark/python/pyspark/rdd.py, line 649, in collect bytesInJava = self._jrdd.collect().iterator() File /root/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 537, in __call__ File /root/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o286.collect. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 26.0:15 failed 4 times, most recent failure: Exception failure in TID 394 on host ip-10-183-59-125.ec2.internal: net.razorvine.pickle.PickleException: couldn't introspect javabean: java.lang.IllegalArgumentException: wrong number of arguments net.razorvine.pickle.Pickler.put_javabean(Pickler.java:603) net.razorvine.pickle.Pickler.dispatch(Pickler.java:299) net.razorvine.pickle.Pickler.save(Pickler.java:125) net.razorvine.pickle.Pickler.put_map(Pickler.java:322) net.razorvine.pickle.Pickler.dispatch(Pickler.java:286) net.razorvine.pickle.Pickler.save(Pickler.java:125) net.razorvine.pickle.Pickler.put_map(Pickler.java:322) net.razorvine.pickle.Pickler.dispatch(Pickler.java:286) net.razorvine.pickle.Pickler.save(Pickler.java:125) net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:392) net.razorvine.pickle.Pickler.dispatch(Pickler.java:195) net.razorvine.pickle.Pickler.save(Pickler.java:125) net.razorvine.pickle.Pickler.dump(Pickler.java:95) net.razorvine.pickle.Pickler.dumps(Pickler.java:80) org.apache.spark.sql.SchemaRDD$$anonfun$javaToPython$1$$anonfun$apply$3.apply(SchemaRDD.scala:385) org.apache.spark.sql.SchemaRDD$$anonfun$javaToPython$1$$anonfun$apply$3.apply(SchemaRDD.scala:385) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:317) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:203) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:178) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:178) org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1213) org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:177) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1041) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1025) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1023) at
[jira] [Commented] (SPARK-2201) Improve FlumeInputDStream's stability and make it scalable
[ https://issues.apache.org/jira/browse/SPARK-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054431#comment-14054431 ] llai commented on SPARK-2201: - is it a good idea to add dependency of zookeeper? Improve FlumeInputDStream's stability and make it scalable -- Key: SPARK-2201 URL: https://issues.apache.org/jira/browse/SPARK-2201 Project: Spark Issue Type: Improvement Reporter: sunshangchun Currently: FlumeUtils.createStream(ssc, localhost, port); This means that only one flume receiver can work with FlumeInputDStream .so the solution is not scalable. I use a zookeeper to solve this problem. Spark flume receivers register themselves to a zk path when started, and a flume agent get physical hosts and push events to them. Some works need to be done here: 1.receiver create tmp node in zk, listeners just watch those tmp nodes. 2. when spark FlumeReceivers started, they acquire a physical host (localhost's ip and an idle port) and register itself to zookeeper. 3. A new flume sink. In the method of appendEvents, they get physical hosts and push data to them in a round-robin manner. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2398) Trouble running Spark 1.0 on Yarn
Nishkam Ravi created SPARK-2398: --- Summary: Trouble running Spark 1.0 on Yarn Key: SPARK-2398 URL: https://issues.apache.org/jira/browse/SPARK-2398 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Nishkam Ravi Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. For example: SparkPageRank when run in standalone mode goes through without any errors (tested for up to 30GB input dataset on a 6-node cluster). Also runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn cluster mode) as the input data size is increased. Confirmed for 16GB input dataset. The same workload runs fine with Spark 0.9 in both standalone and yarn cluster mode (for up to 30 GB input dataset on a 6-node cluster). Commandline used: (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn --deploy-mode cluster --properties-file pagerank.conf --driver-memory 30g --driver-cores 16 --num-executors 5 --class org.apache.spark.examples.SparkPageRank /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar pagerank_in $NUM_ITER) pagerank.conf: spark.masterspark://c1704.halxg.cloudera.com:7077 spark.home /opt/cloudera/parcels/CDH/lib/spark spark.executor.memory 32g spark.default.parallelism 118 spark.cores.max 96 spark.storage.memoryFraction0.6 spark.shuffle.memoryFraction0.3 spark.shuffle.compress true spark.shuffle.spill.compresstrue spark.broadcast.compresstrue spark.rdd.compress false spark.io.compression.codec org.apache.spark.io.LZFCompressionCodec spark.io.compression.snappy.block.size 32768 spark.reducer.maxMbInFlight 48 spark.local.dir /var/lib/jenkins/workspace/tmp spark.driver.memory 30g spark.executor.cores16 spark.locality.wait 6000 spark.executor.instances5 UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions: 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection to ConnectionManagerId(a1016.halxg.cloudera.com,54105) java.nio.channels.AsynchronousCloseException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496) at org.apache.spark.network.SendingConnection.write(Connection.scala:361) at org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703) at org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619) at java.io.FilterInputStream.close(FilterInputStream.java:181) at org.apache.hadoop.util.LineReader.close(LineReader.java:150) at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244) at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226) at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63) at org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197) at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) --- 14/07/07 17:59:52 WARN network.SendingConnection: Error finishing connection to a1016.halxg.cloudera.com/10.20.184.116:54105 java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at
[jira] [Commented] (SPARK-2398) Trouble running Spark 1.0 on Yarn
[ https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054434#comment-14054434 ] Guoqiang Li commented on SPARK-2398: Seems to be related to [SPARK-1930|https://issues.apache.org/jira/browse/SPARK-1930]. Can you post the yarn node manager log? Trouble running Spark 1.0 on Yarn -- Key: SPARK-2398 URL: https://issues.apache.org/jira/browse/SPARK-2398 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Nishkam Ravi Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. For example: SparkPageRank when run in standalone mode goes through without any errors (tested for up to 30GB input dataset on a 6-node cluster). Also runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn cluster mode) as the input data size is increased. Confirmed for 16GB input dataset. The same workload runs fine with Spark 0.9 in both standalone and yarn cluster mode (for up to 30 GB input dataset on a 6-node cluster). Commandline used: (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn --deploy-mode cluster --properties-file pagerank.conf --driver-memory 30g --driver-cores 16 --num-executors 5 --class org.apache.spark.examples.SparkPageRank /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar pagerank_in $NUM_ITER) pagerank.conf: spark.masterspark://c1704.halxg.cloudera.com:7077 spark.home /opt/cloudera/parcels/CDH/lib/spark spark.executor.memory 32g spark.default.parallelism 118 spark.cores.max 96 spark.storage.memoryFraction0.6 spark.shuffle.memoryFraction0.3 spark.shuffle.compress true spark.shuffle.spill.compresstrue spark.broadcast.compresstrue spark.rdd.compress false spark.io.compression.codec org.apache.spark.io.LZFCompressionCodec spark.io.compression.snappy.block.size 32768 spark.reducer.maxMbInFlight 48 spark.local.dir /var/lib/jenkins/workspace/tmp spark.driver.memory 30g spark.executor.cores16 spark.locality.wait 6000 spark.executor.instances5 UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions: 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection to ConnectionManagerId(a1016.halxg.cloudera.com,54105) java.nio.channels.AsynchronousCloseException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496) at org.apache.spark.network.SendingConnection.write(Connection.scala:361) at org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703) at org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619) at java.io.FilterInputStream.close(FilterInputStream.java:181) at org.apache.hadoop.util.LineReader.close(LineReader.java:150) at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244) at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226) at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63) at org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197) at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) --- 14/07/07
[jira] [Commented] (SPARK-2201) Improve FlumeInputDStream's stability and make it scalable
[ https://issues.apache.org/jira/browse/SPARK-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054442#comment-14054442 ] sunshangchun commented on SPARK-2201: - I don't like it's a problem. It's a external module and take no effect on spark core module. Again, spark core module has already used zookeeper to select leader master. Thanks Improve FlumeInputDStream's stability and make it scalable -- Key: SPARK-2201 URL: https://issues.apache.org/jira/browse/SPARK-2201 Project: Spark Issue Type: Improvement Reporter: sunshangchun Currently: FlumeUtils.createStream(ssc, localhost, port); This means that only one flume receiver can work with FlumeInputDStream .so the solution is not scalable. I use a zookeeper to solve this problem. Spark flume receivers register themselves to a zk path when started, and a flume agent get physical hosts and push events to them. Some works need to be done here: 1.receiver create tmp node in zk, listeners just watch those tmp nodes. 2. when spark FlumeReceivers started, they acquire a physical host (localhost's ip and an idle port) and register itself to zookeeper. 3. A new flume sink. In the method of appendEvents, they get physical hosts and push data to them in a round-robin manner. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2235) Spark SQL basicOperator add Intersect operator
[ https://issues.apache.org/jira/browse/SPARK-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2235. - Resolution: Fixed Fix Version/s: 1.1.0 Assignee: (was: Michael Armbrust) Spark SQL basicOperator add Intersect operator --- Key: SPARK-2235 URL: https://issues.apache.org/jira/browse/SPARK-2235 Project: Spark Issue Type: New Feature Components: SQL Reporter: Yanjie Gao Fix For: 1.1.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2181) The keys for sorting the columns of Executor page in SparkUI are incorrect
[ https://issues.apache.org/jira/browse/SPARK-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054455#comment-14054455 ] Shuo Xiang commented on SPARK-2181: --- Hi, I've merged the latest version but the column of Memory Usage inside the RDD storage Info tab is not correct. It is still sorted by the string not the value. The keys for sorting the columns of Executor page in SparkUI are incorrect -- Key: SPARK-2181 URL: https://issues.apache.org/jira/browse/SPARK-2181 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Shuo Xiang Assignee: Guoqiang Li Priority: Minor Fix For: 1.1.0, 1.0.2 Under the Executor page of SparkUI, each column is sorted alphabetically (after clicking). However, it should be sorted by the value, not the string. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (SPARK-2303) Poisson regression model for count data
[ https://issues.apache.org/jira/browse/SPARK-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Bai closed SPARK-2303. --- Resolution: Fixed Poisson regression model for count data --- Key: SPARK-2303 URL: https://issues.apache.org/jira/browse/SPARK-2303 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Gang Bai Modeling count data is of great importance in solving real-world statistic problems. Currently mllib.regression provides models mostly for numeric data, i.e fitting curves with various regularization on resulted weights, but still lacks the support of count data modeling. A very basic model for this is the Poisson regression. Following the patterns in mllib and reusing the components, we address the parameter estimation for Poisson regression in a maximum likelihood manner. In detail, to add Poisson regression to mllib.regression, we need to: # Add the gradient of the negative log-likelihood into mllib/optimization/Gradients.scala. # Add the implementations of PoissonRegressionModel, which extends GeneralizedLinearModel with RegressionModel. Here we need the implementation of the predict method. # Add the implementations of the generalized linear algorithm class. Here we can use either LBFGS or GradientDescent as the optimizer. So we implement both as class PoissonRegressionWithSGD and class PoissonRegressionWithLBFGS respectively. # Add the companion object PoissonRegressionWithSGD and PoissonRegressionWithLBFGS as drivers. # Test suites ## Test the gradient computation. ## Test the regression method using generated data, which requires a PoissonRegressionDataGenerator. ## Test the regression method using a real-world data set. # Add the documents. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2303) Poisson regression model for count data
[ https://issues.apache.org/jira/browse/SPARK-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054470#comment-14054470 ] Gang Bai commented on SPARK-2303: - This change has been merged into another JIRA SPARK-2311. Closing this one. Poisson regression model for count data --- Key: SPARK-2303 URL: https://issues.apache.org/jira/browse/SPARK-2303 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Gang Bai Modeling count data is of great importance in solving real-world statistic problems. Currently mllib.regression provides models mostly for numeric data, i.e fitting curves with various regularization on resulted weights, but still lacks the support of count data modeling. A very basic model for this is the Poisson regression. Following the patterns in mllib and reusing the components, we address the parameter estimation for Poisson regression in a maximum likelihood manner. In detail, to add Poisson regression to mllib.regression, we need to: # Add the gradient of the negative log-likelihood into mllib/optimization/Gradients.scala. # Add the implementations of PoissonRegressionModel, which extends GeneralizedLinearModel with RegressionModel. Here we need the implementation of the predict method. # Add the implementations of the generalized linear algorithm class. Here we can use either LBFGS or GradientDescent as the optimizer. So we implement both as class PoissonRegressionWithSGD and class PoissonRegressionWithLBFGS respectively. # Add the companion object PoissonRegressionWithSGD and PoissonRegressionWithLBFGS as drivers. # Test suites ## Test the gradient computation. ## Test the regression method using generated data, which requires a PoissonRegressionDataGenerator. ## Test the regression method using a real-world data set. # Add the documents. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2215) Multi-way join
[ https://issues.apache.org/jira/browse/SPARK-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054491#comment-14054491 ] Yin Huai commented on SPARK-2215: - Instead of implementing a multi-way join operator, how about we just rely on AddExchange to properly insert Exchange Operator to avoid unnecessary data movements? Multi-way join -- Key: SPARK-2215 URL: https://issues.apache.org/jira/browse/SPARK-2215 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cheng Hao Priority: Minor Support the multi-way join (multiple table joins) in a single reduce stage if they have the same join keys. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2399) Add support for LZ4 compression
[ https://issues.apache.org/jira/browse/SPARK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Bowyer updated SPARK-2399: --- Attachment: SPARK-2399-Make-spark-compression-able-to-use-LZ4.patch Add support for LZ4 compression --- Key: SPARK-2399 URL: https://issues.apache.org/jira/browse/SPARK-2399 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Greg Bowyer Labels: compression, lz4 Attachments: SPARK-2399-Make-spark-compression-able-to-use-LZ4.patch LZ4 is a compression codec of the same ideas as googles snappy, but has some advantages: * It is faster than snappy with a similar compression ration * The implementation is Apache licensed and not GPL It has shown promise in both the lucene and hadoop communities, and it looks like its a really easy add to spark io compression. Attached is a patch that does this -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2399) Add support for LZ4 compression
[ https://issues.apache.org/jira/browse/SPARK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Bowyer updated SPARK-2399: --- Description: LZ4 is a compression codec of the same ideas as googles snappy, but has some advantages: * It is faster than snappy with a similar compression ratio * The implementation is Apache licensed and not GPL It has shown promise in both the lucene and hadoop communities, and it looks like its a really easy add to spark io compression. Attached is a patch that does this was: LZ4 is a compression codec of the same ideas as googles snappy, but has some advantages: * It is faster than snappy with a similar compression ration * The implementation is Apache licensed and not GPL It has shown promise in both the lucene and hadoop communities, and it looks like its a really easy add to spark io compression. Attached is a patch that does this Add support for LZ4 compression --- Key: SPARK-2399 URL: https://issues.apache.org/jira/browse/SPARK-2399 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Greg Bowyer Labels: compression, lz4 Attachments: SPARK-2399-Make-spark-compression-able-to-use-LZ4.patch LZ4 is a compression codec of the same ideas as googles snappy, but has some advantages: * It is faster than snappy with a similar compression ratio * The implementation is Apache licensed and not GPL It has shown promise in both the lucene and hadoop communities, and it looks like its a really easy add to spark io compression. Attached is a patch that does this -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2400) config spark.yarn.max.executor.failures is not explained accurately
Chen Chao created SPARK-2400: Summary: config spark.yarn.max.executor.failures is not explained accurately Key: SPARK-2400 URL: https://issues.apache.org/jira/browse/SPARK-2400 Project: Spark Issue Type: Bug Components: YARN Reporter: Chen Chao Priority: Minor -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2400) config spark.yarn.max.executor.failures is not explained accurately
[ https://issues.apache.org/jira/browse/SPARK-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14054540#comment-14054540 ] Chen Chao commented on SPARK-2400: -- it should be numExecutors * 2, with minimum of 3 rather than 2*numExecutors. config spark.yarn.max.executor.failures is not explained accurately --- Key: SPARK-2400 URL: https://issues.apache.org/jira/browse/SPARK-2400 Project: Spark Issue Type: Bug Components: YARN Reporter: Chen Chao Priority: Minor -- This message was sent by Atlassian JIRA (v6.2#6252)