[jira] [Commented] (SPARK-5212) Add support of schema-less transformation
[ https://issues.apache.org/jira/browse/SPARK-5212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274845#comment-14274845 ] Apache Spark commented on SPARK-5212: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/4014 > Add support of schema-less transformation > - > > Key: SPARK-5212 > URL: https://issues.apache.org/jira/browse/SPARK-5212 > Project: Spark > Issue Type: Improvement >Reporter: Liang-Chi Hsieh > > According to Hive's language manual, the AS clause should be optional in > transform > (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform) > syntax. This pr adds the support for it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5212) Add support of schema-less transformation
Liang-Chi Hsieh created SPARK-5212: -- Summary: Add support of schema-less transformation Key: SPARK-5212 URL: https://issues.apache.org/jira/browse/SPARK-5212 Project: Spark Issue Type: Bug Reporter: Liang-Chi Hsieh According to Hive's language manual, the AS clause should be optional in transform (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform) syntax. This pr adds the support for it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5212) Add support of schema-less transformation
[ https://issues.apache.org/jira/browse/SPARK-5212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-5212: --- Issue Type: Improvement (was: Bug) > Add support of schema-less transformation > - > > Key: SPARK-5212 > URL: https://issues.apache.org/jira/browse/SPARK-5212 > Project: Spark > Issue Type: Improvement >Reporter: Liang-Chi Hsieh > > According to Hive's language manual, the AS clause should be optional in > transform > (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform) > syntax. This pr adds the support for it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5207) StandardScalerModel mean and variance re-use
[ https://issues.apache.org/jira/browse/SPARK-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274802#comment-14274802 ] DB Tsai commented on SPARK-5207: [~mengxr]'s idea sounds great for me. Specifically, let's have mean and variance as required variables in the constructor, and have withMean = false, and withStd = true as default variables. Add another two methods to change withMean and withStd. Thanks. > StandardScalerModel mean and variance re-use > > > Key: SPARK-5207 > URL: https://issues.apache.org/jira/browse/SPARK-5207 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Octavian Geagla >Assignee: Octavian Geagla > > From this discussion: > http://apache-spark-developers-list.1001551.n3.nabble.com/Re-use-scaling-means-and-variances-from-StandardScalerModel-td10073.html > Changing constructor to public would be a simple change, but a discussion is > needed to determine what args necessary for this change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5138) pyspark unable to infer schema of namedtuple
[ https://issues.apache.org/jira/browse/SPARK-5138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-5138. - Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3978 [https://github.com/apache/spark/pull/3978] > pyspark unable to infer schema of namedtuple > > > Key: SPARK-5138 > URL: https://issues.apache.org/jira/browse/SPARK-5138 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.2.0 >Reporter: Gabe Mulley >Priority: Trivial > Fix For: 1.3.0 > > > When attempting to infer the schema of an RDD that contains namedtuples, > pyspark fails to identify the records as namedtuples, resulting in it raising > an error. > Example: > {noformat} > from pyspark import SparkContext > from pyspark.sql import SQLContext > from collections import namedtuple > import os > sc = SparkContext() > rdd = sc.textFile(os.path.join(os.getenv('SPARK_HOME'), 'README.md')) > TextLine = namedtuple('TextLine', 'line length') > tuple_rdd = rdd.map(lambda l: TextLine(line=l, length=len(l))) > tuple_rdd.take(5) # This works > sqlc = SQLContext(sc) > # The following line raises an error > schema_rdd = sqlc.inferSchema(tuple_rdd) > {noformat} > The error raised is: > {noformat} > File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", line 107, > in main > process() > File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", line 98, in > process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/serializers.py", line > 227, in dump_stream > vs = list(itertools.islice(iterator, batch)) > File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/rdd.py", line 1107, in > takeUpToNumLeft > yield next(iterator) > File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/sql.py", line 816, in > convert_struct > raise ValueError("unexpected tuple: %s" % obj) > TypeError: not all arguments converted during string formatting > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4999) No need to put WAL-backed block into block manager by default
[ https://issues.apache.org/jira/browse/SPARK-4999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-4999. -- Resolution: Fixed Fix Version/s: 1.2.1 1.3.0 > No need to put WAL-backed block into block manager by default > - > > Key: SPARK-4999 > URL: https://issues.apache.org/jira/browse/SPARK-4999 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.2.0 >Reporter: Saisai Shao > Fix For: 1.3.0, 1.2.1 > > > Currently WAL-backed block is read out from HDFS and put into BlockManger > with storage level MEMORY_ONLY_SER by default, since WAL-backed block is > already fault-tolerant, no need to put into BlockManger again by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5147) write ahead logs from streaming receiver are not purged because cleanupOldBlocks in WriteAheadLogBasedBlockHandler is never called
[ https://issues.apache.org/jira/browse/SPARK-5147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274685#comment-14274685 ] Saisai Shao commented on SPARK-5147: I'm working on this, the major part of work is done besides a small bug, I will figure out the problem and submit a PR. > write ahead logs from streaming receiver are not purged because > cleanupOldBlocks in WriteAheadLogBasedBlockHandler is never called > -- > > Key: SPARK-5147 > URL: https://issues.apache.org/jira/browse/SPARK-5147 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.2.0 >Reporter: Max Xu >Priority: Blocker > > Hi all, > We are running a Spark streaming application with ReliableKafkaReceiver. We > have "spark.streaming.receiver.writeAheadLog.enable" set to true so write > ahead logs (WALs) for received data are created under receivedData/streamId > folder in the checkpoint directory. > However, old WALs are never purged by time. receivedBlockMetadata and > checkpoint files are purged correctly though. I went through the code, > WriteAheadLogBasedBlockHandler class in ReceivedBlockHandler.scala is > responsible for cleaning up the old blocks. It has method cleanupOldBlocks, > which is never called by any class. ReceiverSupervisorImpl class holds a > WriteAheadLogBasedBlockHandler instance. However, it only calls storeBlock > method to create WALs but never calls cleanupOldBlocks method to purge old > WALs. > The size of the WAL folder increases constantly on HDFS. This is preventing > us from running the ReliableKafkaReceiver 24x7. Can somebody please take a > look. > Thanks, > Max -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5147) write ahead logs from streaming receiver are not purged because cleanupOldBlocks in WriteAheadLogBasedBlockHandler is never called
[ https://issues.apache.org/jira/browse/SPARK-5147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-5147: - Target Version/s: 1.3.0, 1.2.1 > write ahead logs from streaming receiver are not purged because > cleanupOldBlocks in WriteAheadLogBasedBlockHandler is never called > -- > > Key: SPARK-5147 > URL: https://issues.apache.org/jira/browse/SPARK-5147 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.2.0 >Reporter: Max Xu >Priority: Blocker > > Hi all, > We are running a Spark streaming application with ReliableKafkaReceiver. We > have "spark.streaming.receiver.writeAheadLog.enable" set to true so write > ahead logs (WALs) for received data are created under receivedData/streamId > folder in the checkpoint directory. > However, old WALs are never purged by time. receivedBlockMetadata and > checkpoint files are purged correctly though. I went through the code, > WriteAheadLogBasedBlockHandler class in ReceivedBlockHandler.scala is > responsible for cleaning up the old blocks. It has method cleanupOldBlocks, > which is never called by any class. ReceiverSupervisorImpl class holds a > WriteAheadLogBasedBlockHandler instance. However, it only calls storeBlock > method to create WALs but never calls cleanupOldBlocks method to purge old > WALs. > The size of the WAL folder increases constantly on HDFS. This is preventing > us from running the ReliableKafkaReceiver 24x7. Can somebody please take a > look. > Thanks, > Max -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5147) write ahead logs from streaming receiver are not purged because cleanupOldBlocks in WriteAheadLogBasedBlockHandler is never called
[ https://issues.apache.org/jira/browse/SPARK-5147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274683#comment-14274683 ] Tathagata Das commented on SPARK-5147: -- I think this is a critical bug. This should be fixed ASAP. Can you come up with a fix? > write ahead logs from streaming receiver are not purged because > cleanupOldBlocks in WriteAheadLogBasedBlockHandler is never called > -- > > Key: SPARK-5147 > URL: https://issues.apache.org/jira/browse/SPARK-5147 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.2.0 >Reporter: Max Xu >Priority: Blocker > > Hi all, > We are running a Spark streaming application with ReliableKafkaReceiver. We > have "spark.streaming.receiver.writeAheadLog.enable" set to true so write > ahead logs (WALs) for received data are created under receivedData/streamId > folder in the checkpoint directory. > However, old WALs are never purged by time. receivedBlockMetadata and > checkpoint files are purged correctly though. I went through the code, > WriteAheadLogBasedBlockHandler class in ReceivedBlockHandler.scala is > responsible for cleaning up the old blocks. It has method cleanupOldBlocks, > which is never called by any class. ReceiverSupervisorImpl class holds a > WriteAheadLogBasedBlockHandler instance. However, it only calls storeBlock > method to create WALs but never calls cleanupOldBlocks method to purge old > WALs. > The size of the WAL folder increases constantly on HDFS. This is preventing > us from running the ReliableKafkaReceiver 24x7. Can somebody please take a > look. > Thanks, > Max -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5147) write ahead logs from streaming receiver are not purged because cleanupOldBlocks in WriteAheadLogBasedBlockHandler is never called
[ https://issues.apache.org/jira/browse/SPARK-5147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-5147: - Priority: Blocker (was: Major) > write ahead logs from streaming receiver are not purged because > cleanupOldBlocks in WriteAheadLogBasedBlockHandler is never called > -- > > Key: SPARK-5147 > URL: https://issues.apache.org/jira/browse/SPARK-5147 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.2.0 >Reporter: Max Xu >Priority: Blocker > > Hi all, > We are running a Spark streaming application with ReliableKafkaReceiver. We > have "spark.streaming.receiver.writeAheadLog.enable" set to true so write > ahead logs (WALs) for received data are created under receivedData/streamId > folder in the checkpoint directory. > However, old WALs are never purged by time. receivedBlockMetadata and > checkpoint files are purged correctly though. I went through the code, > WriteAheadLogBasedBlockHandler class in ReceivedBlockHandler.scala is > responsible for cleaning up the old blocks. It has method cleanupOldBlocks, > which is never called by any class. ReceiverSupervisorImpl class holds a > WriteAheadLogBasedBlockHandler instance. However, it only calls storeBlock > method to create WALs but never calls cleanupOldBlocks method to purge old > WALs. > The size of the WAL folder increases constantly on HDFS. This is preventing > us from running the ReliableKafkaReceiver 24x7. Can somebody please take a > look. > Thanks, > Max -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5206) Accumulators are not re-registered during recovering from checkpoint
[ https://issues.apache.org/jira/browse/SPARK-5206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274681#comment-14274681 ] Tathagata Das commented on SPARK-5206: -- Interesting observation! Can this be solved just by explicitly referencing the Accumulator object in the beginning of your program? If that works, then we can add this reference to Accumulator in the StreamingContext object to make sure it is automatically called. > Accumulators are not re-registered during recovering from checkpoint > > > Key: SPARK-5206 > URL: https://issues.apache.org/jira/browse/SPARK-5206 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.1.0 >Reporter: vincent ye > > I got exception as following while my streaming application restarts from > crash from checkpoit: > 15/01/12 10:31:06 sparkDriver-akka.actor.default-dispatcher-4 ERROR > scheduler.DAGScheduler: Failed to update accumulators for ShuffleMapTask(41, > 4) > java.util.NoSuchElementException: key not found: 1 > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:58) > at scala.collection.mutable.HashMap.apply(HashMap.scala:64) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:939) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:938) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) > at > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) > at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:938) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1388) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > I guess that an Accumulator is registered to a singleton Accumulators in Line > 58 of org.apache.spark.Accumulable: > Accumulators.register(this, true) > This code need to be executed in the driver once. But when the application is > recovered from checkpoint. It won't be executed in the driver. So when the > driver process it at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:938), > It can't find the Accumulator because it's not re-register during the > recovery. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5164) YARN | Spark job submits from windows machine to a linux YARN cluster fail
[ https://issues.apache.org/jira/browse/SPARK-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Bhatnagar resolved SPARK-5164. - Resolution: Duplicate Duplicates and has similar findings to SPARK-1825. > YARN | Spark job submits from windows machine to a linux YARN cluster fail > -- > > Key: SPARK-5164 > URL: https://issues.apache.org/jira/browse/SPARK-5164 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.2.0 > Environment: Spark submit from Windows 7 > YARN cluster on CentOS 6.5 >Reporter: Aniket Bhatnagar > > While submitting spark jobs from a windows machine to a linux YARN cluster, > the jobs fail because of the following reasons: > 1. Commands and classpath contain environment variables (like JAVA_HOME, PWD, > etc) but are added as per windows's syntax (%JAVA_HOME%, %PWD%, etc) instead > of linux's syntax ($JAVA_HOME, $PWD, etc). > 2. Paths in launch environment are delimited by semi-colon instead of colon. > This is because of usage of File.pathSeparator in YarnSparkHadoopUtil. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4924) Factor out code to launch Spark applications into a separate library
[ https://issues.apache.org/jira/browse/SPARK-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4924: - Target Version/s: 1.3.0 > Factor out code to launch Spark applications into a separate library > > > Key: SPARK-4924 > URL: https://issues.apache.org/jira/browse/SPARK-4924 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Attachments: spark-launcher.txt > > > One of the questions we run into rather commonly is "how to start a Spark > application from my Java/Scala program?". There currently isn't a good answer > to that: > - Instantiating SparkContext has limitations (e.g., you can only have one > active context at the moment, plus you lose the ability to submit apps in > cluster mode) > - Calling SparkSubmit directly is doable but you lose a lot of the logic > handled by the shell scripts > - Calling the shell script directly is doable, but sort of ugly from an API > point of view. > I think it would be nice to have a small library that handles that for users. > On top of that, this library could be used by Spark itself to replace a lot > of the code in the current shell scripts, which have a lot of duplication. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4859) Refactor LiveListenerBus and StreamingListenerBus
[ https://issues.apache.org/jira/browse/SPARK-4859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-4859: Description: [#4006|https://github.com/apache/spark/pull/4006] refactors LiveListenerBus and StreamingListenerBus and extracts the common codes to a parent class ListenerBus. It also includes bug fixes in [#3710|https://github.com/apache/spark/pull/3710]: 1. Fix the race condition of queueFullErrorMessageLogged in LiveListenerBus and StreamingListenerBus to avoid outputing queue-full-error logs multiple times. 2. Make sure the SHUTDOWN message will be delivered to listenerThread, so that we can make sure listenerThread will always be able to exit. 3. Log the error from listener rather than crashing listenerThread in StreamingListenerBus. During fixing the above bugs, we find it's better to make LiveListenerBus and StreamingListenerBus have the same bahaviors. Then there will be many duplicated codes in LiveListenerBus and StreamingListenerBus. Therefore, I extracted their common codes to ListenerBus as a parent class: LiveListenerBus and StreamingListenerBus only need to extend ListenerBus and implement onPostEvent (how to process an event) and onDropEvent (do something when droppping an event). was: Fix the race condition of `queueFullErrorMessageLogged`. Log the error from listener rather than crashing `listenerThread`. Summary: Refactor LiveListenerBus and StreamingListenerBus (was: Improve StreamingListenerBus) > Refactor LiveListenerBus and StreamingListenerBus > - > > Key: SPARK-4859 > URL: https://issues.apache.org/jira/browse/SPARK-4859 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.0 >Reporter: Shixiong Zhu > > [#4006|https://github.com/apache/spark/pull/4006] refactors LiveListenerBus > and StreamingListenerBus and extracts the common codes to a parent class > ListenerBus. > It also includes bug fixes in > [#3710|https://github.com/apache/spark/pull/3710]: > 1. Fix the race condition of queueFullErrorMessageLogged in LiveListenerBus > and StreamingListenerBus to avoid outputing queue-full-error logs multiple > times. > 2. Make sure the SHUTDOWN message will be delivered to listenerThread, so > that we can make sure listenerThread will always be able to exit. > 3. Log the error from listener rather than crashing listenerThread in > StreamingListenerBus. > During fixing the above bugs, we find it's better to make LiveListenerBus and > StreamingListenerBus have the same bahaviors. Then there will be many > duplicated codes in LiveListenerBus and StreamingListenerBus. > Therefore, I extracted their common codes to ListenerBus as a parent class: > LiveListenerBus and StreamingListenerBus only need to extend ListenerBus and > implement onPostEvent (how to process an event) and onDropEvent (do something > when droppping an event). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5056) Implementing Clara k-medoids clustering algorithm for large datasets
[ https://issues.apache.org/jira/browse/SPARK-5056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng closed SPARK-5056. Resolution: Won't Fix > Implementing Clara k-medoids clustering algorithm for large datasets > > > Key: SPARK-5056 > URL: https://issues.apache.org/jira/browse/SPARK-5056 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Tomislav Milinovic >Priority: Minor > Labels: features > > There is a specific k-medoids clustering algorithm for large datasets. The > algorithm is called Clara in R, and is fully described in chapter 3 of > Finding Groups in Data: An Introduction to Cluster Analysis. by Kaufman, L > and Rousseeuw, PJ (1990). > The algorithm considers sub-datasets of fixed size (sampsize) such that the > time and storage requirements become linear in n rather than quadratic. Each > sub-dataset is partitioned into k clusters using the same algorithm as in > Partinioning around Medoids (PAM). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5056) Implementing Clara k-medoids clustering algorithm for large datasets
[ https://issues.apache.org/jira/browse/SPARK-5056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274568#comment-14274568 ] Xiangrui Meng commented on SPARK-5056: -- This is along the same direction with our discussion in SPARK-4510. If we choose a sample, is there any theoretical guarantee on the convergence? If we have 1 billion instances, what sample size would be proper? The original paper https://lirias.kuleuven.be/handle/123456789/426399, if I found the correct one, hasn't received many citations. In general, I think this algorithm is out of MLlib's scope. If someone is interested in implementing this algorithm, it would be best maintained outside Spark as a 3rd-party package. I'm going to mark it as "Won't Fix", but feel free to reopen it if there are things I missed. > Implementing Clara k-medoids clustering algorithm for large datasets > > > Key: SPARK-5056 > URL: https://issues.apache.org/jira/browse/SPARK-5056 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Tomislav Milinovic >Priority: Minor > Labels: features > > There is a specific k-medoids clustering algorithm for large datasets. The > algorithm is called Clara in R, and is fully described in chapter 3 of > Finding Groups in Data: An Introduction to Cluster Analysis. by Kaufman, L > and Rousseeuw, PJ (1990). > The algorithm considers sub-datasets of fixed size (sampsize) such that the > time and storage requirements become linear in n rather than quadratic. Each > sub-dataset is partitioned into k clusters using the same algorithm as in > Partinioning around Medoids (PAM). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5211) Restore HiveMetastoreTypes.toDataType
Yin Huai created SPARK-5211: --- Summary: Restore HiveMetastoreTypes.toDataType Key: SPARK-5211 URL: https://issues.apache.org/jira/browse/SPARK-5211 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai It was a public API. Since developers are using it, we need to get it back. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5211) Restore HiveMetastoreTypes.toDataType
[ https://issues.apache.org/jira/browse/SPARK-5211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-5211: Priority: Critical (was: Major) > Restore HiveMetastoreTypes.toDataType > - > > Key: SPARK-5211 > URL: https://issues.apache.org/jira/browse/SPARK-5211 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Critical > > It was a public API. Since developers are using it, we need to get it back. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5210) Support log rolling in EventLogger
Josh Rosen created SPARK-5210: - Summary: Support log rolling in EventLogger Key: SPARK-5210 URL: https://issues.apache.org/jira/browse/SPARK-5210 Project: Spark Issue Type: New Feature Components: Spark Core, Web UI Reporter: Josh Rosen Assignee: Josh Rosen For long-running Spark applications (e.g. running for days / weeks), the Spark event log may grow to be very large. As a result, it would be useful if EventLoggingListener supported log file rolling / rotation. Adding this feature will involve changes to the HistoryServer in order to be able to load event logs from a sequence of files instead of a single file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274535#comment-14274535 ] Nicholas Chammas commented on SPARK-3821: - That's correct. All those paths are just relative to the folder containing {{spark-packer.json}}. > Develop an automated way of creating Spark images (AMI, Docker, and others) > --- > > Key: SPARK-3821 > URL: https://issues.apache.org/jira/browse/SPARK-3821 > Project: Spark > Issue Type: Improvement > Components: Build, EC2 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas > Attachments: packer-proposal.html > > > Right now the creation of Spark AMIs or Docker containers is done manually. > With tools like [Packer|http://www.packer.io/], we should be able to automate > this work, and do so in such a way that multiple types of machine images can > be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5209) Jobs fail with "unexpected value" exception in certain environments
[ https://issues.apache.org/jira/browse/SPARK-5209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sven Krasser updated SPARK-5209: Attachment: spark-defaults.conf repro.py gen_test_data.py exec_log.txt driver_log.txt > Jobs fail with "unexpected value" exception in certain environments > --- > > Key: SPARK-5209 > URL: https://issues.apache.org/jira/browse/SPARK-5209 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.2.0 > Environment: Amazon Elastic Map Reduce >Reporter: Sven Krasser > Attachments: driver_log.txt, exec_log.txt, gen_test_data.py, > repro.py, spark-defaults.conf > > > Jobs fail consistently and reproducibly with exceptions of the following type > in PySpark using Spark 1.2.0: > {noformat} > 2015-01-13 00:14:05,898 ERROR [Executor task launch worker-1] > executor.Executor (Logging.scala:logError(96)) - Exception in task 27.0 in > stage 0.0 (TID 28) > org.apache.spark.SparkException: PairwiseRDD: unexpected value: > List([B@4c09f3e0) > {noformat} > The issue appeared the first time in Spark 1.2.0 and is sensitive to the > environment (configuration, cluster size), i.e. some changes to the > environment will cause the error to not occur. > The following steps yield a reproduction on Amazon Elastic Map Reduce. Launch > an EMR cluster with the following parameters (this will bootstrap Spark 1.2.0 > onto it): > {code} > aws emr create-cluster --region us-west-1 --no-auto-terminate \ >--ec2-attributes KeyName=your-key-here,SubnetId=your-subnet-here \ >--bootstrap-actions > Path=s3://support.elasticmapreduce/spark/install-spark,Args='["-g","-v","1.2.0.a"]' > \ >--ami-version 3.3 --instance-groups > InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge \ >InstanceGroupType=CORE,InstanceCount=3,InstanceType=r3.xlarge --name > "Spark Issue Repro" \ >--visible-to-all-users --applications Name=Ganglia > {code} > Next, copy the attached {{spark-defaults.conf}} to {{~/spark/conf/}}. > Run {{~/spark/bin/spark-submit gen_test_data.py}} to generate a test data set > on HDFS. Then lastly run {{~/spark/bin/spark-submit repro.py}} to reproduce > the error. > Driver and executor logs are attached. For reference, a spark-user thread on > the topic is here: > http://mail-archives.us.apache.org/mod_mbox/spark-user/201501.mbox/%3cc5a80834-8f1c-4c0a-89f9-e04d3f1c4...@gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4959) Attributes are case sensitive when using a select query from a projection
[ https://issues.apache.org/jira/browse/SPARK-4959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274524#comment-14274524 ] Apache Spark commented on SPARK-4959: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/4013 > Attributes are case sensitive when using a select query from a projection > - > > Key: SPARK-4959 > URL: https://issues.apache.org/jira/browse/SPARK-4959 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Andy Konwinski >Priority: Critical > > Per [~marmbrus], see this line of code, where we should be using an attribute > map > > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L147 > To reproduce, i ran the following in the Spark shell: > {code} > import sqlContext._ > sql("drop table if exists test") > sql("create table test (col1 string)") > sql("""insert into table test select "hi" from prejoined limit 1""") > val projection = "col1".attr.as(Symbol("CaseSensitiveColName")) :: > "col1".attr.as(Symbol("CaseSensitiveColName2")) :: Nil > sqlContext.table("test").select(projection:_*).registerTempTable("test2") > # This succeeds. > sql("select CaseSensitiveColName from test2").first() > # This fails with java.util.NoSuchElementException: key not found: > casesensitivecolname#23046 > sql("select casesensitivecolname from test2").first() > {code} > The full stack trace printed for the final command that is failing: > {code} > java.util.NoSuchElementException: key not found: casesensitivecolname#23046 > at scala.collection.MapLike$class.default(MapLike.scala:228) > at > org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29) > at scala.collection.MapLike$class.apply(MapLike.scala:141) > at > org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.hive.execution.HiveTableScan.(HiveTableScan.scala:57) > at > org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221) > at > org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221) > at > org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:378) > at > org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:217) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) > at > org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:285) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) > at > org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418) > at > org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416) > at > org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422) > at > org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422) > at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444) > at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:446) > at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:108) > at org.apache.spark.rdd.RDD.first(RDD.scala:1093) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --
[jira] [Created] (SPARK-5209) Jobs fail with "unexpected value" exception in certain environments
Sven Krasser created SPARK-5209: --- Summary: Jobs fail with "unexpected value" exception in certain environments Key: SPARK-5209 URL: https://issues.apache.org/jira/browse/SPARK-5209 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Environment: Amazon Elastic Map Reduce Reporter: Sven Krasser Jobs fail consistently and reproducibly with exceptions of the following type in PySpark using Spark 1.2.0: {noformat} 2015-01-13 00:14:05,898 ERROR [Executor task launch worker-1] executor.Executor (Logging.scala:logError(96)) - Exception in task 27.0 in stage 0.0 (TID 28) org.apache.spark.SparkException: PairwiseRDD: unexpected value: List([B@4c09f3e0) {noformat} The issue appeared the first time in Spark 1.2.0 and is sensitive to the environment (configuration, cluster size), i.e. some changes to the environment will cause the error to not occur. The following steps yield a reproduction on Amazon Elastic Map Reduce. Launch an EMR cluster with the following parameters (this will bootstrap Spark 1.2.0 onto it): {code} aws emr create-cluster --region us-west-1 --no-auto-terminate \ --ec2-attributes KeyName=your-key-here,SubnetId=your-subnet-here \ --bootstrap-actions Path=s3://support.elasticmapreduce/spark/install-spark,Args='["-g","-v","1.2.0.a"]' \ --ami-version 3.3 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge \ InstanceGroupType=CORE,InstanceCount=3,InstanceType=r3.xlarge --name "Spark Issue Repro" \ --visible-to-all-users --applications Name=Ganglia {code} Next, copy the attached {{spark-defaults.conf}} to {{~/spark/conf/}}. Run {{~/spark/bin/spark-submit gen_test_data.py}} to generate a test data set on HDFS. Then lastly run {{~/spark/bin/spark-submit repro.py}} to reproduce the error. Driver and executor logs are attached. For reference, a spark-user thread on the topic is here: http://mail-archives.us.apache.org/mod_mbox/spark-user/201501.mbox/%3cc5a80834-8f1c-4c0a-89f9-e04d3f1c4...@gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3433) Mima false-positives with @DeveloperAPI and @Experimental annotations
[ https://issues.apache.org/jira/browse/SPARK-3433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274507#comment-14274507 ] Apache Spark commented on SPARK-3433: - User 'ScrapCodes' has created a pull request for this issue: https://github.com/apache/spark/pull/2285 > Mima false-positives with @DeveloperAPI and @Experimental annotations > - > > Key: SPARK-3433 > URL: https://issues.apache.org/jira/browse/SPARK-3433 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.1.0, 1.2.0 >Reporter: Josh Rosen >Assignee: Prashant Sharma >Priority: Minor > Fix For: 1.2.0, 1.1.2 > > > In https://github.com/apache/spark/pull/2315, I found two cases where > {{@DeveloperAPI}} and {{@Experimental}} annotations didn't prevent > false-positive warnings from Mima. To reproduce this problem, run dev/mima > as of > https://github.com/JoshRosen/spark/commit/ec90e21947b615d4ef94a3a54cfd646924ccaf7c. > The spurious warnings are listed at the top of > https://gist.github.com/JoshRosen/5d8df835516dc367389d. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3433) Mima false-positives with @DeveloperAPI and @Experimental annotations
[ https://issues.apache.org/jira/browse/SPARK-3433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-3433: -- Affects Version/s: 1.1.0 Fix Version/s: 1.1.2 I've backported this to {{branch-1.1}} in order to fix a MiMa false-positive in that branch. > Mima false-positives with @DeveloperAPI and @Experimental annotations > - > > Key: SPARK-3433 > URL: https://issues.apache.org/jira/browse/SPARK-3433 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.1.0, 1.2.0 >Reporter: Josh Rosen >Assignee: Prashant Sharma >Priority: Minor > Fix For: 1.2.0, 1.1.2 > > > In https://github.com/apache/spark/pull/2315, I found two cases where > {{@DeveloperAPI}} and {{@Experimental}} annotations didn't prevent > false-positive warnings from Mima. To reproduce this problem, run dev/mima > as of > https://github.com/JoshRosen/spark/commit/ec90e21947b615d4ef94a3a54cfd646924ccaf7c. > The spurious warnings are listed at the top of > https://gist.github.com/JoshRosen/5d8df835516dc367389d. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5208) Add more documentation to Netty-based configs
[ https://issues.apache.org/jira/browse/SPARK-5208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274489#comment-14274489 ] Apache Spark commented on SPARK-5208: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/4012 > Add more documentation to Netty-based configs > -- > > Key: SPARK-5208 > URL: https://issues.apache.org/jira/browse/SPARK-5208 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.3.0 >Reporter: Kousuke Saruta > > SPARK-4864 added some documentation about Netty-based configs but I think we > need more. I think following configs can be useful for performance tuning. > * spark.shuffle.io.mode > * spark.shuffle.io.backLog > * spark.shuffle.io.receiveBuffer > * spark.shuffle.io.sendBuffer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5208) Add more documentation to Netty-based configs
Kousuke Saruta created SPARK-5208: - Summary: Add more documentation to Netty-based configs Key: SPARK-5208 URL: https://issues.apache.org/jira/browse/SPARK-5208 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.3.0 Reporter: Kousuke Saruta SPARK-4864 added some documentation about Netty-based configs but I think we need more. I think following configs can be useful for performance tuning. * spark.shuffle.io.mode * spark.shuffle.io.backLog * spark.shuffle.io.receiveBuffer * spark.shuffle.io.sendBuffer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5208) Add more documentation to Netty-based configs
[ https://issues.apache.org/jira/browse/SPARK-5208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-5208: -- Issue Type: Improvement (was: Bug) > Add more documentation to Netty-based configs > -- > > Key: SPARK-5208 > URL: https://issues.apache.org/jira/browse/SPARK-5208 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.3.0 >Reporter: Kousuke Saruta > > SPARK-4864 added some documentation about Netty-based configs but I think we > need more. I think following configs can be useful for performance tuning. > * spark.shuffle.io.mode > * spark.shuffle.io.backLog > * spark.shuffle.io.receiveBuffer > * spark.shuffle.io.sendBuffer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4924) Factor out code to launch Spark applications into a separate library
[ https://issues.apache.org/jira/browse/SPARK-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4924: - Affects Version/s: (was: 1.2.0) 1.0.0 > Factor out code to launch Spark applications into a separate library > > > Key: SPARK-4924 > URL: https://issues.apache.org/jira/browse/SPARK-4924 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Attachments: spark-launcher.txt > > > One of the questions we run into rather commonly is "how to start a Spark > application from my Java/Scala program?". There currently isn't a good answer > to that: > - Instantiating SparkContext has limitations (e.g., you can only have one > active context at the moment, plus you lose the ability to submit apps in > cluster mode) > - Calling SparkSubmit directly is doable but you lose a lot of the logic > handled by the shell scripts > - Calling the shell script directly is doable, but sort of ugly from an API > point of view. > I think it would be nice to have a small library that handles that for users. > On top of that, this library could be used by Spark itself to replace a lot > of the code in the current shell scripts, which have a lot of duplication. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4924) Factor out code to launch Spark applications into a separate library
[ https://issues.apache.org/jira/browse/SPARK-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4924: - Affects Version/s: 1.2.0 > Factor out code to launch Spark applications into a separate library > > > Key: SPARK-4924 > URL: https://issues.apache.org/jira/browse/SPARK-4924 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Attachments: spark-launcher.txt > > > One of the questions we run into rather commonly is "how to start a Spark > application from my Java/Scala program?". There currently isn't a good answer > to that: > - Instantiating SparkContext has limitations (e.g., you can only have one > active context at the moment, plus you lose the ability to submit apps in > cluster mode) > - Calling SparkSubmit directly is doable but you lose a lot of the logic > handled by the shell scripts > - Calling the shell script directly is doable, but sort of ugly from an API > point of view. > I think it would be nice to have a small library that handles that for users. > On top of that, this library could be used by Spark itself to replace a lot > of the code in the current shell scripts, which have a lot of duplication. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5053) Test maintenance branches on Jenkins using SBT
[ https://issues.apache.org/jira/browse/SPARK-5053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274472#comment-14274472 ] Josh Rosen commented on SPARK-5053: --- I fixed the {{branch-1.1}} PySpark issue in https://github.com/apache/spark/pull/4011 and now have to fix a MiMa issue. > Test maintenance branches on Jenkins using SBT > -- > > Key: SPARK-5053 > URL: https://issues.apache.org/jira/browse/SPARK-5053 > Project: Spark > Issue Type: New Feature > Components: Project Infra >Reporter: Josh Rosen >Priority: Blocker > > We need to create Jenkins jobs to test maintenance branches using SBT. The > current Maven jobs for backport branches do not run the same checks that the > pull request builder / SBT builds do (e.g. MiMa checks, PySpark, RAT, etc.) > which means that cherry-picking backports can silently break things and we'll > only discover it once PRs that are explicitly opened against those branches > fail tests; this long delay between introducing test failures and detecting > them is a huge productivity issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3910) ./python/pyspark/mllib/classification.py doctests fails with module name pollution
[ https://issues.apache.org/jira/browse/SPARK-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3910. --- Resolution: Fixed Target Version/s: 1.2.0, 1.1.2 (was: 1.2.0) I backported Davies' 1.2 fix to branch-1.1, so I think we can mark this issue as resolved: https://github.com/apache/spark/pull/4011 > ./python/pyspark/mllib/classification.py doctests fails with module name > pollution > -- > > Key: SPARK-3910 > URL: https://issues.apache.org/jira/browse/SPARK-3910 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 1.2.0 > Environment: Mac OS X 10.9.5, Python 2.6.8, Java 1.8.0_20, > Jinja2==2.7.3, MarkupSafe==0.23, Pygments==1.6, Sphinx==1.2.3, > argparse==1.2.1, docutils==0.12, flake8==2.2.3, mccabe==0.2.1, numpy==1.9.0, > pep8==1.5.7, psutil==2.1.3, pyflake8==0.1.9, pyflakes==0.8.1, > unittest2==0.5.1, wsgiref==0.1.2 >Reporter: Tomohiko K. > Labels: pyspark, testing > > In ./python/run-tests script, we run the doctests in > ./pyspark/mllib/classification.py. > The output is as following: > {noformat} > $ ./python/run-tests > ... > Running test: pyspark/mllib/classification.py > Traceback (most recent call last): > File "pyspark/mllib/classification.py", line 20, in > import numpy > File > "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/__init__.py", > line 170, in > from . import add_newdocs > File > "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/add_newdocs.py", > line 13, in > from numpy.lib import add_newdoc > File > "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/__init__.py", > line 8, in > from .type_check import * > File > "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/type_check.py", > line 11, in > import numpy.core.numeric as _nx > File > "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/core/__init__.py", > line 46, in > from numpy.testing import Tester > File > "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/__init__.py", > line 13, in > from .utils import * > File > "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/utils.py", > line 15, in > from tempfile import mkdtemp > File > "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/tempfile.py", > line 34, in > from random import Random as _Random > File "/Users/tomohiko/MyRepos/Scala/spark/python/pyspark/mllib/random.py", > line 24, in > from pyspark.rdd import RDD > File "/Users/tomohiko/MyRepos/Scala/spark/python/pyspark/__init__.py", line > 51, in > from pyspark.context import SparkContext > File "/Users/tomohiko/MyRepos/Scala/spark/python/pyspark/context.py", line > 22, in > from tempfile import NamedTemporaryFile > ImportError: cannot import name NamedTemporaryFile > 0.07 real 0.04 user 0.02 sys > Had test failures; see logs. > {noformat} > The problem is a cyclic import of tempfile module. > The cause of it is that pyspark.mllib.random module exists in the directory > where pyspark.mllib.classification module exists. > classification module imports numpy module, and then numpy module imports > tempfile module from its inside. > Now the first entry sys.path is the directory "./python/pyspark/mllib" (where > the executed file "classification.py" exists), so tempfile module imports > pyspark.mllib.random module (not the standard library "random" module). > Finally, import chains reach tempfile again, then a cyclic import is formed. > Summary: classification → numpy → tempfile → pyspark.mllib.random → tempfile > → (cyclic import!!) > Furthermore, stat module is in a standard library, and pyspark.mllib.stat > module exists. This also may be troublesome. > commit: 0e8203f4fb721158fb27897680da476174d24c4b > A fundamental solution is to avoid using module names used by standard > libraries (currently "random" and "stat"). > A difficulty of this solution is to rename pyspark.mllib.random and > pyspark.mllib.stat, which may be already used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4348) pyspark.mllib.random conflicts with random module
[ https://issues.apache.org/jira/browse/SPARK-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4348: -- Fix Version/s: 1.1.0 I've also fixed this in 1.1.2 by backporting the 1.2 patch: https://github.com/apache/spark/pull/4011 > pyspark.mllib.random conflicts with random module > - > > Key: SPARK-4348 > URL: https://issues.apache.org/jira/browse/SPARK-4348 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.1.0, 1.2.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Blocker > Fix For: 1.1.0, 1.2.0 > > > There are conflict in two cases: > 1. random module is used by pyspark.mllib.feature, if the first part of > sys.path is not '', then the hack in pyspark/__init__.py will fail to fix the > conflict. > 2. Run tests in mllib/xxx.py, the '' should be popped out before import > anything, or it will fail. > The first one is not fully fixed for user, it will introduce problems in some > cases, such as: > {code} > >>> import sys > >>> import sys.insert(0, PATH_OF_MODULE) > >>> import pyspark > >>> # use Word2Vec will fail > {code} > I'd like to rename mllib/random.py as random/_random.py, then in > mllib/__init.py > {code} > import pyspark.mllib._random as random > {code} > cc [~mengxr] [~dorx] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5049) ParquetTableScan always prepends the values of partition columns in output rows irrespective of the order of the partition columns in the original SELECT query
[ https://issues.apache.org/jira/browse/SPARK-5049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-5049. - Resolution: Fixed Fix Version/s: 1.2.1 1.3.0 Issue resolved by pull request 3990 [https://github.com/apache/spark/pull/3990] > ParquetTableScan always prepends the values of partition columns in output > rows irrespective of the order of the partition columns in the original > SELECT query > --- > > Key: SPARK-5049 > URL: https://issues.apache.org/jira/browse/SPARK-5049 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0, 1.2.0 >Reporter: Rahul Aggarwal > Fix For: 1.3.0, 1.2.1 > > > This happens when ParquetTableScan is being used by turning on > spark.sql.hive.convertMetastoreParquet > For example: > spark-sql> set spark.sql.hive.convertMetastoreParquet=true; > spark-sql> create table table1(a int , b int) partitioned by (p1 string, p2 > int) ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' STORED AS > INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat' OUTPUTFORMAT > 'parquet.hive.DeprecatedParquetOutputFormat'; > spark-sql> insert into table table1 partition(p1='January',p2=1) select key, > 10 from src; > spark-sql> select a, b, p1, p2 from table1 limit 10; > January 1 484 10 > January 1 484 10 > January 1 484 10 > January 1 484 10 > January 1 484 10 > January 1 484 10 > January 1 484 10 > January 1 484 10 > January 1 484 10 > January 1 484 10 > The correct output should be > 484 10 January 1 > 484 10 January 1 > 484 10 January 1 > 484 10 January 1 > 484 10 January 1 > 484 10 January 1 > 484 10 January 1 > 484 10 January 1 > 484 10 January 1 > 484 10 January 1 > This also leads to schema mismatch if the query is run using HiveContext and > the result is a SchemaRDD. > For example : > scala> import org.apache.spark.sql.hive._ > scala> val hc = new HiveContext(sc) > scala> hc.setConf("spark.sql.hive.convertMetastoreParquet", "true") > scala> val res = hc.sql("select a, b, p1, p2 from table1 limit 10") > scala> res.collect > res2: Array[org.apache.spark.sql.Row] = Array([January,1,238,10], > [January,1,86,10], [January,1,311,10], [January,1,27,10], [January,1,165,10], > [January,1,409,10], [January,1,255,10], [January,1,278,10], > [January,1,98,10], [January,1,484,10]) > scala> res.schema > res5: org.apache.spark.sql.StructType = > StructType(ArrayBuffer(StructField(a,IntegerType,true), > StructField(b,IntegerType,true), StructField(p1,StringType,true), > StructField(p2,IntegerType,true))) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1239) Don't fetch all map output statuses at each reducer during shuffles
[ https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-1239: -- Assignee: (was: Josh Rosen) > Don't fetch all map output statuses at each reducer during shuffles > --- > > Key: SPARK-1239 > URL: https://issues.apache.org/jira/browse/SPARK-1239 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 1.0.2, 1.1.0 >Reporter: Patrick Wendell > > Instead we should modify the way we fetch map output statuses to take both a > mapper and a reducer - or we should just piggyback the statuses on each task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4348) pyspark.mllib.random conflicts with random module
[ https://issues.apache.org/jira/browse/SPARK-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274333#comment-14274333 ] Apache Spark commented on SPARK-4348: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/4011 > pyspark.mllib.random conflicts with random module > - > > Key: SPARK-4348 > URL: https://issues.apache.org/jira/browse/SPARK-4348 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.1.0, 1.2.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Blocker > Fix For: 1.2.0 > > > There are conflict in two cases: > 1. random module is used by pyspark.mllib.feature, if the first part of > sys.path is not '', then the hack in pyspark/__init__.py will fail to fix the > conflict. > 2. Run tests in mllib/xxx.py, the '' should be popped out before import > anything, or it will fail. > The first one is not fully fixed for user, it will introduce problems in some > cases, such as: > {code} > >>> import sys > >>> import sys.insert(0, PATH_OF_MODULE) > >>> import pyspark > >>> # use Word2Vec will fail > {code} > I'd like to rename mllib/random.py as random/_random.py, then in > mllib/__init.py > {code} > import pyspark.mllib._random as random > {code} > cc [~mengxr] [~dorx] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5207) StandardScalerModel mean and variance re-use
[ https://issues.apache.org/jira/browse/SPARK-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274335#comment-14274335 ] Xiangrui Meng commented on SPARK-5207: -- [~ogeagla] I've assigned this ticket to you. Now the constructor takes withMean, withStd, mean, and std. We may want to consider whether we want to change the ordering of the parameters or provide auxiliary constructors. For example, we can have StandardScalerModel(mean, std) and then make withMean, withStd configurable via setters. setWithMean setWithStd Just provide one option here. [~dbtsai] implemented this feature. He may want to add more. > StandardScalerModel mean and variance re-use > > > Key: SPARK-5207 > URL: https://issues.apache.org/jira/browse/SPARK-5207 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Octavian Geagla >Assignee: Octavian Geagla > > From this discussion: > http://apache-spark-developers-list.1001551.n3.nabble.com/Re-use-scaling-means-and-variances-from-StandardScalerModel-td10073.html > Changing constructor to public would be a simple change, but a discussion is > needed to determine what args necessary for this change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4821) pyspark.mllib.rand docs not generated correctly
[ https://issues.apache.org/jira/browse/SPARK-4821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274334#comment-14274334 ] Apache Spark commented on SPARK-4821: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/4011 > pyspark.mllib.rand docs not generated correctly > --- > > Key: SPARK-4821 > URL: https://issues.apache.org/jira/browse/SPARK-4821 > Project: Spark > Issue Type: Bug > Components: Documentation, MLlib, PySpark >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > Fix For: 1.3.0, 1.2.1 > > > spark/python/docs/pyspark.mllib.rst needs to be updated to reflect the change > in package names from pyspark.mllib.random to .rand > Otherwise, the Python API docs are empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5207) StandardScalerModel mean and variance re-use
[ https://issues.apache.org/jira/browse/SPARK-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5207: - Target Version/s: 1.3.0 > StandardScalerModel mean and variance re-use > > > Key: SPARK-5207 > URL: https://issues.apache.org/jira/browse/SPARK-5207 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Octavian Geagla > > From this discussion: > http://apache-spark-developers-list.1001551.n3.nabble.com/Re-use-scaling-means-and-variances-from-StandardScalerModel-td10073.html > Changing constructor to public would be a simple change, but a discussion is > needed to determine what args necessary for this change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5207) StandardScalerModel mean and variance re-use
[ https://issues.apache.org/jira/browse/SPARK-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5207: - Issue Type: Improvement (was: Wish) > StandardScalerModel mean and variance re-use > > > Key: SPARK-5207 > URL: https://issues.apache.org/jira/browse/SPARK-5207 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Octavian Geagla > > From this discussion: > http://apache-spark-developers-list.1001551.n3.nabble.com/Re-use-scaling-means-and-variances-from-StandardScalerModel-td10073.html > Changing constructor to public would be a simple change, but a discussion is > needed to determine what args necessary for this change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5207) StandardScalerModel mean and variance re-use
[ https://issues.apache.org/jira/browse/SPARK-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5207: - Assignee: Octavian Geagla > StandardScalerModel mean and variance re-use > > > Key: SPARK-5207 > URL: https://issues.apache.org/jira/browse/SPARK-5207 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Octavian Geagla >Assignee: Octavian Geagla > > From this discussion: > http://apache-spark-developers-list.1001551.n3.nabble.com/Re-use-scaling-means-and-variances-from-StandardScalerModel-td10073.html > Changing constructor to public would be a simple change, but a discussion is > needed to determine what args necessary for this change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4667) Spillable can request more than twice its current memory from pool
[ https://issues.apache.org/jira/browse/SPARK-4667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Williams closed SPARK-4667. Resolution: Not a Problem > Spillable can request more than twice its current memory from pool > -- > > Key: SPARK-4667 > URL: https://issues.apache.org/jira/browse/SPARK-4667 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Ryan Williams > > [Spillable|https://github.com/apache/spark/blob/0eb4a7fb0fa1fa56677488cbd74eb39e65317621/core/src/main/scala/org/apache/spark/util/collection/Spillable.scala#L78] > has a comment that says "{{Claim up to double our current memory from the > shuffle memory pool}}", but then it proceeds to request {{2 * currentMemory - > myMemoryThreshold}}, which can more than double its current memory amount. > The requested amount (or the comment) should be changed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4004) add akka-persistence based recovery mechanism for Master (maybe Worker)
[ https://issues.apache.org/jira/browse/SPARK-4004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274299#comment-14274299 ] Nan Zhu edited comment on SPARK-4004 at 1/12/15 10:30 PM: -- I'd close the PR as I saw some discussions in https://github.com/apache/spark/pull/3825 which stated that we would introduce less Akka's feature to make it easier to replace Akka with Spark's RPC framework was (Author: codingcat): I'd close the PR as I saw some discussions in https://github.com/apache/spark/pull/3825 which stated that we would introduce less Akka's feature to make it easier to replace Akka with Spark's own RPC framework > add akka-persistence based recovery mechanism for Master (maybe Worker) > --- > > Key: SPARK-4004 > URL: https://issues.apache.org/jira/browse/SPARK-4004 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.1.0 >Reporter: Nan Zhu > > Since we have upgraded akka version to 2.3.x > we can utilize the features which are actually helpful in many applications, > e.g. by using persistence we can add akka-persistence recovery mechanism to > Master (maybe also Worker, but I'm not sure if we have many things to recover > from that) > this would be with better performance and more flexibility than current File > based persistence Engine -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4004) add akka-persistence based recovery mechanism for Master (maybe Worker)
[ https://issues.apache.org/jira/browse/SPARK-4004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nan Zhu closed SPARK-4004. -- Resolution: Won't Fix I'd close the PR as I saw some discussions in https://github.com/apache/spark/pull/3825 which stated that we would introduce less Akka's feature to make it easier to replace Akka with Spark's own RPC framework > add akka-persistence based recovery mechanism for Master (maybe Worker) > --- > > Key: SPARK-4004 > URL: https://issues.apache.org/jira/browse/SPARK-4004 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.1.0 >Reporter: Nan Zhu > > Since we have upgraded akka version to 2.3.x > we can utilize the features which are actually helpful in many applications, > e.g. by using persistence we can add akka-persistence recovery mechanism to > Master (maybe also Worker, but I'm not sure if we have many things to recover > from that) > this would be with better performance and more flexibility than current File > based persistence Engine -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4923) Maven build should keep publishing spark-repl
[ https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274268#comment-14274268 ] Chip Senkbeil commented on SPARK-4923: -- Okay, I'll do that and update this JIRA once I've submitted the pull request. > Maven build should keep publishing spark-repl > - > > Key: SPARK-4923 > URL: https://issues.apache.org/jira/browse/SPARK-4923 > Project: Spark > Issue Type: Bug > Components: Build, Spark Shell >Affects Versions: 1.2.0 >Reporter: Peng Cheng >Priority: Critical > Labels: shell > Attachments: > SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > Spark-repl installation and deployment has been discontinued (see > SPARK-3452). But its in the dependency list of a few projects that extends > its initialization process. > Please remove the 'skip' setting in spark-repl and make it an 'official' API > to encourage more platform to integrate with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4923) Maven build should keep publishing spark-repl
[ https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274263#comment-14274263 ] Patrick Wendell commented on SPARK-4923: [~senkwich] definitely prefer github. > Maven build should keep publishing spark-repl > - > > Key: SPARK-4923 > URL: https://issues.apache.org/jira/browse/SPARK-4923 > Project: Spark > Issue Type: Bug > Components: Build, Spark Shell >Affects Versions: 1.2.0 >Reporter: Peng Cheng >Priority: Critical > Labels: shell > Attachments: > SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > Spark-repl installation and deployment has been discontinued (see > SPARK-3452). But its in the dependency list of a few projects that extends > its initialization process. > Please remove the 'skip' setting in spark-repl and make it an 'official' API > to encourage more platform to integrate with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4923) Maven build should keep publishing spark-repl
[ https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274253#comment-14274253 ] Chip Senkbeil edited comment on SPARK-4923 at 1/12/15 10:07 PM: [~pwendell], I can definitely do that. Would you prefer a patch in the same form as the one attached? Or would it be better to create a pull request on Github for this with the changes? was (Author: senkwich): [~pwendell], I can definitely do that. Would you prefer a patch in the same form as the one attached? Or would it be better to create a pull request for this? > Maven build should keep publishing spark-repl > - > > Key: SPARK-4923 > URL: https://issues.apache.org/jira/browse/SPARK-4923 > Project: Spark > Issue Type: Bug > Components: Build, Spark Shell >Affects Versions: 1.2.0 >Reporter: Peng Cheng >Priority: Critical > Labels: shell > Attachments: > SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > Spark-repl installation and deployment has been discontinued (see > SPARK-3452). But its in the dependency list of a few projects that extends > its initialization process. > Please remove the 'skip' setting in spark-repl and make it an 'official' API > to encourage more platform to integrate with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4923) Maven build should keep publishing spark-repl
[ https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274253#comment-14274253 ] Chip Senkbeil commented on SPARK-4923: -- [~pwendell], I can definitely do that. Would you prefer a patch in the same form as the one attached? Or would it be better to create a pull request for this? > Maven build should keep publishing spark-repl > - > > Key: SPARK-4923 > URL: https://issues.apache.org/jira/browse/SPARK-4923 > Project: Spark > Issue Type: Bug > Components: Build, Spark Shell >Affects Versions: 1.2.0 >Reporter: Peng Cheng >Priority: Critical > Labels: shell > Attachments: > SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > Spark-repl installation and deployment has been discontinued (see > SPARK-3452). But its in the dependency list of a few projects that extends > its initialization process. > Please remove the 'skip' setting in spark-repl and make it an 'official' API > to encourage more platform to integrate with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4923) Maven build should keep publishing spark-repl
[ https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274239#comment-14274239 ] Patrick Wendell edited comment on SPARK-4923 at 1/12/15 9:58 PM: - Hey All, Sorry this has caused a disruption. As I said in the earlier comment. if anyone on these projects can submit a patch that locks down the visibility in that package and opening up things that are specifically needed, I'm fine to keep publishing it (and do so retro-actively for 1.2). We just need to look closely at what we are exposing because this package currently violates Spark's API policy. Because the Scala repl does not itself offer any kind of API stability, it will be hard for Spark to do same. But I think it's fine to just annotate and expose unstable API's here, provided projects understand the implications of depending on them. [~senkwich] - since you guys are probably the heaviest user, would you be willing to take a crack at this? Basically start by making everything private and then go and unlock things that you need as Developer API's. - Patrick was (Author: pwendell): Hey All, Sorry this has caused a disruption. As I said in the earlier comment. if anyone on these projects can submit a patch that locks down the visibility in that package and opening up things that are specifically needed, I'm fine to keep publishing it (and do so retro-actively for 1.2). We just need to look closely at what we are exposing because this package currently violates Spark's API policy. Because the Scala repl does not itself offer any kind of API stability, it will be hard for Spark to do same. But I think it's fine to just annotate and expose unstable API's here, provided projects understand the implications of depending on them. Chi - since you guys are probably the heaviest user, would you be willing to take a crack at this? Basically start by making everything private and then go and unlock things that you need as Developer API's. - Patrick > Maven build should keep publishing spark-repl > - > > Key: SPARK-4923 > URL: https://issues.apache.org/jira/browse/SPARK-4923 > Project: Spark > Issue Type: Bug > Components: Build, Spark Shell >Affects Versions: 1.2.0 >Reporter: Peng Cheng >Priority: Critical > Labels: shell > Attachments: > SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > Spark-repl installation and deployment has been discontinued (see > SPARK-3452). But its in the dependency list of a few projects that extends > its initialization process. > Please remove the 'skip' setting in spark-repl and make it an 'official' API > to encourage more platform to integrate with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4923) Maven build should keep publishing spark-repl
[ https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274239#comment-14274239 ] Patrick Wendell commented on SPARK-4923: Hey All, Sorry this has caused a disruption. As I said in the earlier comment. if anyone on these projects can submit a patch that locks down the visibility in that package and opening up things that are specifically needed, I'm fine to keep publishing it (and do so retro-actively for 1.2). We just need to look closely at what we are exposing because this package currently violates Spark's API policy. Because the Scala repl does not itself offer any kind of API stability, it will be hard for Spark to do same. But I think it's fine to just annotate and expose unstable API's here, provided projects understand the implications of depending on them. Chi - since you guys are probably the heaviest user, would you be willing to take a crack at this? Basically start by making everything private and then go and unlock things that you need as Developer API's. - Patrick > Maven build should keep publishing spark-repl > - > > Key: SPARK-4923 > URL: https://issues.apache.org/jira/browse/SPARK-4923 > Project: Spark > Issue Type: Bug > Components: Build, Spark Shell >Affects Versions: 1.2.0 >Reporter: Peng Cheng >Priority: Critical > Labels: shell > Attachments: > SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > Spark-repl installation and deployment has been discontinued (see > SPARK-3452). But its in the dependency list of a few projects that extends > its initialization process. > Please remove the 'skip' setting in spark-repl and make it an 'official' API > to encourage more platform to integrate with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4296) Throw "Expression not in GROUP BY" when using same expression in group by clause and select clause
[ https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274227#comment-14274227 ] Apache Spark commented on SPARK-4296: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/4010 > Throw "Expression not in GROUP BY" when using same expression in group by > clause and select clause > --- > > Key: SPARK-4296 > URL: https://issues.apache.org/jira/browse/SPARK-4296 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0, 1.1.1, 1.2.0 >Reporter: Shixiong Zhu >Assignee: Cheng Lian >Priority: Blocker > > When the input data has a complex structure, using same expression in group > by clause and select clause will throw "Expression not in GROUP BY". > {code:java} > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > import sqlContext.createSchemaRDD > case class Birthday(date: String) > case class Person(name: String, birthday: Birthday) > val people = sc.parallelize(List(Person("John", Birthday("1990-01-22")), > Person("Jim", Birthday("1980-02-28" > people.registerTempTable("people") > val year = sqlContext.sql("select count(*), upper(birthday.date) from people > group by upper(birthday.date)") > year.collect > {code} > Here is the plan of year: > {code:java} > SchemaRDD[3] at RDD at SchemaRDD.scala:105 > == Query Plan == > == Physical Plan == > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression > not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree: > Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date > AS date#9) AS c1#3] > Subquery people > LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at > ExistingRDD.scala:36 > {code} > The bug is the equality test for `Upper(birthday#1.date)` and > `Upper(birthday#1.date AS date#9)`. > Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias > expression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5172) spark-examples-***.jar shades a wrong Hadoop distribution
[ https://issues.apache.org/jira/browse/SPARK-5172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-5172. Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Sean Owen > spark-examples-***.jar shades a wrong Hadoop distribution > - > > Key: SPARK-5172 > URL: https://issues.apache.org/jira/browse/SPARK-5172 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: Shixiong Zhu >Assignee: Sean Owen >Priority: Minor > Fix For: 1.3.0 > > > Steps to check it: > 1. Download "spark-1.2.0-bin-hadoop2.4.tgz" from > http://www.apache.org/dyn/closer.cgi/spark/spark-1.2.0/spark-1.2.0-bin-hadoop2.4.tgz > 2. unzip `spark-examples-1.2.0-hadoop2.4.0.jar`. > 3. There is a file called `org/apache/hadoop/package-info.class` in the jar. > It doesn't exist in hadoop 2.4. > 4. Run "javap -classpath . -private -c -v org.apache.hadoop.package-info" > {code} > Compiled from "package-info.java" > interface org.apache.hadoop.package-info > SourceFile: "package-info.java" > RuntimeVisibleAnnotations: length = 0x24 >00 01 00 06 00 06 00 07 73 00 08 00 09 73 00 0A >00 0B 73 00 0C 00 0D 73 00 0E 00 0F 73 00 10 00 >11 73 00 12 > minor version: 0 > major version: 50 > Constant pool: > const #1 = Asciz org/apache/hadoop/package-info; > const #2 = class #1; // "org/apache/hadoop/package-info" > const #3 = Asciz java/lang/Object; > const #4 = class #3; // java/lang/Object > const #5 = Asciz package-info.java; > const #6 = Asciz Lorg/apache/hadoop/HadoopVersionAnnotation;; > const #7 = Asciz version; > const #8 = Asciz 1.2.1; > const #9 = Asciz revision; > const #10 = Asciz 1503152; > const #11 = Asciz user; > const #12 = Asciz mattf; > const #13 = Asciz date; > const #14 = Asciz Wed Jul 24 13:39:35 PDT 2013; > const #15 = Asciz url; > const #16 = Asciz > https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2; > const #17 = Asciz srcChecksum; > const #18 = Asciz 6923c86528809c4e7e6f493b6b413a9a; > const #19 = Asciz SourceFile; > const #20 = Asciz RuntimeVisibleAnnotations; > { > } > {code} > The version is {{1.2.1}} > It comes because a wrong hbase version settings in examples project. Here is > a part of the dependencly tree when runnning "mvn -Pyarn -Phadoop-2.4 > -Dhadoop.version=2.4.0 -pl examples dependency:tree" > {noformat} > [INFO] +- org.apache.hbase:hbase-testing-util:jar:0.98.7-hadoop1:compile > [INFO] | +- > org.apache.hbase:hbase-common:test-jar:tests:0.98.7-hadoop1:compile > [INFO] | +- > org.apache.hbase:hbase-server:test-jar:tests:0.98.7-hadoop1:compile > [INFO] | | +- com.sun.jersey:jersey-core:jar:1.8:compile > [INFO] | | +- com.sun.jersey:jersey-json:jar:1.8:compile > [INFO] | | | +- org.codehaus.jettison:jettison:jar:1.1:compile > [INFO] | | | +- com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile > [INFO] | | | \- org.codehaus.jackson:jackson-xc:jar:1.7.1:compile > [INFO] | | \- com.sun.jersey:jersey-server:jar:1.8:compile > [INFO] | | \- asm:asm:jar:3.3.1:test > [INFO] | +- org.apache.hbase:hbase-hadoop1-compat:jar:0.98.7-hadoop1:compile > [INFO] | +- > org.apache.hbase:hbase-hadoop1-compat:test-jar:tests:0.98.7-hadoop1:compile > [INFO] | +- org.apache.hadoop:hadoop-core:jar:1.2.1:compile > [INFO] | | +- xmlenc:xmlenc:jar:0.52:compile > [INFO] | | +- commons-configuration:commons-configuration:jar:1.6:compile > [INFO] | | | +- commons-digester:commons-digester:jar:1.8:compile > [INFO] | | | | \- commons-beanutils:commons-beanutils:jar:1.7.0:compile > [INFO] | | | \- commons-beanutils:commons-beanutils-core:jar:1.8.0:compile > [INFO] | | \- commons-el:commons-el:jar:1.0:compile > [INFO] | +- org.apache.hadoop:hadoop-test:jar:1.2.1:compile > [INFO] | | +- org.apache.ftpserver:ftplet-api:jar:1.0.0:compile > [INFO] | | +- org.apache.mina:mina-core:jar:2.0.0-M5:compile > [INFO] | | +- org.apache.ftpserver:ftpserver-core:jar:1.0.0:compile > [INFO] | | \- org.apache.ftpserver:ftpserver-deprecated:jar:1.0.0-M2:compile > [INFO] | +- > com.github.stephenc.findbugs:findbugs-annotations:jar:1.3.9-1:compile > [INFO] | \- junit:junit:jar:4.10:test > [INFO] | \- org.hamcrest:hamcrest-core:jar:1.1:test > {noformat} > If I ran `mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -pl examples -am > dependency:tree -Dhbase.profile=hadoop2`, the dependency tree is right. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5078) Allow setting Akka host name from env vars
[ https://issues.apache.org/jira/browse/SPARK-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-5078. Resolution: Fixed Fix Version/s: 1.2.1 1.3.0 > Allow setting Akka host name from env vars > -- > > Key: SPARK-5078 > URL: https://issues.apache.org/jira/browse/SPARK-5078 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Michael Armbrust >Assignee: Michael Armbrust >Priority: Critical > Fix For: 1.3.0, 1.2.1 > > > Current spark lets you set the ip address using SPARK_LOCAL_IP, but then this > is given to akka after doing a reverse DNS lookup. This makes it difficult > to run spark in Docker. You can already change the hostname that is used > programmatically, but it would be nice to be able to do this with an > environment variable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5097) Adding data frame APIs to SchemaRDD
[ https://issues.apache.org/jira/browse/SPARK-5097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274050#comment-14274050 ] Reynold Xin commented on SPARK-5097: [~mohitjaggi] thanks for commenting. The implementation is actually pretty minor (it is mostly about finalizing the API). It would be great if you can review the design doc and chime in, and later on also review my initial pull request. Once the first pull request is in, I'm sure we will have more splittable tasks. > Adding data frame APIs to SchemaRDD > --- > > Key: SPARK-5097 > URL: https://issues.apache.org/jira/browse/SPARK-5097 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Critical > Attachments: DesignDocAddingDataFrameAPIstoSchemaRDD.pdf > > > SchemaRDD, through its DSL, already provides common data frame > functionalities. However, the DSL was originally created for constructing > test cases without much end-user usability and API stability consideration. > This design doc proposes a set of API changes for Scala and Python to make > the SchemaRDD DSL API more usable and stable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5063) Raise more helpful errors when RDD actions or transformations are called inside of transformations
[ https://issues.apache.org/jira/browse/SPARK-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5063: -- Description: Spark does not support nested RDDs or performing Spark actions inside of transformations; this usually leads to NullPointerExceptions (see SPARK-718 as one example). The confusing NPE is one of the most common sources of Spark questions on StackOverflow: - https://stackoverflow.com/questions/13770218/call-of-distinct-and-map-together-throws-npe-in-spark-library/14130534#14130534 - https://stackoverflow.com/questions/23793117/nullpointerexception-in-scala-spark-appears-to-be-caused-be-collection-type/23793399#23793399 - https://stackoverflow.com/questions/25997558/graphx-ive-got-nullpointerexception-inside-mapvertices/26003674#26003674 (those are just a sample of the ones that I've answered personally; there are many others). I think we can detect these errors by adding logic to {{RDD}} to check whether {{sc}} is null (e.g. turn {{sc}} into a getter function); we can use this to add a better error message. In PySpark, these errors manifest themselves slightly differently. Attempting to nest RDDs or perform actions inside of transformations results in pickle-time errors: {code} rdd1 = sc.parallelize(range(100)) rdd2 = sc.parallelize(range(100)) rdd1.mapPartitions(lambda x: [rdd2.map(lambda x: x)]) {code} produces {code} [...] File "/Users/joshrosen/anaconda/lib/python2.7/pickle.py", line 306, in save rv = reduce(self.proto) File "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ File "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 304, in get_return_value py4j.protocol.Py4JError: An error occurred while calling o21.__getnewargs__. Trace: py4j.Py4JException: Method __getnewargs__([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342) at py4j.Gateway.invoke(Gateway.java:252) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) {code} We get the same error when attempting to broadcast an RDD in PySpark. For Python, improved error reporting could be as simple as overriding the {{getnewargs}} method to throw a more useful UnsupportedOperation exception with a more helpful error message. was: Spark does not support nested RDDs or performing Spark actions inside of transformations; this usually leads to NullPointerExceptions (see SPARK-718 as one example). The confusing NPE is one of the most common sources of Spark questions on StackOverflow: - https://stackoverflow.com/questions/13770218/call-of-distinct-and-map-together-throws-npe-in-spark-library/14130534#14130534 - https://stackoverflow.com/questions/23793117/nullpointerexception-in-scala-spark-appears-to-be-caused-be-collection-type/23793399#23793399 - https://stackoverflow.com/questions/25997558/graphx-ive-got-nullpointerexception-inside-mapvertices/26003674#26003674 (those are just a sample of the ones that I've answered personally; there are many others). I think we can detect these errors by adding logic to {{RDD}} to check whether {{sc}} is null (e.g. turn {{sc}} into a getter function); we can use this to add a better error message. > Raise more helpful errors when RDD actions or transformations are called > inside of transformations > -- > > Key: SPARK-5063 > URL: https://issues.apache.org/jira/browse/SPARK-5063 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Josh Rosen >Assignee: Josh Rosen > > Spark does not support nested RDDs or performing Spark actions inside of > transformations; this usually leads to NullPointerExceptions (see SPARK-718 > as one example). The confusing NPE is one of the most common sources of > Spark questions on StackOverflow: > - > https://stackoverflow.com/questions/13770218/call-of-distinct-and-map-together-throws-npe-in-spark-library/14130534#14130534 > - > https://stackoverflow.com/questions/23793117/nullpointerexception-in-scala-spark-appears-to-be-caused-be-collection-type/23793399#23793399 > - > https://stackoverflow.com/questions/25997558/graphx-ive-got-nullpointerexception-inside-mapvertices/26003674#26003674 > (those are just a sample of the ones that I've answered personally; there are > many others). > I think we can detect these errors by adding logic to {{RDD}} to check > whether {{sc}} is null (e.
[jira] [Commented] (SPARK-4879) Missing output partitions after job completes with speculative execution
[ https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274019#comment-14274019 ] Josh Rosen commented on SPARK-4879: --- I think that part of the reproduction issues that I had might have been due to {{attemptId}} returning a unique task attempt ID rather than the attempt number, meaning that only the _first_ run of that test in the REPL would be capable of uncovering the bug. See https://github.com/apache/spark/pull/3849 / SPARK-4014 for more context. I'm going to try to merge that patch today, which will let me write a reliable regression test. > Missing output partitions after job completes with speculative execution > > > Key: SPARK-4879 > URL: https://issues.apache.org/jira/browse/SPARK-4879 > Project: Spark > Issue Type: Bug > Components: Input/Output, Spark Core >Affects Versions: 1.0.2, 1.1.1, 1.2.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Critical > Attachments: speculation.txt, speculation2.txt > > > When speculative execution is enabled ({{spark.speculation=true}}), jobs that > save output files may report that they have completed successfully even > though some output partitions written by speculative tasks may be missing. > h3. Reproduction > This symptom was reported to me by a Spark user and I've been doing my own > investigation to try to come up with an in-house reproduction. > I'm still working on a reliable local reproduction for this issue, which is a > little tricky because Spark won't schedule speculated tasks on the same host > as the original task, so you need an actual (or containerized) multi-host > cluster to test speculation. Here's a simple reproduction of some of the > symptoms on EC2, which can be run in {{spark-shell}} with {{--conf > spark.speculation=true}}: > {code} > // Rig a job such that all but one of the tasks complete instantly > // and one task runs for 20 seconds on its first attempt and instantly > // on its second attempt: > val numTasks = 100 > sc.parallelize(1 to numTasks, > numTasks).repartition(2).mapPartitionsWithContext { case (ctx, iter) => > if (ctx.partitionId == 0) { // If this is the one task that should run > really slow > if (ctx.attemptId == 0) { // If this is the first attempt, run slow > Thread.sleep(20 * 1000) > } > } > iter > }.map(x => (x, x)).saveAsTextFile("/test4") > {code} > When I run this, I end up with a job that completes quickly (due to > speculation) but reports failures from the speculated task: > {code} > [...] > 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Finished task 37.1 in stage > 3.0 (TID 411) in 131 ms on ip-172-31-8-164.us-west-2.compute.internal > (100/100) > 14/12/11 01:41:13 INFO scheduler.DAGScheduler: Stage 3 (saveAsTextFile at > :22) finished in 0.856 s > 14/12/11 01:41:13 INFO spark.SparkContext: Job finished: saveAsTextFile at > :22, took 0.885438374 s > 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Ignoring task-finished event > for 70.1 in stage 3.0 because task 70 has already completed successfully > scala> 14/12/11 01:41:13 WARN scheduler.TaskSetManager: Lost task 49.1 in > stage 3.0 (TID 413, ip-172-31-8-164.us-west-2.compute.internal): > java.io.IOException: Failed to save output of task: > attempt_201412110141_0003_m_49_413 > > org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160) > > org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172) > > org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132) > org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109) > > org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:991) > > org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) > org.apache.spark.scheduler.Task.run(Task.scala:54) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > java.lang.Thread.run(Thread.java:745) > {code} > One interesting thing to note about this stack trace: if we look at > {{FileOutputCommitter.java:160}} > ([link|http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-core/2.5.0-mr1-cdh5.2.0/org/apache/hadoop/mapred/FileOutputCommitter.java#160]), > this point in the execution seems to cor
[jira] [Created] (SPARK-5207) StandardScalerModel mean and variance re-use
Octavian Geagla created SPARK-5207: -- Summary: StandardScalerModel mean and variance re-use Key: SPARK-5207 URL: https://issues.apache.org/jira/browse/SPARK-5207 Project: Spark Issue Type: Wish Components: MLlib Reporter: Octavian Geagla >From this discussion: >http://apache-spark-developers-list.1001551.n3.nabble.com/Re-use-scaling-means-and-variances-from-StandardScalerModel-td10073.html Changing constructor to public would be a simple change, but a discussion is needed to determine what args necessary for this change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5097) Adding data frame APIs to SchemaRDD
[ https://issues.apache.org/jira/browse/SPARK-5097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273994#comment-14273994 ] Mohit Jaggi commented on SPARK-5097: Hi, This is Mohit Jaggi, author of https://github.com/AyasdiOpenSource/bigdf Matei had suggested integrating bigdf with SchemaRDD and I was planning on doing that soon. I would love to contribute to this item. Most of the constructs mentioned in the design document already exist in bigdf. Mohit. > Adding data frame APIs to SchemaRDD > --- > > Key: SPARK-5097 > URL: https://issues.apache.org/jira/browse/SPARK-5097 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Critical > Attachments: DesignDocAddingDataFrameAPIstoSchemaRDD.pdf > > > SchemaRDD, through its DSL, already provides common data frame > functionalities. However, the DSL was originally created for constructing > test cases without much end-user usability and API stability consideration. > This design doc proposes a set of API changes for Scala and Python to make > the SchemaRDD DSL API more usable and stable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2584) Do not mutate block storage level on the UI
[ https://issues.apache.org/jira/browse/SPARK-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273877#comment-14273877 ] Ilya Ganelin edited comment on SPARK-2584 at 1/12/15 7:08 PM: -- Understood, I am able to recreate this issue in 1.1. I'll work on a fix to clarify what's going on. Thank. was (Author: ilganeli): Understood, I was looking at the UI for Spark 1.1 and did not see the block storage level represented as MEMORY_AND_DISK or DISK_ONLY. It's now presented as Memory Deserialized or Disk Deserialized. I'll attempt to recreate this problem in the newer version of Spark but wanted to know if you've seen it since 1.0.1. > Do not mutate block storage level on the UI > --- > > Key: SPARK-2584 > URL: https://issues.apache.org/jira/browse/SPARK-2584 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 1.0.1 >Reporter: Andrew Or > > If a block is stored MEMORY_AND_DISK and we drop it from memory, it becomes > DISK_ONLY on the UI. We should preserve the original storage level proposed > by the user, in addition to the change in actual storage level. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273957#comment-14273957 ] Shivaram Venkataraman commented on SPARK-3821: -- Thanks [~nchammas] for benchmarks. This is looking good. Just curious about one thing in the spark-packer.json file. Where does the `create_image.sh` in https://github.com/nchammas/spark-ec2/blob/273c8c518fbc6e86e0fb4410efbe77a4d4e4ff5b/packer/spark-packer.json#L66 come from ? Is it the same file in the current spark-ec2 repo ? > Develop an automated way of creating Spark images (AMI, Docker, and others) > --- > > Key: SPARK-3821 > URL: https://issues.apache.org/jira/browse/SPARK-3821 > Project: Spark > Issue Type: Improvement > Components: Build, EC2 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas > Attachments: packer-proposal.html > > > Right now the creation of Spark AMIs or Docker containers is done manually. > With tools like [Packer|http://www.packer.io/], we should be able to automate > this work, and do so in such a way that multiple types of machine images can > be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5102) CompressedMapStatus needs to be registered with Kryo
[ https://issues.apache.org/jira/browse/SPARK-5102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-5102. Resolution: Fixed Fix Version/s: 1.2.1 1.3.0 Fixed by: https://github.com/apache/spark/pull/4007 > CompressedMapStatus needs to be registered with Kryo > > > Key: SPARK-5102 > URL: https://issues.apache.org/jira/browse/SPARK-5102 > Project: Spark > Issue Type: Bug >Affects Versions: 1.2.0 >Reporter: Daniel Darabos >Assignee: Lianhui Wang >Priority: Minor > Fix For: 1.3.0, 1.2.1 > > > After upgrading from Spark 1.1.0 to 1.2.0 I got this exception: > {code} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in > stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: Class is > not registered: org.apache.spark.scheduler.CompressedMapStatus > Note: To register this class use: > kryo.register(org.apache.spark.scheduler.CompressedMapStatus.class); > at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:442) > at > com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79) > at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:565) > at > org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:165) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:206) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > I had to register {{org.apache.spark.scheduler.CompressedMapStatus}} with > Kryo. I think this should be done in > {{spark/serializer/KryoSerializer.scala}}, unless instances of this class are > not expected to be sent over the wire. (Maybe I'm doing something wrong?) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5102) CompressedMapStatus needs to be registered with Kryo
[ https://issues.apache.org/jira/browse/SPARK-5102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5102: --- Target Version/s: 1.2.1 Assignee: Lianhui Wang > CompressedMapStatus needs to be registered with Kryo > > > Key: SPARK-5102 > URL: https://issues.apache.org/jira/browse/SPARK-5102 > Project: Spark > Issue Type: Bug >Affects Versions: 1.2.0 >Reporter: Daniel Darabos >Assignee: Lianhui Wang >Priority: Minor > > After upgrading from Spark 1.1.0 to 1.2.0 I got this exception: > {code} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in > stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: Class is > not registered: org.apache.spark.scheduler.CompressedMapStatus > Note: To register this class use: > kryo.register(org.apache.spark.scheduler.CompressedMapStatus.class); > at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:442) > at > com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79) > at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:565) > at > org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:165) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:206) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > I had to register {{org.apache.spark.scheduler.CompressedMapStatus}} with > Kryo. I think this should be done in > {{spark/serializer/KryoSerializer.scala}}, unless instances of this class are > not expected to be sent over the wire. (Maybe I'm doing something wrong?) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5200) Disable web UI in Hive Thriftserver tests
[ https://issues.apache.org/jira/browse/SPARK-5200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-5200. --- Resolution: Fixed Fix Version/s: 1.2.1 1.3.0 1.1.2 Issue resolved by pull request 3998 [https://github.com/apache/spark/pull/3998] > Disable web UI in Hive Thriftserver tests > - > > Key: SPARK-5200 > URL: https://issues.apache.org/jira/browse/SPARK-5200 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen > Labels: flaky-test > Fix For: 1.1.2, 1.3.0, 1.2.1 > > > In our unit tests, we should disable the Spark Web UI when starting the Hive > Thriftserver, since port contention during this test has been a cause of test > failures on Jenkins. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5206) Accumulators are not re-registered during recovering from checkpoint
[ https://issues.apache.org/jira/browse/SPARK-5206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] vincent ye updated SPARK-5206: -- Description: I got exception as following while my streaming application restarts from crash from checkpoit: 15/01/12 10:31:06 sparkDriver-akka.actor.default-dispatcher-4 ERROR scheduler.DAGScheduler: Failed to update accumulators for ShuffleMapTask(41, 4) java.util.NoSuchElementException: key not found: 1 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:939) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:938) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:938) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1388) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) I guess that an Accumulator is registered to a singleton Accumulators in Line 58 of org.apache.spark.Accumulable: Accumulators.register(this, true) This code need to be executed in the driver once. But when the application is recovered from checkpoint. It won't be executed in the driver. So when the driver process it at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:938), It can't find the Accumulator because it's not re-register during the recovery. was: I got exception as following while my streaming application restarts from crash from checkpoit: 15/01/12 10:31:06 sparkDriver-akka.actor.default-dispatcher-4 ERROR scheduler.DAGScheduler: Failed to update accumulators for ShuffleMapTask(41, 4) java.util.NoSuchElementException: key not found: 1 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:939) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:938) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:938) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1388) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Accumulators are not re-registered during recovering from checkpoint > > >
[jira] [Commented] (SPARK-5206) Accumulators are not re-registered during recovering from checkpoint
[ https://issues.apache.org/jira/browse/SPARK-5206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273927#comment-14273927 ] vincent ye commented on SPARK-5206: --- I guess that an Accumulator is registered to a singleton Accumulators in Line 58 of org.apache.spark.Accumulable: Accumulators.register(this, true) This code need to be executed in the driver once. But when the application is recovered from checkpoint. It won't be executed in the driver. So when the driver process it at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:938), It can't find the Accumulator because it's not re-register during the recovery. > Accumulators are not re-registered during recovering from checkpoint > > > Key: SPARK-5206 > URL: https://issues.apache.org/jira/browse/SPARK-5206 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.1.0 >Reporter: vincent ye > > I got exception as following while my streaming application restarts from > crash from checkpoit: > 15/01/12 10:31:06 sparkDriver-akka.actor.default-dispatcher-4 ERROR > scheduler.DAGScheduler: Failed to update accumulators for ShuffleMapTask(41, > 4) > java.util.NoSuchElementException: key not found: 1 > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:58) > at scala.collection.mutable.HashMap.apply(HashMap.scala:64) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:939) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:938) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) > at > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) > at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:938) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1388) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3450) Enable specifiying the --jars CLI option multiple times
[ https://issues.apache.org/jira/browse/SPARK-3450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273924#comment-14273924 ] Marcelo Vanzin commented on SPARK-3450: --- [~pwendell] if your only concern is complicating the parsing, this is probably a one-line change in SparkSubmitArgument.scala. It wouldn't complicate anything. > Enable specifiying the --jars CLI option multiple times > --- > > Key: SPARK-3450 > URL: https://issues.apache.org/jira/browse/SPARK-3450 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.0.2 >Reporter: wolfgang hoschek > > spark-submit should support specifiying the --jars option multiple time, e.g. > --jars foo.jar,bar.jar --jars baz.jar,oops.jar should be equivalent to --jars > foo.jar,bar.jar,baz.jar,oops.jar > This would allow using wrapper scripts that simplify usage for enterprise > customers along the following lines: > {code} > my-spark-submit.sh: > jars= > for i in /opt/myapp/*.jar; do > if [ $i -gt 0] > then > jars="$jars," > fi > jars="$jars$i" > done > spark-submit --jars "$jars" "$@" > {code} > Example usage: > {code} > my-spark-submit.sh --jars myUserDefinedFunction.jar > {code} > The relevant enhancement code might go into SparkSubmitArguments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5206) Accumulators are not re-registered during recovering from checkpoint
vincent ye created SPARK-5206: - Summary: Accumulators are not re-registered during recovering from checkpoint Key: SPARK-5206 URL: https://issues.apache.org/jira/browse/SPARK-5206 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.1.0 Reporter: vincent ye I got exception as following while my streaming application restarts from crash from checkpoit: 15/01/12 10:31:06 sparkDriver-akka.actor.default-dispatcher-4 ERROR scheduler.DAGScheduler: Failed to update accumulators for ShuffleMapTask(41, 4) java.util.NoSuchElementException: key not found: 1 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:939) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:938) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:938) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1388) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4859) Improve StreamingListenerBus
[ https://issues.apache.org/jira/browse/SPARK-4859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4859: - Priority: Major (was: Minor) > Improve StreamingListenerBus > > > Key: SPARK-4859 > URL: https://issues.apache.org/jira/browse/SPARK-4859 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.0 >Reporter: Shixiong Zhu > > Fix the race condition of `queueFullErrorMessageLogged`. > Log the error from listener rather than crashing `listenerThread`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4859) Improve StreamingListenerBus
[ https://issues.apache.org/jira/browse/SPARK-4859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4859: - Target Version/s: 1.3.0 > Improve StreamingListenerBus > > > Key: SPARK-4859 > URL: https://issues.apache.org/jira/browse/SPARK-4859 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.0 >Reporter: Shixiong Zhu > > Fix the race condition of `queueFullErrorMessageLogged`. > Log the error from listener rather than crashing `listenerThread`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4859) Improve StreamingListenerBus
[ https://issues.apache.org/jira/browse/SPARK-4859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4859: - Affects Version/s: 1.0.0 > Improve StreamingListenerBus > > > Key: SPARK-4859 > URL: https://issues.apache.org/jira/browse/SPARK-4859 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.0 >Reporter: Shixiong Zhu >Priority: Minor > > Fix the race condition of `queueFullErrorMessageLogged`. > Log the error from listener rather than crashing `listenerThread`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5124) Standardize internal RPC interface
[ https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273885#comment-14273885 ] Reynold Xin commented on SPARK-5124: 1. Let's put that outside of this PR (either leave it as an actor for now and follow up to change it to a loop, or submit a separate PR to change it to a loop before we merge the actor PR). 2. Yes - you don't necessarily need an alternative implementation, but making sure the current API design can indeed support alternative implementations is a good idea. > Standardize internal RPC interface > -- > > Key: SPARK-5124 > URL: https://issues.apache.org/jira/browse/SPARK-5124 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Reynold Xin >Assignee: Shixiong Zhu > Attachments: Pluggable RPC - draft 1.pdf > > > In Spark we use Akka as the RPC layer. It would be great if we can > standardize the internal RPC interface to facilitate testing. This will also > provide the foundation to try other RPC implementations in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2909) Indexing for SparseVector in pyspark
[ https://issues.apache.org/jira/browse/SPARK-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273880#comment-14273880 ] Manoj Kumar commented on SPARK-2909: [~josephkb] Sorry for spamming your inbox, but just a heads up that I'm working on this. Will mostly submit a Pull Request by tomorrow. > Indexing for SparseVector in pyspark > > > Key: SPARK-2909 > URL: https://issues.apache.org/jira/browse/SPARK-2909 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Reporter: Joseph K. Bradley >Priority: Minor > > SparseVector in pyspark does not currently support indexing, except by > examining the internal representation. Though indexing is a pricy operation, > it would be useful for, e.g., iterating through a dataset (RDD[LabeledPoint]) > and operating on a single feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2584) Do not mutate block storage level on the UI
[ https://issues.apache.org/jira/browse/SPARK-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273877#comment-14273877 ] Ilya Ganelin commented on SPARK-2584: - Understood, I was looking at the UI for Spark 1.1 and did not see the block storage level represented as MEMORY_AND_DISK or DISK_ONLY. It's now presented as Memory Deserialized or Disk Deserialized. I'll attempt to recreate this problem in the newer version of Spark but wanted to know if you've seen it since 1.0.1. > Do not mutate block storage level on the UI > --- > > Key: SPARK-2584 > URL: https://issues.apache.org/jira/browse/SPARK-2584 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 1.0.1 >Reporter: Andrew Or > > If a block is stored MEMORY_AND_DISK and we drop it from memory, it becomes > DISK_ONLY on the UI. We should preserve the original storage level proposed > by the user, in addition to the change in actual storage level. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2584) Do not mutate block storage level on the UI
[ https://issues.apache.org/jira/browse/SPARK-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273871#comment-14273871 ] Andrew Or commented on SPARK-2584: -- When the in-memory cache is full, the RDD will be automatically dropped from memory to disk without the user explicitly calling anything. This is what I mean by drop it from memory. > Do not mutate block storage level on the UI > --- > > Key: SPARK-2584 > URL: https://issues.apache.org/jira/browse/SPARK-2584 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 1.0.1 >Reporter: Andrew Or > > If a block is stored MEMORY_AND_DISK and we drop it from memory, it becomes > DISK_ONLY on the UI. We should preserve the original storage level proposed > by the user, in addition to the change in actual storage level. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2584) Do not mutate block storage level on the UI
[ https://issues.apache.org/jira/browse/SPARK-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273762#comment-14273762 ] Ilya Ganelin commented on SPARK-2584: - Hi Andrew, question about this. When you say "we drop it from memory" what mechanism are you talking about? It's illegal to change the persistence level of an already persisted RDD and if you call unpersist() it's dropped from both memory and disk storage. How would an RDD be "dropped" from memory? > Do not mutate block storage level on the UI > --- > > Key: SPARK-2584 > URL: https://issues.apache.org/jira/browse/SPARK-2584 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 1.0.1 >Reporter: Andrew Or > > If a block is stored MEMORY_AND_DISK and we drop it from memory, it becomes > DISK_ONLY on the UI. We should preserve the original storage level proposed > by the user, in addition to the change in actual storage level. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2584) Do not mutate block storage level on the UI
[ https://issues.apache.org/jira/browse/SPARK-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273762#comment-14273762 ] Ilya Ganelin edited comment on SPARK-2584 at 1/12/15 4:47 PM: -- Hi Andrew, question about this. When you say "we drop it from memory" what mechanism are you talking about? It's illegal to change the persistence level of an already persisted RDD and if you call unpersist() it's dropped from both memory and disk storage. How would an RDD be "dropped" from memory? I'm just trying to reproduce the issue before creating a fix. was (Author: ilganeli): Hi Andrew, question about this. When you say "we drop it from memory" what mechanism are you talking about? It's illegal to change the persistence level of an already persisted RDD and if you call unpersist() it's dropped from both memory and disk storage. How would an RDD be "dropped" from memory? > Do not mutate block storage level on the UI > --- > > Key: SPARK-2584 > URL: https://issues.apache.org/jira/browse/SPARK-2584 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 1.0.1 >Reporter: Andrew Or > > If a block is stored MEMORY_AND_DISK and we drop it from memory, it becomes > DISK_ONLY on the UI. We should preserve the original storage level proposed > by the user, in addition to the change in actual storage level. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5205) Inconsistent behaviour between Streaming job and others, when click kill link in WebUI
[ https://issues.apache.org/jira/browse/SPARK-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273759#comment-14273759 ] Apache Spark commented on SPARK-5205: - User 'uncleGen' has created a pull request for this issue: https://github.com/apache/spark/pull/4008 > Inconsistent behaviour between Streaming job and others, when click kill link > in WebUI > -- > > Key: SPARK-5205 > URL: https://issues.apache.org/jira/browse/SPARK-5205 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: uncleGen > > The "kill" link is used to kill a stage in job. It works in any kinds of > Spark job but Spark Streaming. To be specific, we can only kill the stage > which is used to run "Receiver", but not kill the "Receivers". Well, the > stage can be killed and cleaned from the ui, but the receivers are still > alive and receiving data. I think it dose not fit with the common sense. > IMHO, killing the "receiver" stage means kill the "receivers" and stopping > receiving data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5205) Inconsistent behaviour between Streaming job and others, when click kill link in WebUI
uncleGen created SPARK-5205: --- Summary: Inconsistent behaviour between Streaming job and others, when click kill link in WebUI Key: SPARK-5205 URL: https://issues.apache.org/jira/browse/SPARK-5205 Project: Spark Issue Type: Bug Components: Streaming Reporter: uncleGen The "kill" link is used to kill a stage in job. It works in any kinds of Spark job but Spark Streaming. To be specific, we can only kill the stage which is used to run "Receiver", but not kill the "Receivers". Well, the stage can be killed and cleaned from the ui, but the receivers are still alive and receiving data. I think it dose not fit with the common sense. IMHO, killing the "receiver" stage means kill the "receivers" and stopping receiving data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5164) YARN | Spark job submits from windows machine to a linux YARN cluster fail
[ https://issues.apache.org/jira/browse/SPARK-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273631#comment-14273631 ] Kousuke Saruta commented on SPARK-5164: --- This ticket is a duplication of SPARK-1825 right? > YARN | Spark job submits from windows machine to a linux YARN cluster fail > -- > > Key: SPARK-5164 > URL: https://issues.apache.org/jira/browse/SPARK-5164 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.2.0 > Environment: Spark submit from Windows 7 > YARN cluster on CentOS 6.5 >Reporter: Aniket Bhatnagar > > While submitting spark jobs from a windows machine to a linux YARN cluster, > the jobs fail because of the following reasons: > 1. Commands and classpath contain environment variables (like JAVA_HOME, PWD, > etc) but are added as per windows's syntax (%JAVA_HOME%, %PWD%, etc) instead > of linux's syntax ($JAVA_HOME, $PWD, etc). > 2. Paths in launch environment are delimited by semi-colon instead of colon. > This is because of usage of File.pathSeparator in YarnSparkHadoopUtil. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3561) Allow for pluggable execution contexts in Spark
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273609#comment-14273609 ] Oleg Zhurakousky commented on SPARK-3561: - Thanks Patrick I 100% agree that Spark is _NOT just an API_ and in fact in our current efforts we are using much more of Spark then its user facing API but here is the thing; The reasons for extending execution environment could be many and indeed _RDD_ is a great extension point, just like _SparkContext_ is to accomplish that. However, both are less then ideal since they would require constant code modification forcing _re-compilation and re-packaging_ of an application every time one wants to delegate to an alternative execution environment (regardless of the reasons). But since we all seem to agree (based on previous comments) that _SparkContext_ is the right API-based extension point to address such extension requirements, then why not allow it to be extended via configuration as well? Merely a convenience without any harm. . . . no different then a configuration based “driver” model (e.g., JDBC). > Allow for pluggable execution contexts in Spark > --- > > Key: SPARK-3561 > URL: https://issues.apache.org/jira/browse/SPARK-3561 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Oleg Zhurakousky > Labels: features > Attachments: SPARK-3561.pdf > > > Currently Spark provides integration with external resource-managers such as > Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the > current architecture of Spark-on-YARN can be enhanced to provide > significantly better utilization of cluster resources for large scale, batch > and/or ETL applications when run alongside other applications (Spark and > others) and services in YARN. > Proposal: > The proposed approach would introduce a pluggable JobExecutionContext (trait) > - a gateway and a delegate to Hadoop execution environment - as a non-public > api (@Experimental) not exposed to end users of Spark. > The trait will define 6 operations: > * hadoopFile > * newAPIHadoopFile > * broadcast > * runJob > * persist > * unpersist > Each method directly maps to the corresponding methods in current version of > SparkContext. JobExecutionContext implementation will be accessed by > SparkContext via master URL as > "execution-context:foo.bar.MyJobExecutionContext" with default implementation > containing the existing code from SparkContext, thus allowing current > (corresponding) methods of SparkContext to delegate to such implementation. > An integrator will now have an option to provide custom implementation of > DefaultExecutionContext by either implementing it from scratch or extending > form DefaultExecutionContext. > Please see the attached design doc for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5019) Update GMM API to use MultivariateGaussian
[ https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273565#comment-14273565 ] Travis Galoppo commented on SPARK-5019: --- [~lewuathe] Are you still interested in working on this ticket? SPARK-5018 is now complete. > Update GMM API to use MultivariateGaussian > -- > > Key: SPARK-5019 > URL: https://issues.apache.org/jira/browse/SPARK-5019 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley >Priority: Blocker > > The GaussianMixtureModel API should expose MultivariateGaussian instances > instead of the means and covariances. This should be fixed as soon as > possible to stabilize the API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5102) CompressedMapStatus needs to be registered with Kryo
[ https://issues.apache.org/jira/browse/SPARK-5102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273564#comment-14273564 ] Apache Spark commented on SPARK-5102: - User 'lianhuiwang' has created a pull request for this issue: https://github.com/apache/spark/pull/4007 > CompressedMapStatus needs to be registered with Kryo > > > Key: SPARK-5102 > URL: https://issues.apache.org/jira/browse/SPARK-5102 > Project: Spark > Issue Type: Bug >Affects Versions: 1.2.0 >Reporter: Daniel Darabos >Priority: Minor > > After upgrading from Spark 1.1.0 to 1.2.0 I got this exception: > {code} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in > stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: Class is > not registered: org.apache.spark.scheduler.CompressedMapStatus > Note: To register this class use: > kryo.register(org.apache.spark.scheduler.CompressedMapStatus.class); > at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:442) > at > com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79) > at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:565) > at > org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:165) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:206) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > I had to register {{org.apache.spark.scheduler.CompressedMapStatus}} with > Kryo. I think this should be done in > {{spark/serializer/KryoSerializer.scala}}, unless instances of this class are > not expected to be sent over the wire. (Maybe I'm doing something wrong?) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5012) Python API for Gaussian Mixture Model
[ https://issues.apache.org/jira/browse/SPARK-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273561#comment-14273561 ] Meethu Mathew commented on SPARK-5012: -- I added a new class GaussianMixtureModel in clustering.py and the method predict in it and trying to pass a List of more than one dimension to the function _py2java , but I am getting the exception 'list' object has no attribute '_get_object_id' and when I give a tuple input (Vectors.dense([0.8786, -0.7855]),Vectors.dense([-0.1863, 0.7799])) exception is like 'numpy.ndarray' object has no attribute '_get_object_id'. Can you help me to solve this? My aim is to call the predictsoft() in GaussianMixtureModel.scala from clustering.py by passing the values of weight,mean and sigma > Python API for Gaussian Mixture Model > - > > Key: SPARK-5012 > URL: https://issues.apache.org/jira/browse/SPARK-5012 > Project: Spark > Issue Type: New Feature > Components: MLlib, PySpark >Reporter: Xiangrui Meng >Assignee: Meethu Mathew > > Add Python API for the Scala implementation of GMM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5124) Standardize internal RPC interface
[ https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273560#comment-14273560 ] Shixiong Zhu commented on SPARK-5124: - {quote} 1. Let's not rely on the property of local actor not passing messages through a socket for local actor speedup. Conceptually, there is no reason to tie local actor implementation to RPC. DAGScheduler's actor used to be a simple queue & event loop (before it was turned into an actor for no good reason). We can restore it to that. {quote} OK. I will change DAGScheduler actor to a simple event loop. {quote} 2. Have you thought about how the fate sharing stuff would work with alternative RPC implementations? {quote} Just want to make sure we are thinking the same thing: do you mean how to notify DisassociatedEvent in alternative RPC implementation? If so, I'm thinking how to extract it from the RPC layer. But have not yet started it. > Standardize internal RPC interface > -- > > Key: SPARK-5124 > URL: https://issues.apache.org/jira/browse/SPARK-5124 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Reynold Xin >Assignee: Shixiong Zhu > Attachments: Pluggable RPC - draft 1.pdf > > > In Spark we use Akka as the RPC layer. It would be great if we can > standardize the internal RPC interface to facilitate testing. This will also > provide the foundation to try other RPC implementations in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273551#comment-14273551 ] Valeriy Avanesov commented on SPARK-1405: - [~josephkb], I've read your proposal and I suggest to consider Stochastic Gradient Langevin Dynamics [1]. It was shown be ~100 times faster than Gibbs sampling [2]. Though, I'm not sure if it's implementable in terms of RDD. [1] http://papers.nips.cc/paper/4883-stochastic-gradient-riemannian-langevin-dynamics-on-the-probability-simplex.pdf [2] http://www.ics.uci.edu/~sungjia/icml2014_dist_v0.2.pdf > parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib > - > > Key: SPARK-1405 > URL: https://issues.apache.org/jira/browse/SPARK-1405 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xusen Yin >Assignee: Guoqiang Li >Priority: Critical > Labels: features > Attachments: performance_comparison.png > > Original Estimate: 336h > Remaining Estimate: 336h > > Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts > topics from text corpus. Different with current machine learning algorithms > in MLlib, instead of using optimization algorithms such as gradient desent, > LDA uses expectation algorithms such as Gibbs sampling. > In this PR, I prepare a LDA implementation based on Gibbs sampling, with a > wholeTextFiles API (solved yet), a word segmentation (import from Lucene), > and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4859) Improve StreamingListenerBus
[ https://issues.apache.org/jira/browse/SPARK-4859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273535#comment-14273535 ] Apache Spark commented on SPARK-4859: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/4006 > Improve StreamingListenerBus > > > Key: SPARK-4859 > URL: https://issues.apache.org/jira/browse/SPARK-4859 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Shixiong Zhu >Priority: Minor > > Fix the race condition of `queueFullErrorMessageLogged`. > Log the error from listener rather than crashing `listenerThread`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5204) Column case need to be consistent with Hive
[ https://issues.apache.org/jira/browse/SPARK-5204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shengli closed SPARK-5204. -- Resolution: Not a Problem > Column case need to be consistent with Hive > --- > > Key: SPARK-5204 > URL: https://issues.apache.org/jira/browse/SPARK-5204 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.0 >Reporter: shengli >Priority: Minor > Fix For: 1.3.0 > > Original Estimate: 3h > Remaining Estimate: 3h > > Column case need to be consistent with Hive > Hive0.13 -> lower case -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5204) Column case need to be consistent with Hive
[ https://issues.apache.org/jira/browse/SPARK-5204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273417#comment-14273417 ] Apache Spark commented on SPARK-5204: - User 'OopsOutOfMemory' has created a pull request for this issue: https://github.com/apache/spark/pull/4005 > Column case need to be consistent with Hive > --- > > Key: SPARK-5204 > URL: https://issues.apache.org/jira/browse/SPARK-5204 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.0 >Reporter: shengli >Priority: Minor > Fix For: 1.3.0 > > Original Estimate: 3h > Remaining Estimate: 3h > > Column case need to be consistent with Hive > Hive0.13 -> lower case -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5204) Column case need to be consistent with Hive
shengli created SPARK-5204: -- Summary: Column case need to be consistent with Hive Key: SPARK-5204 URL: https://issues.apache.org/jira/browse/SPARK-5204 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: shengli Priority: Minor Fix For: 1.3.0 Column case need to be consistent with Hive Hive0.13 -> lower case -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5203) union with different decimal type report error
[ https://issues.apache.org/jira/browse/SPARK-5203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273404#comment-14273404 ] Apache Spark commented on SPARK-5203: - User 'guowei2' has created a pull request for this issue: https://github.com/apache/spark/pull/4004 > union with different decimal type report error > -- > > Key: SPARK-5203 > URL: https://issues.apache.org/jira/browse/SPARK-5203 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: guowei > > cases like this > create table test (a decimal(10,1)); > select a from test union all select a*2 from test; > 15/01/12 16:28:54 ERROR SparkSQLDriver: Failed in [select a from test union > all select a*2 from test] > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved > attributes: *, tree: > 'Project [*] > 'Subquery _u1 > 'Union >Project [a#1] > MetastoreRelation default, test, None >Project [CAST((CAST(a#2, DecimalType()) * CAST(CAST(2, DecimalType(10,0)), > DecimalType())), DecimalType(21,1)) AS _c0#0] > MetastoreRelation default, test, None > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:85) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:83) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:83) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:81) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) > at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) > at > org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:410) > at > org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:410) > at > org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:411) > at > org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:411) > at > org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:412) > at > org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:412) > at > org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:417) > at > org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:415) > at > org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:421) > at > org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:421) > at > org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:369) > at > org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:58) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:275) > at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:211) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5203) union with different decimal type report error
guowei created SPARK-5203: - Summary: union with different decimal type report error Key: SPARK-5203 URL: https://issues.apache.org/jira/browse/SPARK-5203 Project: Spark Issue Type: Bug Components: SQL Reporter: guowei cases like this create table test (a decimal(10,1)); select a from test union all select a*2 from test; 15/01/12 16:28:54 ERROR SparkSQLDriver: Failed in [select a from test union all select a*2 from test] org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: *, tree: 'Project [*] 'Subquery _u1 'Union Project [a#1] MetastoreRelation default, test, None Project [CAST((CAST(a#2, DecimalType()) * CAST(CAST(2, DecimalType(10,0)), DecimalType())), DecimalType(21,1)) AS _c0#0] MetastoreRelation default, test, None at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:85) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:83) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:83) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:81) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:410) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:410) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:417) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:415) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:421) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:421) at org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:369) at org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:58) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:275) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:211) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org