[jira] [Closed] (SPARK-5232) CombineFileInputFormatShim#getDirIndices is expensive
[ https://issues.apache.org/jira/browse/SPARK-5232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang closed SPARK-5232. -- Resolution: Invalid Wrong project. CombineFileInputFormatShim#getDirIndices is expensive - Key: SPARK-5232 URL: https://issues.apache.org/jira/browse/SPARK-5232 Project: Spark Issue Type: Improvement Reporter: Jimmy Xiang [~lirui] found out that we spent quite some time on CombineFileInputFormatShim#getDirIndices. Looked into it and it seems to me we should be able to get rid of this method completely if we can enhance CombineFileInputFormatShim a little. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5123) Stabilize Spark SQL data type API
[ https://issues.apache.org/jira/browse/SPARK-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-5123. Resolution: Fixed Fix Version/s: 1.3.0 Stabilize Spark SQL data type API - Key: SPARK-5123 URL: https://issues.apache.org/jira/browse/SPARK-5123 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.3.0 Having two versions of the data type APIs (one for Java, one for Scala) requires downstream libraries to also have two versions of the APIs if the library wants to support both Java and Scala. I took a look at the Scala version of the data type APIs - it can actually work out pretty well for Java out of the box. The proposal is to move Spark SQL data type definitions from org.apache.spark.sql.catalyst.types into org.apache.spark.sql.types, and make the existing Scala type API usable in Java. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5220) keepPushingBlocks in BlockGenerator terminated when an exception occurs, which causes the block pushing thread to terminate and blocks receiver
[ https://issues.apache.org/jira/browse/SPARK-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276350#comment-14276350 ] Saisai Shao commented on SPARK-5220: Hi Max, as I said in the mail, this is an expected behavior of receiver and block generator because of locking mechanism of BlockGenerator. The receiver will block on the locks for adding data into BlockGenerator, and the BlockGenerator is waiting for pushing thread to put data into HDFS and BM. Because of unmatched speed, it is expected from my understanding. keepPushingBlocks in BlockGenerator terminated when an exception occurs, which causes the block pushing thread to terminate and blocks receiver - Key: SPARK-5220 URL: https://issues.apache.org/jira/browse/SPARK-5220 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.0 Reporter: Max Xu I am running a Spark streaming application with ReliableKafkaReceiver. It uses BlockGenerator to push blocks to BlockManager. However, writing WALs to HDFS may time out that causes keepPushingBlocks in BlockGenerator to terminate. 15/01/12 19:07:06 ERROR receiver.BlockGenerator: Error in block pushing thread java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.streaming.receiver.WriteAheadLogBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:176) at org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:160) at org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:126) at org.apache.spark.streaming.receiver.Receiver.store(Receiver.scala:124) at org.apache.spark.streaming.kafka.ReliableKafkaReceiver.org$apache$spark$streaming$kafka$ReliableKafkaReceiver$$storeBlockAndCommitOffset(ReliableKafkaReceiver.scala:207) at org.apache.spark.streaming.kafka.ReliableKafkaReceiver$GeneratedBlockHandler.onPushBlock(ReliableKafkaReceiver.scala:275) at org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:181) at org.apache.spark.streaming.receiver.BlockGenerator.org$apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:154) at org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:86) Then the block pushing thread is done and no subsequent blocks can be pushed into blockManager. In turn this blocks receiver from receiving new data. So when running my app and the TimeoutException happens, the ReliableKafkaReceiver stays in ACTIVE status but doesn't do anything at all. The application rogues. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276380#comment-14276380 ] RJ Nowling commented on SPARK-4894: --- Hi @lmcguire, Always happy to have more help! :) I started looking through the Spark NB functions but I haven't started writing code yet. The docs for NB mention that using binary features will cause the multinomial NB to act like Bernoulli NB. I don't believe the documentation is correct, at least when smoothing is used since P(0) != 1 - P(1).I was planning on comparing the sklearn implementation with the Spark implementation and showing that the docs were wrong. Once verified, I think the changes will be very small to add a Bernoulli mode controlled by a flag in the constructor. I won't get to this until next week, though. If you have time now and want to tackle this, I'd be happy to hand it over to you and review any patches. (I'm not a committer, though -- [~mengxr] would have to sign off.)Otherwise, if you want to wait until I have a patch and test it, that could work, too. What do you think? Add Bernoulli-variant of Naive Bayes Key: SPARK-4894 URL: https://issues.apache.org/jira/browse/SPARK-4894 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.1.1 Reporter: RJ Nowling MLlib only supports the multinomial-variant of Naive Bayes. The Bernoulli version of Naive Bayes is more useful for situations where the features are binary values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1805) Error launching cluster when master and slaves machines are of different visualization types
[ https://issues.apache.org/jira/browse/SPARK-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-1805: Issue Type: Bug (was: Improvement) Error launching cluster when master and slaves machines are of different visualization types Key: SPARK-1805 URL: https://issues.apache.org/jira/browse/SPARK-1805 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 0.9.0, 0.9.1, 1.0.0 Reporter: Han JU Priority: Minor In current EC2 script, the AMI image object is loaded only once. This is ok when master and slave machines are of the same visualization type (pvm or hvm). But this won't work if, say, master is pvm and slaves are hvm since the AMI is not compatible between these two kinds of visualization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1805) Error launching cluster when master and slave machines are of different virtualization types
[ https://issues.apache.org/jira/browse/SPARK-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-1805: Description: In the current EC2 script, the AMI image object is loaded only once. This is OK when the master and slave machines are of the same virtualization type (pv or hvm). But this won't work if, say, the master is pv and the slaves are hvm since the AMI is not compatible across these two kinds of virtualization. (was: In current EC2 script, the AMI image object is loaded only once. This is ok when master and slave machines are of the same visualization type (pvm or hvm). But this won't work if, say, master is pvm and slaves are hvm since the AMI is not compatible between these two kinds of visualization. ) Target Version/s: 1.3.0 Summary: Error launching cluster when master and slave machines are of different virtualization types (was: Error launching cluster when master and slaves machines are of different visualization types) Error launching cluster when master and slave machines are of different virtualization types Key: SPARK-1805 URL: https://issues.apache.org/jira/browse/SPARK-1805 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 0.9.0, 0.9.1, 1.0.0, 1.1.1, 1.2.0 Reporter: Han JU Priority: Minor In the current EC2 script, the AMI image object is loaded only once. This is OK when the master and slave machines are of the same virtualization type (pv or hvm). But this won't work if, say, the master is pv and the slaves are hvm since the AMI is not compatible across these two kinds of virtualization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5235) java.io.NotSerializableException: org.apache.spark.sql.SQLConf
Alex Baretta created SPARK-5235: --- Summary: java.io.NotSerializableException: org.apache.spark.sql.SQLConf Key: SPARK-5235 URL: https://issues.apache.org/jira/browse/SPARK-5235 Project: Spark Issue Type: Bug Reporter: Alex Baretta The SQLConf field in SQLContext is neither Serializable nor transient. Here's the stack trace I get when running SQL queries against a Parquet file. Exception in thread Thread-43 org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: org.apache.spark.sql.SQLConf at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3678) Yarn app name reported in RM is different between cluster and client mode
[ https://issues.apache.org/jira/browse/SPARK-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276429#comment-14276429 ] WangTaoTheTonic commented on SPARK-3678: In SparkHdfsLR there has {quote}val sparkConf = new SparkConf().setAppName(SparkHdfsLR){quote}. And in client mode, the register to yarn happens in YarnClientSchedulerBackend, which is after the setAppName above. While in cluster mode, the register happens in yarn.Client, which is before setAppName above. So it is the register sequence that makes the difference. Yarn app name reported in RM is different between cluster and client mode - Key: SPARK-3678 URL: https://issues.apache.org/jira/browse/SPARK-3678 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Thomas Graves If you launch an application in yarn cluster mode the name of the application in the ResourceManager generally shows up as the full name org.apache.spark.examples.SparkHdfsLR. If you start the same app in client mode it shows up as SparkHdfsLR. We should be consistent between them. I haven't looked at it in detail, perhaps its only the examples but I think I've seen this with customer apps also. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5234) examples for ml don't have sparkContext.stop
yuhao yang created SPARK-5234: - Summary: examples for ml don't have sparkContext.stop Key: SPARK-5234 URL: https://issues.apache.org/jira/browse/SPARK-5234 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.2.0 Environment: all Reporter: yuhao yang Priority: Trivial Fix For: 1.3.0 Not sure why sc.stop() is not in the org.apache.spark.examples.ml {CrossValidatorExample, SimpleParamsExample, SimpleTextClassificationPipeline}. I can prepare a PR if it's not intentional to omit the call to stop. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276471#comment-14276471 ] Nicholas Chammas commented on SPARK-3821: - [~shivaram] Are we ready to open a PR against {{mesos/spark-ec2}} and start a review discussion there? Develop an automated way of creating Spark images (AMI, Docker, and others) --- Key: SPARK-3821 URL: https://issues.apache.org/jira/browse/SPARK-3821 Project: Spark Issue Type: Improvement Components: Build, EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Attachments: packer-proposal.html Right now the creation of Spark AMIs or Docker containers is done manually. With tools like [Packer|http://www.packer.io/], we should be able to automate this work, and do so in such a way that multiple types of machine images can be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3185) SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting JOURNAL_FOLDER
[ https://issues.apache.org/jira/browse/SPARK-3185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276436#comment-14276436 ] Florian Verhein commented on SPARK-3185: I'm also getting this, though with Server IPC version 9 now that I'm using hadoop 2.4.1 (modification of the various hadoop init.sh scripts). I'm also using spark 1.2.0. My understanding is that spark-1.2.0-bin-hadoop2.4.tgz is built against hadoop 2.4 and tachyon 0.4.1. But I suspect the tachyon 0.4.1 that is installed in the spark-ec2 scripts is built against hadoop 1... Does this mean building tachyon against hadoop 2.4.1 would fix this? SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting JOURNAL_FOLDER --- Key: SPARK-3185 URL: https://issues.apache.org/jira/browse/SPARK-3185 Project: Spark Issue Type: Bug Affects Versions: 1.0.2 Environment: Amazon Linux AMI [ec2-user@ip-172-30-1-145 ~]$ uname -a Linux ip-172-30-1-145 3.10.42-52.145.amzn1.x86_64 #1 SMP Tue Jun 10 23:46:43 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux https://aws.amazon.com/amazon-linux-ami/2014.03-release-notes/ The build I used (and MD5 verified): [ec2-user@ip-172-30-1-145 ~]$ wget http://supergsego.com/apache/spark/spark-1.0.2/spark-1.0.2-bin-hadoop2.tgz Reporter: Jeremy Chambers {code} org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 {code} When I launch SPARK 1.0.2 on Hadoop 2 in a new EC2 cluster, the above tachyon exception is thrown when Formatting JOURNAL_FOLDER. No exception occurs when I launch on Hadoop 1. Launch used: {code} ./spark-ec2 -k spark_cluster -i /home/ec2-user/kagi/spark_cluster.ppk --zone=us-east-1a --hadoop-major-version=2 --spot-price=0.0165 -s 3 launch sparkProd {code} {code} log snippet Formatting Tachyon Master @ ec2-54-80-49-244.compute-1.amazonaws.com Formatting JOURNAL_FOLDER: /root/tachyon/libexec/../journal/ Exception in thread main java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 at tachyon.util.CommonUtils.runtimeException(CommonUtils.java:246) at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:73) at tachyon.UnderFileSystemHdfs.getClient(UnderFileSystemHdfs.java:53) at tachyon.UnderFileSystem.get(UnderFileSystem.java:53) at tachyon.Format.main(Format.java:54) Caused by: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 at org.apache.hadoop.ipc.Client.call(Client.java:1070) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379) at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:238) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:203) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187) at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:69) ... 3 more Killed 0 processes Killed 0 processes ec2-54-167-219-159.compute-1.amazonaws.com: Killed 0 processes ec2-54-198-198-17.compute-1.amazonaws.com: Killed 0 processes ec2-54-166-36-0.compute-1.amazonaws.com: Killed 0 processes ---end snippet--- {code} *I don't have this problem when I launch without the --hadoop-major-version=2 (which defaults to Hadoop 1.x).* -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276505#comment-14276505 ] Shivaram Venkataraman commented on SPARK-3821: -- [~nchammas] Yes -- That sounds good Develop an automated way of creating Spark images (AMI, Docker, and others) --- Key: SPARK-3821 URL: https://issues.apache.org/jira/browse/SPARK-3821 Project: Spark Issue Type: Improvement Components: Build, EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Attachments: packer-proposal.html Right now the creation of Spark AMIs or Docker containers is done manually. With tools like [Packer|http://www.packer.io/], we should be able to automate this work, and do so in such a way that multiple types of machine images can be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5235) java.io.NotSerializableException: org.apache.spark.sql.SQLConf
[ https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276507#comment-14276507 ] Apache Spark commented on SPARK-5235: - User 'alexbaretta' has created a pull request for this issue: https://github.com/apache/spark/pull/4031 java.io.NotSerializableException: org.apache.spark.sql.SQLConf -- Key: SPARK-5235 URL: https://issues.apache.org/jira/browse/SPARK-5235 Project: Spark Issue Type: Bug Reporter: Alex Baretta The SQLConf field in SQLContext is neither Serializable nor transient. Here's the stack trace I get when running SQL queries against a Parquet file. Exception in thread Thread-43 org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: org.apache.spark.sql.SQLConf at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5233) Error replay of WAL when recovered from driver failue
[ https://issues.apache.org/jira/browse/SPARK-5233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-5233: --- Description: Spark Streaming will write all the event into WAL for driver recovery, the sequence in the WAL may be like this: {code} BlockAdditionEvent --- BlockAdditionEvent --- BlockAdditionEvent --- BatchAllocationEvent --- BatchCleanupEvent --- BlockAdditionEvent --- BlockAdditionEvent --- 'Driver Down Time' --- BlockAdditionEvent --- BlockAdditionEvent --- BatchAllocationEvent {code} When driver recovered from failure, it will replay all the existed metadata WAL to get the right status, in this situation, two BatchAdditionEvent before down will put into received block queue. After driver started, new incoming blocking will also put into this queue and a follow-up BlockAllocationEvent will allocate an allocatedBlocks with queue draining out. So old, not this batch's data will also mix into this batch, here is the partial log: {code} 15/01/13 17:19:10 INFO KafkaInputDStream: block store result for batch 142114075 ms 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406 53201,46704,480) 197757 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406 53201,47188,480) 197758 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406 53201,47672,480) 197759 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406 53201,48156,480) 197760 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406 53201,48640,480) 197761 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406 53201,49124,480) 197762 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408 07074,0,44184) 197763 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408 07074,44188,58536) 197764 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408 07074,102728,60168) 197765 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408 07074,162900,64584) 197766 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408 07074,227488,51240) {code} The old log /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406 is obviously far older than current batch interval, and will fetch again to add to process. This issue is subtle, because in the previous code we never delete the old received data WAL. This will lead to unwanted result as I know. Basically because we miss some BlockAllocationEvent when recovered from failure. I think we need to correctly replay and insert all the events correctly. was: Spark Streaming will write all the event into WAL for driver recovery, the sequence in the WAL may be like this: {code} BlockAdditionEvent --- BlockAdditionEvent --- BlockAdditionEvent --- BatchAllocationEvent --- BatchCleanupEvent --- BlockAdditionEvent --- BlockAdditionEvent --- 'Driver Down Time' --- BlockAdditionEvent --- BlockAdditionEvent --- BatchAllocationEvent {code} When driver recovered from failure, it will replay all the existed metadata WAL to get the right status, in this situation, two BatchAdditionEvent before down will put into received block queue. After driver started, new incoming blocking will also put into this queue and a follow-up BlockAllocationEvent will allocate an allocatedBlocks with queue draining out. So old, not this
[jira] [Created] (SPARK-5236) parquet.io.ParquetDecodingException: Can not read value at 0 in block 0
Alex Baretta created SPARK-5236: --- Summary: parquet.io.ParquetDecodingException: Can not read value at 0 in block 0 Key: SPARK-5236 URL: https://issues.apache.org/jira/browse/SPARK-5236 Project: Spark Issue Type: Bug Reporter: Alex Baretta 15/01/14 05:39:27 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 18.0 (TID 28, localhost): parquet.io.ParquetDecodingException: Can not read value at 0 in block 0 in file gs://pa-truven/20141205/parquet/P/part-r-1.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141) at org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1331) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1331) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt at org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:241) at org.apache.spark.sql.parquet.CatalystPrimitiveRowConverter.updateInt(ParquetConverter.scala:375) at org.apache.spark.sql.parquet.CatalystPrimitiveConverter.addInt(ParquetConverter.scala:434) at parquet.column.impl.ColumnReaderImpl$2$3.writeValue(ColumnReaderImpl.java:237) at parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:353) at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:402) at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:194) ... 27 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5233) Error replay of WAL when recovered from driver failue
[ https://issues.apache.org/jira/browse/SPARK-5233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276513#comment-14276513 ] Apache Spark commented on SPARK-5233: - User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/4032 Error replay of WAL when recovered from driver failue - Key: SPARK-5233 URL: https://issues.apache.org/jira/browse/SPARK-5233 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.0 Reporter: Saisai Shao Spark Streaming will write all the event into WAL for driver recovery, the sequence in the WAL may be like this: {code} BlockAdditionEvent --- BlockAdditionEvent --- BlockAdditionEvent --- BatchAllocationEvent --- BatchCleanupEvent --- BlockAdditionEvent --- BlockAdditionEvent --- 'Driver Down Time' --- BlockAdditionEvent --- BlockAdditionEvent --- BatchAllocationEvent {code} When driver recovered from failure, it will replay all the existed metadata WAL to get the right status, in this situation, two BatchAdditionEvent before down will put into received block queue. After driver started, new incoming blocking will also put into this queue and a follow-up BlockAllocationEvent will allocate an allocatedBlocks with queue draining out. So old, not this batch's data will also mix into this batch, here is the partial log: {code} 15/01/13 17:19:10 INFO KafkaInputDStream: block store result for batch 142114075 ms 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406 53201,46704,480) 197757 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406 53201,47188,480) 197758 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406 53201,47672,480) 197759 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406 53201,48156,480) 197760 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406 53201,48640,480) 197761 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406 53201,49124,480) 197762 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408 07074,0,44184) 197763 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408 07074,44188,58536) 197764 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408 07074,102728,60168) 197765 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408 07074,162900,64584) 197766 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408 07074,227488,51240) {code} The old log /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406 is obviously far older than current batch interval, and will fetch again to add to process. This issue is subtle, because in the previous code we never delete the old received data WAL. This will lead to unwanted result as I know. Basically because we miss some BlockAllocationEvent when recovered from failure. I think we need to correctly replay and insert all the events correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional
[jira] [Created] (SPARK-5237) UDTF don't work on SparK SQL
Yi Zhou created SPARK-5237: -- Summary: UDTF don't work on SparK SQL Key: SPARK-5237 URL: https://issues.apache.org/jira/browse/SPARK-5237 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yi Zhou -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5233) Error replay of WAL when recovered from driver failue
[ https://issues.apache.org/jira/browse/SPARK-5233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-5233: --- Issue Type: Sub-task (was: Bug) Parent: SPARK-5238 Error replay of WAL when recovered from driver failue - Key: SPARK-5233 URL: https://issues.apache.org/jira/browse/SPARK-5233 Project: Spark Issue Type: Sub-task Components: Streaming Affects Versions: 1.2.0 Reporter: Saisai Shao Spark Streaming will write all the event into WAL for driver recovery, the sequence in the WAL may be like this: {code} BlockAdditionEvent --- BlockAdditionEvent --- BlockAdditionEvent --- BatchAllocationEvent --- BatchCleanupEvent --- BlockAdditionEvent --- BlockAdditionEvent --- 'Driver Down Time' --- BlockAdditionEvent --- BlockAdditionEvent --- BatchAllocationEvent {code} When driver recovered from failure, it will replay all the existed metadata WAL to get the right status, in this situation, two BatchAdditionEvent before down will put into received block queue. After driver started, new incoming blocking will also put into this queue and a follow-up BlockAllocationEvent will allocate an allocatedBlocks with queue draining out. So old, not this batch's data will also mix into this batch, here is the partial log: {code} 15/01/13 17:19:10 INFO KafkaInputDStream: block store result for batch 142114075 ms 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406 53201,46704,480) 197757 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406 53201,47188,480) 197758 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406 53201,47672,480) 197759 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406 53201,48156,480) 197760 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406 53201,48640,480) 197761 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406 53201,49124,480) 197762 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408 07074,0,44184) 197763 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408 07074,44188,58536) 197764 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408 07074,102728,60168) 197765 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408 07074,162900,64584) 197766 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408 07074,227488,51240) {code} The old log /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406 is obviously far older than current batch interval, and will fetch again to add to process. This issue is subtle, because in the previous code we never delete the old received data WAL. This will lead to unwanted result as I know. Basically because we miss some BlockAllocationEvent when recovered from failure. I think we need to correctly replay and insert all the events correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5237) UDTF don't work on SparK SQL
[ https://issues.apache.org/jira/browse/SPARK-5237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Zhou updated SPARK-5237: --- Description: Hive query with UDTF don't work on Spark SQL 15/01/14 13:23:50 INFO ParseDriver: Parse Completed 15/01/14 13:23:50 WARN HiveConf: DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore. 15/01/14 13:23:50 INFO ParseDriver: Parsing command: INSERT INTO TABLE q10_spark_RUN_QUERY_0_result SELECT extract_sentiment(pr_item_sk,pr_review_content) AS (pr_item_sk, review_sentence, sentiment, sentiment_word) FROM product_reviews 15/01/14 13:23:50 INFO ParseDriver: Parse Completed 15/01/14 13:23:50 ERROR SparkSQLDriver: Failed in [ INSERT INTO TABLE ${hiveconf:RESULT_TABLE} SELECT extract_sentiment(pr_item_sk,pr_review_content) AS (pr_item_sk, review_sentence, sentiment, sentiment_word) FROM product_reviews ] java.lang.RuntimeException: Unsupported language features in query: INSERT INTO TABLE q10_spark_RUN_QUERY_0_result SELECT extract_sentiment(pr_item_sk,pr_review_content) AS (pr_item_sk, review_sentence, sentiment, sentiment_word) FROM product_reviews TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME product_reviews TOK_INSERT TOK_INSERT_INTO TOK_TAB TOK_TABNAME q10_spark_RUN_QUERY_0_result TOK_SELECT TOK_SELEXPR TOK_FUNCTION extract_sentiment TOK_TABLE_OR_COL pr_item_sk TOK_TABLE_OR_COL pr_review_content pr_item_sk review_sentence sentiment sentiment_word scala.NotImplementedError: No parse rules for: TOK_SELEXPR TOK_FUNCTION extract_sentiment TOK_TABLE_OR_COL pr_item_sk TOK_TABLE_OR_COL pr_review_content pr_item_sk review_sentence sentiment sentiment_word org.apache.spark.sql.hive.HiveQl$.selExprNodeToExpr(HiveQl.scala:862) at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:251) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) at scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:31) at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:133) at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:133) at org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:174) at org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:173) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at
[jira] [Updated] (SPARK-5238) Improve the robustness of Spark Streaming WAL mechanism
[ https://issues.apache.org/jira/browse/SPARK-5238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-5238: --- Description: Several issues identified in Spark Streaming's WAL mechanism, this is a cap of all the related issues. Improve the robustness of Spark Streaming WAL mechanism --- Key: SPARK-5238 URL: https://issues.apache.org/jira/browse/SPARK-5238 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.0 Reporter: Saisai Shao Several issues identified in Spark Streaming's WAL mechanism, this is a cap of all the related issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5238) Improve the robustness of Spark Streaming WAL mechanism
Saisai Shao created SPARK-5238: -- Summary: Improve the robustness of Spark Streaming WAL mechanism Key: SPARK-5238 URL: https://issues.apache.org/jira/browse/SPARK-5238 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.0 Reporter: Saisai Shao -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5147) write ahead logs from streaming receiver are not purged because cleanupOldBlocks in WriteAheadLogBasedBlockHandler is never called
[ https://issues.apache.org/jira/browse/SPARK-5147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-5147: --- Issue Type: Sub-task (was: Bug) Parent: SPARK-5238 write ahead logs from streaming receiver are not purged because cleanupOldBlocks in WriteAheadLogBasedBlockHandler is never called -- Key: SPARK-5147 URL: https://issues.apache.org/jira/browse/SPARK-5147 Project: Spark Issue Type: Sub-task Components: Streaming Affects Versions: 1.2.0 Reporter: Max Xu Priority: Blocker Hi all, We are running a Spark streaming application with ReliableKafkaReceiver. We have spark.streaming.receiver.writeAheadLog.enable set to true so write ahead logs (WALs) for received data are created under receivedData/streamId folder in the checkpoint directory. However, old WALs are never purged by time. receivedBlockMetadata and checkpoint files are purged correctly though. I went through the code, WriteAheadLogBasedBlockHandler class in ReceivedBlockHandler.scala is responsible for cleaning up the old blocks. It has method cleanupOldBlocks, which is never called by any class. ReceiverSupervisorImpl class holds a WriteAheadLogBasedBlockHandler instance. However, it only calls storeBlock method to create WALs but never calls cleanupOldBlocks method to purge old WALs. The size of the WAL folder increases constantly on HDFS. This is preventing us from running the ReliableKafkaReceiver 24x7. Can somebody please take a look. Thanks, Max -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5142) Possibly data may be ruined in Spark Streaming's WAL mechanism.
[ https://issues.apache.org/jira/browse/SPARK-5142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-5142: --- Issue Type: Sub-task (was: Bug) Parent: SPARK-5238 Possibly data may be ruined in Spark Streaming's WAL mechanism. --- Key: SPARK-5142 URL: https://issues.apache.org/jira/browse/SPARK-5142 Project: Spark Issue Type: Sub-task Components: Streaming Affects Versions: 1.2.0 Reporter: Saisai Shao Currently in Spark Streaming's WAL manager, data will be written into HDFS with multiple tries when meeting failure, because of lacking of transactional guarantee, previously partial-written data is not rolled back and the retried data will be appended to the last, this will ruin the file and make the WriteAheadLogReader to read data with failure. Firstly I think this problem is hard to fix because HDFS do not support truncate operation(HDFS-3107) or random write with specific offset. Secondly, I think if we meet such write exception, it is better not to try again, try again will ruin the file and make read abnormal. Sorry if I misunderstand anything. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5239) JdbcRDD throws java.lang.AbstractMethodError: oracle.jdbc.driver.xxxxxx.isClosed()Z
Gankun Luo created SPARK-5239: - Summary: JdbcRDD throws java.lang.AbstractMethodError: oracle.jdbc.driver.xx.isClosed()Z Key: SPARK-5239 URL: https://issues.apache.org/jira/browse/SPARK-5239 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0, 1.1.1 Environment: centos6.4 + ojdbc14 Reporter: Gankun Luo Priority: Minor I try use JdbcRDD to operate the table of Oracle database, but failed. My test code as follows: {code} import java.sql.DriverManager import org.apache.spark.SparkContext import org.apache.spark.rdd.JdbcRDD import org.apache.spark.SparkConf object JdbcRDD4Oracle { def main(args: Array[String]) { val sc = new SparkContext(new SparkConf().setAppName(JdbcRDD4Oracle).setMaster(local[2])) val rdd = new JdbcRDD(sc, () = getConnection, getSQL, 12987, 13055, 3, r = { (r.getObject(HISTORY_ID), r.getObject(APPROVE_OPINION)) }) println(rdd.collect.toList) sc.stop() } def getConnection() = { Class.forName(oracle.jdbc.driver.OracleDriver).newInstance() DriverManager.getConnection(jdbc:oracle:thin:@hadoop000:1521/ORCL, scott, tiger) } def getSQL() = { select HISTORY_ID,APPROVE_OPINION from CI_APPROVE_HISTORY WHERE HISTORY_ID =? AND HISTORY_ID =? } } {code} Run the example, I get the following exception: {code} 09:56:48,302 [Executor task launch worker-0] ERROR Logging$class : Error in TaskCompletionListener java.lang.AbstractMethodError: oracle.jdbc.driver.OracleResultSetImpl.isClosed()Z at org.apache.spark.rdd.JdbcRDD$$anon$1.close(JdbcRDD.scala:99) at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63) at org.apache.spark.rdd.JdbcRDD$$anon$1$$anonfun$1.apply(JdbcRDD.scala:71) at org.apache.spark.rdd.JdbcRDD$$anon$1$$anonfun$1.apply(JdbcRDD.scala:71) at org.apache.spark.TaskContext$$anon$1.onTaskCompletion(TaskContext.scala:85) at org.apache.spark.TaskContext$$anonfun$markTaskCompleted$1.apply(TaskContext.scala:110) at org.apache.spark.TaskContext$$anonfun$markTaskCompleted$1.apply(TaskContext.scala:108) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.TaskContext.markTaskCompleted(TaskContext.scala:108) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:64) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 09:56:48,302 [Executor task launch worker-1] ERROR Logging$class : Error in TaskCompletionListener {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5239) JdbcRDD throws java.lang.AbstractMethodError: oracle.jdbc.driver.xxxxxx.isClosed()Z
[ https://issues.apache.org/jira/browse/SPARK-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276540#comment-14276540 ] Apache Spark commented on SPARK-5239: - User 'luogankun' has created a pull request for this issue: https://github.com/apache/spark/pull/4033 JdbcRDD throws java.lang.AbstractMethodError: oracle.jdbc.driver.xx.isClosed()Z - Key: SPARK-5239 URL: https://issues.apache.org/jira/browse/SPARK-5239 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1, 1.2.0 Environment: centos6.4 + ojdbc14 Reporter: Gankun Luo Priority: Minor I try use JdbcRDD to operate the table of Oracle database, but failed. My test code as follows: {code} import java.sql.DriverManager import org.apache.spark.SparkContext import org.apache.spark.rdd.JdbcRDD import org.apache.spark.SparkConf object JdbcRDD4Oracle { def main(args: Array[String]) { val sc = new SparkContext(new SparkConf().setAppName(JdbcRDD4Oracle).setMaster(local[2])) val rdd = new JdbcRDD(sc, () = getConnection, getSQL, 12987, 13055, 3, r = { (r.getObject(HISTORY_ID), r.getObject(APPROVE_OPINION)) }) println(rdd.collect.toList) sc.stop() } def getConnection() = { Class.forName(oracle.jdbc.driver.OracleDriver).newInstance() DriverManager.getConnection(jdbc:oracle:thin:@hadoop000:1521/ORCL, scott, tiger) } def getSQL() = { select HISTORY_ID,APPROVE_OPINION from CI_APPROVE_HISTORY WHERE HISTORY_ID =? AND HISTORY_ID =? } } {code} Run the example, I get the following exception: {code} 09:56:48,302 [Executor task launch worker-0] ERROR Logging$class : Error in TaskCompletionListener java.lang.AbstractMethodError: oracle.jdbc.driver.OracleResultSetImpl.isClosed()Z at org.apache.spark.rdd.JdbcRDD$$anon$1.close(JdbcRDD.scala:99) at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63) at org.apache.spark.rdd.JdbcRDD$$anon$1$$anonfun$1.apply(JdbcRDD.scala:71) at org.apache.spark.rdd.JdbcRDD$$anon$1$$anonfun$1.apply(JdbcRDD.scala:71) at org.apache.spark.TaskContext$$anon$1.onTaskCompletion(TaskContext.scala:85) at org.apache.spark.TaskContext$$anonfun$markTaskCompleted$1.apply(TaskContext.scala:110) at org.apache.spark.TaskContext$$anonfun$markTaskCompleted$1.apply(TaskContext.scala:108) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.TaskContext.markTaskCompleted(TaskContext.scala:108) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:64) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 09:56:48,302 [Executor task launch worker-1] ERROR Logging$class : Error in TaskCompletionListener {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5236) java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt
[ https://issues.apache.org/jira/browse/SPARK-5236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Baretta updated SPARK-5236: Summary: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt (was: parquet.io.ParquetDecodingException: Can not read value at 0 in block 0) java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt - Key: SPARK-5236 URL: https://issues.apache.org/jira/browse/SPARK-5236 Project: Spark Issue Type: Bug Reporter: Alex Baretta 15/01/14 05:39:27 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 18.0 (TID 28, localhost): parquet.io.ParquetDecodingException: Can not read value at 0 in block 0 in file gs://pa-truven/20141205/parquet/P/part-r-1.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141) at org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1331) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1331) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt at org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:241) at org.apache.spark.sql.parquet.CatalystPrimitiveRowConverter.updateInt(ParquetConverter.scala:375) at org.apache.spark.sql.parquet.CatalystPrimitiveConverter.addInt(ParquetConverter.scala:434) at parquet.column.impl.ColumnReaderImpl$2$3.writeValue(ColumnReaderImpl.java:237) at parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:353) at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:402) at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:194) ... 27 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4923) Maven build should keep publishing spark-repl
[ https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276558#comment-14276558 ] Apache Spark commented on SPARK-4923: - User 'rcsenkbeil' has created a pull request for this issue: https://github.com/apache/spark/pull/4034 Maven build should keep publishing spark-repl - Key: SPARK-4923 URL: https://issues.apache.org/jira/browse/SPARK-4923 Project: Spark Issue Type: Bug Components: Build, Spark Shell Affects Versions: 1.2.0 Reporter: Peng Cheng Priority: Critical Labels: shell Attachments: SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch Original Estimate: 1h Remaining Estimate: 1h Spark-repl installation and deployment has been discontinued (see SPARK-3452). But its in the dependency list of a few projects that extends its initialization process. Please remove the 'skip' setting in spark-repl and make it an 'official' API to encourage more platform to integrate with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5240) Adding `createDataSourceTable` interface to Catalog
wangfei created SPARK-5240: -- Summary: Adding `createDataSourceTable` interface to Catalog Key: SPARK-5240 URL: https://issues.apache.org/jira/browse/SPARK-5240 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: wangfei Adding `createDataSourceTable` interface to Catalog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4923) Maven build should keep publishing spark-repl
[ https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276562#comment-14276562 ] Chip Senkbeil commented on SPARK-4923: -- As the nice bot has stated, I created a pull request for this issue. I detailed why I marked each method/field public and provided Scaladocs for each of them to make the exposure of the REPL API a little nicer. As stated in the pull request, I only tackled Scala 2.10 for now as the Scala 2.11 did not appear to be ready, although I could easily be mistaken. I merely glanced at the SparkIMain and noticed that it did not have the class server declaration to ship the compiled class files nor was it - or any of the other classes - in the org.apache.spark.repl package. Maven build should keep publishing spark-repl - Key: SPARK-4923 URL: https://issues.apache.org/jira/browse/SPARK-4923 Project: Spark Issue Type: Bug Components: Build, Spark Shell Affects Versions: 1.2.0 Reporter: Peng Cheng Priority: Critical Labels: shell Attachments: SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch Original Estimate: 1h Remaining Estimate: 1h Spark-repl installation and deployment has been discontinued (see SPARK-3452). But its in the dependency list of a few projects that extends its initialization process. Please remove the 'skip' setting in spark-repl and make it an 'official' API to encourage more platform to integrate with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5006) spark.port.maxRetries doesn't work
[ https://issues.apache.org/jira/browse/SPARK-5006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-5006. Resolution: Fixed Fix Version/s: 1.3.0 Assignee: WangTaoTheTonic Target Version/s: 1.3.0 spark.port.maxRetries doesn't work -- Key: SPARK-5006 URL: https://issues.apache.org/jira/browse/SPARK-5006 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.1.0 Reporter: WangTaoTheTonic Assignee: WangTaoTheTonic Fix For: 1.3.0 We normally config spark.port.maxRetries in properties file or SparkConf. But in Utils.scala it read from SparkEnv's conf. As SparkEnv is an object whose env need to be set after JVM is launched and Utils.scala is also an object. So in most cases portMaxRetries will get the default value 16. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3288) All fields in TaskMetrics should be private and use getters/setters
[ https://issues.apache.org/jira/browse/SPARK-3288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275600#comment-14275600 ] Apache Spark commented on SPARK-3288: - User 'ilganeli' has created a pull request for this issue: https://github.com/apache/spark/pull/4020 All fields in TaskMetrics should be private and use getters/setters --- Key: SPARK-3288 URL: https://issues.apache.org/jira/browse/SPARK-3288 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Patrick Wendell Assignee: Dale Richardson Labels: starter This is particularly bad because we expose this as a developer API. Technically a library could create a TaskMetrics object and then change the values inside of it and pass it onto someone else. It can be written pretty compactly like below: {code} /** * Number of bytes written for the shuffle by this task */ @volatile private var _shuffleBytesWritten: Long = _ def incrementShuffleBytesWritten(value: Long) = _shuffleBytesWritten += value def decrementShuffleBytesWritten(value: Long) = _shuffleBytesWritten -= value def shuffleBytesWritten = _shuffleBytesWritten {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5223) Use pickle instead of MapConvert and ListConvert in MLlib Python API
Davies Liu created SPARK-5223: - Summary: Use pickle instead of MapConvert and ListConvert in MLlib Python API Key: SPARK-5223 URL: https://issues.apache.org/jira/browse/SPARK-5223 Project: Spark Issue Type: Bug Components: MLlib, PySpark Reporter: Davies Liu Priority: Critical It will introduce problems if the object in dict/list/tuple can not support by py4j, such as Vector. Also, pickle may have better performance for larger object (less RPC). In some cases that the object in dict/list can not be pickled (such as JavaObject), we should still use MapConvert/ListConvert. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4697) System properties should override environment variables
[ https://issues.apache.org/jira/browse/SPARK-4697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4697: - Affects Version/s: 1.0.0 System properties should override environment variables --- Key: SPARK-4697 URL: https://issues.apache.org/jira/browse/SPARK-4697 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Reporter: WangTaoTheTonic I found some arguments in yarn module take environment variables before system properties while the latter override the former in core module. This should be changed in org.apache.spark.deploy.yarn.ClientArguments and org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5219) Race condition in TaskSchedulerImpl and TaskSetManager
[ https://issues.apache.org/jira/browse/SPARK-5219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5219: - Assignee: Shixiong Zhu Race condition in TaskSchedulerImpl and TaskSetManager -- Key: SPARK-5219 URL: https://issues.apache.org/jira/browse/SPARK-5219 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu TaskSchedulerImpl.handleTaskGettingResult, TaskSetManager.canFetchMoreResults and TaskSetManager.abort will access variables which are used in multiple threads, but they don't use a correct lock. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5219) Race condition in TaskSchedulerImpl and TaskSetManager
[ https://issues.apache.org/jira/browse/SPARK-5219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5219: - Affects Version/s: 1.2.0 Race condition in TaskSchedulerImpl and TaskSetManager -- Key: SPARK-5219 URL: https://issues.apache.org/jira/browse/SPARK-5219 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Shixiong Zhu TaskSchedulerImpl.handleTaskGettingResult, TaskSetManager.canFetchMoreResults and TaskSetManager.abort will access variables which are used in multiple threads, but they don't use a correct lock. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3885) Provide mechanism to remove accumulators once they are no longer used
[ https://issues.apache.org/jira/browse/SPARK-3885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275607#comment-14275607 ] Apache Spark commented on SPARK-3885: - User 'ilganeli' has created a pull request for this issue: https://github.com/apache/spark/pull/4021 Provide mechanism to remove accumulators once they are no longer used - Key: SPARK-3885 URL: https://issues.apache.org/jira/browse/SPARK-3885 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0, 1.2.0 Reporter: Josh Rosen Spark does not currently provide any mechanism to delete accumulators after they are no longer used. This can lead to OOMs for long-lived SparkContexts that create many large accumulators. Part of the problem is that accumulators are registered in a global {{Accumulators}} registry. Maybe the fix would be as simple as using weak references in the Accumulators registry so that accumulators can be GC'd once they can no longer be used. In the meantime, here's a workaround that users can try: Accumulators have a public setValue() method that can be called (only by the driver) to change an accumulator’s value. You might be able to use this to reset accumulators’ values to smaller objects (e.g. the “zero” object of whatever your accumulator type is, or ‘null’ if you’re sure that the accumulator will never be accessed again). This issue was originally reported by [~nkronenfeld] on the dev mailing list: http://apache-spark-developers-list.1001551.n3.nabble.com/Fwd-Accumulator-question-td8709.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5222) YARN client and cluster modes have different app name behaviors
Andrew Or created SPARK-5222: Summary: YARN client and cluster modes have different app name behaviors Key: SPARK-5222 URL: https://issues.apache.org/jira/browse/SPARK-5222 Project: Spark Issue Type: Bug Reporter: Andrew Or Assignee: WangTaoTheTonic The behavior is summarized in a table produced by [~WangTaoTheTonic] here: https://github.com/apache/spark/pull/3557 SPARK_YARN_APP_NAME is respected only in client mode but not in cluster mode. This results in the strange behavior where the app name changes if the user runs the same application but uses a different deploy mode from before. We should make sure the app name behavior is consistent across deploy modes regardless of what variable or config is set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5222) YARN client and cluster modes have different app name behaviors
[ https://issues.apache.org/jira/browse/SPARK-5222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5222: - Description: The behavior is summarized in a table produced by [~WangTaoTheTonic] here: https://github.com/apache/spark/pull/3557 SPARK_YARN_APP_NAME is respected only in client mode but not in cluster mode. This results in the strange behavior where the app name changes if the user runs the same application but uses a different deploy mode from before. We should make sure the app name behavior is consistent across deploy modes regardless of what variable or config is set. Additionally, it should be noted that because spark.app.name is required of all applications, the setting of SPARK_YARN_APP_NAME will not take effect unless we handle it preemptively in Spark submit. was: The behavior is summarized in a table produced by [~WangTaoTheTonic] here: https://github.com/apache/spark/pull/3557 SPARK_YARN_APP_NAME is respected only in client mode but not in cluster mode. This results in the strange behavior where the app name changes if the user runs the same application but uses a different deploy mode from before. We should make sure the app name behavior is consistent across deploy modes regardless of what variable or config is set. YARN client and cluster modes have different app name behaviors --- Key: SPARK-5222 URL: https://issues.apache.org/jira/browse/SPARK-5222 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Reporter: Andrew Or Assignee: WangTaoTheTonic The behavior is summarized in a table produced by [~WangTaoTheTonic] here: https://github.com/apache/spark/pull/3557 SPARK_YARN_APP_NAME is respected only in client mode but not in cluster mode. This results in the strange behavior where the app name changes if the user runs the same application but uses a different deploy mode from before. We should make sure the app name behavior is consistent across deploy modes regardless of what variable or config is set. Additionally, it should be noted that because spark.app.name is required of all applications, the setting of SPARK_YARN_APP_NAME will not take effect unless we handle it preemptively in Spark submit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5223) Use pickle instead of MapConvert and ListConvert in MLlib Python API
[ https://issues.apache.org/jira/browse/SPARK-5223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275671#comment-14275671 ] Apache Spark commented on SPARK-5223: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/4023 Use pickle instead of MapConvert and ListConvert in MLlib Python API Key: SPARK-5223 URL: https://issues.apache.org/jira/browse/SPARK-5223 Project: Spark Issue Type: Bug Components: MLlib, PySpark Reporter: Davies Liu Priority: Critical It will introduce problems if the object in dict/list/tuple can not support by py4j, such as Vector. Also, pickle may have better performance for larger object (less RPC). In some cases that the object in dict/list can not be pickled (such as JavaObject), we should still use MapConvert/ListConvert. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4697) System properties should override environment variables
[ https://issues.apache.org/jira/browse/SPARK-4697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4697: - Assignee: WangTaoTheTonic System properties should override environment variables --- Key: SPARK-4697 URL: https://issues.apache.org/jira/browse/SPARK-4697 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Reporter: WangTaoTheTonic Assignee: WangTaoTheTonic I found some arguments in yarn module take environment variables before system properties while the latter override the former in core module. This should be changed in org.apache.spark.deploy.yarn.ClientArguments and org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-733) Add documentation on use of accumulators in lazy transformation
[ https://issues.apache.org/jira/browse/SPARK-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275608#comment-14275608 ] Apache Spark commented on SPARK-733: User 'ilganeli' has created a pull request for this issue: https://github.com/apache/spark/pull/4022 Add documentation on use of accumulators in lazy transformation --- Key: SPARK-733 URL: https://issues.apache.org/jira/browse/SPARK-733 Project: Spark Issue Type: Bug Components: Documentation Reporter: Josh Rosen Accumulators updates are side-effects of RDD computations. Unlike RDDs, accumulators do not carry lineage that would allow them to be computed when their values are accessed on the master. This can lead to confusion when accumulators are used in lazy transformations like `map`: {code} val acc = sc.accumulator(0) data.map(x = acc += x; f(x)) // Here, acc is 0 because no actions have cause the `map` to be computed. {code} As far as I can tell, our documentation only includes examples of using accumulators in `foreach`, for which this problem does not occur. This pattern of using accumulators in map() occurs in Bagel and other Spark code found in the wild. It might be nice to document this behavior in the accumulators section of the Spark programming guide. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5222) YARN client and cluster modes have different app name behaviors
[ https://issues.apache.org/jira/browse/SPARK-5222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5222: - Affects Version/s: 1.0.0 YARN client and cluster modes have different app name behaviors --- Key: SPARK-5222 URL: https://issues.apache.org/jira/browse/SPARK-5222 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Reporter: Andrew Or Assignee: WangTaoTheTonic The behavior is summarized in a table produced by [~WangTaoTheTonic] here: https://github.com/apache/spark/pull/3557 SPARK_YARN_APP_NAME is respected only in client mode but not in cluster mode. This results in the strange behavior where the app name changes if the user runs the same application but uses a different deploy mode from before. We should make sure the app name behavior is consistent across deploy modes regardless of what variable or config is set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5008) Persistent HDFS does not recognize EBS Volumes
[ https://issues.apache.org/jira/browse/SPARK-5008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275586#comment-14275586 ] Brad Willard commented on SPARK-5008: - [~nchammas] I went ahead and created a cluster with this ./spark-ec2 -v 1.2.0 --wait 235 -k ... --copy-aws-credentials --hadoop-major-version 1 -z us-east-1c -s 2 -m c1.medium -t c1.medium launch spark-hdfs-bug --ebs-vol-size 10 --ebs-vol-type gp2 --ebs-vol-num 1 I updated the core-site.xml and switched /vol - to /vol0. ran copy-dir and restarted via stop-all.sh and start-all.sh. That brings it up in a broken state. However if I then modify the core-site.xml back to /vol on master and restart, it works correctly. So that's a partial solution. I assume this is because the master node doesn't get an ebs volume. Persistent HDFS does not recognize EBS Volumes -- Key: SPARK-5008 URL: https://issues.apache.org/jira/browse/SPARK-5008 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.2.0 Environment: 8 Node Cluster Generated from 1.2.0 spark-ec2 script. -m c3.2xlarge -t c3.8xlarge --ebs-vol-size 300 --ebs-vol-type gp2 --ebs-vol-num 1 Reporter: Brad Willard Cluster is built with correct size EBS volumes. It creates the volume at /dev/xvds and it mounted to /vol0. However when you start persistent hdfs with start-all script, it starts but it isn't correctly configured to use the EBS volume. I'm assuming some sym links or expected mounts are not correctly configured. This has worked flawlessly on all previous versions of spark. I have a stupid workaround by installing pssh and mucking with it by mounting it to /vol, which worked, however it doesn't not work between restarts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5222) YARN client and cluster modes have different app name behaviors
[ https://issues.apache.org/jira/browse/SPARK-5222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5222: - Component/s: YARN YARN client and cluster modes have different app name behaviors --- Key: SPARK-5222 URL: https://issues.apache.org/jira/browse/SPARK-5222 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Reporter: Andrew Or Assignee: WangTaoTheTonic The behavior is summarized in a table produced by [~WangTaoTheTonic] here: https://github.com/apache/spark/pull/3557 SPARK_YARN_APP_NAME is respected only in client mode but not in cluster mode. This results in the strange behavior where the app name changes if the user runs the same application but uses a different deploy mode from before. We should make sure the app name behavior is consistent across deploy modes regardless of what variable or config is set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4955) Dynamic allocation doesn't work in YARN cluster mode
[ https://issues.apache.org/jira/browse/SPARK-4955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4955: - Summary: Dynamic allocation doesn't work in YARN cluster mode (was: Executor does not get killed after configured interval.) Dynamic allocation doesn't work in YARN cluster mode Key: SPARK-4955 URL: https://issues.apache.org/jira/browse/SPARK-4955 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Chengxiang Li With executor dynamic scaling enabled, in yarn-cluster mode, after query finished and spark.dynamicAllocation.executorIdleTimeout interval, executor number is not reduced to configured min number. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4955) Dynamic allocation doesn't work in YARN cluster mode
[ https://issues.apache.org/jira/browse/SPARK-4955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4955: - Priority: Critical (was: Major) Dynamic allocation doesn't work in YARN cluster mode Key: SPARK-4955 URL: https://issues.apache.org/jira/browse/SPARK-4955 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Chengxiang Li Priority: Critical With executor dynamic scaling enabled, in yarn-cluster mode, after query finished and spark.dynamicAllocation.executorIdleTimeout interval, executor number is not reduced to configured min number. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4955) Dynamic allocation doesn't work in YARN cluster mode
[ https://issues.apache.org/jira/browse/SPARK-4955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4955: - Assignee: Lianhui Wang Dynamic allocation doesn't work in YARN cluster mode Key: SPARK-4955 URL: https://issues.apache.org/jira/browse/SPARK-4955 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Chengxiang Li Assignee: Lianhui Wang Priority: Critical With executor dynamic scaling enabled, in yarn-cluster mode, after query finished and spark.dynamicAllocation.executorIdleTimeout interval, executor number is not reduced to configured min number. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5242) ec2/spark_ec2.py lauch does not work with VPC if no public DNS or IP is available
Vladimir Grigor created SPARK-5242: -- Summary: ec2/spark_ec2.py lauch does not work with VPC if no public DNS or IP is available Key: SPARK-5242 URL: https://issues.apache.org/jira/browse/SPARK-5242 Project: Spark Issue Type: Bug Components: EC2 Reporter: Vladimir Grigor How to reproduce: user starting cluster in VPC needs to wait forever: {code} ./spark-ec2 -k key20141114 -i ~/aws/key.pem -s 1 --region=eu-west-1 --spark-version=1.2.0 --instance-type=m1.large --vpc-id=vpc-2e71dd46 --subnet-id=subnet-2571dd4d --zone=eu-west-1a launch SparkByScript Setting up security groups... Searching for existing cluster SparkByScript... Spark AMI: ami-1ae0166d Launching instances... Launched 1 slaves in eu-west-1a, regid = r-e70c5502 Launched master in eu-west-1a, regid = r-bf0f565a Waiting for cluster to enter 'ssh-ready' state..{forever} {code} Problem is that current code makes wrong assumption that VPC instance has public_dns_name or public ip_address. Actually more common is that VPC instance has only private_ip_address. The bug is already fixed in my fork, I am going to submit pull request -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276572#comment-14276572 ] Florian Verhein commented on SPARK-3821: Thanks [~nchammas], that makes sense. Created #SPARK-5241. I'm not sure about the pre-built scenario, but am guessing e.g. http://s3.amazonaws.com/spark-related-packages/spark-1.2.0-bin-hadoop2.4.tgz != http://s3.amazonaws.com/spark-related-packages/spark-1.2.0-bin-cdh4.tgz. So perhaps the intent is that the spark-ec2 scripts only support cdh distributions... Develop an automated way of creating Spark images (AMI, Docker, and others) --- Key: SPARK-3821 URL: https://issues.apache.org/jira/browse/SPARK-3821 Project: Spark Issue Type: Improvement Components: Build, EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Attachments: packer-proposal.html Right now the creation of Spark AMIs or Docker containers is done manually. With tools like [Packer|http://www.packer.io/], we should be able to automate this work, and do so in such a way that multiple types of machine images can be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5147) write ahead logs from streaming receiver are not purged because cleanupOldBlocks in WriteAheadLogBasedBlockHandler is never called
[ https://issues.apache.org/jira/browse/SPARK-5147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276581#comment-14276581 ] Apache Spark commented on SPARK-5147: - User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/4037 write ahead logs from streaming receiver are not purged because cleanupOldBlocks in WriteAheadLogBasedBlockHandler is never called -- Key: SPARK-5147 URL: https://issues.apache.org/jira/browse/SPARK-5147 Project: Spark Issue Type: Sub-task Components: Streaming Affects Versions: 1.2.0 Reporter: Max Xu Priority: Blocker Hi all, We are running a Spark streaming application with ReliableKafkaReceiver. We have spark.streaming.receiver.writeAheadLog.enable set to true so write ahead logs (WALs) for received data are created under receivedData/streamId folder in the checkpoint directory. However, old WALs are never purged by time. receivedBlockMetadata and checkpoint files are purged correctly though. I went through the code, WriteAheadLogBasedBlockHandler class in ReceivedBlockHandler.scala is responsible for cleaning up the old blocks. It has method cleanupOldBlocks, which is never called by any class. ReceiverSupervisorImpl class holds a WriteAheadLogBasedBlockHandler instance. However, it only calls storeBlock method to create WALs but never calls cleanupOldBlocks method to purge old WALs. The size of the WAL folder increases constantly on HDFS. This is preventing us from running the ReliableKafkaReceiver 24x7. Can somebody please take a look. Thanks, Max -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5243) Spark will hang if (driver memory + executor memory) exceeds limit on a 1-worker cluster
yuhao yang created SPARK-5243: - Summary: Spark will hang if (driver memory + executor memory) exceeds limit on a 1-worker cluster Key: SPARK-5243 URL: https://issues.apache.org/jira/browse/SPARK-5243 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 1.2.0 Environment: centos, others should be similar Reporter: yuhao yang Priority: Minor Spark will hang if calling spark-submit under the conditions: 1. the cluster has only one worker. 2. driver memory + executor memory worker memory 3. deploy-mode = cluster This usually happens during development for beginners. There should be some exit mechanism or at least a warning message in the output of the spark-submit. I am preparing PR for the case. And I would like to know your opinions about if a fix is needed and better fix options. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5242) ec2/spark_ec2.py lauch does not work with VPC if no public DNS or IP is available
[ https://issues.apache.org/jira/browse/SPARK-5242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276590#comment-14276590 ] Apache Spark commented on SPARK-5242: - User 'voukka' has created a pull request for this issue: https://github.com/apache/spark/pull/4038 ec2/spark_ec2.py lauch does not work with VPC if no public DNS or IP is available --- Key: SPARK-5242 URL: https://issues.apache.org/jira/browse/SPARK-5242 Project: Spark Issue Type: Bug Components: EC2 Reporter: Vladimir Grigor Labels: easyfix How to reproduce: user starting cluster in VPC needs to wait forever: {code} ./spark-ec2 -k key20141114 -i ~/aws/key.pem -s 1 --region=eu-west-1 --spark-version=1.2.0 --instance-type=m1.large --vpc-id=vpc-2e71dd46 --subnet-id=subnet-2571dd4d --zone=eu-west-1a launch SparkByScript Setting up security groups... Searching for existing cluster SparkByScript... Spark AMI: ami-1ae0166d Launching instances... Launched 1 slaves in eu-west-1a, regid = r-e70c5502 Launched master in eu-west-1a, regid = r-bf0f565a Waiting for cluster to enter 'ssh-ready' state..{forever} {code} Problem is that current code makes wrong assumption that VPC instance has public_dns_name or public ip_address. Actually more common is that VPC instance has only private_ip_address. The bug is already fixed in my fork, I am going to submit pull request -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5236) java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt
[ https://issues.apache.org/jira/browse/SPARK-5236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276594#comment-14276594 ] Apache Spark commented on SPARK-5236: - User 'alexbaretta' has created a pull request for this issue: https://github.com/apache/spark/pull/4039 java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt - Key: SPARK-5236 URL: https://issues.apache.org/jira/browse/SPARK-5236 Project: Spark Issue Type: Bug Reporter: Alex Baretta 15/01/14 05:39:27 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 18.0 (TID 28, localhost): parquet.io.ParquetDecodingException: Can not read value at 0 in block 0 in file gs://pa-truven/20141205/parquet/P/part-r-1.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141) at org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1331) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1331) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt at org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:241) at org.apache.spark.sql.parquet.CatalystPrimitiveRowConverter.updateInt(ParquetConverter.scala:375) at org.apache.spark.sql.parquet.CatalystPrimitiveConverter.addInt(ParquetConverter.scala:434) at parquet.column.impl.ColumnReaderImpl$2$3.writeValue(ColumnReaderImpl.java:237) at parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:353) at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:402) at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:194) ... 27 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5242) ec2/spark_ec2.py lauch does not work with VPC if no public DNS or IP is available
[ https://issues.apache.org/jira/browse/SPARK-5242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276591#comment-14276591 ] Vladimir Grigor commented on SPARK-5242: This bug is fixed in https://github.com/apache/spark/pull/4038 ec2/spark_ec2.py lauch does not work with VPC if no public DNS or IP is available --- Key: SPARK-5242 URL: https://issues.apache.org/jira/browse/SPARK-5242 Project: Spark Issue Type: Bug Components: EC2 Reporter: Vladimir Grigor Labels: easyfix How to reproduce: user starting cluster in VPC needs to wait forever: {code} ./spark-ec2 -k key20141114 -i ~/aws/key.pem -s 1 --region=eu-west-1 --spark-version=1.2.0 --instance-type=m1.large --vpc-id=vpc-2e71dd46 --subnet-id=subnet-2571dd4d --zone=eu-west-1a launch SparkByScript Setting up security groups... Searching for existing cluster SparkByScript... Spark AMI: ami-1ae0166d Launching instances... Launched 1 slaves in eu-west-1a, regid = r-e70c5502 Launched master in eu-west-1a, regid = r-bf0f565a Waiting for cluster to enter 'ssh-ready' state..{forever} {code} Problem is that current code makes wrong assumption that VPC instance has public_dns_name or public ip_address. Actually more common is that VPC instance has only private_ip_address. The bug is already fixed in my fork, I am going to submit pull request -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5241) spark-ec2 spark init scripts do not handle all hadoop (or tachyon?) dependencies correctly
Florian Verhein created SPARK-5241: -- Summary: spark-ec2 spark init scripts do not handle all hadoop (or tachyon?) dependencies correctly Key: SPARK-5241 URL: https://issues.apache.org/jira/browse/SPARK-5241 Project: Spark Issue Type: Bug Components: Build, EC2 Reporter: Florian Verhein spark-ec2/spark/init.sh doesn't completely adhere to hadoop dependencies. This may also be an issue for tachyon dependencies. Related: tachyon appears require builds against the right version of hadoop also (probably causes this: SPARK-3185). Applies to the spark build from git checkout in spark/init.sh (I suspect this should also be changed to using mvn as that's the reference build according to the docs?). May apply to pre-built spark in spark/init.sh as well, but I'm not sure about this. E.g. I thought that the hadoop2.4 and cdh4.2 builds of spark are different. Also note that hadoop native is built from hadoop 2.4.1 on the AMI, and this is used regardless of HADOOP_MAJOR_VERSION in the *-hdfs modules. Tachyon is hard coded to 0.4.1 (which is probably built against hadoop1.x?) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5240) Adding `createDataSourceTable` interface to Catalog
[ https://issues.apache.org/jira/browse/SPARK-5240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276566#comment-14276566 ] Apache Spark commented on SPARK-5240: - User 'scwf' has created a pull request for this issue: https://github.com/apache/spark/pull/4036 Adding `createDataSourceTable` interface to Catalog --- Key: SPARK-5240 URL: https://issues.apache.org/jira/browse/SPARK-5240 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: wangfei Adding `createDataSourceTable` interface to Catalog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5095) Support launching multiple mesos executors in coarse grained mesos mode
[ https://issues.apache.org/jira/browse/SPARK-5095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276607#comment-14276607 ] Josh Devins commented on SPARK-5095: Nice one, gonna try and test it this week. Support launching multiple mesos executors in coarse grained mesos mode --- Key: SPARK-5095 URL: https://issues.apache.org/jira/browse/SPARK-5095 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Currently in coarse grained mesos mode, it's expected that we only launch one Mesos executor that launches one JVM process to launch multiple spark executors. However, this become a problem when the JVM process launched is larger than an ideal size (30gb is recommended value from databricks), which causes GC problems reported on the mailing list. We should support launching mulitple executors when large enough resources are available for spark to use, and these resources are still under the configured limit. This is also applicable when users want to specifiy number of executors to be launched on each node -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5224) parallelize list/ndarray is really slow
Davies Liu created SPARK-5224: - Summary: parallelize list/ndarray is really slow Key: SPARK-5224 URL: https://issues.apache.org/jira/browse/SPARK-5224 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Reporter: Davies Liu Priority: Blocker After the default batchSize changed to 0 (batched based on the size of object), but parallelize() still use BatchedSerializer with batchSize=1. Also, BatchedSerializer did not work well with list and numpy.ndarray -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5224) parallelize list/ndarray is really slow
[ https://issues.apache.org/jira/browse/SPARK-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275711#comment-14275711 ] Apache Spark commented on SPARK-5224: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/4024 parallelize list/ndarray is really slow --- Key: SPARK-5224 URL: https://issues.apache.org/jira/browse/SPARK-5224 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Reporter: Davies Liu Priority: Blocker After the default batchSize changed to 0 (batched based on the size of object), but parallelize() still use BatchedSerializer with batchSize=1. Also, BatchedSerializer did not work well with list and numpy.ndarray -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5225) Support coalesed Input Metrics from different sources
Kostas Sakellis created SPARK-5225: -- Summary: Support coalesed Input Metrics from different sources Key: SPARK-5225 URL: https://issues.apache.org/jira/browse/SPARK-5225 Project: Spark Issue Type: Bug Reporter: Kostas Sakellis Currently, If task reads data from more than one block and it is from different read methods we ignore the second read method bytes. For example: CoalescedRDD | Task1 / |\ / | \ hadoop hadoop cached if Task1 starts reading from the hadoop blocks first, then the input metrics for Task1 will only contain input metrics from the hadoop blocks and ignre the input metrics from cached blocks. We need to change the way we collect input metrics so that it is not a single value but rather a collection of input metrics for a task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2909) Indexing for SparseVector in pyspark
[ https://issues.apache.org/jira/browse/SPARK-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275741#comment-14275741 ] Apache Spark commented on SPARK-2909: - User 'MechCoder' has created a pull request for this issue: https://github.com/apache/spark/pull/4025 Indexing for SparseVector in pyspark Key: SPARK-2909 URL: https://issues.apache.org/jira/browse/SPARK-2909 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Joseph K. Bradley Priority: Minor SparseVector in pyspark does not currently support indexing, except by examining the internal representation. Though indexing is a pricy operation, it would be useful for, e.g., iterating through a dataset (RDD[LabeledPoint]) and operating on a single feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4879) Missing output partitions after job completes with speculative execution
[ https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272291#comment-14272291 ] Zach Fry edited comment on SPARK-4879 at 1/13/15 7:53 PM: -- Hey Josh, I was able to reproduce the missing file using the speculation settings in my previous comment: {code} scala 15/01/09 18:33:28 WARN scheduler.TaskSetManager: Lost task 42.1 in stage 0.0 (TID 113, redacted-03): java.io.IOException: Failed to save output of task: attempt_201501091833__m_42_113 org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160) org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172) org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132) org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:991) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) 15/01/09 18:33:47 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, redacted-03): java.io.IOException: The temporary job-output directory hdfs://redacted-01:8020/test2/_temporary doesn't exist! org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250) org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:240) org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:116) org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:89) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:980) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} Notice here that there are only 99 part files and part-00042 is missing (as seen in the stacktrace above) {code} $ hadoop fs -ls /test2 | grep part | wc -l 99 $ hadoop fs -ls /test2 | grep part-0004 -rw-r--r-- 3 redacted supergroup 8 2015-01-09 18:33 /test2/part-00040 -rw-r--r-- 3 redacted supergroup 8 2015-01-09 18:33 /test2/part-00041 -rw-r--r-- 3 redacted supergroup 8 2015-01-09 18:33 /test2/part-00043 -rw-r--r-- 3 redacted supergroup 8 2015-01-09 18:33 /test2/part-00044 -rw-r--r-- 3 redacted supergroup 8 2015-01-09 18:33 /test2/part-00045 -rw-r--r-- 3 redacted supergroup 8 2015-01-09 18:33 /test2/part-00046 -rw-r--r-- 3 redacted supergroup 8 2015-01-09 18:33 /test2/part-00047 -rw-r--r-- 3 redacted supergroup 8 2015-01-09 18:33 /test2/part-00048 -rw-r--r-- 3 redacted supergroup 8 2015-01-09 18:33 /test2/part-00049 {code} was (Author: zfry): Hey Josh, I was able to reproduce the missing file using the speculation settings in my previous comment: {code} scala 15/01/09 18:33:28 WARN scheduler.TaskSetManager: Lost task 42.1 in stage 0.0 (TID 113, redacted-03): java.io.IOException: Failed to save output of task: attempt_201501091833__m_42_113 org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160) org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172) org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132) org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:991) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[jira] [Updated] (SPARK-5225) Support coalesed Input Metrics from different sources
[ https://issues.apache.org/jira/browse/SPARK-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kostas Sakellis updated SPARK-5225: --- Description: Currently, If task reads data from more than one block and it is from different read methods we ignore the second read method bytes. For example: {noformat} CoalescedRDD | Task1 / | \ hadoop hadoop cached {noformat} if Task1 starts reading from the hadoop blocks first, then the input metrics for Task1 will only contain input metrics from the hadoop blocks and ignre the input metrics from cached blocks. We need to change the way we collect input metrics so that it is not a single value but rather a collection of input metrics for a task. was: Currently, If task reads data from more than one block and it is from different read methods we ignore the second read method bytes. For example: CoalescedRDD | Task1 / |\ / | \ hadoop hadoop cached if Task1 starts reading from the hadoop blocks first, then the input metrics for Task1 will only contain input metrics from the hadoop blocks and ignre the input metrics from cached blocks. We need to change the way we collect input metrics so that it is not a single value but rather a collection of input metrics for a task. Support coalesed Input Metrics from different sources - Key: SPARK-5225 URL: https://issues.apache.org/jira/browse/SPARK-5225 Project: Spark Issue Type: Bug Reporter: Kostas Sakellis Currently, If task reads data from more than one block and it is from different read methods we ignore the second read method bytes. For example: {noformat} CoalescedRDD | Task1 / | \ hadoop hadoop cached {noformat} if Task1 starts reading from the hadoop blocks first, then the input metrics for Task1 will only contain input metrics from the hadoop blocks and ignre the input metrics from cached blocks. We need to change the way we collect input metrics so that it is not a single value but rather a collection of input metrics for a task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5211) Restore HiveMetastoreTypes.toDataType
[ https://issues.apache.org/jira/browse/SPARK-5211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275846#comment-14275846 ] Apache Spark commented on SPARK-5211: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/4026 Restore HiveMetastoreTypes.toDataType - Key: SPARK-5211 URL: https://issues.apache.org/jira/browse/SPARK-5211 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Priority: Critical It was a public API. Since developers are using it, we need to get it back. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3185) SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting JOURNAL_FOLDER
[ https://issues.apache.org/jira/browse/SPARK-3185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-3185: Description: {code} org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 {code} When I launch SPARK 1.0.2 on Hadoop 2 in a new EC2 cluster, the above tachyon exception is thrown when Formatting JOURNAL_FOLDER. No exception occurs when I launch on Hadoop 1. Launch used: {code} ./spark-ec2 -k spark_cluster -i /home/ec2-user/kagi/spark_cluster.ppk --zone=us-east-1a --hadoop-major-version=2 --spot-price=0.0165 -s 3 launch sparkProd {code} {code} log snippet Formatting Tachyon Master @ ec2-54-80-49-244.compute-1.amazonaws.com Formatting JOURNAL_FOLDER: /root/tachyon/libexec/../journal/ Exception in thread main java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 at tachyon.util.CommonUtils.runtimeException(CommonUtils.java:246) at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:73) at tachyon.UnderFileSystemHdfs.getClient(UnderFileSystemHdfs.java:53) at tachyon.UnderFileSystem.get(UnderFileSystem.java:53) at tachyon.Format.main(Format.java:54) Caused by: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 at org.apache.hadoop.ipc.Client.call(Client.java:1070) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379) at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:238) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:203) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187) at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:69) ... 3 more Killed 0 processes Killed 0 processes ec2-54-167-219-159.compute-1.amazonaws.com: Killed 0 processes ec2-54-198-198-17.compute-1.amazonaws.com: Killed 0 processes ec2-54-166-36-0.compute-1.amazonaws.com: Killed 0 processes ---end snippet--- {code} *I don't have this problem when I launch without the --hadoop-major-version=2 (which defaults to Hadoop 1.x).* was: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 When I launch SPARK 1.0.2 on Hadoop 2 in a new EC2 cluster, the above tachyon exception is thrown when Formatting JOURNAL_FOLDER. No exception occurs when I launch on Hadoop 1. Launch used: ./spark-ec2 -k spark_cluster -i /home/ec2-user/kagi/spark_cluster.ppk --zone=us-east-1a --hadoop-major-version=2 --spot-price=0.0165 -s 3 launch sparkProd log snippet Formatting Tachyon Master @ ec2-54-80-49-244.compute-1.amazonaws.com Formatting JOURNAL_FOLDER: /root/tachyon/libexec/../journal/ Exception in thread main java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 at tachyon.util.CommonUtils.runtimeException(CommonUtils.java:246) at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:73) at tachyon.UnderFileSystemHdfs.getClient(UnderFileSystemHdfs.java:53) at tachyon.UnderFileSystem.get(UnderFileSystem.java:53) at tachyon.Format.main(Format.java:54) Caused by: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 at org.apache.hadoop.ipc.Client.call(Client.java:1070) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379) at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:238) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:203) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276263#comment-14276263 ] Florian Verhein commented on SPARK-3821: This is great stuff! It'll also help serve as some documentation for AMI requirements when using the spark-ec2 scripts. Re the above, I think everything in create_image.sh can be refactored to packer (+ duplicate removal - e.g. root login). I've attempted to do this in a fork of [~nchammas]'s work, but my use case is a bit different in that I need to go from a fresh centos6 minimal (rather than an amazon linux AMI) and then add other things. Possibly related to AMI generation in general: I've noticed that the version dependencies in the spark-ec2 scripts are broken. I suspect this will need to be handled in both the image and the setup. For example: - It looks like Spark needs to be built with the right hadoop profile to work, but this isn't adhered to. This applies when spark is built from a git checkout or from an existing build. This is likely also the case with Tachyon too. Probably the cause of https://issues.apache.org/jira/browse/SPARK-3185 - The hadoop native libs are built on the image using 2.4.1, but then copied into whatever hadoop build is downloaded in the ephemeral-hdfs and persistent-hdfs scripts. I suspect that could cause issues too. Since building hadoop is very time consuming, it's something you'd wan't on the image - hence creating a dependency. - The version dependencies for other things like ganglia aren't documented (I believe this is installed on the image but duplicated again in spark-ec2/ganglia). I've found that the ganglia config doesn't work for me (but recall I'm using a different base AMI, so I'll likely get a different ganglia version). I have a sneaky suspicion that the hadoop configs in spark-ec2 won't work across the hadoop versions either (but, fingers crossed!). Re the above, I might try keeping the entire hadoop build (from the image creation) for the hdfs setup. Sorry for the sidetrack, but struggling though all this so hoping it might ring a bell for someone. p.s. With the image automation, it might also be worth considering putting more on the image as an option (esp for people happy to build their own AMIs). For example, I see no reason why the module init.sh scripts can't be run from packer in order to speed start-up times of the cluster :) Develop an automated way of creating Spark images (AMI, Docker, and others) --- Key: SPARK-3821 URL: https://issues.apache.org/jira/browse/SPARK-3821 Project: Spark Issue Type: Improvement Components: Build, EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Attachments: packer-proposal.html Right now the creation of Spark AMIs or Docker containers is done manually. With tools like [Packer|http://www.packer.io/], we should be able to automate this work, and do so in such a way that multiple types of machine images can be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5167) Move Row into sql package and make it usable for Java
[ https://issues.apache.org/jira/browse/SPARK-5167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276327#comment-14276327 ] Apache Spark commented on SPARK-5167: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/4030 Move Row into sql package and make it usable for Java - Key: SPARK-5167 URL: https://issues.apache.org/jira/browse/SPARK-5167 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin This will help us eliminate the duplicated Java code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4894) Add Bernoulli-variant of Naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276380#comment-14276380 ] RJ Nowling edited comment on SPARK-4894 at 1/14/15 2:06 AM: Hi [~lmcguire] Always happy to have more help! :) I started looking through the Spark NB functions but I haven't started writing code yet. The docs for NB mention that using binary features will cause the multinomial NB to act like Bernoulli NB. I don't believe the documentation is correct, at least when smoothing is used since P(0) != 1 - P(1).I was planning on comparing the sklearn implementation with the Spark implementation and showing that the docs were wrong. Once verified, I think the changes will be very small to add a Bernoulli mode controlled by a flag in the constructor. I won't get to this until next week, though. If you have time now and want to tackle this, I'd be happy to hand it over to you and review any patches. (I'm not a committer, though -- [~mengxr] would have to sign off.)Otherwise, if you want to wait until I have a patch and test it, that could work, too. What do you think? was (Author: rnowling): Hi @lmcguire, Always happy to have more help! :) I started looking through the Spark NB functions but I haven't started writing code yet. The docs for NB mention that using binary features will cause the multinomial NB to act like Bernoulli NB. I don't believe the documentation is correct, at least when smoothing is used since P(0) != 1 - P(1).I was planning on comparing the sklearn implementation with the Spark implementation and showing that the docs were wrong. Once verified, I think the changes will be very small to add a Bernoulli mode controlled by a flag in the constructor. I won't get to this until next week, though. If you have time now and want to tackle this, I'd be happy to hand it over to you and review any patches. (I'm not a committer, though -- [~mengxr] would have to sign off.)Otherwise, if you want to wait until I have a patch and test it, that could work, too. What do you think? Add Bernoulli-variant of Naive Bayes Key: SPARK-4894 URL: https://issues.apache.org/jira/browse/SPARK-4894 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.1.1 Reporter: RJ Nowling MLlib only supports the multinomial-variant of Naive Bayes. The Bernoulli version of Naive Bayes is more useful for situations where the features are binary values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4296) Throw Expression not in GROUP BY when using same expression in group by clause and select clause
[ https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276278#comment-14276278 ] Cheng Lian commented on SPARK-4296: --- Yeah, I think whenever we use expressions that are not {{NamedExpression}} in GROUP BY, this issue may be triggered, because an intermediate alias is introduced during analysis phase. That's why I tried to fix all similar aliases in PR #3910 (but failed). Throw Expression not in GROUP BY when using same expression in group by clause and select clause --- Key: SPARK-4296 URL: https://issues.apache.org/jira/browse/SPARK-4296 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0, 1.1.1, 1.2.0 Reporter: Shixiong Zhu Assignee: Cheng Lian Priority: Blocker When the input data has a complex structure, using same expression in group by clause and select clause will throw Expression not in GROUP BY. {code:java} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD case class Birthday(date: String) case class Person(name: String, birthday: Birthday) val people = sc.parallelize(List(Person(John, Birthday(1990-01-22)), Person(Jim, Birthday(1980-02-28 people.registerTempTable(people) val year = sqlContext.sql(select count(*), upper(birthday.date) from people group by upper(birthday.date)) year.collect {code} Here is the plan of year: {code:java} SchemaRDD[3] at RDD at SchemaRDD.scala:105 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree: Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date AS date#9) AS c1#3] Subquery people LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:36 {code} The bug is the equality test for `Upper(birthday#1.date)` and `Upper(birthday#1.date AS date#9)`. Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias expression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5233) Error replay of WAL when recovered from driver failue
Saisai Shao created SPARK-5233: -- Summary: Error replay of WAL when recovered from driver failue Key: SPARK-5233 URL: https://issues.apache.org/jira/browse/SPARK-5233 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.0 Reporter: Saisai Shao Spark Streaming will write all the event into WAL for driver recovery, the sequence in the WAL may be like this: {code} BlockAdditionEvent --- BlockAdditionEvent --- BlockAdditionEvent --- BatchAllocationEvent --- BatchCleanupEvent --- BlockAdditionEvent --- BlockAdditionEvent --- 'Driver Down Time' --- BlockAdditionEvent --- BlockAdditionEvent --- BatchAllocationEvent {code} When driver recovered from failure, it will replay all the existed metadata WAL to get the right status, in this situation, two BatchAdditionEvent before down will put into received block queue. After driver started, new incoming blocking will also put into this queue and a follow-up BlockAllocationEvent will allocate an allocatedBlocks with queue draining out. So old, not this batch's data will also mix into this batch, here is the partial log: {code} 15/01/13 17:19:10 INFO KafkaInputDStream: block store result for batch 142114075 ms 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406 53201,46704,480) 197757 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406 53201,47188,480) 197758 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406 53201,47672,480) 197759 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406 53201,48156,480) 197760 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406 53201,48640,480) 197761 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406 53201,49124,480) 197762 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408 07074,0,44184) 197763 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408 07074,44188,58536) 197764 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408 07074,102728,60168) 197765 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408 07074,162900,64584) 197766 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: WriteAheadLogFileSegment(file: /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408 07074,227488,51240) {code} The old log /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406 is obviously far older than current batch interval, and will fetch again to add to process. This issue is implicit, because in the previous code we never delete the old received data WAL. This will lead to unwanted result as I know. Basically because we miss some BlockAllocationEvent when recovered from failure. I think we need to correctly replay and insert all the events correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276411#comment-14276411 ] Nicholas Chammas commented on SPARK-3821: - Hi [~florianverhein] and thanks for chiming in! {quote} Re the above, I think everything in create_image.sh can be refactored to packer (+ duplicate removal - e.g. root login). {quote} Definitely. I'm hoping to make as few changes as possible to the existing {{create_image.sh}} script to reduce the review burden, but after this initial proposal is accepted it makes sense to refactor these scripts. There is some related work proposed in [SPARK-5189]. Some of the things you call out regarding version mismatches and whatnot sound like they might merit their own JIRA issues. For example: {quote} It looks like Spark needs to be built with the right hadoop profile to work, but this isn't adhered to. {quote} I haven't tested this out, but from the Spark init script, it looks like the correct version of Spark is used in [the pre-built scenario|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/spark/init.sh#L109]. Not so in the [build-from-git scenario|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/spark/init.sh#L21], so nice catch. Could you file a JIRA issue for that? {quote} For example, I see no reason why the module init.sh scripts can't be run from packer in order to speed start-up times of the cluster {quote} Regarding this and other ideas regarding pre-baking more on the images, [that's how this proposal started, actually|https://github.com/nchammas/spark-ec2/blob/9c28878694171ba085a10acd4405c702397d28ce/packer/README.md#base-vs-spark-pre-installed] (here's the [original Packer template|https://github.com/nchammas/spark-ec2/blob/9c28878694171ba085a10acd4405c702397d28ce/packer/spark-packer.json#L118-L133]). We decided to rip that out to reduce the complexity of the initial proposal and make it easier to specify different versions of Spark and Hadoop at launch time. Develop an automated way of creating Spark images (AMI, Docker, and others) --- Key: SPARK-3821 URL: https://issues.apache.org/jira/browse/SPARK-3821 Project: Spark Issue Type: Improvement Components: Build, EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Attachments: packer-proposal.html Right now the creation of Spark AMIs or Docker containers is done manually. With tools like [Packer|http://www.packer.io/], we should be able to automate this work, and do so in such a way that multiple types of machine images can be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5213) Support the SQL Parser Registry
[ https://issues.apache.org/jira/browse/SPARK-5213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14274908#comment-14274908 ] Apache Spark commented on SPARK-5213: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/4015 Support the SQL Parser Registry --- Key: SPARK-5213 URL: https://issues.apache.org/jira/browse/SPARK-5213 Project: Spark Issue Type: New Feature Components: SQL Reporter: Cheng Hao Currently, the SQL Parser dialect is hard code in SQLContext, which is not easy to extend, we need to provide a SQL Parser Dialect Factory util to manage them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5213) Support the SQL Parser Registry
Cheng Hao created SPARK-5213: Summary: Support the SQL Parser Registry Key: SPARK-5213 URL: https://issues.apache.org/jira/browse/SPARK-5213 Project: Spark Issue Type: New Feature Components: SQL Reporter: Cheng Hao Currently, the SQL Parser dialect is hard code in SQLContext, which is not easy to extend, we need to provide a SQL Parser Dialect Factory util to manage them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1507) Spark on Yarn: Add support for user to specify # cores for ApplicationMaster
[ https://issues.apache.org/jira/browse/SPARK-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275244#comment-14275244 ] Apache Spark commented on SPARK-1507: - User 'WangTaoTheTonic' has created a pull request for this issue: https://github.com/apache/spark/pull/4018 Spark on Yarn: Add support for user to specify # cores for ApplicationMaster Key: SPARK-1507 URL: https://issues.apache.org/jira/browse/SPARK-1507 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.0.0 Reporter: Thomas Graves Now that Hadoop 2.x can schedule cores as a resource we should allow the user to specify the # of cores for the ApplicationMaster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5219) Race condition in TaskSchedulerImpl and TaskSetManager
Shixiong Zhu created SPARK-5219: --- Summary: Race condition in TaskSchedulerImpl and TaskSetManager Key: SPARK-5219 URL: https://issues.apache.org/jira/browse/SPARK-5219 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Shixiong Zhu TaskSchedulerImpl.handleTaskGettingResult, TaskSetManager.canFetchMoreResults and TaskSetManager.abort will access variables which are used in multiple threads, but they don't use a correct lock. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5219) Race condition in TaskSchedulerImpl and TaskSetManager
[ https://issues.apache.org/jira/browse/SPARK-5219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275284#comment-14275284 ] Apache Spark commented on SPARK-5219: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/4019 Race condition in TaskSchedulerImpl and TaskSetManager -- Key: SPARK-5219 URL: https://issues.apache.org/jira/browse/SPARK-5219 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Shixiong Zhu TaskSchedulerImpl.handleTaskGettingResult, TaskSetManager.canFetchMoreResults and TaskSetManager.abort will access variables which are used in multiple threads, but they don't use a correct lock. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5215) concat support in sqlcontext
Adrian Wang created SPARK-5215: -- Summary: concat support in sqlcontext Key: SPARK-5215 URL: https://issues.apache.org/jira/browse/SPARK-5215 Project: Spark Issue Type: New Feature Components: SQL Reporter: Adrian Wang define concat follow rules in https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5215) concat support in sqlcontext
[ https://issues.apache.org/jira/browse/SPARK-5215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275011#comment-14275011 ] Apache Spark commented on SPARK-5215: - User 'adrian-wang' has created a pull request for this issue: https://github.com/apache/spark/pull/4017 concat support in sqlcontext Key: SPARK-5215 URL: https://issues.apache.org/jira/browse/SPARK-5215 Project: Spark Issue Type: New Feature Components: SQL Reporter: Adrian Wang define concat follow rules in https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5218) Report per stage remaining time estimate for each stage.
Prashant Sharma created SPARK-5218: -- Summary: Report per stage remaining time estimate for each stage. Key: SPARK-5218 URL: https://issues.apache.org/jira/browse/SPARK-5218 Project: Spark Issue Type: Sub-task Components: Spark Core, Web UI Affects Versions: 1.3.0 Reporter: Prashant Sharma Assignee: Prashant Sharma -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5217) Spark UI should report waiting stages during job execution.
[ https://issues.apache.org/jira/browse/SPARK-5217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma updated SPARK-5217: --- Attachment: waiting_stages.png Spark UI should report waiting stages during job execution. --- Key: SPARK-5217 URL: https://issues.apache.org/jira/browse/SPARK-5217 Project: Spark Issue Type: Sub-task Components: Spark Core, Web UI Affects Versions: 1.3.0 Reporter: Prashant Sharma Assignee: Prashant Sharma Attachments: waiting_stages.png This is a first step. Spark listener already reports all the stages at the time of job submission and of which we only show active, failed and completed. This addition has no overhead and seems straight forward to achieve. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5124) Standardize internal RPC interface
[ https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275014#comment-14275014 ] Shixiong Zhu commented on SPARK-5124: - For 1) I prefer to finish it before this JIRA. See [#4016|https://github.com/apache/spark/pull/4016] For 2), I will write some prototype codes to see if the current API design is sufficient. Standardize internal RPC interface -- Key: SPARK-5124 URL: https://issues.apache.org/jira/browse/SPARK-5124 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin Assignee: Shixiong Zhu Attachments: Pluggable RPC - draft 1.pdf In Spark we use Akka as the RPC layer. It would be great if we can standardize the internal RPC interface to facilitate testing. This will also provide the foundation to try other RPC implementations in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5214) Add EventLoop and change DAGScheduler to an EventLoop
Shixiong Zhu created SPARK-5214: --- Summary: Add EventLoop and change DAGScheduler to an EventLoop Key: SPARK-5214 URL: https://issues.apache.org/jira/browse/SPARK-5214 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shixiong Zhu As per discussion in SPARK-5124, DAGScheduler can simply use a queue event loop to process events. It would be great when we want to decouple Akka in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5216) Spark Ui should report estimated time remaining for each stage.
[ https://issues.apache.org/jira/browse/SPARK-5216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma updated SPARK-5216: --- Description: Per stage feedback on estimated remaining time can help user get a grasp on how much time the job is going to take. This will only require changes on the UI/JobProgressListener side of code since we already have most of the information needed. In the initial cut, plan is to estimate time based on statistics of running job i.e. average time taken by each task and number of task per stage. This will makes sense when jobs are long. And then if this makes sense, then more heuristics can be added like projected time saved if the rdd is cached and so on. More precise details will come as this evolves. In the meantime thoughts on alternate ways and suggestion on usefulness are welcome. Spark Ui should report estimated time remaining for each stage. --- Key: SPARK-5216 URL: https://issues.apache.org/jira/browse/SPARK-5216 Project: Spark Issue Type: Wish Components: Spark Core, Web UI Affects Versions: 1.3.0 Reporter: Prashant Sharma Assignee: Prashant Sharma Per stage feedback on estimated remaining time can help user get a grasp on how much time the job is going to take. This will only require changes on the UI/JobProgressListener side of code since we already have most of the information needed. In the initial cut, plan is to estimate time based on statistics of running job i.e. average time taken by each task and number of task per stage. This will makes sense when jobs are long. And then if this makes sense, then more heuristics can be added like projected time saved if the rdd is cached and so on. More precise details will come as this evolves. In the meantime thoughts on alternate ways and suggestion on usefulness are welcome. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5214) Add EventLoop and change DAGScheduler to an EventLoop
[ https://issues.apache.org/jira/browse/SPARK-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275007#comment-14275007 ] Apache Spark commented on SPARK-5214: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/4016 Add EventLoop and change DAGScheduler to an EventLoop - Key: SPARK-5214 URL: https://issues.apache.org/jira/browse/SPARK-5214 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shixiong Zhu As per discussion in SPARK-5124, DAGScheduler can simply use a queue event loop to process events. It would be great when we want to decouple Akka in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5216) Spark Ui should report estimated time remaining for each stage.
Prashant Sharma created SPARK-5216: -- Summary: Spark Ui should report estimated time remaining for each stage. Key: SPARK-5216 URL: https://issues.apache.org/jira/browse/SPARK-5216 Project: Spark Issue Type: Wish Components: Spark Core, Web UI Affects Versions: 1.3.0 Reporter: Prashant Sharma Assignee: Prashant Sharma -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5220) keepPushingBlocks in BlockGenerator terminated when an exception occurs, which causes the block pushing thread to terminate and blocks receiver
Max Xu created SPARK-5220: - Summary: keepPushingBlocks in BlockGenerator terminated when an exception occurs, which causes the block pushing thread to terminate and blocks receiver Key: SPARK-5220 URL: https://issues.apache.org/jira/browse/SPARK-5220 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.0 Reporter: Max Xu I am running a Spark streaming application with ReliableKafkaReceiver. It uses BlockGenerator to push blocks to BlockManager. However, writing WALs to HDFS may time out that causes keepPushingBlocks in BlockGenerator to terminate. 15/01/12 19:07:06 ERROR receiver.BlockGenerator: Error in block pushing thread java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.streaming.receiver.WriteAheadLogBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:176) at org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:160) at org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:126) at org.apache.spark.streaming.receiver.Receiver.store(Receiver.scala:124) at org.apache.spark.streaming.kafka.ReliableKafkaReceiver.org$apache$spark$streaming$kafka$ReliableKafkaReceiver$$storeBlockAndCommitOffset(ReliableKafkaReceiver.scala:207) at org.apache.spark.streaming.kafka.ReliableKafkaReceiver$GeneratedBlockHandler.onPushBlock(ReliableKafkaReceiver.scala:275) at org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:181) at org.apache.spark.streaming.receiver.BlockGenerator.org$apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:154) at org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:86) Then the block pushing thread is done and no subsequent blocks can be pushed into blockManager. In turn this blocks receiver from receiving new data. So when running my app and the TimeoutException happens, the ReliableKafkaReceiver stays in ACTIVE status but doesn't do anything at all. The application rogues. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4794) Wrong parse of GROUP BY query
[ https://issues.apache.org/jira/browse/SPARK-4794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275464#comment-14275464 ] Damien Carol commented on SPARK-4794: - [~marmbrus] Sorry for the late answer. For the record, I'm testing this query every commit (trunk branch on git) without sucess since I created this ticket. Here the details (EXPLAIN query) : {noformat} explain [...] {noformat} {noformat} == Physical Plan == Project [Annee#3676,Mois#3677,Jour#3678,Heure#3679,Societe#3680,Magasin#3681,CF Presentee#3682,CompteCarteFidelite#3683,NbCompteCarteFidelite#3684,DetentionCF#3685,NbCarteFidelite#3686,PlageDUCB#3687,NbCheque#3688L,CACheque#3689,NbImpaye#3690,NbEnsemble#3691L,NbCompte#3692,ResteDuImpaye#3693] !Sort [annee#3695 ASC,mois#3696 ASC,jour#3697 ASC,heure#3698 ASC,nom_societe#3699 ASC,id_magasin#3700 ASC,CarteFidelitePresentee#3702 ASC,CompteCarteFidelite#3705 ASC,NbCompteCarteFidelite#3706 ASC,DetentionCF#3703 ASC,NbCarteFidelite#3704 ASC,Id_CF_Dim_DUCB#3707 ASC], true !Exchange (RangePartitioning [annee#3695 ASC,mois#3696 ASC,jour#3697 ASC,heure#3698 ASC,nom_societe#3699 ASC,id_magasin#3700 ASC,CarteFidelitePresentee#3702 ASC,CompteCarteFidelite#3705 ASC,NbCompteCarteFidelite#3706 ASC,DetentionCF#3703 ASC,NbCarteFidelite#3704 ASC,Id_CF_Dim_DUCB#3707 ASC], 200) !OutputFaker [Annee#3676,Mois#3677,Jour#3678,Heure#3679,Societe#3680,Magasin#3681,CF Presentee#3682,CompteCarteFidelite#3683,NbCompteCarteFidelite#3684,DetentionCF#3685,NbCarteFidelite#3686,PlageDUCB#3687,NbCheque#3688L,CACheque#3689,NbImpaye#3690,NbEnsemble#3691L,NbCompte#3692,ResteDuImpaye#3693,Mois#3677,Annee#3676,Jour#3678,id_magasin#3700,DetentionCF#3685,CompteCarteFidelite#3683,nom_societe#3699,NbCarteFidelite#3686,NbCompteCarteFidelite#3684,CarteFidelitePresentee#3702,Id_CF_Dim_DUCB#3707,Heure#3679] Project [annee#3715 AS Annee#3676,mois#3716 AS Mois#3677,jour#3717 AS Jour#3678,heure#3718 AS Heure#3679,nom_societe#3719 AS Societe#3680,id_magasin#3720 AS Magasin#3681,CarteFidelitePresentee#3722 AS CF Presentee#3682,CompteCarteFidelite#3725 AS CompteCarteFidelite#3683,NbCompteCarteFidelite#3726 AS NbCompteCarteFidelite#3684,DetentionCF#3723 AS DetentionCF#3685,NbCarteFidelite#3724 AS NbCarteFidelite#3686,Id_CF_Dim_DUCB#3727 AS PlageDUCB#3687,NbCheque#3729L AS NbCheque#3688L,CACheque#3730 AS CACheque#3689,NbImpaye#3731 AS NbImpaye#3690,Id_Ensemble#3732L AS NbEnsemble#3691L,ZIBZIN#3734 AS NbCompte#3692,ResteDuImpaye#3733 AS ResteDuImpaye#3693,mois#3716,annee#3715,jour#3717,id_magasin#3720,DetentionCF#3723,CompteCarteFidelite#3725,nom_societe#3719,NbCarteFidelite#3724,NbCompteCarteFidelite#3726,CarteFidelitePresentee#3722,Id_CF_Dim_DUCB#3727,heure#3718] Filter annee#3715 = 2014) (mois#3716 = 1)) (jour#3717 = 25)) (id_magasin#3720 = 649)) ParquetTableScan [Id_CF_Dim_DUCB#3727,ResteDuImpaye#3733,NbCarteFidelite#3724,heure#3718,mois#3716,CompteCarteFidelite#3725,annee#3715,CarteFidelitePresentee#3722,CACheque#3730,NbImpaye#3731,ZIBZIN#3734,NbCompteCarteFidelite#3726,DetentionCF#3723,id_magasin#3720,nom_societe#3719,Id_Ensemble#3732L,jour#3717,NbCheque#3729L], (ParquetRelation hdfs://nc-h07/user/hive/warehouse/testsimon3.db/cf_encaissement_fact_pq, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml), org.apache.spark.sql.hive.HiveContext@7db3bcc, []), [] {noformat} Complete stack trace : {noformat} 15/01/13 17:10:32 INFO SparkExecuteStatementOperation: Running query ' CACHE TABLE A_1421165432909 AS select `cf_encaissement_fact_pq`.`annee` as `Annee`, `cf_encaissement_fact_pq`.`mois` as `Mois`, `cf_encaissement_fact_pq`.`jour` as `Jour`, `cf_encaissement_fact_pq`.`heure` as `Heure`, `cf_encaissement_fact_pq`.`nom_societe` as `Societe`, `cf_encaissement_fact_pq`.`id_magasin` as `Magasin`, `cf_encaissement_fact_pq`.`CarteFidelitePresentee` as `CF Presentee`, `cf_encaissement_fact_pq`.`CompteCarteFidelite` as `CompteCarteFidelite`, `cf_encaissement_fact_pq`.`NbCompteCarteFidelite` as `NbCompteCarteFidelite`, `cf_encaissement_fact_pq`.`DetentionCF` as `DetentionCF`, `cf_encaissement_fact_pq`.`NbCarteFidelite` as `NbCarteFidelite`, `cf_encaissement_fact_pq`.`Id_CF_Dim_DUCB` as `PlageDUCB`, `cf_encaissement_fact_pq`.`NbCheque` as `NbCheque`, `cf_encaissement_fact_pq`.`CACheque` as `CACheque`, `cf_encaissement_fact_pq`.`NbImpaye` as `NbImpaye`, `cf_encaissement_fact_pq`.`Id_Ensemble` as `NbEnsemble`, `cf_encaissement_fact_pq`.`ZIBZIN` as `NbCompte`, `cf_encaissement_fact_pq`.`ResteDuImpaye` as `ResteDuImpaye` from `testsimon3`.`cf_encaissement_fact_pq` as `cf_encaissement_fact_pq` where `cf_encaissement_fact_pq`.`annee` = 2014 and `cf_encaissement_fact_pq`.`mois` = 1 and `cf_encaissement_fact_pq`.`jour` = 25 and
[jira] [Created] (SPARK-5221) FileInputDStream remember window in certain situations causes files to be ignored
Jem Tucker created SPARK-5221: - Summary: FileInputDStream remember window in certain situations causes files to be ignored Key: SPARK-5221 URL: https://issues.apache.org/jira/browse/SPARK-5221 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.0, 1.1.1 Reporter: Jem Tucker Priority: Minor When batch times are greater than 1 minute, if a file begins to be moved into a directory just before FileInputDStream.findNewFiles() is called but does not become visible untill after it has excecuted and therefore is not included in that batch, the file is then ignored in the following batch as its mod time is less than the modTimeIgnoreThreshold. This causes data to be ignored in spark streaming that shouldnt be, especially when large files are being moved into the directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4796) Spark does not remove temp files
[ https://issues.apache.org/jira/browse/SPARK-4796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275494#comment-14275494 ] Fabian Gebert commented on SPARK-4796: -- suffering from this issue as well and can't see any workaround Spark does not remove temp files Key: SPARK-4796 URL: https://issues.apache.org/jira/browse/SPARK-4796 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 1.1.0 Environment: I'm runnin spark on mesos and mesos slaves are docker containers. Spark 1.1.0, elasticsearch spark 2.1.0-Beta3, mesos 0.20.0, docker 1.2.0. Reporter: Ian Babrou I started a job that cannot fill into memory and got no space left on device. That was fair, because docker containers only have 10gb of disk space and some is taken by OS already. But then I found out when job failed it didn't release any disk space and left container without any free disk space. Then I decided to check if spark removes temp files in any case, because many mesos slaves had /tmp/spark-local-*. Apparently some garbage stays after spark task is finished. I attached with strace to running job: [pid 30212] unlink(/tmp/spark-local-20141209091330-48b5/12/temp_8a73fcc2-4baa-499a-8add-0161f918de8a) = 0 [pid 30212] unlink(/tmp/spark-local-20141209091330-48b5/31/temp_47efd04b-d427-4139-8f48-3d5d421e9be4) = 0 [pid 30212] unlink(/tmp/spark-local-20141209091330-48b5/15/temp_619a46dc-40de-43f1-a844-4db146a607c6) = 0 [pid 30212] unlink(/tmp/spark-local-20141209091330-48b5/05/temp_d97d90a7-8bc1-4742-ba9b-41d74ea73c36 unfinished ... [pid 30212] ... unlink resumed ) = 0 [pid 30212] unlink(/tmp/spark-local-20141209091330-48b5/36/temp_a2deb806-714a-457a-90c8-5d9f3247a5d7) = 0 [pid 30212] unlink(/tmp/spark-local-20141209091330-48b5/04/temp_afd558f1-2fd0-48d7-bc65-07b5f4455b22) = 0 [pid 30212] unlink(/tmp/spark-local-20141209091330-48b5/32/temp_a7add910-8dc3-482c-baf5-09d5a187c62a unfinished ... [pid 30212] ... unlink resumed ) = 0 [pid 30212] unlink(/tmp/spark-local-20141209091330-48b5/21/temp_485612f0-527f-47b0-bb8b-6016f3b9ec19) = 0 [pid 30212] unlink(/tmp/spark-local-20141209091330-48b5/12/temp_bb2b4e06-a9dd-408e-8395-f6c5f4e2d52f) = 0 [pid 30212] unlink(/tmp/spark-local-20141209091330-48b5/1e/temp_825293c6-9d3b-4451-9cb8-91e2abe5a19d unfinished ... [pid 30212] ... unlink resumed ) = 0 [pid 30212] unlink(/tmp/spark-local-20141209091330-48b5/15/temp_43fbb94c-9163-4aa7-ab83-e7693b9f21fc) = 0 [pid 30212] unlink(/tmp/spark-local-20141209091330-48b5/3d/temp_37f3629c-1b09-4907-b599-61b7df94b898 unfinished ... [pid 30212] ... unlink resumed ) = 0 [pid 30212] unlink(/tmp/spark-local-20141209091330-48b5/35/temp_d18f49f6-1fb1-4c01-a694-0ee0a72294c0) = 0 And after job is finished, some files are still there: /tmp/spark-local-20141209091330-48b5/ /tmp/spark-local-20141209091330-48b5/11 /tmp/spark-local-20141209091330-48b5/11/shuffle_0_1_4 /tmp/spark-local-20141209091330-48b5/32 /tmp/spark-local-20141209091330-48b5/04 /tmp/spark-local-20141209091330-48b5/05 /tmp/spark-local-20141209091330-48b5/0f /tmp/spark-local-20141209091330-48b5/0f/shuffle_0_1_2 /tmp/spark-local-20141209091330-48b5/3d /tmp/spark-local-20141209091330-48b5/0e /tmp/spark-local-20141209091330-48b5/0e/shuffle_0_1_1 /tmp/spark-local-20141209091330-48b5/15 /tmp/spark-local-20141209091330-48b5/0d /tmp/spark-local-20141209091330-48b5/0d/shuffle_0_1_0 /tmp/spark-local-20141209091330-48b5/36 /tmp/spark-local-20141209091330-48b5/31 /tmp/spark-local-20141209091330-48b5/12 /tmp/spark-local-20141209091330-48b5/21 /tmp/spark-local-20141209091330-48b5/10 /tmp/spark-local-20141209091330-48b5/10/shuffle_0_1_3 /tmp/spark-local-20141209091330-48b5/1e /tmp/spark-local-20141209091330-48b5/35 If I look into my mesos slaves, there are mostly shuffle files, overall picture for single node: root@web338:~# find /tmp/spark-local-20141* -type f | fgrep shuffle | wc -l 781 root@web338:~# find /tmp/spark-local-20141* -type f | fgrep -v shuffle | wc -l 10 root@web338:~# find /tmp/spark-local-20141* -type f | fgrep -v shuffle /tmp/spark-local-20141119144512-67c4/2d/temp_9056f380-3edb-48d6-a7df-d4896f1e1cc3 /tmp/spark-local-20141119144512-67c4/3d/temp_e005659b-eddf-4a34-947f-4f63fcddf111 /tmp/spark-local-20141119144512-67c4/16/temp_71eba702-36b4-4e1a-aebc-20d2080f1705 /tmp/spark-local-20141119144512-67c4/0d/temp_8037b9db-2d8a-4786-a554-a8cad922bf5e /tmp/spark-local-20141119144512-67c4/24/temp_f0e4cc43-6cc9-42a7-882d-f8a031fa4dc3 /tmp/spark-local-20141119144512-67c4/29/temp_a8bbe2cb-f590-4b71-8ef8-9c0324beddc7 /tmp/spark-local-20141119144512-67c4/3a/temp_9fc08519-f23a-40ac-a3fd-e58df6871460
[jira] [Comment Edited] (SPARK-4879) Missing output partitions after job completes with speculative execution
[ https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272291#comment-14272291 ] Zach Fry edited comment on SPARK-4879 at 1/13/15 7:51 PM: -- Hey Josh, I was able to reproduce the missing file using the speculation settings in my previous comment: {code} scala 15/01/09 18:33:28 WARN scheduler.TaskSetManager: Lost task 42.1 in stage 0.0 (TID 113, redacted-03): java.io.IOException: Failed to save output of task: attempt_201501091833__m_42_113 org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160) org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172) org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132) org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:991) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) 15/01/09 18:33:47 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, redacted-03): java.io.IOException: The temporary job-output directory hdfs://redacted-01:8020/test2/_temporary doesn't exist! org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250) org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:240) org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:116) org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:89) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:980) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} Notice here that there are only 99 part files and part-00042 is missing (as seen in the stacktrace above) {code} $ hadoop fs -ls /test2 | grep part | wc -l 99 palantir@pd-support-01 (/home/palantir/homes/zfry) (master) $ hadoop fs -ls /test2 | grep part-0004 -rw-r--r-- 3 palantir supergroup 8 2015-01-09 18:33 /test2/part-00040 -rw-r--r-- 3 palantir supergroup 8 2015-01-09 18:33 /test2/part-00041 -rw-r--r-- 3 palantir supergroup 8 2015-01-09 18:33 /test2/part-00043 -rw-r--r-- 3 palantir supergroup 8 2015-01-09 18:33 /test2/part-00044 -rw-r--r-- 3 palantir supergroup 8 2015-01-09 18:33 /test2/part-00045 -rw-r--r-- 3 palantir supergroup 8 2015-01-09 18:33 /test2/part-00046 -rw-r--r-- 3 palantir supergroup 8 2015-01-09 18:33 /test2/part-00047 -rw-r--r-- 3 palantir supergroup 8 2015-01-09 18:33 /test2/part-00048 -rw-r--r-- 3 palantir supergroup 8 2015-01-09 18:33 /test2/part-00049 {code} was (Author: zfry): Hey Josh, I was able to reproduce the missing file using the speculation settings in my previous comment: {code} scala 15/01/09 18:33:28 WARN scheduler.TaskSetManager: Lost task 42.1 in stage 0.0 (TID 113, redacted-03): java.io.IOException: Failed to save output of task: attempt_201501091833__m_42_113 org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160) org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172) org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132) org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:991) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
[jira] [Commented] (SPARK-5095) Support launching multiple mesos executors in coarse grained mesos mode
[ https://issues.apache.org/jira/browse/SPARK-5095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276094#comment-14276094 ] Timothy Chen commented on SPARK-5095: - [~joshdevins][~maasg] I have a PR out now, I wonder if you guys can try it? https://github.com/apache/spark/pull/4027 Support launching multiple mesos executors in coarse grained mesos mode --- Key: SPARK-5095 URL: https://issues.apache.org/jira/browse/SPARK-5095 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Currently in coarse grained mesos mode, it's expected that we only launch one Mesos executor that launches one JVM process to launch multiple spark executors. However, this become a problem when the JVM process launched is larger than an ideal size (30gb is recommended value from databricks), which causes GC problems reported on the mailing list. We should support launching mulitple executors when large enough resources are available for spark to use, and these resources are still under the configured limit. This is also applicable when users want to specifiy number of executors to be launched on each node -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5095) Support launching multiple mesos executors in coarse grained mesos mode
[ https://issues.apache.org/jira/browse/SPARK-5095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276093#comment-14276093 ] Apache Spark commented on SPARK-5095: - User 'tnachen' has created a pull request for this issue: https://github.com/apache/spark/pull/4027 Support launching multiple mesos executors in coarse grained mesos mode --- Key: SPARK-5095 URL: https://issues.apache.org/jira/browse/SPARK-5095 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Currently in coarse grained mesos mode, it's expected that we only launch one Mesos executor that launches one JVM process to launch multiple spark executors. However, this become a problem when the JVM process launched is larger than an ideal size (30gb is recommended value from databricks), which causes GC problems reported on the mailing list. We should support launching mulitple executors when large enough resources are available for spark to use, and these resources are still under the configured limit. This is also applicable when users want to specifiy number of executors to be launched on each node -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5228) Hide tables for Active Jobs/Completed Jobs/Failed Jobs when they are empty
[ https://issues.apache.org/jira/browse/SPARK-5228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276109#comment-14276109 ] Apache Spark commented on SPARK-5228: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/4028 Hide tables for Active Jobs/Completed Jobs/Failed Jobs when they are empty Key: SPARK-5228 URL: https://issues.apache.org/jira/browse/SPARK-5228 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.3.0 Reporter: Kousuke Saruta In current WebUI, tables for Active Stages, Completed Stages, Skipped Stages and Failed Stages are hidden when they are empty while tables for Active Jobs, Completed Jobs and Failed Jobs are not hidden though they are empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5223) Use pickle instead of MapConvert and ListConvert in MLlib Python API
[ https://issues.apache.org/jira/browse/SPARK-5223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-5223: -- Description: It will introduce problems if the object in dict/list/tuple can not support by py4j, such as Vector. Also, pickle may have better performance for larger object (less RPC). In some cases that the object in dict/list can not be pickled (such as JavaObject), we should still use MapConvert/ListConvert. discussion: http://apache-spark-developers-list.1001551.n3.nabble.com/Python-to-Java-object-conversion-of-numpy-array-td10065.html was: It will introduce problems if the object in dict/list/tuple can not support by py4j, such as Vector. Also, pickle may have better performance for larger object (less RPC). In some cases that the object in dict/list can not be pickled (such as JavaObject), we should still use MapConvert/ListConvert. Use pickle instead of MapConvert and ListConvert in MLlib Python API Key: SPARK-5223 URL: https://issues.apache.org/jira/browse/SPARK-5223 Project: Spark Issue Type: Bug Components: MLlib, PySpark Reporter: Davies Liu Priority: Critical It will introduce problems if the object in dict/list/tuple can not support by py4j, such as Vector. Also, pickle may have better performance for larger object (less RPC). In some cases that the object in dict/list can not be pickled (such as JavaObject), we should still use MapConvert/ListConvert. discussion: http://apache-spark-developers-list.1001551.n3.nabble.com/Python-to-Java-object-conversion-of-numpy-array-td10065.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5097) Adding data frame APIs to SchemaRDD
[ https://issues.apache.org/jira/browse/SPARK-5097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275960#comment-14275960 ] Mohit Jaggi commented on SPARK-5097: minor comment: mutate existing can do df(x) = df(x) Adding data frame APIs to SchemaRDD --- Key: SPARK-5097 URL: https://issues.apache.org/jira/browse/SPARK-5097 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Attachments: DesignDocAddingDataFrameAPIstoSchemaRDD.pdf SchemaRDD, through its DSL, already provides common data frame functionalities. However, the DSL was originally created for constructing test cases without much end-user usability and API stability consideration. This design doc proposes a set of API changes for Scala and Python to make the SchemaRDD DSL API more usable and stable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275981#comment-14275981 ] Muhammad-Ali A'rabi commented on SPARK-5226: Although I can't assign this task to myself, I am interested to do it. Add DBSCAN Clustering Algorithm to MLlib Key: SPARK-5226 URL: https://issues.apache.org/jira/browse/SPARK-5226 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.2.0 Reporter: Muhammad-Ali A'rabi Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5179) Spark UI history job duration is wrong
[ https://issues.apache.org/jira/browse/SPARK-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olivier Toupin updated SPARK-5179: -- Target Version/s: 1.2.1 Spark UI history job duration is wrong -- Key: SPARK-5179 URL: https://issues.apache.org/jira/browse/SPARK-5179 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.2.0 Reporter: Olivier Toupin Priority: Minor In the Web UI, the jobs duration times are wrong when using reviewing the job with the history. The stages duration times are ok. Jobs are shown with milliseconds duration, which is wrong. However, it's only an history issue, while the job is running, it works. More details in that discussion on the mailing list: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-UI-history-job-duration-is-wrong-tc10010.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5223) Use pickle instead of MapConvert and ListConvert in MLlib Python API
[ https://issues.apache.org/jira/browse/SPARK-5223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5223. -- Resolution: Fixed Fix Version/s: 1.2.1 1.3.0 Issue resolved by pull request 4023 [https://github.com/apache/spark/pull/4023] Use pickle instead of MapConvert and ListConvert in MLlib Python API Key: SPARK-5223 URL: https://issues.apache.org/jira/browse/SPARK-5223 Project: Spark Issue Type: Bug Components: MLlib, PySpark Reporter: Davies Liu Priority: Critical Fix For: 1.3.0, 1.2.1 It will introduce problems if the object in dict/list/tuple can not support by py4j, such as Vector. Also, pickle may have better performance for larger object (less RPC). In some cases that the object in dict/list can not be pickled (such as JavaObject), we should still use MapConvert/ListConvert. discussion: http://apache-spark-developers-list.1001551.n3.nabble.com/Python-to-Java-object-conversion-of-numpy-array-td10065.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5223) Use pickle instead of MapConvert and ListConvert in MLlib Python API
[ https://issues.apache.org/jira/browse/SPARK-5223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5223: - Assignee: Davies Liu Use pickle instead of MapConvert and ListConvert in MLlib Python API Key: SPARK-5223 URL: https://issues.apache.org/jira/browse/SPARK-5223 Project: Spark Issue Type: Bug Components: MLlib, PySpark Reporter: Davies Liu Assignee: Davies Liu Priority: Critical Fix For: 1.3.0, 1.2.1 It will introduce problems if the object in dict/list/tuple can not support by py4j, such as Vector. Also, pickle may have better performance for larger object (less RPC). In some cases that the object in dict/list can not be pickled (such as JavaObject), we should still use MapConvert/ListConvert. discussion: http://apache-spark-developers-list.1001551.n3.nabble.com/Python-to-Java-object-conversion-of-numpy-array-td10065.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4912) Persistent data source tables
[ https://issues.apache.org/jira/browse/SPARK-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4912. - Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3960 [https://github.com/apache/spark/pull/3960] Persistent data source tables - Key: SPARK-4912 URL: https://issues.apache.org/jira/browse/SPARK-4912 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker Fix For: 1.3.0 It would be good if tables created through the new data sources api could be persisted to the hive metastore. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org